NEW: Call for open parasite data... more
Pf8: An open dataset of Plasmodium falciparum genome variation in 33,325 worldwide samples
15 Apr 2025

MalariaGEN et. al

Wellcome Open Research, 2025; - - DOI: -

Parasite

About the Plasmodium falciparum version 8 data

This page provides information about the Pf8 dataset which contains genome variation data on over 33,325 worldwide samples of Plasmodium falciparum. The key publication details will be made available here as soon as it is available. 

Open the Pf8 app to view summary information about contributing studies, countries, and resistance profiles.

Open the HaploAtlas app to study and track genetic mutations across any gene in the P. falciparum genome.

Background and previous releases

This latest dataset is based on genome variation from the MalariaGEN network, and is a significant expansion in sample size of approximately 60% on the previously released Pf7 dataset (Pf7, published 2023). It includes samples which were previously released through the Pf3k Project, Plasmodium falciparum Community Project, GenRe Mekong Project and Pf7. It comprises multiple partner studies, each with its own research objectives and led by a local investigator. Genome sequencing is performed centrally, and partner studies are free to analyse and publish the genetic data produced on their own samples, in line with MalariaGEN’s guiding principles on equitable data sharing. Details of the 99 contributing partner studies, including description, contact information and key people, are available as Appendix in Supplementary Materials.

About the version 8 data pipeline

Details of the methods can be found in the accompanying paper.

Content of the data release

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. Further details and analytical results can be found in the accompanying data release paper.

These data are available open access under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0). Publications using these data should acknowledge and cite the source of the data. Details of the publication will be made available immediately upon publication. 

  • Sample provenance and sequencing metadata: sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 33,325 samples from 34 countries.
  • CNV calls: amplification calls for genes CRT, GCH1, MDR1 and PM2_PM3, and deletion calls for HRP2 and HRP3.
  • Tandem duplication breakpoints: genomic coordinates of breakpoints used for faceaway read-based calling.
  • Measure of complexity of infections: characterisation of within-host diversity (Fws) for 24,409 QC pass samples.
  • Drug resistance marker genotypes: genotypes at known markers of drug resistance for 24,409 samples, containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3.
  • Inferred resistance status classification: classification of 24,409 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 gene deletions.
  • Drug resistance markers to inferred resistance status: details of the heuristics utilised to map genetic markers to resistance status classification.
  • Reference genome: the version of the 3D7 reference genome fasta file used for mapping (PlasmoDB-54-Pfalciparum3D7-Genome.fasta).
  • Annotation file: the version of the 3D7 reference annotation gff file used for genome annotations (PlasmoDB-55_Pfalciparum3D7.gff.gz).
  • Genetic distances: Pf8_mean_genotype_distance.npy. Genetic distance matrix comparing all 33,325 samples (numpy array).
  • SNP-only genetic distances: Pf8_mean_genotype_distance_snp_only.npy. Genetic distance matrix comparing all 33,325 samples using SNP-only call set (numpy array).
  • Short variants genotypes: Genotype calls on 12,493,205  SNPs and short indels in all 33,325 samples from 34 countries, available both as VCF and zarr files (Pf8.zarr.zip).
  • SNP-only genotypes: Genotype calls on 10,821,552 SNPs in all 33,325 samples from 34 countries, available both as VCF and zarr files (pf8_snp_only_clean_zarr.zip).
  • CRAM files: Compressed sequencing data files for all 33,325 samples
  • gVCF files: genomic VCF files containing both variant and non-variant regions for all 33,325 Pf8 samples

A README file describes in fine detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing genotype data in VCF and zarr files.

The Pf8 user guide is a useful companion to these data, providing information on how to use the malariagen_data Python package to access data in the cloud using free computer services and Jupyter Notebooks without having to first download the resource locally.