NEW: Dominic Kwiatkowski’s final paper... more
Pf7: An open dataset of Plasmodium falciparum - v.7.0

Released on 8 Dec 2022.

Parasite

This page provides information about the Pf7 dataset which contains genome variation data on over 20,000 worldwide samples of Plasmodium falciparum.

Open the Pf7 app to view summary information about contributing studies, countries, and resistance profiles.

This release contains details on contributing partner studies, sample metadata and key sample attributes inferred from genomic data, and genomic data including raw sequence reads. A description of the dataset can be found here.

These data are available open access. Publications using these data should acknowledge and cite the source of the data using the following format: “This publication uses MalariaGEN data as described in ‘Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples . MalariaGEN et al, Wellcome Open Research 2023, 8:22 https://doi.org/10.12688/wellcomeopenres.18681.1.

Data sets

Study information

Details of the 82 contributing partner studies, including description, contact information and key people

Download study information

Sample provenance and sequencing metadata

Sample information including partner study information, location and year of collection, ENA accession numbers, and QC information for 20,864 samples from 33 countries.

Download sample provenance and sequencing metadata

Measure of complexity of infections

Characterisation of within-host diversity (FWS) for 16,203 QC pass samples.

Download measure of complexity of infections

Drug resistance marker genotypes

Genotypes at known markers of drug resistance for 16,203 samples, containing amino acid and copy number genotypes at six loci: crt, dhfr, dhps, mdr1, kelch13, plasmepsin 2-3.

Download drug resistance marker genotypes

Inferred resistance status classification

Classification of 16,203 QC pass samples into different types of resistance to 10 drugs or combinations of drugs and to RDT detection: chloroquine, pyrimethamine, sulfadoxine, mefloquine, artemisinin, piperaquine, sulfadoxine- pyrimethamine for treatment of uncomplicated malaria, sulfadoxine- pyrimethamine for intermittent preventive treatment in pregnancy, artesunate-mefloquine, dihydroartemisinin-piperaquine, hrp2 and hrp3 gene deletions.

Download Inferred resistance status classification

Drug resistance markers to inferred resistance status

Details of the heuristics utilised to map genetic markers to resistance status classification

Download drug resistance markers to inferred resistance status

CRT haplotypes

Full crt gene haplotypes for 16,203 QC pass samples. These are available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_crt_haplotypes.txt.

CSP C-terminal haplotypes

Full csp C-terminal haplotypes for 16,203 QC pass samples plus 6 lab strains. These are available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_csp_c_terminal_haplotypes.txt.

EBA175 calls

eba175 allelic type calls for 16,203 QC pass samples.

Download EBA175 calls

Reference genome

The version of the 3D7 reference genome fasta file used for mapping. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pfalciparum.genome.fasta

Annotation file

The version of the 3D7 reference annotation gff file used for genome annotations. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/
Pfalciparum_replace_Pf3D7_MIT_v3_with_Pf_M76611.gff.

Genetic distances

Genetic distance matrix comparing all 20,864 samples. This is available at ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_genetic_distance_matrix.npy

Short variants genotypes

Genotype calls on 10,145,661 SNPs and short indels in all 20,864 samples from 33 countries, available both as VCF (ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7_vcf/) and zarr (ftp://ngs.sanger.ac.uk/production/malaria/Resource/34/Pf7.zarr.zip) files.

Release notes

A README file describes in fine detail all the files included in the release, the format and interpretation of each column, and contains some tips and tricks for accessing genotype data in VCF and zarr files.

NOTE: You may need to download a free FTP client to access the FTP links.

Access the data on Google Cloud

Open access

Our approach to sharing data

Data package contact

Citations

Publications using these data should acknowledge and cite the source of the data using the following format:

“This publication uses data from the MalariaGEN Plasmodium falciparum Community Project as described in ‘Pf7: an open dataset of Plasmodium falciparum genome variation in 20,000 worldwide samples. MalariaGEN et al, Wellcome Open Research 2023 8 22 DOI: https://doi.org/10.12688/wellcomeopenres.18681.1’”