The initial study and data description are published in: Band G et al. (2013). Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9:e1003509.
This data release contains three separate data packages of SNP genotype data for cases and controls from three populations: Gambia, Kenya and Malawi.
This data has been deposited in the European Genotyping Archive under EGA Study Code EGAS00001000807.
- All cases have been diagnosed with malaria in a hospital.
- Controls were samples from within the general population and from new births.
- All samples are unrelated (but see Readme files for further details).
The information provided here is common to each of the three country datasets and where differences exist these are noted.
Data set structure
Each data package contains:
Files are all in the ‘oxford’ format suitable for use with SNPTEST v2: http://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html
For a more detailed description of the file formats used in SNPTEST, see http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html
Samples
- Gambia_650Y_HM3_information.sample
- Kenya_HM3_information.sample
- Malawi_1M_HM3.sample
These are space-delimited files.
These files contain information on the samples. The table below gives an example of the data:
ID_1 | ID_2 | missing | sex | ethnicity | country | control | MALARIA | rs334 | PCA_1 | … | PCA_2 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | D | D | D | B | B | D | C | C | C |
WTCCC130585 | WTCCC130585 | 0.076 | M | MANDINKA | GM | 0 | 1 | TT | 0.0481 | … | -0.00276 |
- 0 = N/A
- D = discrete variable
- C = continuous variable
Row 3 onwards are the sample data and they are listed in the same row order as in the other files in this dataset. Please do not change the sample sort order of this file or the other files.
Headers:
- ID_1 and ID_2 describe the sample_ids and are identical
- Missing: proportion of SNP data missing
- Sex: M = male, F = female, NA = missing
- Ethnicity: only ethnic information for the major ethnic groups is provided and all other groups have been pooled together and labelled as “OTHER”
- Country_code: GM, KE, MW (ISO code for Gambia, Kenya and Malawi respectively)
- Control: sample collected from the general population (0=NO, 1=YES)
- Malaria: sample collected from a patient with severe malaria (0=NO, 1=YES)
- rs334: probable HbS (rs334) genotype for each individual typed using the Sequenom iPLEX platform.
- For further details on this SNP see:http://www.ensembl.org/Homo_sapiens/Variation/Explore?r=11:5226502-5227502;v=rs334;vdb=variation;vf=328 and http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=rs334
- The genotype data are provided with respect to the plus ‘+’ strand
- T: Major allele/ancestral allele/reference allele
- A: Minor allele/alternative allele/non-reference allele
- The genome position with respect to the GRCh36 is 11:5204808
- Where we were unable to determine a genotype the data are represented by NA
- PCA_x: PCA_1 to PCA_10 are the first 10 principal components used in the above paper to control for population structure in Genome Wide Association Analysis (GWAS). Missing values are set to NA; samples with missing PCs are those that were excluded from GWAS analyses in the the above paper; these samples also appear in the exclusion lists.
Excluded samples:
- Gambia_650Y_HM3_samples.excluded
- Kenya_HM3_samples.excluded
- Malawi_1M_HM3_samples.excluded
These files contain lists of samples that were excluded from final analysis due to various QC criteria including:
- pass_rate
- heterozygosity
Note that there are no header line or column data-type rows at the beginning of these files.
Genotypes
These files contain posterior probabilities of genotyping calls from imputation using IMPUTE2 into the HapMap 3 (release #2) reference panel obtained from the IMPUTE webpage (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#reference).
Coordinates refer to NCBI build 36.
A Directory called ‘gen’ contains the SNP data by chromosome.
Files are provide per chromosome of the name format:
- Gambia_650Y_HM3_imputed.??.gen.gz
- Kenya_650Y_HM3_imputed.??.gen.gz
- Malawi_HM3_imputed.??.gen.gz
where ?? represents the chromosome number with zero-padded prefix.
Each file contains the posterior probabilities of genotyping calls for HapMap3 SNPs.
The file format is as per SNPTEST v2:
SNP1 | rs1 | 1000 | A | C | 1 | 0 | 0 | 1 | 0 | 0 | … |
SNP2 | rs2 | 2000 | G | T | 1 | 0 | 0 | 0 | 1 | 0 | … |
SNP3 | rs3 | 3000 | C | T | 1 | 0 | 0 | 0 | 1 | 0 | … |
There are no header rows.
Columns correspond to:
- Column1: SNP name
- Column2: rs id
- Column3: chromosomal position
- Column4 and 5: SNP alleles from Illumina Manifests
- Column6 … ColumnN: contain the posterior probabilities of imputation. Each individual is represented by 3 values corresponding to the genotypes AA, AB and BB respectively.
Other files
A ReadMe file:
- ReadMe_Gambia_Illumina-Imputation_data_EGAS00001000807.txt
- ReadMe_Kenya_Illumina-Imputation_data_EGAS00001000807.txt
- ReadMe_Malawi_Illumina-Imputation_data_EGAS00001000807.txt
A directory called ‘bgen’:
- Gambia_650Y_HM3_imputed.bgen
- Kenya_650Y_HM3_imputed.bgen
- Malawi_HM3_imputed.bgen
Genotype probabilities of imputation output in bgen format (http://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html#input_file_formats)
A directory called ‘info’:
- Gambia_650Y_HM3_imputed.info.gz
- Kenya_650Y_HM3_imputed.info.gz
- Malawi_HM3_imputed.info.gz
These files provide imputation information. Note: The .bgen file can be converted into other formats using qctool, e.g. bgen to gen formatsee qctool manual for more details (http://www.well.ox.ac.uk/~gav/qctool/#tutorial) example command:
./qctool -g Gambia_650Y_HM3_imputed.bgen -og Gambia_650Y_HM3_imputed.gen
Extra information specific to Kenya
Kenya_HM3_information.families
This is file describes samples in trios or part trios.
Example format:
family_ID | sample_ID | father_ID | mother_ID | family_member |
---|---|---|---|---|
Kenya_fam_02 | MLCP1_1M1300381 | MLCP1_1M1424842 | MLCP1_1M1424843 | Index |
Kenya_fam_02 | MLCP1_1M1424842 | NA | NA | Father |
Kenya_fam_02 | MLCP1_1M1424843 | NA | NA | Mother |
- family_id: id to group family members
- sample_id: unique individual_id that maps to the Kenya_HM3_information.samples file
- father_id: individual_id of the father that maps to the individual_id field
- mother_id: individual_id of the mother that maps to the individual_id field
family_member: relationship of the individual to the family index member
Data sets
Gambia
EGA Data Study: EGAS00001000807
EGA Data Set ID: EGAD00010000572 (1,533 controls and 1,247 cases)
Method: Illumina 650Y with 1000G imputation
Kenya
EGA Data Study: EGAS00001000807
EGA Data Set ID: EGAD00010000570 (1,544 controls and 1,711 cases)
Method: Illumina 2.5M with 1000G imputation
Malawi
EGA Data Study: EGAS00001000807
EGA Data Set ID: Not yet available (2,239 controls and 1,451 cases)
Method: Illumina 1.2M with 1000G imputation
Release notes
Samples may also be included in other data releases
9 Oct 2015Some of the samples included in this data release may also be present in other MalariaGEN data releases where different genotyping technologies or chip designs were used. The sample_ids provide the primary way to identify these samples between the different data releases.
Malawi data set
10 Oct 2015Please note that this dataset has been prepared for release by MalariaGEN and will be released as soon as the relevant ethics committee confirms the range of acceptable research uses.
Apply for access
Archived
Data package contact
Citations
The initial study and data description are published in:
Band G et al. (2013). Imputation-based meta-analysis of severe malaria in three African populations. PLoS Genet. 9:e1003509.