This page contains information about the phase 1 preview data release from the Anopheles gambiae 1000 Genomes project. This data release comprises variant calls on 103 samples collected in Uganda.
Any use of Project data is subject to the Terms of Use.
Data sets
Downloads
This data release comprises variant call data, available as either VCF or HDF5 format files, and other supporting data files, including a table of sample metadata.
All of the data files included in this release can be downloaded from the Wellcome Trust Sanger Institute public FTP site.
The same data files are also available from Amazon S3, see the following URL for a list of file locations:
If you are downloading files, please use the Sanger FTP site where possible. The ag1000g-eu S3 bucket is hosted in the eu-west-1 region, and so is fastest and most cost-efficient when downloading data into AWS compute resources hosted in the same region.
NOTE: Many browsers now do not support links to FTP sites. If you are experiencing difficulties, you may need to change your browser settings.
Known issues
Incorrect data associated with missing calls (HDF5 only)
2 Jun 2014In the HDF5 format files. where there is a missing genotype call, other data fields (e.g., GQ, AD, DP) may have incorrect values due to a bug in the format conversion software. This applies only to missing genotype calls, otherwise the call data fields in the HDF5 format files are correct and correspond to the data in the VCF format files.
Missing filters on FS, MQ, QD and ReadPosRankSum
30 May 2014Four of the FILTER annotations that are declared in the header of the VCF were not actually applied to the variants due to an error in the VCF processing pipeline. These FILTER annotations are:
##FILTER=<ID=FS,Description=”FS > 60″>
##FILTER=<ID=MQ,Description=”MQ < 40″>
##FILTER=<ID=QD,Description=”QD < 5″>
##FILTER=<ID=ReadPosRankSum,Description=”ReadPosRankSum < -8″>
If you use these data, it is recommended that you apply these variant filters yourself prior to any analysis. If you use GATK to apply these filters you must use JEXL expressions with the correct value type, these are all Float fields so, e.g., the correct expression for the FS filter should be “FS > 60.0”.
Multiallelic filter
30 May 2014This preview release is a subset of a larger callset which will be released in the near future. The Multiallelic filter was applied to the larger callset, and so some variants annotated in this preview release as Multiallelic will actually only have two segregating alleles.
Open access
Archived
Data package contact
Citations
To cite these data directly, please use the following citation format:
The Anopheles gambiae 1000 Genomes Consortium (2014): Ag1000G phase 1 preview data release. MalariaGEN. http://www.malariagen.net/data_package/aag1000g-phase-1-preview-data-release/