Higher throughput, better accuracy, and lower costs of DNA sequencing technology revolutionized the field of genetics. Building upon these technological advances, 1000 genomes project marked the new era of human genetics. The ambitious goal of this international project is to build a detailed map of human genetic variation by sequencing 2500 individuals from five major population groups. The first insights into the project results got available upon completion of the pilot phase that covered some hundreds of individuals (The 1000 Genomes Project Consortium 2010).
Whereas sequencing costs drop, data management costs are raising. The tremendous amounts of sequencing data from thousands of genomes over 3 billion DNA base pairs raise important challenges for storage and analysis. To tackle this, EBI developed a dedicated computer platform to manipulate and share large-scale data. Furthermore, although sequencing becomes cheaper, getting the sequences of 2500 genomes remains a burden. Pilot project assessed two cost-containment strategies: low-coverage (4x) sequencing of the whole genome and high coverage (50x) sequencing of exon-targeted regions (8140 exons were included).
According to pilot study, low-coverage whole genome sequencing approach performs reasonably well. Targeting multiple individuals increases the power to detect different frequency variants in the population. The number and accuracy of called genotypes are comparable to that called under 15x coverage of exon-enriched samples. Furthermore, pilot study included the whole genome sequencing at 42x of two mother-father-child trios. This allowed estimating the accuracy and completeness of low-coverage samples. The analysis of trio data subsampled at 4x retrieved about 90% of SNP variants and genotypes. The main issue with low-coverage approach is missing data. The pilot study overcomes this limitation using the imputation methods that infer missing data based on known data for other individuals.
Pilot studies alone show incredible amount of variation in human genome. An individual genome contains on average about 375 loss-off-function variants and tens of thousands of mutations in coding regions, in about equal amounts of both affecting and not the triplet for amino acid call. As expected, most high frequency variants found in pilot study were already present in public databases. In addition, study reports about 8 million novel variants. The authors explain the excess of lower frequency variants in exon data with purifying selection under neutral coalescent model with constant population size. This interpretation is not optimal as similar signature is obtained by population growth not taken into account. Most of the novel variants were found in populations with the African ancestry, which is not surprising as most human diversity lies in African populations. Therefore having better resolution for African populations would be advantageous for analyses.
Often, when talking about genome projects, it is common to say that it is never finished. This applies not only to bridging gaps in the sequence, but also to difficulty in finding the right reference genome for many differing individual genomes. 1000 Genomes Project Consortium reports brand new piece of genome of 3.7 millions of DNA base pairs. This fragment was found in great ape and other human sequences available in public databases.
To conclude, I believe that 1000 genomes initiative is a major breakthrough in human medical genetics. Open access to tremendous amount of variation data will foster genome wide association studies. In addition to that, such data is an important contribution to the studies of human evolution. I look forward to 2012, when full-scale results are expected.
Durbin, R., & al. (2010). A map of human genome variation from population-scale sequencing Nature, 467 (7319), 1061-1073 DOI: 10.1038/nature09534