Reconstructing prehistoric African population structure


The highest genetic diversity in humans is found in Africa, in line with Africa being the cradle of humanity. While the three articles we discussed previously during this tutorial (1,2,3) mainly focused on determining the most parsimonious “out-of-Africa” scenarios based on genetic diversity data, this article (Skoglund et al. 2017 4) investigates the population structure of Africa prior to the expansion of food producers (i.e. herders and farmers). In order to reconstruct the prehistoric population structure, the authors analyzed the genomes from 16 ancient African individuals who lived up to 8100 years ago (including 15 newly sequenced genomes), as well as SNP genotypes from 584 present-day Africans, and 300 high coverage genomes from 142 worldwide populations. This is the first study to gather and analyze such a high number of ancient genomes, thereby providing an unpreceded insight into the prehistoric human population structure.


An ancient cline of southern and eastern African hunter-gatherers

The authors used principal component analysis (PCA) and automated clustering in order to relate the 16 ancient individuals to present-day sub-Saharan Africans. This reveals that while the two ancient South African individuals share ancestry with present-day South Africans (Khoe-San), 11 of the 12 ancient individuals living in eastern and south-central Africa between ≈8100 and ≈400 years BP form a gradient of relatedness to the eastern African Hadza on one extremity and to Khoe-San on the other. This genetic cline is also correlated with geography along a North-South axis. Another pattern which emerged from this analysis is the lack of heterogeneity between the seven ancient individuals from Malawi, indicating a long-standing and distinctive population in ancient Malawi which persisted for at least 5000 years but which is extinct today.

Subsequently, the authors built a model where ancient and present-day African population trace their ancestry to a putative set of nine ancestral populations. They then used data from both ancient and present-day populations showing substantial ancestry to major lineages present in Africa today as proxies for these ancestral populations. These proxy populations consisted of three ancient Near Eastern populations representative of Anatolia, the Levant and Irak, respectively, and six African populations representative of different components of ancestry (western African, southern African before agriculture, northeastern African before agriculture, central African rainforest hunter-gatherer, eastern African early pastoralist context and distinctive ancestry found in Nilotic speakers today). By using qpAdm (a generalization of f4 symmetry statistics), they tested for 1-, 2- or 3-source models and admixture proportions for all other ancient and present-day African populations, with a set of 10 non-African populations as outgroups. We note that the f4 statistics are poorly explained in this article, making it hard for a non-initiated reader to grasp its meaning and the relevance of the results. The main finding from this analysis is that ancestry closely related to the ancient southern Africans was present much farther north and east in the past than is apparent today.

Displacement of forager populations in eastern Africa

Unsupervised clustering and formal ancestry estimation both indicate that present-day Hadza in Tanzania can be modeled as deriving all their ancestry from a lineage related to ancient eastern Africans such as Ethiopia_4500BP. However the contribution of this lineage to present-day Bantu speakers in eastern Africans is small, who instead trace their ancestry to a lineage related to present-day western Africans and additional ancestry components. In present-day Malawians, population replacement by incoming food producers seems to have been almost complete as witnessed by a near absence of ancestry from the ancient individuals sampled, and by most of their ancestry coming from the Bantu expansion of western African origin.

Importantly, of all ancient individuals analyzed, only a 600 BP individual from Zanzibar has a genetic profile similar to present-day Bantu speakers, with even more western African ancestry. Using linkage disequilibrium, the authors estimate that the admixture between western- and eastern-African-related lineages occurred 800-400 years ago. This indicates that there was genetic isolation between early farmers and previously established foragers during the Bantu expansion into eastern Africa, and that this barrier disappeared over time as mixture occurred. However this delayed admixture did not occur in all African populations, as shown in present-day Malawians who display no signs of admixture from previously established hunter-gatherers.

Early Levantine farmer-related admixture in a ≈3100-year-old pastoralist from Tanzania

The authors compared estimated the ancestry component from a 3100 BP individual from Tanzania and found that 38% of her ancestry was related to the pre-pottery farmers of the Levant (10000 BP), indicating a critical contribution of Levant-Neolithic-related populations to present-day eastern Africans. The best fitting ancestry component model in Somali indicates that they have ancestry from the 3100 BP Tanzanian individual but also Dinka-related ancestry as well as 16% ancestry related to Iranian-Neolithic-related ancestry. This suggests that ancestry related to the Iranian Neolithic appeared in eastern Africa after an earlier gene flow related to Levant Neolithic populations.

Direct evidence of migration bringing pastoralism to eastern and southern Africa

All three ancient southern Africans show affinities to the ancestry predominant in present-day Tuu speakers in the southern Kalahari. Among them, the 1200 BP sample from western Cape found in a pastoralist context has a similar ancestry composition as present-day pastoralists like the Nama, with affinity to three groups: Khoe-San, western Eurasians and eastern Africans. This is in line with the hypothesis of a non-Bantu-related population transporting eastern African and Levantine ancestry to southern Africa by at least 1200 BP. Using their model to determine the proportions of different ancestries present in western cape 1200 BP, they find mainly a mixture of non-southern African population. This is consistent with the hypothesis that the Savanna Pastoral Neolithic archaeological tradition in eastern Africa is a possible source for the spread of herding to southern Africa.

The earliest divergences among modern human populations

Previous studies indicate that the primary ancestry in the San population (southern Africa) comes from a lineage that separated from all other lineages present in modern humans, before separation of the different modern human lineages. While Skoglund et al. obtain a similar model in absence of admixture, the tree-like representation is a poor fit since ancient southern Africans (2000 BP) were not strictly an outgroup of all other African populations and several examples also show inconsistencies with this model. In order to find models that fit the data, the authors performed admixture graph modeling of the allele frequency correlations and found two parsimonious models. In the first one, present-day western Africans have ancestry from a basal African lineage that contributed more to the Mende than in did to the Yoruba, with the other source of western African ancestry being related to eastern Africans and non-Africans. In the second model, gene flow over long periods of time and over long distances has connected southern and eastern Africa to other groups in western Africa.

A selective sweep targeting a taste receptor locus in southern Africa

The authors then searched for the genomic signature of natural selection in ancient genomes, by searching for regions of greater allele frequency differentiation between ancient and present-day populations than predicted by the genome-wide background. To do this, the researchers compared the two ancient southern African genomes (2000 BP) to six present-day San genomes with minimal recent mixture. Since the small number of ancient genomes does not allow to infer changing allele frequencies at single loci, a scan for high allele frequency differentiation was conducted in 500 kb windows using 10kb steps. This led to the identification of the most differentiated locus which overlapped a cluster of eight taste-receptor genes. Although it is reported that taste receptors have already been identified as targets of natural selection as they affect the ability to detect poisonous compounds in plants, we must be wary that any analysis is bound to find something with such huge datasets, and that the biological interpretation of such finding may not be as straight-forward.

Polygenic adaptation

Skoglund et al. tested for evidence of selection on specific functional gene categories between present-day San and the two ancient genomes from southern Africa using allele frequency differentiation estimation. The functional category with the most extreme allele frequency differentiation between present-day San and the ancient southern Africans corresponded to “response to radiation”. In order to control that this was not a general inflated allele frequency differentiation, the same statistics were used using the Mbuti central African rainforest hunter-gatherer for which no enrichment for “response to radiation” was found. Instead, the top category for Mbutis was “response to growth”. Based on this, the authors speculate that the small stature of hunter-gatherer populations may be an acquired adaptation.



This study brings a first and unique view on the genetic makeup of prehistoric Africans. It is indeed a feat realized by 44 authors from institutions in 11 countries, which take advantage of 15 newly sequenced ancient genomes in addition to the only one that was previously available. The results indicate that an ancient lineage related to the San had a wider distribution in the past, depict two plausible scenarios of gene flow that led to the earliest divergences among modern populations and give new insights into the spread of herding and farming within Africa . As a side note, we noticed that all ancient individuals come from eastern or southern Africa, probably because this is where conditions were most favorable for the conservation of these ancient remains, although this could also introduce some biases, it seems to be the only possible way to go.


  1. Pagani et al 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538: 238–242 (corresponding blog post)
  2. Mallick et al. 2016 The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538: 201–206 (corresponding blog post)
  3. Malaspinas et al 2016 A genomic history of Aboriginal Australia. Nature 538: 207–214 (corresponding blog post)
  4. Skoglund et al 2017 (and references therein) Reconstructing Prehistoric African Population Structure. Cell 171: 59–71.e21
Posted in genomics, human | Leave a comment

A genomic history of Aboriginal Australia;

Blogpost on:

Malaspinas et al 2016 A genomic history of Aboriginal Australia. Nature 538: 207–214.


Prior to the publication of Malaspinas et al. 2016, investigation of Aboriginal Australian genome sequences had been quite limited. In fact, only 3 whole genome sequences from Aboriginals had been analyzed, 2 of these obtained with limited information concerning their place of origin (Rasmussen et al. 2011).

Malaspinas et al. 2016 is the first comprehensive study aimed at uncovering how the settlement of Australia occurred. The study combines genomic, linguistic and archeological studies in order to obtain more detailed information on how the settlement of Australia occurred.

For the largest part of the past 100000 years, Tasmania, New Guinea and Australia were part of the same continent known as Sahul. This continent was detached from mainland Asia, and its settlement process by human populations still remains poorly understood.

Previous archaeological evidence has led to the hypothesis that the settlement of Australia occurred from an African emigration wave, which predates the African emigration wave that settled in Eurasia (Lahr, M. et al. 1994). This has been coined the 2 Out of Africa event hypothesis (2OoA). Yet, other genetic studies support the notion that one major migration out of Africa (OoA) followed by 1 or 2 independent migratory waves led to the settlement of the modern Eurasian and Oceanic continents respectively.

The authors find that the data collected in the study more closely fits a model of single out of Africa dispersal (OoA), followed by divergence of Eurasians from Australo-Papuans. Finally the divergence of Aboriginals and Papuans from their common ancestral population ensued between 25000 and 40000 years ago.


The study is based on 108 newly-sequenced Aboriginal and Papuan genomes (83 Aboriginals and 25 Papuans) and genotype data for 45 additional Papuans. Moreover, SNP genotype data on Aboriginal Australians from Arnhem Land and from the European Collection of Cell Cultures Panel defined in previous studies was taken advantage of for admixture studies.

Colonization of Sahul

The authors use sparse non-negative matrix factorization (sNMF) on the combined datasets in order to determine the genomic ancestry proportions of Papuans and Aboriginals (Frichot, E. et al. 2014). The authors find that Aboriginals are mainly a mix of European, East Asian, New Guinean and Aboriginal ancestry. The most significantly contributing ancestry proportions stemming from Europeans and from Aboriginal ancestry. As expected, individuals from the Australian coastline displayed higher proportion of European ancestry compared to individuals from the desertic Australian inland.

Papuans instead display a majority of genomic ancestry stemming from New Guineans and East Asians. The proportion of New/Guinean ancestry in Aboriginal Australians is related to the distance from Papua, with Northeastern Australians containing a significantly higher proportion of Papuan ancestry compared to Southwestern Australians (Fig. 2a).

Based on f3 statistics, multidimensional scaling analyses (MDS) and genomic ancestry proportion inference, the authors show that Australians and New-Guineans are more similar to each other than to the other populations analyzed in the study (Fig. 2b,c) This favors the hypothesis that they share a common ancestral population which settled the continent of Sahul.

For the subsequent analyses the authors mask data stemming from non-Aboriginal ancestry or select samples based on their Aboriginal Ancestry. Specifically, the authors filter the information from the ancestry proportions and maintain only loci in which both loci show Aboriginal ancestry (Suppl Inf S06).

In order to shed some light on whether the settlement of Australia proceed through one or 2 separate founding waves, the authors use a simulation-based framework initially presented in (Excoffier et al. 2013). Specifically, this composite likelihood method compares the observed joint site frequency spectrum (SFS) to the expected one, allowing inference of coalescence based on SNPs (Excoffier et al. 2013).

The SFS approximation and the MDS analysis results both suggests that, a one wave founding model followed by divergence of a common ancestor into Papuan and Aboriginal populations fits the data more closely (Fig. 2a,b, Fig.3).

Fig.1 (Malaspinas et al. 2016) Describes locations for analyzed Australo-Papuans datasets

Fig.2 (Malaspinas et al. 2016) Australian Aboriginal Ancestry. A)Analysis of admixture in Australo-Papuans by sNMF. B-C) MDS analysis and f3-statistics to assess relationships within the Aboriginal population and between Australo-Papuans




Archaic Admixture

Next, the authors focus on characterizing the extent of archaic (Neanderthal and Denisovan) admixture contributing to the Australian and Papuan genomes. They do so based on the previously described SFS modeling-based approach, a D-statistics based on goodness-of-fit analysis (Green et al. 2010) and a putative archaic haplotype derivation method (Suppl Inf. Section 10).

D-statistics test was initially used in genetics (Green et al. 2010) in order to determine the extent of admixture between 3 populations. It compares which pair in a trio of tested populations is more closely related based on SNPs. The archaic haplotype derivation method used instead, is based on enhanced D-statistics (Meyer, M. et al. 2012) and linkage disequilibrium approaches (Wall, JD. et al. 2013)

Based on the 3 approaches the authors report that Aboriginal and Papuan genomes display an accumulation of Denisovan introgressed genes compared to non-Africans, and have the highest proportion of putatively Denisovan derived haplotypes compared to non-Africans. Additionally, they show that the estimated number of Denisovan derived haplotypes correlates with the proportion of Australo-Papuan ancestry across individuals (Ext. Fig. 3a,b,c). In summary the evidence indicates that Denisovan admixture predates the split of Australo-Papuans and the widespread Eurasian admixture into Aboriginal Australians.

Ext. Fig 3 (Malaspinas et al. 2016) Archaic (Denisovan and Neanderthal) genome introgressed haplotypes. A) Analysis of putative introgressed Denisovan and Neanderthal sites in European, East Asian, Australo-Papuan and South American Populations. B-C) Analysis of estimated archaic haplotype number in world populations. D-E) metrics of archaic derived haplotypes in populations of study.


Out of Africa

In order to determine whether the OoA or 2OoA wave scenario is more likely, the authors peform D-statistic on the following trios: Aboriginals and Eurasians compared to Africans VS Aboriginals and Eurasians compared to Ust’-Ishim (proxy for modern human from Asian (Fu, Q. et al. 2014)).

The authors find that if not accounting for Denisovan admixture, Africans and Ust’-Ishim are closer to Eurasians than to Aboriginal Australians, supporting a 2OoA model. Yet when accounting for the previously identified Denisovan admixture events, the test results indicate that Aboriginals and Eurasians are equally related to Ust’-Ishim favouring a OoA wave model. The same is seen when taking into account Denisovan admixture and considering populations across the whole world (Ext. Data Fig. 4a,b).

Implementation of the SFS-analysis and accounting for moderate Denisovan admixture also shows a more accurate fit of the data to the OoA model. The most accurately fitting model of the SFS-analysis shows that, first, Australo-Papuan divergence from Eurasians most likely occurred about 58000 years ago, and European divergence from East Asians occurred about 42000 years ago (Fig. 4).

Multiple Sequential Markovian Coalescence analysis also supports a model in which Australo-Papuans and Eurasians split from one ancestral population (Ext. Fig. 4, Ext. Fig. 6).


Fig. 3 (Malaspinas et al. 2016) SFS based modelling approximation of Australia settlement founding waves







Fig. 4 (Malaspinas et al. 2016) SFS based modelling approximation of most likely OoA migration


Genetic Structure of Aboriginal Australians

Subsequent investigation of mitochondrial DNA (mtDNA) and Y chromosome between-group variation shows that male-mediated migration was a driving factor in the substructuring of Aboriginals. MDS analyses performed on Aboriginals masked and non-masked for non-Aboriginal ancestry and geographic location of samples (Ext. Fig. 7a,b)suggests a population separation between Southwestern and Northeastern groups. This is in line with the model proposed by the SFS analysis of Australian continent settlement.

Usage of an ulterior modelling approach based on a three layer neural network (Bishop, CM. 1996,; Heaton, J. 2011), reveals that a majority of the gene flow took place along the Australian coasts (Ext. Fig. 7e-g). This result is consistent with the hypothesis that desertic internal Australian regions formed a natural barrier to gene flow.

Implementation of Bayesian statistics approaches for European, East Asian and Papuan admixture among Aboriginals reveals that Papuan admixture predated both East Asian and European admixture (Ext. Fig. 8a). Additionally, local ancestry inference based on tract length also underlines that Papuan gene flow into Aboriginals occurred before European and East Asian gene influx.

Ext. Fig. 4 A)D-statistics based proposed model for African emigration models. B) Sum of squared errors for the possible odels predicted by D-statistics. C)MSMC derived crosscoalescence rates determined by analysis of pairs of individuals. D) Assessment of archaic admixture on cross coalescence presented in C by modelling


Pama-Nyungan languages and genetic structure

Next, the study investigates how closely linguistic demographics could reflect genetic relationships by comparing phylogenetic trees obtained based on linguistics and trees based on Fixation index masking for Eurasian tracts (Ext. Fig. 7). Analysis of the common patterns identified in the two trees by distance-matrices and correlation analysis of linguistics and genetics reveal an initial divergence of populations beginning about 30000 years ago, followed by population size changes and highly reduced gene flow from Northeastern to Southwestern Australia, due in great part to the desertic geographic barrier.


Ext. fig. 7 A-B) MDS analysis performed when including all genome sequences (A) and when masking non-boriginal variants. C-D) Comparison of phylogenetic trees computed based on genetic and linguistic analysis respectively.








Selection in Aboriginal Australians

Finally, the authors perform scanning analyses to identify Aboriginal genomic regions which diverge highly in allele frequency since the split from Papuans (about 10000-30000 years ago), followed by identification of genomic regions which diverge highly based on geo-ecological location. Based on this analysis the authors identify two mutations which may have played a role in Aboriginal adaptation to the arid and desert Australian interior.


The authors analyses supports a model by which a OoA movement, followed by the split of an ancestral population which first colonized Australia and predated the physical separation of Sahul into Mainland Australia from the Papuan/New Guinea islands.



Rasmussen, M. et al. An Aboriginal Australian genome reveals separate human dispersals into Asia. Science 334, 94–98 (2011)

Davidson, I. The colonization of Australia and its adjacent islands and the evolution of modern cognition. Curr. Anthropol. 51, S177–S189 (2010)

Lahr, M. M. & Foley, R. Multiple dispersals and modern human origins. Evol. Anthropol. Issues News Rev . 3, 48–60 (1994)

Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G. & François, O. Fast and efficient estimation of individual ancestry coefficients. Genetics 196, 973–983 (2014)

Excoffier, L. Dupanloup, I. Huerta-Sanchez, E. Sousa, VC. Foll ,M. et al. Robust Demographic Inference from Genomic and SNP data. Plos Genetics 10, 1-17(2013)

Green RE, Krause J, Briggs AW, et al. A draft sequence of the Neandertal genome. Science. (56 co-authors). 2010;328(5979):710–722.

Meyer M, et al. A high-coverage genome sequence from an archaic denisovan individual. Science. 2012;338:222–226. doi: 10.1126/science.1224344


Wall JD, et al. Higher levels of Neanderthal ancestry in East Asians than in Europeans. Genetics. 2013;194:199–209


Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, Aximu-Petri A, Prüfer K, de Filippo C, et al. 2014. Genome sequence of a 45,000-year-old modern human from western Siberia. Nature 514: 445–449

Bishop, CM 1996. Neural Networks For Pattern Recognition. 1 edition. Oxford New York: Clarendon Press

Heaton, J. 2011: Programming neural networks with Encog3 in Java. 2 edition. Heaton Research, Inc.

Posted in Uncategorized | Leave a comment

A journey through the The Simons Genome Diversity Project: more genomes sequenced, more diverse populations


Since the first genome of Bacteriophage MS21 was completely sequenced, in 1976, until 2001 when the first draft of human genome2 was released, a lot of work was done to improve and to make accessible different methods to get inside of the genetics of various organisms. For human genome, this step was a very important one and the Human Genome Project was declared complete in 20033. During the last years, more and more projects are involved in deciphering the human wanderlust. To all of previous studies, we can add The Simons Genome Diversity Project, that brought us more information by sequencing 300 new genomes from 142 diverse populations. One of the aim was to chose populations that differ in genetics, language and culture. The study shows that some of the populations separated 100000 years ago and reveals more information about the ancestors of Australian, New Guinean and Andamanese people.


One of the most important thing in discovering the real human peopling of the Earth is to sequence as many as possible genomes, but from individuals coming from diverse populations, that could differ in many aspects. In this study, the 300 samples were prepared by using PCR-free library, through Illumina Ltd. method and the median coverage it was 42-fold (Figure S1.1; Supplementary Data Table 1). The method is using an improved genome coverage to identify the greatest number of variants with some of them previously reported. The single-sample genotypes was made by using the reference-bias free modification of GATK, but the some preprocessing was conducted for eliminating some adapter sequences. For increasing the data accuracy, it was used a filtering system, highly specific to the SGDP dataset. The levels are from 0 to 9 for each sample as a single character and the first level is the best for having a good balance between sensitivity and low error rate, but level 9 is good to be used when there is needed to low the errors rates (Figure S2.1).

The first part of the study is offering us more information about the time needed for the worldwide populations subjected to the study to get separated. The pairwise sequential Markovian coalescent (PSMC) and multiple sequentially Markovian coalescent (MSMC) was used to interpret the changes in size of the populations and the split time, the phased haplotypes of split time estimation were made by using the SHAPEIT and IMPUTE2. The filter used was the level 1. From the Figure 2a we can see evidence about the ancestors of some present populations that were isolated by at least 100kya, that could be an obstacle of certain mutations across the ancestors of all populations. The gene flow continued until around 50kya among the great majority of ancestral populations. The graphs show the moments when the substructure of different populations starts: in the Figure 2a, we can see that the substructure between french and africans start around 200 kya. In the next ones there is a comparison between only africans (the Yoruba separated from KhoeSan 87kya, from Mbuti 56kya and from the Dinka 19kya) or only non-Africans (the oldest substructure is from 50kya, taking part during or shortly after the deepest part of the shared non-African bottleneck 40-60kya). For the Figure 2d-f, it was used the PSMC and PS1 that show the effective population sizes inferred and the cross-coalescence rates inferred.

By using the neighbours-joining tree (pairwise divergence per nucleotide) and FST, Mallick et. al could reconfirm the previous studies regarding the fact that the deepest splits happened among the Africans. Previous studies showed that all non-Africans today possess Neanderthal ancestry and Figure 1c shows that the higher proportion of Neanderthal ancestry we can find it in East Asians. If we compare the EuroAsians between them, the South Asians have highest Denisovan ancestry (heatmap from Figure 1d). Another result is that there are more Denisovan ancestry in eastern than in western EuroAsians. If we take Australia, New Guinea or Oceania we can see that the results from other studies are confirmed by having more ancestry than in mainland Eurasians. In the Figure 3 the deeper the split is, the more divergent is the early dispersal ancestry. By using the cross-population coalescence pattern and allele frequency correlations, the best model is that the Australian, New-Guinean and Andamanese history doesn’t involve ancestry from an early- diverging source. In this study there is no archeological data taken in consideration regarding southern Asia or Australia. So, by using only the data from this study, it is released that the Australians, New Guineans and Andamanese are lacking in an analogous deep ancestry component. All the data referring to Australians seems to be consistent with descending
from a common homogeneous population since separation from New Guineans. Also, New Guineans, Australians and Andamanese appear as part of an eastern clade together with mainland EastAsians.

The 3P-CLR was used to scan the genome for positive selection. In the end, 38 of the largest peaks emerged for selection in the common ancestors of all modern humans. These peaks are the sweeps at the time that the archeological data shows an accelerated evidence of behavioral modernity. This data does not search for the sweeps on chromosome X or in repetitive or difficult-to-analyze sections of the genome.

For the rate of mutation accumulation between the non-Africans (grouped in America, CentralAsiaSiberia, EastAsia, WestEurasia, Oceania) and sub-Saharan Africans (grouped in Pygmy, Khoesan and Africa) it was supposed to be quite equal, but this study revealed an significant average of 0,5% difference. For this part, they used a highly restriction to the samples, by choosing only the samples processed in the same way and the highest level of filtering, pooling the samples from the same regions together. The one strength of this experiment is the fact that they avoid the bias due to different heterozygosity level in different populations (the heterozygosity is higher in Africans), by using only the chromosome X for males. Although, they map everything to chimpanzee, which is equally distant to all present populations. There are differences in observations related to other studies, by having a different rate of CCT>CTT mutation, that is close to Africans in Europeans, but not in East Asians. This could be explained by the decrease in generation interval in non-Africans since separation. Previous studies5 showed a higher X-to-autosome heterozygosity ratio in sub-Saharan Africans than in non-Africans. Mallick et al. confirmed this results by adding more populations to be analyzed: Khoesan for sud-Saharan Africans and New Guineans, Australians, Native Americans, Near Easternes and indigenous Siberians for the non-Africans. The only one exception, that showed a lower X-to-autosome heterozygosity ratio in sud-Saharan African than in non-Africans is in Pygmies (eastern Mbuti and western Biaka). In the Figure 1b through a scatterplot we can observe the two primary clusters: sud-Saharan Africans and all other populations, but without a big difference among the groups, except of the Pygmies with a high autosomal heterozygosity. If we compare the two Pygmies populations with a lower X-to-autosome ratio, we can see that the Mbuti are closer to non-Africans than to Africans, even if in the Neighbour-joining tree based on pairwise divergence, they are integrated to the Africans. The reduction of the X-to-autosome ratio in the non-African compared to African populations could be explained by the repeated waves of male mixture in already mixed population, but in the Pygmy populations, the strongest argument is the sex-biased gene flow supported by the anthropological data.

In the last part, Mallick et al. shows that the non-Africans are presenting a higher accumulation of mutations. This can be explained in two ways: the rate of mutations in non-Africans is increasing by acceleration of it or by a deceleration within Africans. The Extended Data Table 1, shows that none of the populations with strong signals of non-Africans could be in fact a deceleration of Africans. The acceleration in non-Africans could be caused by many possibilities: the life history traits (eg. generation interval) could change after the dispersal of modern humans outside of Africa, increasing the latitudes conquered by the humans or the colder climates, the gene conversion (GC to A or T alleles) was more effective in Africans or a Neanderthal admixture into the ancestors of non-Africans, that could accumulate more mutations than in the modern humans after separation (but there are not clear evidence about this fact).


The Simons Genome Diversity Project is bringing more information by studying 300 new genomes, from 142 diverse populations, that shows an acceleration of accumulation of mutations in non-Africans compared to Africans. Also, the Pygmies seem to be the only African group with a low X-to-Autosome diversity ratio. Regarding the ancestors, the highest proportion of Neanderthal it was present in EastAsians and an excess of Denisovan in some SouthAsians compared to other Euroasians.


  1. Min Jou, W., Haegeman, G., Ysebaert, M., Fiers, W., Nucleotide Sequence of the Gene Coding for Bacteriophage MS2 Coat Protein, Nat., 237, 5350, pp. 82-88, 1972
  3. International HapMap Consortium. The International HapMap Project. Nat., 426,789–796, 2003.
  4. Keinan, A., Mullikin, J. C., Patterson, N. & Reich, D., Accelerated genetic drift on chromosome X during the human dispersal out of Africa, Nat Genet 41, 66­‐70, doi:10.1038/ng.303, 2009.
  5. Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A. et al., The Simons Genome Diversity Project: 300 genomes from 142 diverse populations, Nat., 538: 201–206, 2016.
Posted in genomics, human | Tagged , , | Leave a comment

Genomic analyses inform on migration events during the peopling of Eurasia


In the past two decades, considerable research effort has been made to sequence the human genome and subsequently trying to unveil the demographic history underlying the genetic patterns of diversity we observe today across the globe. Here we discuss a recent research article by Pagani et al. 1 that addresses genomic diversity and historic migration patterns of human populations in Eurasia. The first human genome was sequenced in 2003 by the Human Genome Project2 and larger projects rapidly followed, such as HAPMAP3 and the 1000 Genomes Project4, largely due to the considerable technological improvement of sequencing technologies. Despite being extremely useful tools for a number of studies, these genome databases have some important sampling caveats that limit their use to address some particular topics. Indeed, HAPMAP sampled a reduced number of populations whereas the 1000 Genomes sampled a large number of populations but did not attempt to sample individuals of “pure” ancestry. For instance, the sampling in North America focused considerably on city-based individuals that were found to have a very diverse recent ancestry thus blurring the signal of ancient colonisation history. Importantly, in the studied paper, a considerable effort was made on sampling a broad panel of 447 unrelated individuals of pure ancestry from 148 distinct populations, particularly including previously unstudied regions like Siberia and western Asia.

One of the main topics of the demographic history of humans that has long been of interest to researchers is the Out of Africa (OoA) of Anatomically Modern Humans (AMH) – a turning point in which humans dispersed from Africa and colonised Eurasia and ultimately Oceania and the Americas. Among other aspects, the number of OoA events has been the focus of discussion from which two major hypotheses emerged. The first, arguably the most wide-accepted, advocates for a single OoA event estimated at around 40 to 80 kya which gave origin to all extant non-african populations. The second hypothesis, dubbed the multiple-dispersal model5, considers multiple migration waves, more or less successful in settling in new continents, and possible admixture events between them at various points in time, which appears to be supported by previously described fossil evidence6,7. Interestingly, Tucci & Akey8 argue that these theories are not necessarily mutually exclusive but rather complementary as there could have been several failed or low-success OoA events followed by a major one that effectively colonised and subsisted in most continents.

In this study, Pagani et al. argue in favour of a multiple-dispersal scenario based on small remaining genetic contributions in the genomes of extant Papuans from an extinct lineage of AMH OoA earlier than the main OoA 75 kya.

Genetic structure and barriers across space

To obtain the first insight on the genetic structure among the sampled genomes, Pagani et al. employed two different approaches: first, treating SNP as independent markers (with ADMIXTURE9) and second taking into account linkage blocks (with fineSTRUCTURE10). Both strategies identified the major biogeographic groups of populations despite differences in resolution, defining 14 main genetic clusters across the globe (Extended Data Figure 1C). The detailed output from fineSTRUCTURE was interestingly used for a range of analyses from spatial patterns of genetic differentiation (Figure 1), co-ancestry (Extended Data Figure 3) and demographic history reconstruction (Extended Data Figure 7).

Taking advantage of their detailed sampling from Eurasia to Sahul, the authors employed a spatially explicit framework to study genetic differences and gene flow between populations as well as their association with environmental/geographic features at a large scale. Figure 1 illustrates this by representing the magnitude of the gradient of allele frequencies from SNPs across space, allowing to pinpoint the regions of major genetic gradients, i.e. potential barriers to gene flow, specifically mountain ranges, deserts and large water masses. These were consistent in broad strokes among the different analyses with the fineSTRUCTURE output (Figure S2.2.2-I) as well as the complementary migration-based EEMS (Estimating Effective Migration Surfaces; Extended Data Figure 5H). Importantly, the authors tested whether the geographic gaps in their sampling could bias the interpolation of barriers and showed their model remained robust in the face of new gaps (Extended Data Figure 5E-G).

On a second stage, Pagani et al. measured the association between the gradients of allele frequencies (termed as SNPs in Figure 1) and fineSTRUCTURE with three environmental barriers – elevation, temperature and precipitation – to determine the relative importance of the role each played in shaping the genetic patterns observed today. As one can see in the inset of Figure 1, SNPs indicated that elevation and precipitation had a strong spatial correlation with genetic differences whereas fineSTRUCTURE gave higher support to precipitation and temperature. This dissimilarity is likely due to the fact that the latter, as explained above, is dependent on linkage patterns. Linkage blocks are physical associations of loci that recombination renders temporary, unless they are specifically maintained by selection. Thus, current neutral linkage patterns reflect relatively recent demographic history, whereas the bulk of raw allelic frequencies reveal older patterns that influenced the majority of the genome. In the same sense, when taking into account only the rare variants (i.e. more recent), the association of SNPs with elevation was reduced (Figure S2.2.2-II).

The authors conclude these observations by suggesting that elevation contributed to shaping old migration routes (as confirmed by patterns of isolation by distance; Extended Data Figure 5A-C) but has not recently impeded the persistence of human populations. On the other hand, precipitation seems to be of paramount importance as populations continue to this day to avoid inhabiting low-precipitation regions such as deserts.

Despite the credibility of the conclusions, we raised some important questions on the analysis that could bias the interpretation. First, the authors did not address the innate correlation between the environmental variables (ex.: elevation and temperature) nor how or whether it was taken into account. Additionally, it is unclear which time period was used for temperature and precipitation as the study spans 120 thousand years of demographic history. Both these points could change the relative importance of a given variable, and should therefore have been specified clearly in the main text.

Selection screening

The authors scanned the genomes for evidence of purifying and positive selection through a series of different approaches and identified multiple candidate loci, some of which had been identified as targets of positive selection in previous studies. Additionally, the authors highlighted different levels of inter-population purifying selection, such as on olfactory receptor genes in Asians. Interestingly, they identified significantly stronger purifying selection in pigmentation and immune response genes in Africans than in the remaining populations, with the single exception of Papuans for the pigmentation genes (Extended Data Figure 6B). However, the authors did not discuss the possible factors behind such selective forces nor how this section on selection contributed to the main storyline and conclusions of the study.

Demographic history of Papuans

The results of fineSTRUCTURE were summarised with ChromoPainter and revealed very interesting patterns of haplotype co-ancestry and length as well as proportion of shared genome between populations. Leading is the observation that African populations display the highest co-ancestry (Extended Data Figure 3) and the shortest haplotypes (Figure S2.2.1-III), confirming their status as the oldest and most diverse populations. Short haplotypes reflect multiple recombination events through time indicating older ancestry. Thus, the most surprising observation was that Papuans have the shortest average haplotype length of all non-African populations (Figure S2.2.1-III), as well as the shortest African-inherited haplotypes (Extended Data Figure 7), which suggests an older ancestry with Africans than that of the remaining populations.

To investigate this further, the authors used multiple sequential Markovian coalescent (MSMC) to determine mean split times between genomes of Papuans and other populations, and it is represented in Figure 2A. This figure depicts the proportion of genome coalescing between populations over time (in logarithmic axis). However, it is important to take into account that for these calculations they used a generation time of 30 years, whereas the selection scans were done with a 25 years’ generation time. The latter is the most commonly used in the literature and no justification is given for this change. This analysis revealed an old split between the Papuan and African at about 90 kya (represented as Koinanbe in Figure 2A, red line), predating the split between Eurasian and African estimated at 75 kya (black line) and between Papuan and Eurasian at 40 kya (blue line). Despite the possible fluctuation in the absolute split times due to the chosen generation time, the relative differences between them is in line with Papuans harboring high amounts of short haplotypes, all suggesting an older population split than previously thought.

To explain the demographic history behind the observed patterns, the authors propose that a previously unknown admixture event took place in Sahul with either an archaic non-AMH (different from Denisovan and Neanderthal) or with a AMH resulted from an extinct OoA (xOoA). The latter hypothesis, which fits into the multiple-dispersal model explained earlier, would have taken place after the split of AMH with Neanderthal but before the main OoA.

Using coalescent simulations, the authors tried to replicate the split times by adding varying amounts of admixture with a non-AMH or with an AMH from a xOoA. There was no plausible scenario simulated of archaic admixture with non-AMH that could mirror the observed data. On the other hand, including in Papuans a genomic component that diverged from the main human lineage prior to the main OoA, replicated somewhat similar population split times. It is noteworthy that the main text indicates the “observed shift in the African-Papuan MSMC split curve can be qualitatively reproduced” under these conditions. In detail, it obtained a 3ky difference between the Papuan-African and Papuan-Eurasian splits (Figure S2.2.8-III) whereas the observed time-gap between the two is actually 15 kya (Figure 2A). The authors suggest that they may not be able to reach a comparable gap due to higher complexities of the demographic model that were not simulated within this study, such as population expansion and bottlenecks. Although this explanation appears reasonable, we believe it ought to have been made clear in the main text of the article.

To discern the weight of admixture with non-AMH, the authors masked putatively introgressed Denisovan haplotypes in Papuan genomes, which did not change the split times estimated between Papuans and the other populations (dashed lines in Figure 2A). Furthermore, the authors confirmed that MSMC behaved linearly through multiple events of admixture by studying populations with known admixture proportions in time (African Americans and Central and East Asians; Extended Data Figure 8), which allowed the calculation that the hypothesized xOoA would have split from most Africans around 120 kya (Supplementary Information 2.2.4).

On a supplementary line of examination, Pagani et al. looked at the age of African haplotypes in Papuans not present in other Eurasian populations by accessing the density of non-African alleles (nAAs) within them. The rationale behind this lies on the assumption that the rate of accumulation of nAAs, i.e. alleles not found among African genomes, within a haplotype of determined African origin in a non-African genome is proportional to the split date of that given population with Africans. First, this analysis revealed that Papuans had an overall higher amount of nAAs within African haplotypes along the genome than Eurasians (Figure 2B), indicating an older coalescent time with the Africans. Further, the proportions of nAAs within African haplotypes in Papuans were modeled under demographic scenarios of single and multiple-dispersal. The results showed that a xOoA of AMH that split around 120 kya from Africans was necessary to explain the constant elevated proportions of nAAs in Papuans (Figure 2D).

Combining results from the different approaches, the authors support an xOoA that split from Africans around 120 kya, and conclude by estimating it contributes to approximately 2% in contemporary Papuan genomes.


In this wide-ranging study, Pagani et al. discussed three main topics of human evolutionary biology in Eurasia using their extensive sampling: i) detect main geographic barriers to gene flow, ii) identify loci and ultimately pathways under selective pressure and iii) propose an extinct Out of Africa event earlier than 75 kya.

The latter was arguably the most important finding of this study with, as described above, the description of a 2% contribution in the genome of Papuans from an early xOoA. The authors provided multiple lines of compelling evidence pointing to an extinct Out of Africa expansion around 120 kya from Africans that admixed with the main OoA later in Sahul. The complete scenario is described in Extended Data Figure 10.

Nevertheless, the results presented in this paper and their associated methods are consistently poorly detailed and/or not self-explanatory. Such a paper covering a trendy topic in a high impact journal should be less indigestible for neophytes or even to fellow evolutionary biologists. Furthermore, the connection between the three main sets of analyses of the study (geographic barriers to gene flow, selection screening and the possibility of an xOoA) seems to be lacking as there is no global discussion bringing all points together.

Written by Ana Paula Machado and Clément Train.

Studied papers
Tucci & Akey 2016 Population genetics: A map of human wanderlust. Nature 538: 179–180
Pagani et al 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538: 238–242


  1. Pagani, L. et al. Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538, 238–242 (2016).
  2. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945 (2004).
  3. International HapMap Consortium. The International HapMap Project. Nature 426, 789–796 (2003).
  4. 1000 Genomes Project Consortium et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).
  5. Lahr, M. M. & Foley, R. Multiple dispersals and modern human origins. Evolutionary Anthropology: Issues, News, and Reviews 3, 48–60 (2005).
  6. Groucutt, H. S. et al. Rethinking the dispersal of Homo sapiens out of Africa. Evol. Anthropol. 24, 149–164 (2015).
  7. Liu, W. et al. The earliest unequivocally modern humans in southern China. Nature 526, 696–699 (2015).
  8. Tucci, S. & Akey, J. M. Population genetics: A map of human wanderlust. Nature 538, 179–180 (2016).
  9. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
  10. Lawson, D. J., Hellenthal, G., Myers, S. & Falush, D. Inference of population structure using dense haplotype data. PLoS Genet. 8, e1002453 (2012).
Posted in genomics, human | 2 Comments

Papers to discuss Autumn 2017: Human demography and Convergent evolution

This Autumn, we will discuss two series of papers:

Series 1: Human demography

News & Views: Tucci & Akey 2016 Population genetics: A map of human wanderlust. Nature 538: 179–180

6 Oct 2017: Pagani et al 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538: 238–242

2 7 Oct 2017: Mallick et al. 2016 The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538: 201–206

3 Nov 2017: Malaspinas et al 2016 A genomic history of Aboriginal Australia. Nature 538: 207–214

10 Nov 2017: Skoglund et al 2017 Reconstructing Prehistoric African Population Structure. Cell 171: 59–71.e21

Series 2: Convergent evolution

17 Nov 2017: Levin & Mishmar 2017 The genomic landscape of evolutionary convergence in mammals, birds and reptiles. Nature Ecology & Evolution 1: 0041

24 Nov 2017: Burga et al 2017 A genetic signature of the evolution of loss of flight in the Galapagos cormorant. Science 356: eaal3345 (also see preprint comment by Berger and Bejerano)

8 Dec 2017: Reid et al 2017 The genomic landscape of rapid repeated evolutionary adaptation to toxic pollution in wild fish. Science 354: 1305-1308

15 Dec 2017: Huang et al 2016 Convergent evolution of caffeine in plants by co-option of exapted ancestral enzymes. PNAS 113: 10613–10618

Posted in paper list | Leave a comment

Papers to discuss Spring 2017

Edit: this Spring’s series has been canceled, but I’m leaving the article list up for the record.

For this Spring, the students will need to chose 3 of these 4 series of articles to discuss:

Evo-Devo genomics:

Lin et al 2016 The seahorse genome and the evolution of its specialized morphology. Nature 540: 395–399

Van Belleghem et al 2017 Complex modular architecture around a simple toolkit of wing pattern genes. Nature Ecology & Evolution 1: 0052

Human populations:

Pagani et al 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538: 238–242

Mallick et al. 2016 The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538: 201–206

Malaspinas et al 2016 A genomic history of Aboriginal Australia. Nature 538: 207–214

Convergent evolution:

Yeaman et al 2016 Convergent local adaptation to climate in distantly related conifers. Science 353: 1431-1433

Fukushima et al 2016 Genome of the pitcher plant Cephalotus reveals genetic changes associated with carnivory. Nature Ecology & Evolution 1: 0059

Reid et al 2017 The genomic landscape of rapid repeated evolutionary adaptation to toxic pollution in wild fish. Science 354: 1305-1308


de Manuel et al 2016 Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354: 477-481

Roux et al 2016 Shedding Light on the Grey Zone of Speciation along a Continuum of Genomic Divergence. PLoS Biol 14: e2000234

Posted in paper list | Leave a comment

Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals

Gene duplications are main contributors of genome evolution, but most of the duplicates are redundant and go through pseudogenization. There are several mechanisms proposed to explain how young duplicates survive long-term and escape from being degraded. Among these, dosage-balance model likely to explain the importance of shared expression levels of young duplicate genes. An alternative model indicates sub-functionalization (new copies shares the initial functions) or neo-functionalization (new copy gains new function) as the main mechanisms of the survival of new duplicate. However, it is largely unknown the survival of gene duplication in mammals. In this current study, by using RNA-seq profiles of different human and mouse tissues, authors show that sub-functionalization is a slowly evolving and rare event. Most of the young duplicates are shown to have decreased level of expression, thereby providing initial survival and long-term preservation in the genome.

Figure 1 Expression profiles of duplicate genes. Examples of Sub-or Neofunctionalization (A) and asymmetrically expressed gene pairs (B) are shown. In sub-functionalized example, SLC4A2 was shown to be expressed in Lung, Kidney, Liver and Testis, whereas SLC4A3 is expressed in Cortex, Heart and Testis. In asymmetrically expressed gene example, CRB1 is shown to be expressed higher in all tissues that examined.

In order to understand the process of long-term survival after gene duplication, they analyzed RNAseq data of 46 human tissues (from Genotype Tissue Expression, GTEx) and26 mouse tissues. With a computational pipeline (More than %80 coding sequence similarity and more than %50 average sequence similarity), 1444 duplicate gene pairs are identified. These gene pairs are classified as major gene and minor gene, for the higher or lower expression level, respectively. In addition, if a gene pair is at least two-fold higher expressed in minimum one tissue, then it is classified as sub- or neo-functionalized (Figure 1A). Moreover, if a gene pair is expressed more than the other pair in 1/3 of the tissues that examined, it is considered as asymmetrically expressed duplicate (AED) as shown in Figure 1B. Synonymous divergence (ds) was used to estimate divergence time, human-mouse split was shown as 0.45 ds and origin of placental mammals was shown as 0.7 ds.

Figure 2 Sub-functionalized or neo-functionalized genes dating back before the emergence of placental mammals.

Some gene pairs (Mostly of ds < 0.7) are shown to be neo or sub functionalized, yet there are very few examples of neo or sub-functionalization in lately occurred duplication events (Figure 2A-C). In addition, as it is expected that sub-functionalized genes would be under strong selective constraint comparing with non-divergent genes, Kolmogorov-Smirnov test showed that sub-functionalized genes have high fraction rare variants (Figure 2D). Since functionalization would rather give new functions to the gene pairs, authors examined if one of the gene pairs is associated with any disease. There is indeed a correlation that indicating an increase of both minor gene specific disease and minor gene associated disease, when there is a sub-functionalization event (Figure 2E).

The duplicates that are risen within placental mammals, most duplicate pairs are shown to be AEDs other than sub-functionalized and within AEDs, very few minor genes are associated with disease in contrast to what was shown in Figure 2E. All these results indicate that, sub-functionalization is a slowly evolving event, although it was shown that duplicates on different chromosomes have higher rates or neo- or sub-functionalization when it is compared with duplicates that are in tandem arrays. This brings the question, whether separation of the duplicates is a facilitating process for sub-functionalization.

Figure 3 Genomic Location of the duplicates and expression correlation. It is shown that most of the young duplicates are located in same chromosome and are closely located to each other, whereas the older duplicates tend to locate on different chromosomes. Depending on how closely the duplicates locate on the chromosome (both in human and mouse), there is a higher of expression correlation of the duplicates.

Supporting this idea, authors indicated that 87% young gene pairs with ds < 0.1 are found in tandem arrays in the same chromosomes (Figure 3A). The rest of the duplicates found on different chromosomes are most likely separated by the result of chromosomal rearrangements and they have diverged expression pattern due to the genomic separation (Figure 3B). It is shown that the more genomic distance of the duplicates increases, the less expression correlation of the duplicates is observed. Notably, it is also shown that duplicates in mouse have a similar correlation with human duplicates, indicating the negative relation between genomic distance and expression correlation is not human specific (Figure 3C). This data supports what was previously shown about the coregulation of closely located genes in the genome and it is once shown in Figure 3D, as neighbor duplicates have higher expression correlation comparing with duplicates on different chromosomes and singletons. In addition, whole-genome chromosome conformation capture (Hi-C) shows that neighboring duplicates have higher connectivity and more promoter-promoter links comparing with neighboring singletons (Figure 3D).

So far, it is shown that expression sub-functionalization is a slowly evolving process and duplicates that are in tandem arrays are mostly coregulated. As an alternative explanation, if dosage sharing is crucial for the preservation of newborn duplicates, it must be shown that there is a shared and lower expression of the duplicates. In order to prove this hypothesis, the authors investigated the human duplicates since human-macaque split with RNA-seq results of six different tissues. It is obvious that, the sum of expression levels of human major and minor duplicate is corresponding to the expression level of macaque singleton ortholog (Figure 4A). This data proves that dosage sharing is a fast evolving event, contributing to the preservation of duplicates in the genome.

Figure 4 Dosage sharing and multi-step model of how duplicate genes are preserved. Summed expression of human young duplicate is similar to the expression of macaque ortholog.

Overall, in this current study the mechanism of how duplicated genes are preserved is explained with a multi-step model (Figure 4C). According to the model, after a duplication event happens, expression dosage is shared between two duplicates which was also suggested for whole genome duplications. In this process, there is a tight competition between dosage sharing and mutational degradation of one of the duplicates. After this important step, minor gene of the asymmetrically expressed duplicate can be lost slowly under reduced constraint. In an alternative long-term scenario, chromosomal rearrangements would happen to separate the coregulation of these tandem duplicates and providing different expression pattern and/or protein adaptation which will cause long-term survival of the duplicated genes. To sum up, this study shows that rapid dose sharing is a fundamental first step after the duplication of a gene and it can be followed by a slow evolving subfunctionalization event of the duplicate.


Xun Lan and Jonathan K. Pritchard

Science 20 May 2016:
Vol. 352, Issue 6288, pp. 1009-1013
DOI: 10.1126/science.aad8411

Posted in evolution | Leave a comment

Peppered moth melanism mutation is a transposable element

One of the most known examples of natural selection in action is the evolution of the peppered moth (Biston betularia), the rapid replacement of the light-colored form of the moth (typica) by a dark-colored form (carbonaria) (Fig. 1) during 1800s in Britain. The first live specimen of the carbonaria form was found in 1848 and its frequency had increased drastically until late 1800s. In 1895, 98% of the moth population in Manchester was the carbonaria form (reviewed in Clarke et al., 1985). Such a phenomenon 36 years after the publication of Darwin’s On the Origin of Species, attracted biologists’ attention. J.W. Tutt first proposed “Differential bird predation hypothesis” in 1896, which is confirmed by a series of experiments by Kettlewell during mid 1950s (reviewed in Cook and Saccheri, 2013). The hypothesis states that the industrial revolution in Britain resulted in blackened trees by soot, so that birds can easily spot light-colored moths on soot-darkened trees while dark-colored moths are camouflaged. However, genetic events giving rise to carbonaria phenotype remained elusive until recently. Researchers from University of Liverpool and Wellcome Trust Sanger Institute now reported in Nature that the mutation causing the peppered moth industrial melanism is the insertion of a large, tandemly repeated transposable element into first intron of gene cortex.

Figure 1. The dark-colored form, carbonaria (top) and the light-colored form, typica (bottom) of Biston betularia

The term industrial melanism refers to darkening of species in response to pollutants. It is widespread in many Lepidoptera species (moths and butterflies). Initial experiments identified that melanism in Biston betularia is determined by a single locus dominant allele (reviewed in Cook and Saccheri, 2013). However, the molecular identity of the gene determining the melanism in peppered moths was completely unknown. In order to determine the gene identity, van’t Hof and Saccheri looked for associations between genetic polymorphisms within sixteen genes previously implicated in melanisation pattern differences in other insects and the carbonaria morph by the candidate gene approach (van’t Hof and Saccheri, 2010). However, this earlier study showed that the carbonaria gene is not a structural variant of a canonical melanisation pathway gene. One year after the failure of the candidate gene approach, Saccheri group constructed a linkage map to identify the chromosomal region containing the carbonariatypica polymorphisms. In 2011, they coarsely localized the carbonaria locus to a <400-kilobase region orthologous to a segment of silkworm (Bombyx mori) chromosome 17 (van’t Hof et al., 2011). However, what the gene is and what it does was still a mystery.

The same group now reported that they have found the gene and the mutation event causing the industrial melanism in Biston betularia (van’t Hof et al., 2016). By using a larger population sample and more closely spaced genetic markers, they narrowed down the carbonaria candidate region to ~100 kb region in Biston betularia genome. The candidate region is the orthologue of Drosophila cortex (cort) gene. As a distant member of the Cdc20 protein family, Drosophila cort gene encodes for a cell-cycle regulator and is shown to be important in regulating oocyte meiosis (Chu et al., 2001), but it is not involved in wing patterning or development. Unlike Drosophila cort, two of multiple alternative first exons (1A and 1B) in Biston betularia cortex are strongly expressed in developing wing disks. In addition, cortex gene has a very large first intron and eight non-first exons.

After identification of the gene, authors compared one carbonaria to three typica haplotypes to identify the first set of carbonaria specific polymorphisms. This initial alignment revealed 87 melanisation candidate polymorphisms concentrated within the large first intron of the gene. However, natural selection increases not only the frequency of the favored allele in carbonaria but also the frequency of the neutral alleles linked to the causal allele. In an earlier study, they have also shown that Biston betularia melanism was originated from a single recent mutation (van’t Hof et al., 2011). Having screened more typica individuals, they further eliminated rare variants and were eventually able to find one polymorphism unique to carbonaria, a very large insert in the first intron of the gene.

The size of the causative large insert is 21,925 nucleotides long and is composed of a roughly 9-kb essentially non-repetitive sequence. The nature of the insert indicated that it is a class II transposable element (TE) – DNA transposon. The transposition of class II TEs are catalyzed by transposases that cut the DNA at the target site in a staggered fashion producing 5′ or 3′ DNA overhangs that are duplicated after transposition. Another hallmark of class II TEs is short inverted repeats at two ends of TE. Sequence analysis of the insert and comparison with the typica haplotypes revealed that both short inverted repeats (6 bp) and duplication of the target site (4 bp) are present in the carbonaria insert (Fig. 2).

Figure 2. The structure of the insert, shown in the carbonaria sequence, corresponds to a class II DNA transposon, with direct repeats resulting from target site duplication (black nucleotides) next to inverted repeats (red nucleotides). Typica haplotypes (lower sequence) lack the 4-base target site duplication, the inverted repeats and the core insert sequence. The transposon consists of ∼9 kb tandemly repeated two and one-third times (repeat unit (RU)1–RU3), with three short tandem subrepeat units (green dots, SRU1–SRU9) within each repeat unit.

To estimate the age of mutation event, the authors looked at 200 kb either side of the carb-TE insert. The idea is to track recombination events that have eroded the ancestral carbonaria haplotype. Given the ancestral state of carbonaria haplotype and recombination rate, how many years do we need to explain the observed haplotypes that are shuffled version of the ancestral one? Simulations based on this assumption predicted the most likely date of the mutation as 1819, shortly before it was first seen in the wild (1848) (Fig. 3).

Figure 3. Probability density for the age of the carb-TE mutation inferred from the recombination pattern in the carbonaria haplotypes (maximum density at 1819 shown by dotted line; first record of carbonaria in 1848 shown by dashed line).

The next question is how the carbonaria – TE leads to the melanisation of Biston betularia. TEs localized in introns effects the expression of the gene through several mechanisms. In order to test this possibility, first they checked tissue-specific expression of cortex splice isoforms and alternative first exons. They have identified two first exons, 1A and 1B, which are expressed highly in developing wing discs. Comparison of the abundance of 1A and 1B-initiated full transcripts between different genetic backgrounds (homozygous carbonariac/c, homozygous typicat/t, and heterozygous individuals – c/t) revealed that 1B expression is significantly higher in carbonaria background (c/c > c/t > t/t) (Fig. 4), whereas 1A-initiated full transcript does not show a significant difference between genotypes. In addition, cumulative expression of all splice-isoforms increases starting from the sixth larval instar (La6) until day 6 prepupa (Cr6) with highest value on day 4 prepupa (Cr4). Surprisingly, a phase of rapid wing disc morphogenesis also occurs in the same time interval, possibly indicating a function of cortex in wing pattern melanisation.

Figure 4. Tukey plot for relative expression of cortex 1B full transcript in developing wings of the three carbonaria-locus genotypes (c/c, c/t and t/t) produced within the progeny of a c/t x c/t cross. Genotypes differ significantly for the transcript (P < 0.001)

As mentioned earlier, Drosophila cort encodes for a distant member of the Cdc20 protein family (Chu et al., 2001). Members of the Cdc20 protein family activate an “E3” ubiquitin ligase, the anaphase-promoting complex (APC) and present its substrates. APC then ubiquitinates presented cell-cycle proteins, causing their degradation. This proteolysis destroys a panel of proteins including cyclins, allowing the cell cycle to progress. Degrons, short linear motifs located anywhere in the protein, are important for substrate recognition in proteolysis. A single shared site in lepidopterans and non-lepidopterans cortex binding the same degron sequence is also predicted for both 1A and 1B full isoforms, indicating a shared function of cortex between D. melanogaster and B. betularia. However, we still need further evidence to understand the exact connection between cell-cycle protein degradation and melanisation.

In conclusion, we now know that the industrial melanism mutation event in British peppered moth is the insertion of a large, tandemly repeated, transposable element into the first intron of the gene cortex. Although we still do not know the molecular mechanisms connecting cortex gene and the melanism in peppered moths, the discovery of causative mutation as a transposable element is breakthrough in the peppered moth story. In addition, it provides a spectacular evidence for the importance of transposable elements in adaptive evolution.


Chu, T., Henrion, G., Haegeli, V., and Strickland, S. (2001). Cortex, a drosophila gene required to complete oocyte meiosis, is a member of the Cdc20/fizzy protein family. Genesis 29, 141–152.

Clarke, C.A., Mani, G.S., and Wynne, G. (1985). Evolution in reverse: clean air and the peppered moth. Biol. J. Linn. Soc. 26, 189–199.

Cook, L.M., and Saccheri, I.J. (2013). The peppered moth and industrial melanism: evolution of a natural selection case study. Heredity (Edinb). 110, 207–212.

van’t Hof, A.E., Edmonds, N., Dalíková, M., Marec, F., and Saccheri, I.J. (2011). Industrial melanism in British peppered moths has a singular and recent mutational origin. Science 332, 958–960.

van’t Hof, A.E., Campagne, P., Rigden, D.J., Yung, C.J., Lingley, J., Quail, M.A., Hall, N., Darby, A.C., and Saccheri, I.J. (2016). The industrial melanism mutation in British peppered moths is a transposable element. Nature 534, 102–105.

van’t Hof, A.E., and Saccheri, I.J. (2010). Industrial melanism in the peppered moth is not associated with genetic variation in canonical melanisation gene candidates. PLoS One 5.



Posted in evolution | Tagged , , | Leave a comment

The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons

The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons

About 450 mya bony vertebrates radiated into Lobe-finned fish, from which tetrapods appeared later, and Ray-finned fish, which include Teleost (Fig.1). Nowadays they make up to 96 percent of all fish in the planet. Among the latter some species such as zebrafish (Dario renio) and medaka (Oryzias latipes) are used as model organisms in biomedical research in order to try to understand which is the genetic basis of certain human diseases. However, the transferability between the models is difficult given the phylogenetic distance between tetrapods (humans) and Ray-finned fish. For this reason, the authors decided to sequence the genome of the Spotted Gar (Lepisosteus oculatos), that can act as a bridge as it split off from the teleosts before the TGN (Teleost Genome Duplication). During vertebrate evolution two other genome duplications happened in the vertebrate linage: VGD1 and VGD2.

Fig1: Spotted gar is a ray-finned fish that diverged from teleost fishes before the TGD. Gar connects teleosts to lobe-finned vertebrates, such as coelacanth, and tetrapods, including human, by clarifying evolution after the two earlier rounds of vertebrate genome duplication (VGD1 and VGD2) that occurred before the divergence of ray-finned and lobe-finned fishes 450 million years ago (MYA)

Genes duplicates derived from the TGN are called Ohnologs. They were named by after Susumu Ohno, who showed in his work genome duplication may play an important role in evolution. The resulting paralogs (a special case of homology when duplicate genes or regions are in the same genome) are associated with development, signaling and gene regulation [2 sentences edited by Marc Robinson-Rechavi]. In addition ohnologs, which amount to about 20 to 35% of genes in the human genome, are frequently implicated in cancer and genetic diseases. Evolution acts on these duplicates and usually they can evolve in three different ways. Mechanisms that lead to preservation of duplicates are sub functionalization (partitioning of ancestral gene functions on the duplicates), neofunctionalization (assigning a novel function to one of the duplicates) and dosage selection (preserving genes to maintain dosage balance between interconnected components). Therefore the most likely outcome is non-functionalization of one duplicate genes due to the lack of selective constraint on preserving both. Because of the asymmetric evolution of ohnologs, TGD, and the speed at which the genome of teleost has evolved, connecting teleost sequences to human sequences can be challenging.
The authors thought, however, that the genome of the Gar can solve these problems due to its slow genetic evolution. Using this “Gar Bridge” allows to clarify the evolution of orthologs (genes in different species that evolved from a common ancestral gene by speciation) in humans such as: (i) Hox and Parahox genes, involved in the formation of body segments during embryogenesis; (ii) The SCPP genes (Calcium binding phosphoproteins), involved in the mineralization of tissues; (iii) miRNA genes, small non-coding RNA molecules that function in RNA silencing and post-transcriptional regulation of gene expression; (iv) CNEs (Conserved Non-coding Elements), regulatory sequences than in previous comparisons between tetrapod and teleost have never appeared. Finally, by the use of transcriptome data they tried to quantify the sum of expression domains and the levels of expression of the TGD-duplicate genes to figure out how these genes evolved.

Genome assembly and annotation
The authors sequenced the genome of one adult female gar to 90x coverage using Illumina technology. By anchoring a scaffold to a meiotic map they captured 94% of assembled bases in 29 linkage groups (LGs). Next, they constructed a gene set composed of 21,433 high confidence protein-coding genes and discovered that 20% of the genome is repetitive with Transposable Elements (TE) that are found in both teleost and lobe-finned fishes. Thanks to this they could clarify the phylogenetic origins of the TE.

The Gar lineage evolved slowly
The authors have made a Bayesian phylogenetic analysis using 243 one-to-one orthologs from 25 jawed vertebrates (Fig.2). Thanks to an evolutionary rate analysis, they showed that the proteins of the sister group of Holostei have evolved more slowly than those of the other vertebrates included in the analysis. These results suggest that the TGD maybe played a role in the rapid evolution of Teleost. The latter is confirmed by the greater branch lengths of the three teleost species used as outgroup.

Fig2: Bayesian phylogeny inferred from 243 proteins with a one-to-one orthology ratio from 25 jawed (gnathostome) vertebrates using PhyloBayes under the CAT + GTR + Γ4 model with rooting on cartilaginous fishes. Node support is shown as posterior probability (first number at each node) and bootstrap support from maximum-likelihood analysis (second number at each node).

Gar inform the evolution of bony vertebrate karyotypes
The karyotype of Gar (n2=58), which is composed of micro- and macro-chromosomes, was aligned to those of human, chicken and medaka, a teleost fish. Microchromosomes are present in a wide range of vertebrate classes but not in mammals and teleost. Probably they are the product of an evolutionary process that minimizes the DNA content (mostly through the number of repeats) and maximizes the recombination rate of them. The authors chose the Gar because its genome is the first that does not belong to teleost or lobe finned fish. They could demonstrate a high degree of one-to-one synteny (co-localization of genetic loci on the same chromosome) comparing gar to the chicken genome. This adds support to the hypothesis that the bony ancestor possessed both micro and macro chromosomes. They explain the absence of microchromosomes in teleost by fusion processes that occurred after the divergence from Gar followed by the TGD. In fact, if you look at the comparisons made between Gar and Medaka chromosomes, the synteny relationship is one-to-two meaning that the chromosome sequences are conserved, but are now located on different chromosomes. This confirms that after the fusion and the TGD, teleostei’s chromosomes where subjected to rearrangements and rediploidization and that the radiation of Holostei sister group happened before the genome duplication (Fig.3).

Fig.3: Gar-chicken-medaka comparisons illuminate the karyotype evolution leading to modern teleosts. The genome of the bony vertebrate ancestor contained both macro- and microchromosomes, some of which remain largely conserved in chicken and gar, for example, macrochromosome Loc2-GgaZ and microchromosomes Loc20-Gga15 and Loc21-Gga17. All three chromosomes possess double-conserved synteny with medaka chromosomes Ola9 and Ola12, which is explained by chromosome fusion in the lineage leading to teleosts after divergence from gar, followed by TGD duplication of the fusion chromosome and subsequent intrachromosomal rearrangements and rediploidization.

Gar clarifies vertebrate gene family evolution
Molecular and physiological mechanisms are shared between vertebrates and this allows to highlight the different types of evolution to which genes were subjected. Despite this after a genome duplication is possible that some ohnologs lineages went lost. The analysis of gar genome allowed to find ancestral genes belonging to VGD1 VGD2 and to clarify the functions of some gene families. For instance, they analyzed the hox family and were able to identify four clusters The number of hox genes that it possesses is greater compared to the ones of tetrapod and teleost. The latter in fact lack some hox orthologs, highlighting that were lost independently in the two groups. The hox genes are very important during embryonic development and intuitively one would think that these have to be more preserved than others. Surprisingly, in my opinion, this study reveal that the teleost, instead of 82 expected Hox cluster genes, have fewer than 50 indicating a massive gene loss after the TGD. The same results were obtained by analyzing circadian clocks, specifically opsin; the MHC’s family; the immunoglobulin genes; the Toll-like receptors. All these genes have shown that gar’s genome can act as a bridge between teleosts and tetrapods, as it possesses characteristics of both.

Gar uncover evolution of vertebrate mineralized tissues
The authors chose this class of proteins because they are preserved for almost all vertebrates. In gar they have an important role as the epidermis is composed of ganoid scales and then formed by ganoin, an “ancestor” of the enamel. However, the evolution of the Scpp (Secretory Calcium-binding Phosphoproteins) was not clear. Gar contain the largest gene number of Scpp, 35, and thanks to this big gene repertory made possible to identify orthologs which with a teleost-tetrapod comparison was not possible to find. The Ambn, Enam and Amel genes, respectively encode ameloblastin, aenamelin and amelogenin. They had been found in the lobe finned fish but not in teleost. These are, however, present in the transcriptome of gar and showing sequence similarity with zebrafish Scpp genes. This suggests that teleost may have different orthologs and that the common ancestor of bony vertebrates had a rich repertoire of Spcc genes. On one hand gar has kept it on the other hand teleosts and tetrapods suffered a loss of subsets of these genes.

Gar connects vertebrate microRNAomes
miRNA is a small non-coding RNA molecule (containing about 22 nucleotides) that functions in RNA silencing and post-transcriptional regulation of gene expression. This gene class has suffered the same evolutionary fate of others mentioned previously. Some sequences have become tetrapod or teleost-specific. The gar genome enabled to identify 107 families. In my opinion the authors did an interesting discover: TGD did not lead to the miRNA loss in teleost. Indeed, the retention rate is higher compared to some protein coding genes, shading new light to the hypothesis that “miRna genes are likely to be retained after a duplication owing their incorporation into multiple gene regulatory networks”. This is evidence of how very often we focus on the evolution of coding sequences of DNA when regulatory mechanisms and non-coding sequences seem to have greater importance.

Gar highlights hidden orthology of cis-regulatory elements
Conserved non-coding element (CNE) are non-coding regions of the genome identified by conventional alignment of genomic sequences from two or more species.
These regions are widely studied because it is unclear the role they play. However, are often considered as cis-acting regulatory sequence (acting on the same molecule of DNA that they regulate). The authors analyzed the evolution of these sequences close to developmental Hox and Parahox genes considering that, during embryonic development, gene expression must be controlled precisely both spatially and temporally. This control is brought about, in large part, by the combinatorial interaction of specific transcription factors with cis-regulatory modules. They chose CNS65, a limb enhancer, because in previous alignments its sequence has been shown to be conserved in humans and chicken but not in teleost. Again using gar CNS65 was possible to find an ortholog in zebrafish. They tested if this cryptic CNS65 enhancer preserves the ancestral function by generate transgenic zebrafish and mice embryos. What they discovered is that the ancestral function was also maintained in zebrafish but with different spatial dynamics. Using mouse embryos, gar CNS65 drives expression of forelimbs and hind limbs in the early stages of development and just later its function is restricted to the distal portion. In zebrafish CNS65 it is only active in the development of the forelimbs (Fig.4).

Fig.4: Gar CNS65 drives expression throughout the early mouse forelimbs and hindlimbs (arrows) at stage E10.5 (left). At later stages (E12.5), gar CNS65 activity is restricted to the proximal portion of the limb and is absent in developing digits (middle). Zebrafish CNS65 drives reporter expression in developing mouse limbs at E10.5 but only in forelimbs (right).

This is an example of partial loss of the original function, a mechanism that during evolution is more frequent than the gaining of a new function. Besides CNS65 they had found 108 other limb-enhancer in common with humans, compared to 81 that had been found previously with the teleost alignment confirming the presence of hidden orthology (Fig.5).

Fig.5: The gar bridge principle of vertebrate CNE connectivity from human through gar to teleosts. Hidden orthology is uncovered for elements that do not directly align between human and teleosts but become evident when first aligning tetrapod genomes to gar, and then aligning gar and teleost genomes

This shows that the latter have suffered the loss of a great number of limb enhancer. In the future, gar will be the ideal candidate to study the limb-to-fin transition.

Gar illuminate gene expression evolution following the TGD
Initially I spoke of evolutionary path that ohnologs (paralog) genes may have after the duplication of the genome. Here the authors were able to get two very clear, I think also very rare, examples as they evolved. The gene slc1a3 went to a neo-functionalization. In gar is expressed only in brain, bone and testis while in medaka, that was chosen by the authors as the representative of the teleost, a ohnolog is mainly expressed in the brain and the other in the liver (Fig.6.c). Completely different fate hit the gpr22 gene that has undergone sub-functionalization. In gar is expressed in the brain and in the heart while in the medaka one ohnolog is expressed in the brain and the other in the heart (Fig.6.d).

Fig.6: (c) Neofunctionalized ohnologs for slc1a3 showing new expression in liver. (d) Subfunctionalized TGD orthologs of gpr22 with one expressed in brain as in gar and the other expressed in heart as in gar. In c and d, the r values denote the correlation of the expression profile of each ohnolog with the gar pattern.

This second mechanism is what you would expect with more chances: an ancestral gene sub-function tends to be partitioned between the TGD-derived paralogs. The authors have also seen that the same mechanism occurs regarding the level of gene expression where a ohnologs pair tends to evolve the same level of expression of the pre-duplication gene.

The “Gar-bridges” led to the identification of many ortholog and paralog genes and clarify their fate during evolution. Previously the lack of direct connection between teleost and tetrapod genomes often lead to the wrong use of the word “innovation” on one group or the other. I think that this work is an excellent starting point to connect the evolution of genetic, developmental and physiological mechanisms that made the human genome evolve to its present state. To fully understand the differences between human and model organisms used in biomedicine it is crucial to create very powerful and close-to-reality models. For these reasons, this path should not stop here because the gar is only one species of Holostei – which is composed of nine species and two orders. The study of their genome and also that of other so-called “primitive” fish can help to shine more light on even the striking points that have emerged from this study. Perhaps the outcome of other comparative studies can give even more emphasis to these results or maybe provide answers that may now be counterintuitive.


Braasch I, Gehrke AR, Smith JJ, Kawasaki K, Manousaki T, Pasquier J, Amores A, Desvignes T, Batzel P, Catchen J, Berlin AM, Campbell MS, Barrell D, Martin KJ, Mulley JF, Ravi V, Lee AP, Nakamura T, Chalopin D, Fan S, Wcisel D, Cañestro C, Sydes J, Beaudry FE, Sun Y, Hertel J, Beam MJ, Fasold M, Ishiyama M, Johnson J, Kehr S, Lara M, Letaw JH, Litman GW, Litman RT, Mikami M, Ota T, Saha NR, Williams L, Stadler PF, Wang H, Taylor JS, Fontenot Q, Ferrara A, Searle SM, Aken B, Yandell M, Schneider I, Yoder JA, Volff JN, Meyer A, Amemiya CT, Venkatesh B, Holland PW, Guiguen Y, Bobe J, Shubin NH, Di Palma F, Alföldi J, Lindblad-Toh K, & Postlethwait JH (2016). The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons. Nature genetics, 48 (4), 427-37 PMID: 26950095

Posted in Uncategorized | Leave a comment

ExAC presents a catalogue of human protein-coding genetic variation

Exploration of variability of human genomes represents a key step in the holy grail of human genetics – to link genotypes with phenotypes, it also provides insights to human evolution and history. For this purpose Exome Aggregation Consortium (ExAC) have been founded; to capture variability of human exomes using next-generation sequencing. The first ExAC dataset of 63,358 individuals was released 20th of October 2014. Recently, a paper describing updated version of the dataset was published : Analysis of protein-coding genetic variation in 60,706 humans.

Authors made a great work on the reproductibility of the downstream analyses they have performed and generally on the availability of data. All the code is well documented in blogpost and available in GitHub repository. All figures in this blogpost I plotted by my own!


ExAC is composed of almost ten fold more individuals and previous dataset of the similar kind Fig 1a. 91,000 individuals were sequenced, of which 60,706 have been kept after quality filtering. Finnish population was excluded from European due to bottleneck they have gone though.

ExAC was targeting individuals with various genetic background. Principal component analysis have shown very strong geographical pattern in the dataset (Fig 1b). I expected a continuum of haplotypes in the environment without strong geographic obstacle (like European-Latino continuum). The gaps between South Asian samples and the rest Europen samples on the PCA plot is most likely caused by the absence of samples from Middle-East Asia. Middle-East Asian samples have just a colour, but no data points. Central Asians do not even have a colour.

Figure 1: Size and diversity of ExAC dataset a, ExAC dataset is almost ten fold bigger than datasets of similar kind: 1000 Genomes project and Exome Sequencing Project (ESP), but more importantly, it captures a far greater diversity of human populations compared to ESP and 1000 Genomes. b, The geographic signal of populations visualized using Principal component analysis (PCA). The first principal component get all the variability of African samples and it does not tells much about the rest of the dataset (Extended Data Figure 5 in the paper), therefore the second and third principal component has been show.

A 45 million nucleotide positions with sufficient coverage (>10x in at least 80% of individuals) are present in ExAC. These positions correspond to 18 million possible synonymous variants (in theory) of which ExAC is capturing 1.4 million (7.5%).

The size of ExAC allows to observe…

…mutational reoccurence: 43% of synonymous de Novo variants identified in previous studies were also identified in ExAC, which is a first direct evidence of mutational reoocuarence.

…multiple allels: 7.9% of high quality polymorphic sites are multiallelic, which is fairly close to Poisson expectation (whatever it means…)

…a LOT of variants after all the filtering, 7,404,909 high-quality variants were identified of which 317,381 indels. The density of variant is on the average one over eight bases. 99% of the variants had frequency bellow 1% and 54% of the variants are singletons (i.e. only one individual carries the variant).

…a selection effects The proportion of singletons among polymorphisms can serve as a measure of purifying selection acting on the polymorphisms of given size. The Figure 2 shows that indels that are not affecting open reading frame (ORF) have significantly less singleton variants than indels that actually affect ORF. There is also significant difference between indels of different sizes that are affecting ORF, but we (our topic group) have not found any possible explanation for this pattern.

…saturation of alleles in CpG sites: CpG sites have very high rate of transitions, therefore capturing all possible variants is substantially easier than for other sites. A subset of 20,000 individuals of ExAC dataset shows saturation of alleles – all non-lethal possible synonymous CpG transition variants are present. ExAC is the first dataset showing a saturation of human variation.

Figure 2: Indel frequencies with respect to the size a, Frequency of deletions is higher and smaller indels are more probable than greater. If we take into account the greater probability of smaller indels, frequency of indels that not shifting open reading frame is bit higher than frequency of indels than are not. b, Proportion of singletons in total number of indels (as proxy for strength of selection) is significantly and consistently lower in all indels that are not shifting open reading frame (-6, -3, +3, +6).

Deletireous alleles

Authors introduce a mutability adjusted proportion singleton (MAPS) metric as a measure of selection. This metric is correcting on biases caused by the different mutational rates allowing comparisons of categories with various mutational speed. Comparison across different functional classes have shown at Figure 3. MAPS shows higher values for categories predicted to be deleterious by conservation-based methods.

Figure 3: MAPS values of different functional classes. MAPS is highest for nonense substiturions and it also consistent with PolyPhen and Combined Annotation Dependent Depletion (CADD) classification.

Rare diseases

Average ExAC individual carries ~54 variants reported as Mendelian disease causing. Approximately 41 of these alleles were identified with frequency greater than one, therefore it is not expected to be caused by problem is variant calling, but in miss-classification of variants in the database. Evidence of 192 previously variants were manually curated of those only 9 had sufficient evidence in disease association. High allele frequencies were identified mainly in previously underrepresented categories Latino and South Asian.

ExAC have shown importance of matching reference population in identification disease-causing variant. An example is recessive disease North American Indian childhood cirrhosis previously reported to be caused by CIRH1A p.R565W. This variant was identified in homozygotic state in four individuals in Latino population, none of them having a record of liver problems during childhood.


ExAC shows the importance of diversity of sampled population in capturing the real link between genotype and phenotype. Even ExAC provides a lot of new insights, there are still populations that are underrepresented or not represented at all.

Given the richness of ExAC and the effort of authors in data sharing and availability, I guess that it will be a great resource for various analyses in the future for a lot of researchers around the globe.

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, & Exome Aggregation Consortium. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536 (7616), 285-91 PMID: 27535533

Posted in evolution, genomics, human | Tagged , , | Leave a comment