Papers to discuss Spring 2017

Edit: this Spring’s series has been canceled, but I’m leaving the article list up for the record.

For this Spring, the students will need to chose 3 of these 4 series of articles to discuss:

Evo-Devo genomics:

Lin et al 2016 The seahorse genome and the evolution of its specialized morphology. Nature 540: 395–399

Van Belleghem et al 2017 Complex modular architecture around a simple toolkit of wing pattern genes. Nature Ecology & Evolution 1: 0052

Human populations:

Pagani et al 2016 Genomic analyses inform on migration events during the peopling of Eurasia. Nature 538: 238–242

Mallick et al. 2016 The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538: 201–206

Malaspinas et al 2016 A genomic history of Aboriginal Australia. Nature 538: 207–214

Convergent evolution:

Yeaman et al 2016 Convergent local adaptation to climate in distantly related conifers. Science 353: 1431-1433

Fukushima et al 2016 Genome of the pitcher plant Cephalotus reveals genetic changes associated with carnivory. Nature Ecology & Evolution 1: 0059

Reid et al 2017 The genomic landscape of rapid repeated evolutionary adaptation to toxic pollution in wild fish. Science 354: 1305-1308


de Manuel et al 2016 Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354: 477-481

Roux et al 2016 Shedding Light on the Grey Zone of Speciation along a Continuum of Genomic Divergence. PLoS Biol 14: e2000234

Posted in paper list | Leave a comment

Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals

Gene duplications are main contributors of genome evolution, but most of the duplicates are redundant and go through pseudogenization. There are several mechanisms proposed to explain how young duplicates survive long-term and escape from being degraded. Among these, dosage-balance model likely to explain the importance of shared expression levels of young duplicate genes. An alternative model indicates sub-functionalization (new copies shares the initial functions) or neo-functionalization (new copy gains new function) as the main mechanisms of the survival of new duplicate. However, it is largely unknown the survival of gene duplication in mammals. In this current study, by using RNA-seq profiles of different human and mouse tissues, authors show that sub-functionalization is a slowly evolving and rare event. Most of the young duplicates are shown to have decreased level of expression, thereby providing initial survival and long-term preservation in the genome.

Figure 1 Expression profiles of duplicate genes. Examples of Sub-or Neofunctionalization (A) and asymmetrically expressed gene pairs (B) are shown. In sub-functionalized example, SLC4A2 was shown to be expressed in Lung, Kidney, Liver and Testis, whereas SLC4A3 is expressed in Cortex, Heart and Testis. In asymmetrically expressed gene example, CRB1 is shown to be expressed higher in all tissues that examined.

In order to understand the process of long-term survival after gene duplication, they analyzed RNAseq data of 46 human tissues (from Genotype Tissue Expression, GTEx) and26 mouse tissues. With a computational pipeline (More than %80 coding sequence similarity and more than %50 average sequence similarity), 1444 duplicate gene pairs are identified. These gene pairs are classified as major gene and minor gene, for the higher or lower expression level, respectively. In addition, if a gene pair is at least two-fold higher expressed in minimum one tissue, then it is classified as sub- or neo-functionalized (Figure 1A). Moreover, if a gene pair is expressed more than the other pair in 1/3 of the tissues that examined, it is considered as asymmetrically expressed duplicate (AED) as shown in Figure 1B. Synonymous divergence (ds) was used to estimate divergence time, human-mouse split was shown as 0.45 ds and origin of placental mammals was shown as 0.7 ds.

Figure 2 Sub-functionalized or neo-functionalized genes dating back before the emergence of placental mammals.

Some gene pairs (Mostly of ds < 0.7) are shown to be neo or sub functionalized, yet there are very few examples of neo or sub-functionalization in lately occurred duplication events (Figure 2A-C). In addition, as it is expected that sub-functionalized genes would be under strong selective constraint comparing with non-divergent genes, Kolmogorov-Smirnov test showed that sub-functionalized genes have high fraction rare variants (Figure 2D). Since functionalization would rather give new functions to the gene pairs, authors examined if one of the gene pairs is associated with any disease. There is indeed a correlation that indicating an increase of both minor gene specific disease and minor gene associated disease, when there is a sub-functionalization event (Figure 2E).

The duplicates that are risen within placental mammals, most duplicate pairs are shown to be AEDs other than sub-functionalized and within AEDs, very few minor genes are associated with disease in contrast to what was shown in Figure 2E. All these results indicate that, sub-functionalization is a slowly evolving event, although it was shown that duplicates on different chromosomes have higher rates or neo- or sub-functionalization when it is compared with duplicates that are in tandem arrays. This brings the question, whether separation of the duplicates is a facilitating process for sub-functionalization.

Figure 3 Genomic Location of the duplicates and expression correlation. It is shown that most of the young duplicates are located in same chromosome and are closely located to each other, whereas the older duplicates tend to locate on different chromosomes. Depending on how closely the duplicates locate on the chromosome (both in human and mouse), there is a higher of expression correlation of the duplicates.

Supporting this idea, authors indicated that 87% young gene pairs with ds < 0.1 are found in tandem arrays in the same chromosomes (Figure 3A). The rest of the duplicates found on different chromosomes are most likely separated by the result of chromosomal rearrangements and they have diverged expression pattern due to the genomic separation (Figure 3B). It is shown that the more genomic distance of the duplicates increases, the less expression correlation of the duplicates is observed. Notably, it is also shown that duplicates in mouse have a similar correlation with human duplicates, indicating the negative relation between genomic distance and expression correlation is not human specific (Figure 3C). This data supports what was previously shown about the coregulation of closely located genes in the genome and it is once shown in Figure 3D, as neighbor duplicates have higher expression correlation comparing with duplicates on different chromosomes and singletons. In addition, whole-genome chromosome conformation capture (Hi-C) shows that neighboring duplicates have higher connectivity and more promoter-promoter links comparing with neighboring singletons (Figure 3D).

So far, it is shown that expression sub-functionalization is a slowly evolving process and duplicates that are in tandem arrays are mostly coregulated. As an alternative explanation, if dosage sharing is crucial for the preservation of newborn duplicates, it must be shown that there is a shared and lower expression of the duplicates. In order to prove this hypothesis, the authors investigated the human duplicates since human-macaque split with RNA-seq results of six different tissues. It is obvious that, the sum of expression levels of human major and minor duplicate is corresponding to the expression level of macaque singleton ortholog (Figure 4A). This data proves that dosage sharing is a fast evolving event, contributing to the preservation of duplicates in the genome.

Figure 4 Dosage sharing and multi-step model of how duplicate genes are preserved. Summed expression of human young duplicate is similar to the expression of macaque ortholog.

Overall, in this current study the mechanism of how duplicated genes are preserved is explained with a multi-step model (Figure 4C). According to the model, after a duplication event happens, expression dosage is shared between two duplicates which was also suggested for whole genome duplications. In this process, there is a tight competition between dosage sharing and mutational degradation of one of the duplicates. After this important step, minor gene of the asymmetrically expressed duplicate can be lost slowly under reduced constraint. In an alternative long-term scenario, chromosomal rearrangements would happen to separate the coregulation of these tandem duplicates and providing different expression pattern and/or protein adaptation which will cause long-term survival of the duplicated genes. To sum up, this study shows that rapid dose sharing is a fundamental first step after the duplication of a gene and it can be followed by a slow evolving subfunctionalization event of the duplicate.


Xun Lan and Jonathan K. Pritchard

Science 20 May 2016:
Vol. 352, Issue 6288, pp. 1009-1013
DOI: 10.1126/science.aad8411

Posted in evolution | Leave a comment

Peppered moth melanism mutation is a transposable element

One of the most known examples of natural selection in action is the evolution of the peppered moth (Biston betularia), the rapid replacement of the light-colored form of the moth (typica) by a dark-colored form (carbonaria) (Fig. 1) during 1800s in Britain. The first live specimen of the carbonaria form was found in 1848 and its frequency had increased drastically until late 1800s. In 1895, 98% of the moth population in Manchester was the carbonaria form (reviewed in Clarke et al., 1985). Such a phenomenon 36 years after the publication of Darwin’s On the Origin of Species, attracted biologists’ attention. J.W. Tutt first proposed “Differential bird predation hypothesis” in 1896, which is confirmed by a series of experiments by Kettlewell during mid 1950s (reviewed in Cook and Saccheri, 2013). The hypothesis states that the industrial revolution in Britain resulted in blackened trees by soot, so that birds can easily spot light-colored moths on soot-darkened trees while dark-colored moths are camouflaged. However, genetic events giving rise to carbonaria phenotype remained elusive until recently. Researchers from University of Liverpool and Wellcome Trust Sanger Institute now reported in Nature that the mutation causing the peppered moth industrial melanism is the insertion of a large, tandemly repeated transposable element into first intron of gene cortex.

Figure 1. The dark-colored form, carbonaria (top) and the light-colored form, typica (bottom) of Biston betularia

The term industrial melanism refers to darkening of species in response to pollutants. It is widespread in many Lepidoptera species (moths and butterflies). Initial experiments identified that melanism in Biston betularia is determined by a single locus dominant allele (reviewed in Cook and Saccheri, 2013). However, the molecular identity of the gene determining the melanism in peppered moths was completely unknown. In order to determine the gene identity, van’t Hof and Saccheri looked for associations between genetic polymorphisms within sixteen genes previously implicated in melanisation pattern differences in other insects and the carbonaria morph by the candidate gene approach (van’t Hof and Saccheri, 2010). However, this earlier study showed that the carbonaria gene is not a structural variant of a canonical melanisation pathway gene. One year after the failure of the candidate gene approach, Saccheri group constructed a linkage map to identify the chromosomal region containing the carbonariatypica polymorphisms. In 2011, they coarsely localized the carbonaria locus to a <400-kilobase region orthologous to a segment of silkworm (Bombyx mori) chromosome 17 (van’t Hof et al., 2011). However, what the gene is and what it does was still a mystery.

The same group now reported that they have found the gene and the mutation event causing the industrial melanism in Biston betularia (van’t Hof et al., 2016). By using a larger population sample and more closely spaced genetic markers, they narrowed down the carbonaria candidate region to ~100 kb region in Biston betularia genome. The candidate region is the orthologue of Drosophila cortex (cort) gene. As a distant member of the Cdc20 protein family, Drosophila cort gene encodes for a cell-cycle regulator and is shown to be important in regulating oocyte meiosis (Chu et al., 2001), but it is not involved in wing patterning or development. Unlike Drosophila cort, two of multiple alternative first exons (1A and 1B) in Biston betularia cortex are strongly expressed in developing wing disks. In addition, cortex gene has a very large first intron and eight non-first exons.

After identification of the gene, authors compared one carbonaria to three typica haplotypes to identify the first set of carbonaria specific polymorphisms. This initial alignment revealed 87 melanisation candidate polymorphisms concentrated within the large first intron of the gene. However, natural selection increases not only the frequency of the favored allele in carbonaria but also the frequency of the neutral alleles linked to the causal allele. In an earlier study, they have also shown that Biston betularia melanism was originated from a single recent mutation (van’t Hof et al., 2011). Having screened more typica individuals, they further eliminated rare variants and were eventually able to find one polymorphism unique to carbonaria, a very large insert in the first intron of the gene.

The size of the causative large insert is 21,925 nucleotides long and is composed of a roughly 9-kb essentially non-repetitive sequence. The nature of the insert indicated that it is a class II transposable element (TE) – DNA transposon. The transposition of class II TEs are catalyzed by transposases that cut the DNA at the target site in a staggered fashion producing 5′ or 3′ DNA overhangs that are duplicated after transposition. Another hallmark of class II TEs is short inverted repeats at two ends of TE. Sequence analysis of the insert and comparison with the typica haplotypes revealed that both short inverted repeats (6 bp) and duplication of the target site (4 bp) are present in the carbonaria insert (Fig. 2).

Figure 2. The structure of the insert, shown in the carbonaria sequence, corresponds to a class II DNA transposon, with direct repeats resulting from target site duplication (black nucleotides) next to inverted repeats (red nucleotides). Typica haplotypes (lower sequence) lack the 4-base target site duplication, the inverted repeats and the core insert sequence. The transposon consists of ∼9 kb tandemly repeated two and one-third times (repeat unit (RU)1–RU3), with three short tandem subrepeat units (green dots, SRU1–SRU9) within each repeat unit.

To estimate the age of mutation event, the authors looked at 200 kb either side of the carb-TE insert. The idea is to track recombination events that have eroded the ancestral carbonaria haplotype. Given the ancestral state of carbonaria haplotype and recombination rate, how many years do we need to explain the observed haplotypes that are shuffled version of the ancestral one? Simulations based on this assumption predicted the most likely date of the mutation as 1819, shortly before it was first seen in the wild (1848) (Fig. 3).

Figure 3. Probability density for the age of the carb-TE mutation inferred from the recombination pattern in the carbonaria haplotypes (maximum density at 1819 shown by dotted line; first record of carbonaria in 1848 shown by dashed line).

The next question is how the carbonaria – TE leads to the melanisation of Biston betularia. TEs localized in introns effects the expression of the gene through several mechanisms. In order to test this possibility, first they checked tissue-specific expression of cortex splice isoforms and alternative first exons. They have identified two first exons, 1A and 1B, which are expressed highly in developing wing discs. Comparison of the abundance of 1A and 1B-initiated full transcripts between different genetic backgrounds (homozygous carbonariac/c, homozygous typicat/t, and heterozygous individuals – c/t) revealed that 1B expression is significantly higher in carbonaria background (c/c > c/t > t/t) (Fig. 4), whereas 1A-initiated full transcript does not show a significant difference between genotypes. In addition, cumulative expression of all splice-isoforms increases starting from the sixth larval instar (La6) until day 6 prepupa (Cr6) with highest value on day 4 prepupa (Cr4). Surprisingly, a phase of rapid wing disc morphogenesis also occurs in the same time interval, possibly indicating a function of cortex in wing pattern melanisation.

Figure 4. Tukey plot for relative expression of cortex 1B full transcript in developing wings of the three carbonaria-locus genotypes (c/c, c/t and t/t) produced within the progeny of a c/t x c/t cross. Genotypes differ significantly for the transcript (P < 0.001)

As mentioned earlier, Drosophila cort encodes for a distant member of the Cdc20 protein family (Chu et al., 2001). Members of the Cdc20 protein family activate an “E3” ubiquitin ligase, the anaphase-promoting complex (APC) and present its substrates. APC then ubiquitinates presented cell-cycle proteins, causing their degradation. This proteolysis destroys a panel of proteins including cyclins, allowing the cell cycle to progress. Degrons, short linear motifs located anywhere in the protein, are important for substrate recognition in proteolysis. A single shared site in lepidopterans and non-lepidopterans cortex binding the same degron sequence is also predicted for both 1A and 1B full isoforms, indicating a shared function of cortex between D. melanogaster and B. betularia. However, we still need further evidence to understand the exact connection between cell-cycle protein degradation and melanisation.

In conclusion, we now know that the industrial melanism mutation event in British peppered moth is the insertion of a large, tandemly repeated, transposable element into the first intron of the gene cortex. Although we still do not know the molecular mechanisms connecting cortex gene and the melanism in peppered moths, the discovery of causative mutation as a transposable element is breakthrough in the peppered moth story. In addition, it provides a spectacular evidence for the importance of transposable elements in adaptive evolution.


Chu, T., Henrion, G., Haegeli, V., and Strickland, S. (2001). Cortex, a drosophila gene required to complete oocyte meiosis, is a member of the Cdc20/fizzy protein family. Genesis 29, 141–152.

Clarke, C.A., Mani, G.S., and Wynne, G. (1985). Evolution in reverse: clean air and the peppered moth. Biol. J. Linn. Soc. 26, 189–199.

Cook, L.M., and Saccheri, I.J. (2013). The peppered moth and industrial melanism: evolution of a natural selection case study. Heredity (Edinb). 110, 207–212.

van’t Hof, A.E., Edmonds, N., Dalíková, M., Marec, F., and Saccheri, I.J. (2011). Industrial melanism in British peppered moths has a singular and recent mutational origin. Science 332, 958–960.

van’t Hof, A.E., Campagne, P., Rigden, D.J., Yung, C.J., Lingley, J., Quail, M.A., Hall, N., Darby, A.C., and Saccheri, I.J. (2016). The industrial melanism mutation in British peppered moths is a transposable element. Nature 534, 102–105.

van’t Hof, A.E., and Saccheri, I.J. (2010). Industrial melanism in the peppered moth is not associated with genetic variation in canonical melanisation gene candidates. PLoS One 5.



Posted in evolution | Tagged , , | Leave a comment

The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons

The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons

About 450 mya bony vertebrates radiated into Lobe-finned fish, from which tetrapods appeared later, and Ray-finned fish, which include Teleost (Fig.1). Nowadays they make up to 96 percent of all fish in the planet. Among the latter some species such as zebrafish (Dario renio) and medaka (Oryzias latipes) are used as model organisms in biomedical research in order to try to understand which is the genetic basis of certain human diseases. However, the transferability between the models is difficult given the phylogenetic distance between tetrapods (humans) and Ray-finned fish. For this reason, the authors decided to sequence the genome of the Spotted Gar (Lepisosteus oculatos), that can act as a bridge as it split off from the teleosts before the TGN (Teleost Genome Duplication). During vertebrate evolution two other genome duplications happened in the vertebrate linage: VGD1 and VGD2.

Fig1: Spotted gar is a ray-finned fish that diverged from teleost fishes before the TGD. Gar connects teleosts to lobe-finned vertebrates, such as coelacanth, and tetrapods, including human, by clarifying evolution after the two earlier rounds of vertebrate genome duplication (VGD1 and VGD2) that occurred before the divergence of ray-finned and lobe-finned fishes 450 million years ago (MYA)

Genes duplicates derived from the TGN are called Ohnologs. They were named by after Susumu Ohno, who showed in his work genome duplication may play an important role in evolution. The resulting paralogs (a special case of homology when duplicate genes or regions are in the same genome) are associated with development, signaling and gene regulation [2 sentences edited by Marc Robinson-Rechavi]. In addition ohnologs, which amount to about 20 to 35% of genes in the human genome, are frequently implicated in cancer and genetic diseases. Evolution acts on these duplicates and usually they can evolve in three different ways. Mechanisms that lead to preservation of duplicates are sub functionalization (partitioning of ancestral gene functions on the duplicates), neofunctionalization (assigning a novel function to one of the duplicates) and dosage selection (preserving genes to maintain dosage balance between interconnected components). Therefore the most likely outcome is non-functionalization of one duplicate genes due to the lack of selective constraint on preserving both. Because of the asymmetric evolution of ohnologs, TGD, and the speed at which the genome of teleost has evolved, connecting teleost sequences to human sequences can be challenging.
The authors thought, however, that the genome of the Gar can solve these problems due to its slow genetic evolution. Using this “Gar Bridge” allows to clarify the evolution of orthologs (genes in different species that evolved from a common ancestral gene by speciation) in humans such as: (i) Hox and Parahox genes, involved in the formation of body segments during embryogenesis; (ii) The SCPP genes (Calcium binding phosphoproteins), involved in the mineralization of tissues; (iii) miRNA genes, small non-coding RNA molecules that function in RNA silencing and post-transcriptional regulation of gene expression; (iv) CNEs (Conserved Non-coding Elements), regulatory sequences than in previous comparisons between tetrapod and teleost have never appeared. Finally, by the use of transcriptome data they tried to quantify the sum of expression domains and the levels of expression of the TGD-duplicate genes to figure out how these genes evolved.

Genome assembly and annotation
The authors sequenced the genome of one adult female gar to 90x coverage using Illumina technology. By anchoring a scaffold to a meiotic map they captured 94% of assembled bases in 29 linkage groups (LGs). Next, they constructed a gene set composed of 21,433 high confidence protein-coding genes and discovered that 20% of the genome is repetitive with Transposable Elements (TE) that are found in both teleost and lobe-finned fishes. Thanks to this they could clarify the phylogenetic origins of the TE.

The Gar lineage evolved slowly
The authors have made a Bayesian phylogenetic analysis using 243 one-to-one orthologs from 25 jawed vertebrates (Fig.2). Thanks to an evolutionary rate analysis, they showed that the proteins of the sister group of Holostei have evolved more slowly than those of the other vertebrates included in the analysis. These results suggest that the TGD maybe played a role in the rapid evolution of Teleost. The latter is confirmed by the greater branch lengths of the three teleost species used as outgroup.

Fig2: Bayesian phylogeny inferred from 243 proteins with a one-to-one orthology ratio from 25 jawed (gnathostome) vertebrates using PhyloBayes under the CAT + GTR + Γ4 model with rooting on cartilaginous fishes. Node support is shown as posterior probability (first number at each node) and bootstrap support from maximum-likelihood analysis (second number at each node).

Gar inform the evolution of bony vertebrate karyotypes
The karyotype of Gar (n2=58), which is composed of micro- and macro-chromosomes, was aligned to those of human, chicken and medaka, a teleost fish. Microchromosomes are present in a wide range of vertebrate classes but not in mammals and teleost. Probably they are the product of an evolutionary process that minimizes the DNA content (mostly through the number of repeats) and maximizes the recombination rate of them. The authors chose the Gar because its genome is the first that does not belong to teleost or lobe finned fish. They could demonstrate a high degree of one-to-one synteny (co-localization of genetic loci on the same chromosome) comparing gar to the chicken genome. This adds support to the hypothesis that the bony ancestor possessed both micro and macro chromosomes. They explain the absence of microchromosomes in teleost by fusion processes that occurred after the divergence from Gar followed by the TGD. In fact, if you look at the comparisons made between Gar and Medaka chromosomes, the synteny relationship is one-to-two meaning that the chromosome sequences are conserved, but are now located on different chromosomes. This confirms that after the fusion and the TGD, teleostei’s chromosomes where subjected to rearrangements and rediploidization and that the radiation of Holostei sister group happened before the genome duplication (Fig.3).

Fig.3: Gar-chicken-medaka comparisons illuminate the karyotype evolution leading to modern teleosts. The genome of the bony vertebrate ancestor contained both macro- and microchromosomes, some of which remain largely conserved in chicken and gar, for example, macrochromosome Loc2-GgaZ and microchromosomes Loc20-Gga15 and Loc21-Gga17. All three chromosomes possess double-conserved synteny with medaka chromosomes Ola9 and Ola12, which is explained by chromosome fusion in the lineage leading to teleosts after divergence from gar, followed by TGD duplication of the fusion chromosome and subsequent intrachromosomal rearrangements and rediploidization.

Gar clarifies vertebrate gene family evolution
Molecular and physiological mechanisms are shared between vertebrates and this allows to highlight the different types of evolution to which genes were subjected. Despite this after a genome duplication is possible that some ohnologs lineages went lost. The analysis of gar genome allowed to find ancestral genes belonging to VGD1 VGD2 and to clarify the functions of some gene families. For instance, they analyzed the hox family and were able to identify four clusters The number of hox genes that it possesses is greater compared to the ones of tetrapod and teleost. The latter in fact lack some hox orthologs, highlighting that were lost independently in the two groups. The hox genes are very important during embryonic development and intuitively one would think that these have to be more preserved than others. Surprisingly, in my opinion, this study reveal that the teleost, instead of 82 expected Hox cluster genes, have fewer than 50 indicating a massive gene loss after the TGD. The same results were obtained by analyzing circadian clocks, specifically opsin; the MHC’s family; the immunoglobulin genes; the Toll-like receptors. All these genes have shown that gar’s genome can act as a bridge between teleosts and tetrapods, as it possesses characteristics of both.

Gar uncover evolution of vertebrate mineralized tissues
The authors chose this class of proteins because they are preserved for almost all vertebrates. In gar they have an important role as the epidermis is composed of ganoid scales and then formed by ganoin, an “ancestor” of the enamel. However, the evolution of the Scpp (Secretory Calcium-binding Phosphoproteins) was not clear. Gar contain the largest gene number of Scpp, 35, and thanks to this big gene repertory made possible to identify orthologs which with a teleost-tetrapod comparison was not possible to find. The Ambn, Enam and Amel genes, respectively encode ameloblastin, aenamelin and amelogenin. They had been found in the lobe finned fish but not in teleost. These are, however, present in the transcriptome of gar and showing sequence similarity with zebrafish Scpp genes. This suggests that teleost may have different orthologs and that the common ancestor of bony vertebrates had a rich repertoire of Spcc genes. On one hand gar has kept it on the other hand teleosts and tetrapods suffered a loss of subsets of these genes.

Gar connects vertebrate microRNAomes
miRNA is a small non-coding RNA molecule (containing about 22 nucleotides) that functions in RNA silencing and post-transcriptional regulation of gene expression. This gene class has suffered the same evolutionary fate of others mentioned previously. Some sequences have become tetrapod or teleost-specific. The gar genome enabled to identify 107 families. In my opinion the authors did an interesting discover: TGD did not lead to the miRNA loss in teleost. Indeed, the retention rate is higher compared to some protein coding genes, shading new light to the hypothesis that “miRna genes are likely to be retained after a duplication owing their incorporation into multiple gene regulatory networks”. This is evidence of how very often we focus on the evolution of coding sequences of DNA when regulatory mechanisms and non-coding sequences seem to have greater importance.

Gar highlights hidden orthology of cis-regulatory elements
Conserved non-coding element (CNE) are non-coding regions of the genome identified by conventional alignment of genomic sequences from two or more species.
These regions are widely studied because it is unclear the role they play. However, are often considered as cis-acting regulatory sequence (acting on the same molecule of DNA that they regulate). The authors analyzed the evolution of these sequences close to developmental Hox and Parahox genes considering that, during embryonic development, gene expression must be controlled precisely both spatially and temporally. This control is brought about, in large part, by the combinatorial interaction of specific transcription factors with cis-regulatory modules. They chose CNS65, a limb enhancer, because in previous alignments its sequence has been shown to be conserved in humans and chicken but not in teleost. Again using gar CNS65 was possible to find an ortholog in zebrafish. They tested if this cryptic CNS65 enhancer preserves the ancestral function by generate transgenic zebrafish and mice embryos. What they discovered is that the ancestral function was also maintained in zebrafish but with different spatial dynamics. Using mouse embryos, gar CNS65 drives expression of forelimbs and hind limbs in the early stages of development and just later its function is restricted to the distal portion. In zebrafish CNS65 it is only active in the development of the forelimbs (Fig.4).

Fig.4: Gar CNS65 drives expression throughout the early mouse forelimbs and hindlimbs (arrows) at stage E10.5 (left). At later stages (E12.5), gar CNS65 activity is restricted to the proximal portion of the limb and is absent in developing digits (middle). Zebrafish CNS65 drives reporter expression in developing mouse limbs at E10.5 but only in forelimbs (right).

This is an example of partial loss of the original function, a mechanism that during evolution is more frequent than the gaining of a new function. Besides CNS65 they had found 108 other limb-enhancer in common with humans, compared to 81 that had been found previously with the teleost alignment confirming the presence of hidden orthology (Fig.5).

Fig.5: The gar bridge principle of vertebrate CNE connectivity from human through gar to teleosts. Hidden orthology is uncovered for elements that do not directly align between human and teleosts but become evident when first aligning tetrapod genomes to gar, and then aligning gar and teleost genomes

This shows that the latter have suffered the loss of a great number of limb enhancer. In the future, gar will be the ideal candidate to study the limb-to-fin transition.

Gar illuminate gene expression evolution following the TGD
Initially I spoke of evolutionary path that ohnologs (paralog) genes may have after the duplication of the genome. Here the authors were able to get two very clear, I think also very rare, examples as they evolved. The gene slc1a3 went to a neo-functionalization. In gar is expressed only in brain, bone and testis while in medaka, that was chosen by the authors as the representative of the teleost, a ohnolog is mainly expressed in the brain and the other in the liver (Fig.6.c). Completely different fate hit the gpr22 gene that has undergone sub-functionalization. In gar is expressed in the brain and in the heart while in the medaka one ohnolog is expressed in the brain and the other in the heart (Fig.6.d).

Fig.6: (c) Neofunctionalized ohnologs for slc1a3 showing new expression in liver. (d) Subfunctionalized TGD orthologs of gpr22 with one expressed in brain as in gar and the other expressed in heart as in gar. In c and d, the r values denote the correlation of the expression profile of each ohnolog with the gar pattern.

This second mechanism is what you would expect with more chances: an ancestral gene sub-function tends to be partitioned between the TGD-derived paralogs. The authors have also seen that the same mechanism occurs regarding the level of gene expression where a ohnologs pair tends to evolve the same level of expression of the pre-duplication gene.

The “Gar-bridges” led to the identification of many ortholog and paralog genes and clarify their fate during evolution. Previously the lack of direct connection between teleost and tetrapod genomes often lead to the wrong use of the word “innovation” on one group or the other. I think that this work is an excellent starting point to connect the evolution of genetic, developmental and physiological mechanisms that made the human genome evolve to its present state. To fully understand the differences between human and model organisms used in biomedicine it is crucial to create very powerful and close-to-reality models. For these reasons, this path should not stop here because the gar is only one species of Holostei – which is composed of nine species and two orders. The study of their genome and also that of other so-called “primitive” fish can help to shine more light on even the striking points that have emerged from this study. Perhaps the outcome of other comparative studies can give even more emphasis to these results or maybe provide answers that may now be counterintuitive.


Braasch I, Gehrke AR, Smith JJ, Kawasaki K, Manousaki T, Pasquier J, Amores A, Desvignes T, Batzel P, Catchen J, Berlin AM, Campbell MS, Barrell D, Martin KJ, Mulley JF, Ravi V, Lee AP, Nakamura T, Chalopin D, Fan S, Wcisel D, Cañestro C, Sydes J, Beaudry FE, Sun Y, Hertel J, Beam MJ, Fasold M, Ishiyama M, Johnson J, Kehr S, Lara M, Letaw JH, Litman GW, Litman RT, Mikami M, Ota T, Saha NR, Williams L, Stadler PF, Wang H, Taylor JS, Fontenot Q, Ferrara A, Searle SM, Aken B, Yandell M, Schneider I, Yoder JA, Volff JN, Meyer A, Amemiya CT, Venkatesh B, Holland PW, Guiguen Y, Bobe J, Shubin NH, Di Palma F, Alföldi J, Lindblad-Toh K, & Postlethwait JH (2016). The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons. Nature genetics, 48 (4), 427-37 PMID: 26950095

Posted in Uncategorized | Leave a comment

ExAC presents a catalogue of human protein-coding genetic variation

Exploration of variability of human genomes represents a key step in the holy grail of human genetics – to link genotypes with phenotypes, it also provides insights to human evolution and history. For this purpose Exome Aggregation Consortium (ExAC) have been founded; to capture variability of human exomes using next-generation sequencing. The first ExAC dataset of 63,358 individuals was released 20th of October 2014. Recently, a paper describing updated version of the dataset was published : Analysis of protein-coding genetic variation in 60,706 humans.

Authors made a great work on the reproductibility of the downstream analyses they have performed and generally on the availability of data. All the code is well documented in blogpost and available in GitHub repository. All figures in this blogpost I plotted by my own!


ExAC is composed of almost ten fold more individuals and previous dataset of the similar kind Fig 1a. 91,000 individuals were sequenced, of which 60,706 have been kept after quality filtering. Finnish population was excluded from European due to bottleneck they have gone though.

ExAC was targeting individuals with various genetic background. Principal component analysis have shown very strong geographical pattern in the dataset (Fig 1b). I expected a continuum of haplotypes in the environment without strong geographic obstacle (like European-Latino continuum). The gaps between South Asian samples and the rest Europen samples on the PCA plot is most likely caused by the absence of samples from Middle-East Asia. Middle-East Asian samples have just a colour, but no data points. Central Asians do not even have a colour.

Figure 1: Size and diversity of ExAC dataset a, ExAC dataset is almost ten fold bigger than datasets of similar kind: 1000 Genomes project and Exome Sequencing Project (ESP), but more importantly, it captures a far greater diversity of human populations compared to ESP and 1000 Genomes. b, The geographic signal of populations visualized using Principal component analysis (PCA). The first principal component get all the variability of African samples and it does not tells much about the rest of the dataset (Extended Data Figure 5 in the paper), therefore the second and third principal component has been show.

A 45 million nucleotide positions with sufficient coverage (>10x in at least 80% of individuals) are present in ExAC. These positions correspond to 18 million possible synonymous variants (in theory) of which ExAC is capturing 1.4 million (7.5%).

The size of ExAC allows to observe…

…mutational reoccurence: 43% of synonymous de Novo variants identified in previous studies were also identified in ExAC, which is a first direct evidence of mutational reoocuarence.

…multiple allels: 7.9% of high quality polymorphic sites are multiallelic, which is fairly close to Poisson expectation (whatever it means…)

…a LOT of variants after all the filtering, 7,404,909 high-quality variants were identified of which 317,381 indels. The density of variant is on the average one over eight bases. 99% of the variants had frequency bellow 1% and 54% of the variants are singletons (i.e. only one individual carries the variant).

…a selection effects The proportion of singletons among polymorphisms can serve as a measure of purifying selection acting on the polymorphisms of given size. The Figure 2 shows that indels that are not affecting open reading frame (ORF) have significantly less singleton variants than indels that actually affect ORF. There is also significant difference between indels of different sizes that are affecting ORF, but we (our topic group) have not found any possible explanation for this pattern.

…saturation of alleles in CpG sites: CpG sites have very high rate of transitions, therefore capturing all possible variants is substantially easier than for other sites. A subset of 20,000 individuals of ExAC dataset shows saturation of alleles – all non-lethal possible synonymous CpG transition variants are present. ExAC is the first dataset showing a saturation of human variation.

Figure 2: Indel frequencies with respect to the size a, Frequency of deletions is higher and smaller indels are more probable than greater. If we take into account the greater probability of smaller indels, frequency of indels that not shifting open reading frame is bit higher than frequency of indels than are not. b, Proportion of singletons in total number of indels (as proxy for strength of selection) is significantly and consistently lower in all indels that are not shifting open reading frame (-6, -3, +3, +6).

Deletireous alleles

Authors introduce a mutability adjusted proportion singleton (MAPS) metric as a measure of selection. This metric is correcting on biases caused by the different mutational rates allowing comparisons of categories with various mutational speed. Comparison across different functional classes have shown at Figure 3. MAPS shows higher values for categories predicted to be deleterious by conservation-based methods.

Figure 3: MAPS values of different functional classes. MAPS is highest for nonense substiturions and it also consistent with PolyPhen and Combined Annotation Dependent Depletion (CADD) classification.

Rare diseases

Average ExAC individual carries ~54 variants reported as Mendelian disease causing. Approximately 41 of these alleles were identified with frequency greater than one, therefore it is not expected to be caused by problem is variant calling, but in miss-classification of variants in the database. Evidence of 192 previously variants were manually curated of those only 9 had sufficient evidence in disease association. High allele frequencies were identified mainly in previously underrepresented categories Latino and South Asian.

ExAC have shown importance of matching reference population in identification disease-causing variant. An example is recessive disease North American Indian childhood cirrhosis previously reported to be caused by CIRH1A p.R565W. This variant was identified in homozygotic state in four individuals in Latino population, none of them having a record of liver problems during childhood.


ExAC shows the importance of diversity of sampled population in capturing the real link between genotype and phenotype. Even ExAC provides a lot of new insights, there are still populations that are underrepresented or not represented at all.

Given the richness of ExAC and the effort of authors in data sharing and availability, I guess that it will be a great resource for various analyses in the future for a lot of researchers around the globe.

Lek M, Karczewski KJ, Minikel EV, Samocha KE, Banks E, Fennell T, O’Donnell-Luria AH, Ware JS, Hill AJ, Cummings BB, Tukiainen T, Birnbaum DP, Kosmicki JA, Duncan LE, Estrada K, Zhao F, Zou J, Pierce-Hoffman E, Berghout J, Cooper DN, Deflaux N, DePristo M, Do R, Flannick J, Fromer M, Gauthier L, Goldstein J, Gupta N, Howrigan D, Kiezun A, Kurki MI, Moonshine AL, Natarajan P, Orozco L, Peloso GM, Poplin R, Rivas MA, Ruano-Rubio V, Rose SA, Ruderfer DM, Shakir K, Stenson PD, Stevens C, Thomas BP, Tiao G, Tusie-Luna MT, Weisburd B, Won HH, Yu D, Altshuler DM, Ardissino D, Boehnke M, Danesh J, Donnelly S, Elosua R, Florez JC, Gabriel SB, Getz G, Glatt SJ, Hultman CM, Kathiresan S, Laakso M, McCarroll S, McCarthy MI, McGovern D, McPherson R, Neale BM, Palotie A, Purcell SM, Saleheen D, Scharf JM, Sklar P, Sullivan PF, Tuomilehto J, Tsuang MT, Watkins HC, Wilson JG, Daly MJ, MacArthur DG, & Exome Aggregation Consortium. (2016). Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536 (7616), 285-91 PMID: 27535533

Posted in evolution, genomics, human | Tagged , , | Leave a comment

Papers to discuss Autumn 2016

This Autumn, we will continue to discuss papers in related series:

Series 1: human genome evolution


  1. Sulem et al 2015 Identification of a large set of rare complete human knockouts. Nature Genetics 47: 448–452
  2. Hehn et al 2016 Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. PNAS 113: E440-E449
  3. Lek et al 2016 Analysis of protein-coding genetic variation in 60,706 humans. Nature 536: 285–291

Series 2: moth coloration


  1. van’t Hof 2016 The industrial melanism mutation in British peppered moths is a transposable element. Nature 534: 102–105
  2. Nadeau et al 2016 The gene cortex controls mimicry and crypsis in butterflies and moths. Nature 534: 106–110

Series 3: gene and genome duplication


  1. Braasch et al 2016 The spotted gar genome illuminates vertebrate evolution and facilitates human-teleost comparisons. Nature Genetics 48: 427–437
  2. Lien et al 2016 The Atlantic salmon genome provides insights into rediploidization. Nature 533: 200–205
  3. Lan and Pritchard 2016 Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science 352: 1009-1013


Posted in paper list | Leave a comment

The genetic sex-determination system predicts adult sex ratios in tetrapods

Genetic sex determination, i. e. the determination of sexual phenotypes by the effect of sex-determining genes, is found in the majority of vertebrates. Sex determination genes have evolved multiple times independently and can be located on different chromosomes. Depending on whether the presence of the sex determining region (SDR) determines female or male sex, genetic systems of sex determination are called ZW or XY systems respectively and the sex which is heterozygous for the SDR is called the heterogametic sex. Lower fitness in the heterogametic sex has long been observed in interspecific hybrids in a wide range of animal and even plant species, an observation called Haldane’s rule. In this paper the authors find a similar pattern in (non-hybrid) tetrapod species: by comparing the adult sex ratio in XY and ZW systems in 344 tetrapod species, they find that the ASR is skewed towards the homogametic sex (towards females in an XY system and towards males in a ZW system).

This observation is based on a dataset containing known genetic sex determination systems and adult sex ratios (ASRs) of species across the vertebrate phylogeny. Within amphibians and reptiles (in which both XY and ZW systems are found), the authors show that ASRs in ZW systems are significantly more male biased than in XY systems and that the proportion of species with male-biased ASRs is greater in ZW than in XY systems. Furthermore these observations hold true for the combined dataset of amphibians, reptiles, mammals (which have a conserved XY-system and male-biased ASRs), and birds (which have a conserved ZW system and female-biased ASRs).

It is important to test whether these observations are actually caused by the GSD or whether there are other factors, which could systematically influence ASR:

– ASRs could be influenced by body size and breeding latitude through correlated life history traits like development, growth and reproductive ecology.

– Differences in body size and dispersal between sexes can lead to differences in mortality which influence ASRs.

The authors account for potential effects of sex-biased dispersal, body size, breeding latitude and sexual size dimorphism in a phylogenetically corrected multi-predictor analysis. Although they do find a significant correlation between sexual size dimorphism and ASR as well as between sex-biased dispersal and ASR, the effect of the GSD remains significant in all cases. Because the dataset for sex-biased dispersal is limited to 32 species in total, which is less than 10% of the number of species in the complete dataset, it is not included in the main multi-predictor model.

Another important factor is the effect of phylogenetic relatedness between species: The effects of GSDs on ASRs of more closely related species are more likely to be correlated due to shared genetic and phenotypic traits.

To account for this, phylogenetic corrections, which are based on composite phylogenies of different tetrapod groups, are applied. As these composite phylogenies don’t include branch length information, different methods are used to assign arbitrary branch lengths, which has surprisingly little effect on the results. Two different methods are applied to account for phylogenetic relatedness across samples: Phylogenetic generalized least squares (PGLS) models to test for differences in ASRs between XY and ZW taxa and Pagel’s discreet method (PDM) to test the fit of dependent and independent models of transitions in ASR bias and GSD. As the second model implies, the number of transitions between GSDs should be more important than the phylogenetic relatedness between species. The author’s claim to take this into account by rerunning their analyses while reducing three large groups with a known shared sexual system (mammals, birds and snakes) to a single datapoint, resulting in unchanged significant differences in ASRs between GSDs.

I wonder whether it would also make a difference to reduce further groups, which share non-independent evolution of SDRs, to single datapoints. For example this dataset includes five species of lizards from the family Lacertidae, which are assumed to share a conserved GSD (Rovatsos et al. 2016) and 9 lizard species of the genus Anolis included in the dataset are likely to share a common sex chromosome system (Gamble et al. 2014). Furthermore in many amphibians and reptiles nothing is known about synteny across sex chromosomes and it is likely that a rigorous reduction of GSDs with common ancestry into single datapoints would reduce the number of independent observations and thus statistical power.

However, the number of relevant datapoints in amphibians is fairly limited anyway: Amphibian species with an XY sex determination system show no significant ASR bias (or even a slight male bias after phylogenetic correction). Thus the observed effect within amphibians relies on data for only 11 species with a ZW system.There are good reasons to be careful when making general conclusions from this dataset:

Sex reversal is common in some amphibian species, which could bias the observed ASRs. Furthermore, although the authors claim to have included only species with known GSDs, the GSD for amphibians with homomorphic, microscopically indistinguishable sex chromosomes is difficult to determine and frequent subject of scientific dissent.

One example for this is Bufo viridis. The ASR of B. viridis is strongly male biased (0.70), and the GSD is supposed to be a ZW system based on the entry from However, the claim that B. viridis is female heterogametic is based on a single study, which detected that all seven females examined in a single Moldavian population were heterozygous for a chromosomal inversion. Such a pattern has never been found in any other green toad population, but instead multiple sex linked genetic markers have been developed, which show male-heterogametic segregation patterns in crosses from different B. viridis populations as well as in the closely related species B. siculus, B. balearicus and B. variabilis (Stöck et al. 2011). In my opinion it would be more appropriate to assign B. viridis to species with XY system, which would result in a decrease in the overall differences in ASRs between both groups.

Possible reasons for the effect of the sex-determination system on adult sex ratios

In general, a skewed adult sex ratio can have two different reasons: a skewed gametic sex ratio or higher mortality of one sex resulting in different sex ratios in adults. In more detail six potential not mutually exclusive explanations of how the GSD could bias adult sex ratios are proposed and discussed:

– Sexual selection in males could increase mortality.

This would be expected to result in a bias towards females in XY and ZW systems and cannot explain male biased ASRs in ZW systems.

– Recessive deleterious mutations on X/Z chromosomes or Y/W specific deleterious mutations.

Recombination suppression on sex chromosomes leads to degeneration of the sex-linked region on Y /W chromosomes, which can result in adverse fitness effects caused by either deleterious mutations on the Y/W, or deleterious recessive mutations on the hemizygous part of the X/Z chromosome.

Based on a population genetic model they develop, the authors claim that the accumulation of deleterious mutations may not be enough to cause the observed adult sex-ratio bias. However, they admit that many of their parameter estimates are very crude and results may vary when other factors are taken into account, like large differences in the rate of deleterious mutations.

The number of deleterious mutations is expected to increase with increasing sex chromosome differentiation and degeneration. Sex chromosome differentiation in tetrapods spans a wide range from completely homomorphic sex chromosomes in many lizards and amphibians but also in some families of snakes and birds to complete loss of the Y chromosome in some mammals. It would thus be interesting to look if there is an association between variable sex chromosome degeneration and skews in the ASR within groups with homologous sex chromosomes.

– Imperfect dosage compensation.

In the heterogametic sex, genes located in the hemizygous region of the X/Z chromosome are present in only one functional copy. In order to reach similar expression levels as in the homogametic sex, the expression of these genes has to be increased. However, research has shown that not all genes are upregulated in the same way and as a result many sex chromosomal genes have a lower expression levels in the heterogametic than in the homogametic sex.

This explanation is unlikely to result in a general pattern across tetrapods, because there are different mechanisms of dosage compensation in vertebrates: mammals deactivate one X chromosome in females to compensate for gene loss on the Y chromosome, while birds show incomplete dosage compensation on a gene-by-gene basis. Since one X is deactivated in the homogametic sex in mammals, we would expect to find sex-specific fitness differences based on dosage compensation only for non-mammals.

– Meiotic drive:

Meiotic drive systems are genetic variants, which favor their own transmission by distorting sex ratios at meiosis. The authors point out, that the observed skews in ASR are unlikely to be caused by meiotic drive, because the sex ratio at birth does not predict the adult sex ratio in mammals and birds. However, there is little information on sex ratio at birth in reptiles or amphibians. Furthermore, a better measure for the effect meiotic drive would be the gametic sex ratio, since the sex ratio may be already skewed at birth due to sex-specific differences in embryonic mortality.

– More rapid degeneration of X and Y chromosomes during lifetime:

The author’s propose, that the Y/W may be more affected by further degeneration during lifetime (for example by increased telomere shortening or loss of epigenetic marks). To my knowledge this is rather speculative, as I am not aware of any results supporting this hypothesis.

– Sexually antagonistic selection:

Loci, which are only beneficial to one sex, but may be detrimental to the other are expected to accumulate on sex chromosomes. In an XY-system, male beneficial loci are expected to be found in linkage disequilibrium with the SDR, which ensures that they are exclusively transmitted to males. The positive fitness effects of these Y/W-linked sexually antagonistic mutations would thus result in a postive skew towards the heterogametic sex (although the evolution of recombination suppression may introduce further degeneration of the Y/W chromosome, which can be detrimental). Furthermore, the authors develop a model for sexually antagonistic selection of loci located on X/Z chromosomes and come to the conclusion, that there are no robust generalizations about the direction of the skew of the adult sex ratio resulting from these loci.

The authors point out, that there is no clear support for any of these hypothesis. Further research could test the assumptions of some of these hypotheses: Recessive deleterious mutations on X/Z chromosomes or Y/W specific deleterious mutations, imperfect dosage compensation and sexually antagonistic selection are all related to sex chromosome degeneration and recombination suppression. Although it is difficult to comparatively quantify sex chromosome degeneration across species, more high quality sequences of sex chromosomes are becoming available and it may soon be possible to link sex chromosome degeneration on a gene level to sex specific fitness differences. A very crude proxy for this would be to include whether sex chromosomes are microscopically distinguishable (heteromorphic) or indistinguishable (homomorphic) in this analysis and test whether this explains significant variance in ASRs. Also further research could clarify whether there is a connection between ASR and sex ratio at birth or even better gametic sex ratio in amphibians or reptiles, which could be indicative of meiotic drive.


Overall, I am skeptical that comparing sexual systems as a simple binary character (male or female heterogametic) does adequately represent the diversity of tetrapod sex chromosome systems and I expect that fitness differences should be more related to sex chromosome degeneration than to the GSD itself. Although a significant proportion of the interspecific variation in ASRs is explained by the GSD in groups with variable sex determination systems, there are multiple possible confounding factors (like sex reversal, problems in determining GSDs, uncertainty of common ancestry of GSDs), which could easily lead to biases in the relatively small number of observations in these groups.


Gamble T, Geneva AJ, Glor RE, Zarkower D (2014). Anolis sex chromosomes are derived from a single ancestral pair. Evolution.68(4):1027-41

Rovatsos M, Jasna V, Altmanova M, Johnson Pokorna M (2016). Conservation of sex chromosomes in lacertid lizards. Molecular Ecology.

Stöck M, Croll D, Dumas Z, Biollay S, Wang J, Perrin N (2011). A cryptic heterogametic transition revealed by sex-linked DNA markers in Palearctic green toads. Journal of Evolutionary Biology. 24:1064-1070

Posted in Uncategorized | Leave a comment

Identification of a large set of rare complete human knockouts

High throughput genotyping and sequencing has led to the discovery of numerous sequence variants associated to human traits and diseases. An important type of variants involved are Loss of Function (LoF) mutations (frameshift indels, stop-gain and essential sites variants), which are predicted to completely disrupt the function of protein-coding genes. In case of Mendelian recessive diseases, for the condition to occur, the LoF variants must be biallelic, i.e. affecting both copies of a gene. The affected gene is then defined as “knockout”.

By studying the Icelandic population, authors aim to identify rare LoF mutations (Minor Allele Frequency, MAF < 2%) present in individuals participating in various disease projects. They then investigate at which frequency in the population these LoF mutations are homozygous (i.e. knockout) in the germline genome.

The Icelandic population Iceland is well-suited for genetic studies for three main reasons. The island was colonized by human population around the 9th century by 8-20 thousand settlers. Since then the population grew to around 320’000 inhabitants today. The initial founder effect and rare genetic admixture make the Icelandic population a genetic isolate. In addition to an unusual genetic isolation, Iceland’s population benefits of a genealogical database containing family histories reaching centuries back in time, as well as a broad access to nationwide healthcare information.

These characteristics led to the development of large-scale genomic studies of Icelanders by deCODE Genetics. This biopharmaceutical company has published various studies, including this paper, related to genetic variants and diseases in Icelanders.

Loss of function mutation and rare complete knockouts Authors sequenced the whole genome of 2’626 Icelanders participating in various disease projects and identified variants in protein coding genes. These variants were annotated with the predicted impact that they have on the gene: LoF, moderate or low impact. A total of 6’795 LoF mutations in 4’924 genes were identified, with most of these variants (6’285) being rare (MAF < 2%).

The identified LoF variants were imputed into an additional 101’584 chip-genotyped and phased Icelanders, allowing the identification of the number of knockout genes in the population. Authors found that 1’485 previously identified LoF mutations (MAF <2%) are contributing to the knockout of 1’171 genes and that 8’041 individuals possess at least 1 of these knockout genes. Out of these 1’171 genes, 88 had been already linked by previous studies to conditions through a recessive mode of inheritance.

Double transmission deficit of LoF variants Because knockout genes should be deleterious for an organisms, we expect a deficit of homozygous for these genes in the population due to embryonic/fetal, perinatal or juvenile lethality. To investigate whether such a deficit was present, authors calculated the transmission probability of LoF variants from parents to their offspring.

Under Mendelian inheritance, the expected percent of transmission of the LoF mutated gene from heterozygous parents to their offspring (i.e. double transmission) is of 25%. However, results show a statistically significant deficit in double transmission, the observed double transmission probability being of 23.6%.

The rare LoF mutations were ranked according to the Residual Variation Intolerance Score (RVIS) percentiles and essentiality score percentiles. Both measures attempt to classify genes according to their tolerance to functional variation, with the lowest rank corresponding to genes being more sensitive to mutations. As expected, the lowest double transmission rate was found for the most sensitive genes (first percentile), suggesting that a homozygous state of LoF mutation in these genes is deleterious.

Tissue specific expression of knockout genes Authors investigated if genes were more likely to be knockout when expressed in specific tissues. By retrieving the information from previous studies of the number of genes that are highly expressed in 1 or more – but not all – 27 tissues, they calculated the fraction of these genes that were knockout in each tissue. They found that the brain and placenta were the tissue with the lowest fraction of knockout genes (3.1% and 3.9%, respectively), and that in testis, small intestine and duodenum were observed the highest fraction of biallelic LoF mutations (5.8%, 6.4%, and 6.9% respectively).

Conclusion and Comments The characteristics of Icelandic population and the incredibly large sample size (~ 1/3 of the total population) allowed authors to identify a large number of new and rare LoF mutations. Part of these mutations was shown to contribute to the knockout of an unexpected large number of genes in an unexpected large number of people. This study is the first to shed a light on the astonishing number of knockout present in human populations. In addition, by investigating the transmission probability, a deficit in homozygous loss-of function offspring was identified, especially when LoF mutations affected essential genes. This result was expected because of the predicted deleterious effect of biallelic LoF mutations.

Besides the aforementioned interesting results of the paper, some aspects were slightly disappointing. First, I was expecting authors to focus more on the genotype-phenotype aspects. Even if they pinpoint a deficit in double transmission, suggesting deleterious consequences for the organism, authors did not discuss the function of the identified knockout genes and their effect on the phenotype. Second, the paper was not an easy read. Many results were only mentioned without additional information on the methods or data used, and it was sometimes difficult to link them with the main aim of the study. Additionally, figures were sometimes misleading because of different axis scales or incomplete legends.

Finally, authors suggested that important tissues, such as the brain, have a lesser number of knockout compared to other tissues, writing that “genes that are highly expressed in the brain are less often completely knocked out than other genes”. However, this result is questionable as we do not have any measure of the number of knockout genes that we expect to be expressed only by chance in the tissues. In other words, the brain could have a lower number of knockout genes expressed compared to other tissues only because the total number of expressed genes in the brain is lower. Therefore we do not know if the lower number of knockout genes in the brain is due to chance or to biological reasons.

Nevertheless, this study opens the door to understanding how many knockout genes occur without phenotypic consequences in humans, what are the genes function and essentiality, and the role of the environment in the buildup of phenotype. The classical search for genetic variants associated to a phenotype, as in GWAS studies, could be reversed by first identifying individuals with the same genetic variants and then precisely phenotyping them.

Sulem, P., Helgason, H., Oddson, A., Stefansson, H., Gudjonsson, S., Zink, F., Hjartarson, E., Sigurdsson, G., Jonasdottir, A., Jonasdottir, A., Sigurdsson, A., Magnusson, O., Kong, A., Helgason, A., Holm, H., Thorsteinsdottir, U., Masson, G., Gudbjartsson, D., & Stefansson, K. (2015). Identification of a large set of rare complete human knockouts Nature Genetics, 47 (5), 448-452 DOI: 10.1038/ng.3243

Posted in genomics, human | Leave a comment

Supergenes and social organization in a bird species




Cindy Dupuis, Xinji Li, Casper van der Kooi


The development of new molecular mechanisms and next generation sequencing techniques have advanced our knowledge on the genetic basis underlying phenotypic polymorphism. Over the coarse of recent years, scientific studies have documented large genomic regions with drastic phenotypic effects, the so-called supergenes. A supergene is a set of genes on the same chromosome that exhibit close genetic linkage and thus inherits as one unit.

The evolution of a supergene requires that multiple loci with complementary effects become linked (i.e. they are genetically clustered and recombination between the loci is suppressed) and that optimal alleles at the linked loci are combined. Genetic clustering of different loci can occur when, via mutation, an adaptive interaction between two closely placed loci is created. In addition, gene duplications or translocations that generate a series of (novel) complementary genes can give rise to supergenes. The probability of a recombination event occurring in between loci depends on various factors. The chance of a recombination event occurring in between two loci will be small when the loci are located closely together, as the chance of a recombination event in between two loci generally decreases with physical distance between the loci. Given the large size of supergenes, additional mechanisms seem, nonetheless, important. This can, for instance, be maintained via structural differences, such as inversions, between the supergene and their homologous chromosomal region.

An interesting example of a supergene in an invertebrate is the case documented by Purcell et al. (2014). They documented a large, nonrecombining region that is association with social organisation in an ant species. The nonrecombining region was found to largely constitute one chromosome and was hence aptly called the ‘social chromosome’. They find a structurally similar region with similar effects in another ant species, however the regions exhibit no homology, suggesting parallel evolution of the social chromosome. Examples of vertebrates social systems determined by supergenes are, to our knowledge, unknown.

Two recent articles (Küpper et al., 2016; Lamichhancy et al., 2016) revealed a single supergene controlling alternative male mating tactics in the ruff (Philomachus pugnax). The studies were carried out independently by two research groups, but reach almost the same conclusions. The ruff (Philomachus pugnax) is a lekking wader known for the great diversity in the male plumage color and behavioral polymorphism. Three types of males can be distinguished; these types are characterized by differences in territoriality and behavior that are highly correlated with differences in nuptial plumage and body size. Predominantly dark-colored Independent males are most common (80-95% of males), these males defend small territories on a lek. Smaller, lighter colored Satellite males (5-20%) are non-territorial and less strict to a particular lek. Satellite males make use of – and are largely tolerated by – the residences of Independent males. The third type are the Faeder males, which are very rare (<1% of males). Faeder males lack male display, are small and resemble the unornamented females; however, they have disproportionately large testes.

Previous studies using pedigrees of large, captive populations showed that reproductive polymorphism follows a single-locus autosomal pattern of inheritance (Lank et al., 1995; Lank et al., 2013). The dominant Faeder allele controls development into Faeder males, whereas the Satelllite allele (that is dominant to Independent) controls development into Satellite or Independent males. Ekblom et al. (2012) studied the nucleotide sequence variation and gene expression in ornamental feathers from 5 Independent and 6 Satellites males using transcriptome sequencing. No significant expression divergence of pre-identified coloration candidate genes was found, but many genetic markers showed nucleotide differentiation between the two morphs. Later, Farrell et al. (2013) used linkage analysis and comparative mapping to locate the Faeder locus, and found linkage to microsatellite markers on avian chromosome 11 that included the Melanocortin-1 receptor (MC1R) gene, a strong candidate in alternative male morph determination, because it is considered to be important in plumage coloration.

Using the captive population that was previously phenotyped, Küpper et al. now set out to determine the genomic structure of the existing morph divergence in P. pugnax. The first step in their analysis was to generate and annotate the full genome for one Independent male. Followingly, the authors identified SNPs in the population using RAD sequencing. More than one million SNPs could be distinguished, and Faeder and Satellites could be mapped to a genetic map based on 3’948 SNPs. Interestingly, both morphs mapped to the same region on chromosome 11, but exhibited clear structural differences. This was corroborated by a GWAS analysis on 41 unrelated Satellite, Independant and Faeder males from a natural population.


In order to characterize the genomic region more precisely, they conducted a whole genome sequencing of a small set of Independent, Satellite and Faeder males. They showed that the region on chromosome 11 was highly differentiated between Satellite and Faeder morphs and that this region contained a greater nucleotide variation compared to the adjacent regions. Using the reads orientation, they found clear evidence for an inversion of the chromosomal regions between the different morphs. Interestingly, they found that one breakpoint occurs within an essential gene, CENPN (encoding centromere protein N, recessive lethal), which implies that individuals homozygous for the inversion are not viable – an observation that is confirmed by breeding experiments. The authors also suggested a recombination event or gene conversion to have occurred between the Satellites and Independent alleles.


By comparing gene sequences among morphs, the authors discovered that 78% of the gene sequences were different between morphs, and that those differences had the potential to change the encoded protein. Among the divergent genes, some where found to be involved in hormonal production, like HSD17B2, an enzyme inactivating testosterone and estradiol. Varying specifically depending on the morph, this enzyme may alter steroid metabolism and explain partly why plumage patterns and behavior is different between morphs. The MC1R gene was also found within the altered genomic region. This gene is considered an important locus controlling color polymorphism, which could be at the source of the reduced melanin levels in satellites. The PLCG2 gene, which has been rearranged in Faeders, was found to be a candidate gene for the rather feminine appearance and non-aggressive behavior in Faeders. Presumably, this gene is part of a cascade leading to the development of the usual impressive plumage of other males morphs.


In a second article, Lamichhancy et al., 2016 studied a natural ruff population using whole-genome sequencing. They first established a high-quality reference genome assembly from an Independent male and conducted functional annotation based on both evidence data and de novo gene predictions. Then, whole-genome resequencing and SNP calling were performed for 15 Independent, 9 Satellite and 1 Faeder males. Their genome-wide screen for genetic divergence estimates (FST) between different male morphs identified a 4.5-Mb region, based on which Independents and Satellites could be phylogenetically clustered as distinct groups. Screening for structural variants identified a 4.5-Mb inversion in Satellites that perfectly overlapped with the differentiated region. In addition, PCR-based sequencing confirmed the positions of proximal and distal breakpoints and identified a 2,108-bp insertion of a repetitive sequence at the distal breakpoint. Diagnostic tests showed that Satellite males were heterozygous (S/I), while most Independent males were homozygous (I/I). They suggested the Independent allele to represent the ancestral state, which is consistent with the conserved synteny among birds.

The comparison between Faeder and Independent males showed that the genetic differentiation was equally strong across the same region, creating a mirror image of the differentiation pattern between Satellites and Independents. Accordingly, the region could be subdivided into two parts: region A where Satellite and Faeder chromosomes were closely related and less closely related to Independent, and region B where the Satellite and Independent loci were closer related and divergent from Faeder. Since an inversion is expected to reduce the amount of recombination within the region between the wild-type (I) and mutant alleles (either S or F), the disruption of the differentiation pattern might be considered the result of one or two recombination events between an Independent and a Faeder-like chromosome. The divergence time between the Independent allele and Satellite or Faeder alleles was estimated to be approximately 4 million years, using the nucleotide divergence and estimated mutation rates for birds. The last recombination event was estimated to occur 520,000 ± 20,000 years ago.

To better understand the genetic consequences of the inversion and relate it to the phenotypic variantion in male ruffs, the authors searched for candidate mutations amongst the genes in the inverted region. Mutations in several genes with important functions were found on Satellite and Faeder chromosomes, including the abovementioned CENPN, HSD17B2 and MC1R genes as well as and SDR42E1 (the latter one is important for the metabolism of sex hormones). Missense mutations in derived MC1R were found to be associated to the Satellite and Faeder alleles, hinting at a potential mechanism explaining the male plumage polymorphism during breeding season.

In conclusion, these two studies demonstrated presence of a genomic inversion that led to the evolution of a supergene. This supergene determines the complex phenotypic variation in male ruffs. These two papers contribute to our understanding of supergenes, complex phenotypes and social organization.


Küpper C, Stocks M, Risse JE, Dos Remedios N, Farrell LL, McRae SB, Morgan TC, Karlionova N, Pinchuk P, Verkuil YI, Kitaysky AS, Wingfield JC, Piersma T, Zeng K, Slate J, Blaxter M, Lank DB, & Burke T (2016). A supergene determines highly divergent male reproductive morphs in the ruff. Nature genetics, 48 (1), 79-83 PMID: 26569125

Posted in evolution, genomics, Uncategorized | Tagged | Leave a comment

Reconstructing human population history : ancestry and admixture

Understanding the evolutionary history of our own species, how migration and mixture of ancestral populations have shaped modern human populations is a key question in evolutionary biology. Here we present three articles related to this topic, the first two dealing with India and the third one focusing on a single Ethiopian group :

1) Moorjani et al 2013 Genetic Evidence for Recent Population Mixture in India AJHG 93,: 422–438

2) Basu et al 2016 Genomic reconstruction of the history of extant populations of India reveals five distinct ancestral components and a complex structure PNAS online before print

3) Van Dorp et al 2016 Evidence for a Common Origin of Blacksmiths and Cultivators in the Ethiopian Ari within the Last 4500 Years: Lessons for Clustering-Based Inference PLOS Genetics 11(8): e1005397

All of them use genome wide data from micro array. After a brief abstract of each paper, showing their similarities and differences, we discuss their methodological approaches.

Ancestral populations of India

The aim of the first two articles is to understand the history of the populations of the Indian subcontinent. The first one (Moorjani et al 2013) reports data from 73 groups living in India for more than 570 individuals sampled. The authors filtered out the data by removing all individuals with evidence of recent admixture or recent ancestry from out of India. The populations that were included in the analysis can be classified into two linguistic categories: the ones speaking Indo-European languages and the ones speaking Dravidian languages.

Figure 1 : map of sampled population (A) and PCA of 70 indians groups and some non-indians, highlighting the “Indian cline” (B)

Previous genetic evidence indicates that most of the groups of India descend from a mixture of two distinct ancestral populations: Ancestral North Indians (ANI) and Ancestral South Indians (ASI). Three different hypothesis exist for the date of mixture of these two populations:

1) arrival of ANI is due to migration prior to agriculture about 30,000-40,000 years ago

2) ANI arrived with the spread of agriculture who probably began around 8,000 and 9,000 years ago

3) ANI arrived very recently (3,000-4,000 years ago) when the Indo-European languages presumably began to be spoken in India.

To prove the admixed origin of Indian groups and estimate the proportion of each ancestry in each population they use a PCA and a statistic called F4 ratio that infers the mixture proportion measuring the correlation in allele frequencies between each pair of groups. They demonstrated that all populations are admixed and lie along an “Indian cline”, that is a gradient going from 17% of ANI ancestry to 71%. These results correlate well with geography and language, with the northern Indo-European populations having more ANI ancestry than the southern Dravidian ones. Then they use linkage disequilibrium (LD) to estimate the dates of admixture : LD blocs are longer if the admixture is younger. By fitting an exponential function to the decay of LD (that is expected from a sudden cessation of admixture) they could estimate that admixture occurred between 1,856 and 4,176 years ago, supporting the third hypothesis. These results correspond with demographic and cultural changes observed in India with the establishment of the caste system leading to strong endogamy that stopped the admixture rapidly. Moreover they found that Indo-Europeans groups have more recent admixture dates, which could be explained by multiple waves of mixture in these populations. Another finding of this paper is that aboriginal Andaman Islanders (Onge) belong to a sister group of ASI.

The second article (Basu et al 2016) has the same focus region and use the same basic dataset, except that the authors kept the all populations in the analyses, including the austro asiatic (AA) and tibeto burman (TB) speakers. They first ran ADMIXTURE on all populations and showed that islanders and mainland populations have distinct ancestral components (islanders share ancestry with oceanic peoples like Papuans). In a second time they ran the same analysis on mainland populations only (thus excluding population from the Andaman and Nicobar islands). The best model was composed of four ancestral components, the ANI, the ASI as well as the ancestral AA and TB and they found that several present day populations are almost pure representatives of these ancestral components (figure 2).

Fig. 2 : PCA of the 18 mainland Indian populations, the four clusters identified by the authors are surrounded (A). Admixture plot of mainland Indian populations with four ancestral components (K = 4, the most parsimonious) (B).

They further estimated the time and extent of admixture using the degree of fragmentation (due to recombination) of haplotypes blocs originating from a donor population into the recipient population. In each population, the distribution fitted again with an exponential curve. They showed that admixture abruptly came to an end about 1575 years ago in upper-caste populations, most likely due to the establishment of endogamy, while tribal populations seemed to have admixed until 1500-1000 years ago.

In short, although they share a common topic, these two papers propose divergent versions of the history of Indian population : while the first considers a priori that austro asiatic and tibeto burman speakers are not component of the ancestral populations of India and only focuses on the mixture between the ANI and ASI components, the second paper claims that the genetic structure of Indian population is the result of admixture events between four ancestral components. However the two views converge on the idea that admixture was a common phenomenon in India that ceased rapidly with the establishment of the caste systems that enforced endogamy.

Common origin of two subgroups of Ari people

The 3rd paper investigates the history of human populations at a smaller scale, focusing on a single ethnic group, the Ari people of Ethiopia. The Ari are composed of two socially and genetically distinct subgroups : the cultivators (Aric) and the blacksmiths (Arib). Anthropologists have proposed two alternatives hypothesis to explain the division of the Ari : under the remnant hypothesis (RN), the blacksmiths are the remnants of an indigenous group that was assimilated by the more recently arrived cultivators, whereas the marginalization (MA) hypothesis proposes that the two groups share a common ancestry but the blacksmith were recently marginalized due to their activity. While anthropologists traditionally favour the MA hypothesis, recent genetic studies have provided support for the RN hypothesis. In this article the authors use a new methodology on the same genetic dataset to bring evidence for the MA hypothesis. They show that when ADMIXTURE, fineSTRUCTURE or CHROMOPAINTER analysis are run on a complete dataset of 237 samples of 12 Ethiopian and neighbouring populations, the Arib are grouped into a single homogeneous cluster. But when the patterns of haplotype sharing are inferred by composing the Ari as a genetic mixture of all other groups, except themselves, the genetic differences between Arib and Aric disappear. In fact, their analyses reveal that the two Ari groups have the same mixture events with non Ari populations (figure 3).

Fig. 3 : Top : Inferred ancestry composition of recipient groups when forming each group as mixtures of (a) all sampled groups, (b) all sampled groups except the Ari. Bottom : TVD XY values comparing the painting profiles for all pairwise comparisons of groups X, Y under each analysis, with scale at far right. Ari groups (ARIb/ARIc) are highlighted with black outlines in each plot.

To explain this pattern they propose that the genetic differentiation of the blacksmith is due to a bottleneck effect. Their hypothesis is supported by the fact that identity-by-descent (IBD) is stronger in blacksmiths than cultivators which is consistent with reduced genetic diversity in the blacksmiths. Using the D-statistic, they also show that the Arib and Aric are more closely related to each other than they are to any other Ethiopian group. Therefore they conclude that the observed genetic differentiation between the Arib and Aric does not represent separate ancestry but is rather the result of strong genetic drift due to a bottleneck effect induced by the social marginalization of the blacksmiths.

Methodological discussion

What stands out from reading these three articles is that selection of a proper methodology is crucial within an hypothesis testing framework. While the two articles on Indian populations use the same initial dataset, the way they filter and analyse it results in very different conclusions. The inclusion or exclusion of some populations from an admixture analysis or outgroup selection for an f4 ratio estimation directly impact the output of these analysis and can lead the authors to tell very different stories. Before disclaiming or putting forward one hypothesis, it is important to be aware of the limitations of the method that is used to produce the results. For example the authors of the second paper on India’s ancestral populations, claim to demonstrate a more complex history than shown in the first paper but their result is solely based on a clustering analyse (implemented in various softwares such as STRUCTURE or ADMIXTURE).

The basic principle of those STRUCTURE/ADMIXTURE like programs is to take the K most different groups of the dataset, consider them as the pure ancestral groups and force the others to be a combination of those. This means that the results depend on the populations and the number of clusters K that are input in the program. There are different methods to determine which K provide the best fit to the data (cross-validation error, delta K …) but in numerous cases the inferred mixture proportions are wrong. Only in very simple cases, like the African American genetic history (well explained in Daniel Falush’s blog) that involves three clearly defined and very differentiated ancestral populations (West Africans, Europeans and Native Americans) we can be confident in the results of the clustering analyse.


Fig. 4 : Admixture plot of African American population (ASW) with his three ancestral populations, West Africans(YRI), Europeans (CEU) and Native Americans (MEX). Source : Daniel Falush’s blog

But in many cases the history is more complex and no current population actually corresponds to a pure ancestral population because of multiple waves of admixtures. In this case the most differentiated groups correspond only to the most extreme groups but it does not mean that these groups are pure or ancestral. This is well explained in Razib Khan’s blog using the simple example of Uygurs and Europeans : it is known that the Uygurs are a recently mixed group (between European and Asian) but if K is fixed to 2 with Uygurs and Europeans, STRUCTURE will form two different clusters at 100% levels, one with the Uygurs and one with Europeans. This is why, in the 2nd paper, the apparently pure AAA, ATB, ASI and ANI populations and all the clustering implications are probably meaningless. In fact, when using the f4 ratio (as in the first paper) all groups are found to be admixed to a certain extent (with the smallest rate of admixture being 17%).

This critic of clustering analysis is a key element of the study on the Ari people where the authors point out that results from such methods should not be taken for granted but interpreted with caution. Indeed this kind of method cannot discriminate between alternative scenarios of recent mixture of separate populations or shared ancestry followed by population divergence. Therefore support for one of these hypotheses should rely on additional tests. Instead of directly accepting the story suggested by a clustering analysis, a more reasonable work-flow would be to use other methods in order to address the specific implications of one hypothesis. This is exactly what is done in the third article where, as we previously explained, the authors constrain the analysis of mixture by forbidding self ancestry in the two groups of interest which remove the confounding effect of recent bottleneck. In such complex cases, associating PCA and STRUCTURE-like analyses with F-statistics and simulations allow to draw a more robust conclusion. Indeed statistics such as Fst or Dxy that estimate the genetic differentiation between two populations can be simulated under alternative scenarios, representing competing hypothesis (figure 5). These simulated statistics can be subsequently compared with the ones estimated from real data to favour one hypothesis over the other. Simulations can also give an idea of how difficult it is to discriminate between the different hypothesis, which avoid over interpretation of the results. In the second paper, where the authors put forward an new hypothesis, radically different from the classical hypothesis of anthropology and other genetic studies, additional tests like these seem necessary to strengthen their conclusions.

Fig. 5 : Differences in inferred ancestry under analyses A and B using F XY from real data on the top and from simulated data on the bottom (under MA and RN hypotesis). Here the MA hypothesis is obviously the closest to the reality.

Although it was not mentioned in any of the articles, the quality of the data and the way to obtain them, i.e. the kind of sequencing methodology, should also be a matter of precaution. Indeed, they all use micro arrays designed from European populations. These micro arrays consist of thousands of DNA spots containing a predefined sequence, known to be polymorphic in Europeans and only the complementary sequence can fix to this spot and be sequenced. So using these micro arrays to study the history of non european populations may be problematic as only SNPs that are variable for europeans will be targeted, probably leading to the exclusion of meaningful information for non European populations. Today, with New Generation Sequencing (NGS) there are many alternatives, such as RAD sequencing or Whole Genome Sequencing, that allow to sequence tens of thousands non-predefined SNPs.


To conclude, the take home messages from these three articles are :

– Social systems leading to endogamy can influence and modify rapidly and dramatically the genetic structure and patterns of humans populations.

– It is difficult to reconstruct the ancestry of human populations, especially when they involve a complex process with multiple waves of admixture.

– Clustering methods are designed to find a structure in a genetic dataset but they do not necessarily reflect real shared ancestry. Further test using other methods are required to robustly support one hypothesis.

Posted in genomics, human, PNAS | Leave a comment