Annotating Ralstonia genomes: what have our students been up to? #biocuration #bioinformatics

Last week of annotation by our students (see list of annotation topics). Let’s see what they’re up to. This is a snapshot of an unfinished annotation.

In which of the three replicons are the core genes? And first, are they chromosomes or plasmids? GC content, GC skew, and codon usage all indicate that the first replicon is a chromosome, the second a chromid (or megaplasmid or secondary chromosome), and the third a plasmid. So far, no more than 5 genes out of 206 are found on the chromid or the plasmid, and all the others on the chromosome, confirming its chromosome status. Next, to check the replication partitionning machinery.

Ralstonia 705 has a Nitrogen fixating system, but it cannot do symbiosis with plants, since they lack the genes for symbiosis, such as the Nod factors. It can use nitrate respiration, with a denitrification process, although it lacks Nitrous oxide reductase. Relative to closely related strains, such as Burkholderia vietnamensis, which is N2 fixing, gene content and order is well conserved (see Fig. 1), and is on the chrosome as in R705, whereas relative to also closely related symbiotic bacteria gene order is not conserved, and the genes are on a plasmid.

Gene order comparison between Ralstonia 705 and B. vietnamensis, focused on the Nif gene cluster

Fig. 1 Gene order comparison between Ralstonia 705 and B. vietnamensis, focused on the Nif gene cluster

In Ralstonia 743, we found almost an entire phage, in one block in the genome, and the integration regions AttL and AttR. It’s on the chromosome, whereas there are no plasmids found in the other replicons. There are two other regions with phage sequences, but not a complete phage. We also found 5 genomic islands on the chromosome, and 2 on the chromid (chromosome 2), which include many heavy metal transporters. In one of the islands, an interesting resistance to polymyxin, an antibiotic, was also found.

Both Ralstonia have many transporters, generally the same ones. There are notably many heavy metal transporters, which are often conserved in other bacteria and organized in operons. For example, Co, Ni, Ar and Hg all have such conserved operons.

Concerning the degration of aromatic compounds, the 705 strain has one operon of the aerobic degradation of benzoyl-CoA pathway almost complete, and there are two such operons in 743. BoxE is missing in all operons, but PcaF is always present and seems to have the same Beta-ketoadipyl-CoA thiolase function. Thus both strains can apparently do complete aerobic degradation of benzoyl-CoA. For the degradation of toluene, 743 has the complete pathway whereas 705 misses the end of the pathway, and is bracketed by two transposases which are absent from 743. This raises the question of how the difference appeared, especially since at least one other Ralstonia strain also has the partial operon (Müller et al 2003).

Staying on aromatic compounds, we looked for transporters. 743 has a chlorobenzene transporter, whereas we didn’t find any in 705; this transporter seems to also be able to transport styrene, based on Blast hits and its gene neighborhood on the chromosome; we checked for this styrene transport based on the KEGG pathways mapped to Ralstonia genes. We also found a benzoate transporter in both strains, with an interesting conserved neighborhood (Fig. 2) containing benzoate-CoA ligase and sub-units of an ABC transporter, as well as a transporter for 4-hydrobenzoate in a different chromosomal location. Overall, this partially explains how these strains degrade aromatic compounds in the environment.

Fig 2 Benzoate transporter (blue box) and conserved genomic environment (red box) in 705 (top) and 743 (bottom).

Fig 2 Benzoate transporter (blue box) and conserved genomic environment (red box) in 705 (top) and 743 (bottom).

Still on aromatics, 743 seems to have many operons for aromatic degradation, mostly through benzoate metabolism. There is notably a complete Tod operon for toluene degradation (Fig. 3), on the chromid. Benzoate is converted to catechol though the BenABCD operon, which is entirely present. We also found a large PAAX operon present in all Ralstonia, but without clearly described function; of note, the same gene names are annotated in E. coli, but with very limited sequence conservation, so function may or may not be conserved. PAAX neighbors Box on the chromid. 743 also seems to degrade phtalates and protocatechuates.

Tod operon in strain 743

Tod operon in strain 743

Finally, in 705, most of these pathways are conserved. But there is a gene missing for the degradation of toluene (Fig. 4), which is consistent with preliminary unpublished experimental results. There are mcb genes to degrade chlorobenzene. There is a locus with all genes to degrade aminophenol which is inside a mobile CLC element.

Reconstructed aromatic metabolism of strain 705

Fig 4 Reconstructed aromatic metabolism of strain 705

Genomic islands of 705: the CLC island is obviously there, but seems misassembled, because a small insertion surrounded by duplicate sequences which should be in the island is in the “unplaced” contigs. There is a large deletion (≈10kb) relative to the Knackmussi CLC island. We found another genomic island, which apparently belongs to the Tn4371 family; the integration and conjugal transfer are well conserved, whereas the rest of the island is not (Fig. 5). The non conserved regions seem to contain arsenic resistance and a multidrug efflux transporter. There are clearly more genomic islands, but without clear homology to known elements in the ICEberg database. This illustrates how genomic islands are still poorly known in general. We also found two rather well conserved Mu-like phages; one has most of the structural proteins, so is probably quite young, whereas the other is more degraded. There are also more phage remains which are difficult to identify further. They are all on chromosome 1, there are no phages detected on the other replicons. Althoug we didn’t find any specific resistance genes inside the phages, we did find a multidrug resistance and a Vibrio cholera toxin just neighboring phages.

Fig 4 Genomic island Tn4371, compared between Ralstonia 705 and Ralstonia oxalatica

Fig 5 Genomic island Tn4371, compared between Ralstonia 705 and Ralstonia oxalatica

And now for something completely different: comparing different assemblies of 705 and 743, to improve the assembly. Thus we improved the alignment between the strains. Improving the assembly of the chromid was easier based on a published strain (H16) than based on our two strains. Interestingly, the differences between our two strains have odd GC content, a direction which is being pursued.

Also in comparisons, the phylogeny of Ralstonia seems different between the 16S phylogeny and that based on core genes (unique copy, present in all 47 species of Burkhonderiaceae compared). We also did a phylogeny based on the ICE elements, one based on 23S, and 6 randomly selected genes. This work is ongoing.

So, an emerging picture of dynamic evolution of Ralstonia with three replicons carrying heavy metal resistance and aromatic compound transport and degradation, for a species found in contaminated environments.

Posted in annotation, students | Leave a comment

Annotating two genomes #biocuration

The students have started annotating the two best genomes from the assembly. For this, most of them are using the GenDB platform to annotate specific systems:

  • Secretion systems;
  • Transporters;
  • Prophages and genomic islands;
  • Aromatic compound metabolism;
  • alternative respiration;
  • Replicons;
  • N-fixation and N-respiration.

Most of these are being annotated by different pairs of students in each of the two strains.

In addition, since we have two strains of the same species with good enough assemblies, and more students than in previous years, some student pairs are annotating comparisons between the strains:

  • SNP calling;
  • Genome comparisons between strains and with closest neighbors;
  • Reconstruction of aromatic compound pathway evolution;
  • Phylogenetic supertree of close species.

Depending on topics, this allows students to explore different aspects of bioinformatics and of biocuration, while getting to see the biological relevance of the genome sequencing (see the feedback of students on our course last year). Until we get results, you can see the highlights of annotation from last year.

Posted in annotation, students | 2 Comments

Closing the assembly #bioinformatics

We left you with assemblies numbering tens of contigs. The next step was for our students to order them. We chose to go ahead only for strains 705 and 743. For strain 705 we had an optical map to help ordering contigs, and for 743 we combined assemblies, as well as similarity to closely related bacteria, since it was anyway the assembly with the less contigs.

The students then designed PCR primers at the extremities of these contigs, at the end of the autumn semester. These primers were ordered, and at the begining of the spring semester the students performed all the combinatorial PCRs:

For 705: 92 contigs, 95 PCRs, 75 positive PCRs, 42 gaps closed sequencing these PCR products. Thus we were left with 50 contigs to annotate, which were mapped to 3 replicons plus “unmapped”.

For 743: 50 contigs, 46 PCRs, 24 positive PCRs, 16 gaps closed sequencing these PCR products. Thus we were left with 30 contigs to annotate, which were mapped to 2 replicons plus “unmapped” (but probably a third replicon).

Posted in assembly, students | 1 Comment

Progress in genome assembly: 3 strains, 6 QC parameters, 3 software, 2 k-mers #bioinformatics

Restarting this blog after a pause due to other duties, extra motivated by the acceptance of our first paper

This autumn, our students worked hard to make their millions of reads into assembled genomes.

The students have worked on a combination of different strains, quality score and read length thresholds for quality control, assembly software, and k-mer length for the assembly:

131119_Assembly_Page_26First, quality control of the reads. Example before trimming:

See the big dip on the right? That’s quality going down at the end of the reads. Then we trimmed with fastq-mcf, with a quality threshold of 20 or 30, and a minimum read length after trimming of 150, 200 or 250 nucleotides. After trimming, we obtain the following:

per_base_quality trimmed
After assembly with diverse parameters, we get a large variation of assemblies, whose N50 varies from 19’271 bp to 148’738 bp, and whose total length varies from 1.03 Mb to 6.17 Mb for one strain. We chose the best assemblies based on N50, total length and number of contigs >1kb.

We kept the following assemblies:

Bacterium N contigs > 1000 N50 Total length Assembly parameters
705 92 145562 6142164 250nt Spades 79
705 93 148738 6149199 150nt Spades 91
705 101 108208 6160442 150nt Edena 79
705 116 108277 6144546 150nt Velvet 79
743 50 313482 7399154 200nt Spades 81
743 102 128710 7249125 150nt Edena 75
743 100 121247 7233116 150nt Velvet 75
757 82 185576 6218735 250nt Spades 87
757 90 159912 6144964 200nt Spades 73
757 98 118871 6162918 150nt Edena 83
757 110 113990 6146891 150nt Velvet 91

The first line for each bacterial strain is considered the best assembly.

Posted in assembly, students | 1 Comment

First paper from the course accepted with 8 students as co-authors

Excellent news this morning! The first paper describing a genome sequenced, assembled and annotated in this course has been accepted for publication:

Ryo Miyazaki, Claire Bertelli, Paola Benaglio, Jonas Canton, Nicoló De Coi, Walid H. Gharib, Bebeka Gjoksi, Alexander Goesmann, Gilbert Greub, Keith Harshman, Burkhard Linke, Josip Mikulic, Linda Mueller, Damien Nicolas, Marc Robinson-Rechavi, Carlo Rivolta, Clémence Roggo, Shantanu Roy, Vladimir Sentchilo, Alexandra Von Siebenthal, Laurent Falquet, and Jan Roelof van der Meer. Comparative genome analysis of Pseudomonas knackmussii B13, the first bacterium known to degrade chloroaromatic compounds. Environmental Microbiology

Pseudomonas knackmussii B13 was the first strain to be isolated in 1974 that could degrade chlorinated aromatic hydrocarbons. This discovery was the prologue for subsequent characterization of numerous bacterial metabolic pathways, for genetic and biochemical studies, and which spurred ideas for pollutant bioremediation. In this study we determined the complete genome sequence of B13 using next generation sequencing technologies and optical mapping. Genome annotation indicated that B13 has a variety of metabolic pathways for degrading monoaromatic hydrocarbons including chlorobenzoate, aminophenol, anthranilate, and hydroxyquinol, but not polyaromatic compounds. Comparative genome analysis revealed that B13 is closest to Pseudomonas denitrificans and Pseudomonas aeruginosa. The B13 genome contains at least 8 genomic islands (prophages and integrative conjugative elements – ICE), which were absent in closely related pseudomonads. We confirm that two ICE are identical copies of the 103-kb self-transmissible element ICEclc that carries the genes for chlorocatechol metabolism. Comparison of ICEclc showed it is composed of a variable and a “core” region, which is very conserved among proteobacterial genomes, suggesting a widely distributed family of so far uncharacterized ICE. Resequencing of two spontaneous B13 mutants revealed a number of single nucleotide substitutions, as well as excision of a large 220 kb region and a prophage, which drastically change the host metabolic capacity and survivability.

The names of the master students are in bold in the author list.

Posted in article | 3 Comments

Our students have the reads from three bacterial genomes

Starting a new year of Sequence a genome!

We have selected 3 bacterial strains isolated from contaminated ground water, all from the same species. The students have grown the bacteria, isolated the DNA, and brought it to our sequencing facility. They don’t get to do the final library preparation themselves, but they do get a guided tour of our sequencing machines: Illumina HiSeq and MiSeq, PacBio and IonTorrent.

2013-09-20 08.24.43

Photo of the class by Jan van der Meer

And now the sequencing is done, and the students are ready to assemble: 4.5, 5.3 and 6.3 million reads, respectively for each of our three bugs.

In preparation for this difficult exercice, we had the hardest session of the year: Introduction to Unix and using the cluster! This will allow them to perform the quality control and assembly steps on our 16 million reads. The students will test different assemblers, quality control cut-offs, and k-mer sizes, which will hopefully allow us to select the best assembly for each of the three strains by the end of this semester.

Posted in sequencing | Leave a comment

How do our students feel after two semesters of genome sequencing?

I interviewed our students on the last day of annotation on their impressions and take-home message from the two semesters of this class, from extracting the bacterial DNA to summarizing their annotation of a specific subset of the genome. The following summary is of course subjective.

There is a strong constrast in the experience of the students between the sequence assembly in the autumn, and the annotation in the spring.

The sequence assembly was more abstract, with theoretical explanations of algorithms which were often difficult to follow for the students (De Bruijn graphs, k-mers, and friends), and all the work needed to be done on a distant cluster using the dreaded Unix command-line interface. So the students often felt that they were copy-pasting commands and using ready-made scripts without truly understanding what they were doing nor why.

Another source of frustration is that, although we try to provide a full A to Z experience, the sequencing is done without them in a facility to which we just send the DNA. Then, miracle!, the sequences are on the cluster a few weeks later. On the plus side, several students appreciated getting much closer to these new and fashionable techniques. Illumina and PacBio stopped being just names they had heard, became more real.

After the bioinformatics part of the assembly, we did some experiments which were a bit frustrating because they take quite a bit of time for a small progress. I’m told a lot of experimental biology is like that. 😉

Overall, the assembly was more interesting for the students who have chosen to specialize in Bioinformatics in the master, relative to the others. One student called assembly “putting pieces of the puzzle together”. Some bioinformatics students also liked the experimental part of the course, which they otherwise don’t have during their master. Of note, one non bioinformatics student found the bioinformatics of assembly “hard but cool”.

On the other hand, almost all the students appreciated the annotation, which is closer to biology, and allowed them to use multiple bioinformatics tools independently. By using these tools to pursue a biological aim, several students felt that they improved their understanding of the tools (e.g., flavors of Blast, differences between SwissProt and Pfam). It also allowed them to have more initiative, to pursue leads which they found interesting. And then, in the words of a student, “we see how the bacteria use their genome”.

A few students would have appreciated spending more time on the annotation, they feel that they barely scratched the surface, found tentalizing leads, then had to stop because the semester was ending. We will consider in the future allowing motivated students to continue for extra credit.

A difference between the two semesters which was purely practical, but influenced the experience of the students, is that in autumn classes were every two weeks, whereas in spring they were over half of the semester every week. The weekly rythm was unanimously judged more adequate, allowing students to carry on from one session to the next more easily.

While many students found the assembly more difficult and less interesting, many also appreciated having seen the whole process from begining to end. Or as one student put it, “from A to W”, since we didn’t really have time to finish everything which we would have liked to. But isn’t it always so in research…

Those students who are doing microbiology appreciated that the whole course gave them a better knowledge and understanding of what goes into a bacterial genome and gene annotations, since they are confronted with these data in their work at some point or another.

Finally, one student, after telling me everything which he didn’t like about the course (Unix! algorithms! we didn’t have time to finish!), concluded “It was challenging but interesting” with a smile.

I can only encourage colleagues to organize similar courses everywhere.

Posted in students | 1 Comment

Highlights of our annotation: heavy metal resistance, lateral gene transfer, genomic islands, mega plasmids!

This was the last day of annotation for our students, where they had to write up their results. Next week, oral presentations of the results. In the meantime, I asked each group of 2 or 3 students for some highlights of their annotation efforts. Please keep in mind that this is all preliminary.

Some highlights from our annotation of the Neochlamydia genome:

We clarified which amino acids could be synthesized. The striking results is that very few amino acids can be synthesized: this intracellular bacteria lives inside amoeba, and is very dependent on synthesis or transport by the amoeba.


Presence / absence of amino acid synthesis pathways: Green: present; red: absent; yellow: maybe present. Temporary figure, results still being updated.

A comparison of type III secretion among Chlamydiae showed that the structural proteins are very well conserved. On the other hand, CopB and CopC are less conserved. Despite low sequence conservation, CopB can be identified by conserved synteny. But CopD appears absent from not only our Neochlamydia, but from most Chlamydiae, whereas it is reported in complex with CopB in Yersinia and Salmonella. Three new genes were identified while studying this system, inserted in the conserved syntenic region: a transposase, a heat shock protein, and a potential secreted effector.

T3SS figure

Conservation of type III secretation system among several bacteria

While looking for prophages in vain, we found a super cool 55 kb genomic island, bordered by magnificient direct repeats (34 bp). Integrase, recombinase, and multiple antibiotic resistance genes, how cool can it be? (The students were quite excited by this find.)

Some highlights from our annotation of the Ralstonia genome:

Replicons I and II (3.5 and 1.5 Mb) follow the GC skew pattern of a real chromosome, whereas replicon III (0.5 Mb) doesn’t have a clear pattern. tRNA genes only on replicon I, with a initiation of replication DNA-dependent, whereas II and III have a plasmid type Rep system.

Chromosome III probably corresponds to the mega plasmid pHG1 of Ralstonia eutropha. In general, many similarities of our strain with R. eutropha. E.g., also has genes for lithoautotrophic growth.


R. eutropha pHG1 on top, our Ralstonia chromosome III on bottom. Generated using Artemis.

Looking for transporters, found heavy metal transport in both bacteria, more in Ralstonia, as expected from its isolation. Two copies of mercury resistance cluster on chromosome III, which suggests lateral transfer of the whole system with the plasmid. Surprisingly (for us), four copies of the operon for Cobalt-zinc-cadmium resistance (genes CzcA-B-C), which from phylogeny seem to come from three operon duplications shared by other species, and one lateral gene transfer.

Posted in annotation | 1 Comment

Bacterial genome annotation strategies

Into our third week of genome annotation, a look at the strategies of different student pairs. In all of these, annotation is still ongoing.

Integrated phages and genomic islands:

  • Neochlamydia: No results with PHAST. 🙁 GC-skew confirmed the order of contigs, but showed some cases of inversion, and especially, once assembly was corrected, allowed to detect a zone with a recombination, which also includes a recombinase and a beta-lactamase (which could play a role in resistance to beta-lactame). Through keyword search, two more regions were found, which are being investigated.
  • Ralstonia: This time, PHAST worked: 4 regions, of which 3 appear to be whole phages, and one cryptic. Now annotating all genes in these phages, one by one. Interesting finds: lysozymes!
PHAST results

PHAST results


Started with a keyword search in KEGG, for genes involved in nitrogen metabolism, nitrogen fixation or nitrate respiration. Then found these genes in the Ralstonia genome, and annotated the clusters: they were clearly clustered functionally, maybe operons. Interestingly, they are located on different replicons.


Started with a keyword search for “transporter” in the automatic annotations of GenDB. First annotated manually the 117 such genes in Neochlamydia, but not the >600 in Ralstonia. This allowed to find several Neochlamydia operons in which one or two genes had automatic “transporter” annotation, but the other genes didn’t. Given the large number of transporters, and the biological origin of these bacteria, we decided to focus on heavy metal transporters. Restarted keyword searches followed by manual annotation in both genomes, focusing on keywords associated to heavy metals.

Type III secretion system:

Starting from a figure from a Waddlia genome paper, searched for homologs in Neochlamydia by BLASTP:


Clicking will take you to the list of Supplementary materials. This is Fig S4.

Amino acid synthesis:

Starting with the primary amino-acid list from KEGG, found the pathways. GenDB has the KEGG pathways pre-annotated automatically. Focused on those pathways which were already largely covered by the automatic annotation, manually annotating and completing them.


Starting with two review articles (Egan et al 2005, Jha et al 2012), then further relevant articles (Salanoubat et al. 2002, Gibbs et al. 2006), made a list of relevant proteins, involved in DNA replication. Ralstonia has several replicons, so we searched for the housekeeping genes to help define the main chromosome. Indeed most are on chromosome I. 🙂 Now searching for the origins of replication using GC-skew; works for two of the replicons but not the third. Also looked for replication proteins near the peaks of GC-skew (yes! ongoing…). The combination of replication genes and (even weak) GC-skew peaks is allowing to define replication origins for all replicons. Future plans: look for consensus DnaA-binding box; compare to sequence of replication origins on closely related bacteria. Also plan to check further annotations to understand why the bacterium kept these additional chromosomes / mega-plasmids.

Degradation of aromatic compounds:

Started with a review article, searched for gene names: no luck, so switched EC numbers, not much better, so switched to Gene Ontology terms; annotating genes which have hits to these GO terms in the automatic annotation. In parallel, starting from aromatic pathways in KEGG, and annotating them in detail.


Checked literature to find protein domains and genes which are involved: Cas and Cse. Then used CRISPR Finder on the DNA Fasta files of the genome assemblies, which provided putative CRISPRs: none for Neochlamydia, 9 in Ralstonia, of whihc 5 unmapped. Now confirming these 9, based on Cse and Cas genes which should be in the neighborhood: annotating genes around the putative CRISPRs.


All in all, quite a diversity of starting points, from the literature to automatic annotations, and of efforts, which can be concentrated on few genes, on long lists, or on specific DNA regions.

Of note, this summary is not a recommendation on the best manner to annotate a bacterial genome, but a snapshot of the strategies and results of our students during their annotation effort.

Posted in annotation | Leave a comment

Starting to annotate our two genomes

We have now started annotating our two genomes with our students. We use GenDB, a semi-automated system which provides a first run of automatic annotations, and integration of results from many bioinformatics analyses (BLAST against various databases, trans-membrane domains, etc.) as well as a genome browser and an interface to KEGG.

Our Neochlamydia and our Ralstonia have never been sequenced before, but we know where we found them, and we know related genomes, which gives us some clues to their biology. The Ralstonia was isolated from a contaminated groundwater aquifer. The Neochlamydia was also isolated from the environment, and both might have heavy metal resistance for example.

The students are now grouped in pairs (and one trio), and are annotating different sub-systems using GenDB as well as information from outside resources, such as StringDB. The systems are:

•Amino acid biosynthesis
•Conservation of the type III secretion system
•Prophages & genomic islands
•CRISPR elements
•Carbon metabolism
•Mobile genetic elements (ICE, Tn) and prophages
•Respiration and N-metabolism

This process of annotation will take us four 4 hour sessions (we’re half way through!). Each annotation goes through the steps of automated annotation, suggestion by a student, verification by a teacher who provides feed-back, final annotation by a student, and validation by a teacher. In the process, we learn more about these bacteria, and the students learn about the labor-intensive process of manual curation of biological data. And since none of the teachers knows everything from microbiology to bioinformatics, we all learn to be better genome biologists.

We will update this blog with the results of the annotation process.

Posted in annotation | Leave a comment