Update on annotation of Pseudomonas protegens: confirmations and surprises #biocuration

After 2 weeks of annotation, some of our students presented their preliminary results to the class. Here is a summary.

Phosphate and nitrate uptake and metabolism: using KEGG and keywords, the students identified genes to annotate; of these, 45 have already been confirmed. Some surprising results, such as a plant-specific nitrilase, which is also found in a Rhizobium (UniProtKB/Swiss-Prot) and other Pseudomonas (but not well annotated there). They also identified a Pseudomonas-specific duplication of PhoA2. And for the rest, a lot of annotation is confirmatory.

Partial phylogeny from phylogeny.fr (default parameters), showing PhoA2 duplication

Partial phylogeny from phylogeny.fr (default parameters), showing PhoA2 duplication

Genomic islands: checking base composition with 5 different tools: 3 in IslandViewer, 1 in INDiGenIUS and 1 in SigHunt. Each method finds some different regions, but the students decided to focus on the most supported by several methods. They did not consider regions which on first examination appeared to come from phages, as another student group is annotating these. The putative regions were checked by BLAST: indeed no hit with Pseudomonas, and integrase genes as expected.

IslandViewer graphical result

IslandViewer graphical result

Insect killing (coolest topic title): Starting from the observation of an insect toxin cluster Fit in Pseudomonas fluorescens (Péchy-Tarr et al 2008), they found by BLAST, first an unannotated gene, and then a potential cluster around it, with a very similar structure to the known Fit. Sequence homology confirms that this is probably an insect toxin cluster in Pseudomonas protegens, and the genes were annotated as Fit-A to Fit-H. Neighboring genes are not conserved with the 2008 publication of Fit in P. fluorescens PF05, and a synteny analysis shows actually an inversion in this region between our strain and PF05.

cluster of genes homologous to Fit, probably insect toxin cluster

cluster of genes homologous to Fit, probably insect toxin cluster (picture from GenDB)

Dotplot of synteny between P. fluorescens and protegens; in the center, the inversion around the Fit cluster

Dotplot of synteny between P. fluorescens and protegens; in the center, the inversion around the Fit cluster

Chemotaxis: The students found well characterized genes from KEGG, spread among several clusters of putative chemotaxis genes. This topic appears difficult, because domains may lead to “chemotaxis” automatic annotations, whereas the genes have different functions. They also found many chemotaxis transductors. During the presentation, the group annotating motility was able to contribute also genes related to chemotaxis which they also identified. The two groups now know that they need to coordinate their work. A lot of work left on a tough annotation topic…

Posted in annotation, bioinformatics, students | 1 Comment

Good assembly of genomes with #PacBio: no difference between our strains

A new year of the class Sequence a genome with our master students has started! In September, the students cultivated two strains likely to be Pseudomonas protegens, extracted their DNA and prepared it for PacBio sequencing.

All of the students produced DNA of sufficient quality and quantity:

DNA samples produced by our students

DNA samples produced by our students (provided by Vladimir Sentchilo)

After pooling and ethanol precipitation they yielded two batches of DNA which are suitable for sequencing. Fragment size is about 19 kb, which is good for PacBio sequencing, and purity was assessed by absorbance ratios: 260/380 = 1.9 and 260/230= 2.2.

Our first session of bioinformatics simply consists in checking the assembly with simple visualization and quality-control. We only work with one strain, because the sequences show that they are in fact identical.

We started with a rapid crowd-sourcing of what the students expect of a bacterial genome:

  • a single chromosome
  • few introns
  • circular chromosome
  • small genome
  • an origin and a terminus of replication
  • presence of operons
  • presence of plasmids
  • species-specific GC content.

A pretty good list. As it turns out, our PacBio sequences assembled into one conting of 6.778 Mb. Thus we have an average sized genome, no plasmids, and an apparently perfect assembly.

Using Artemis, the students have been able to manipulate a little bit the genome and check its GC skew: all is well.

DNAplotter of Artemis, GC skew (provided by Mark Szenteczki and Paula Zganiacz)

DNAplotter of Artemis, GC skew (provided by Mark Szenteczki and Paula Zganiacz)

The students have launched mapping of PacBio shorter reads onto the genome to check distribution of reads, but results on the cluster will take a few more days… So that’s it for now.

Posted in assembly, bioinformatics, experimental, students | Leave a comment

#Bioinformatics treatment of bacterial #RNAseq data

The students of the Sequence a genome class have been advancing in their analysis of bacterial RNA-seq (see design in previous post). Let’s present here the basic analysis, common to all biological questions which can then be studied using these data. For this, the students needed to venture first into Unix (using the Vital-IT cluster), and then into R.

First, quality control using FastQC. Take-home message: super good quality, we keep everything:

Fastqc

FastQC plot of one of our RNA-seq samples. They all look like this.

Second, mapping with BowTie, which is largely sufficient since bacteria don’t have intron (so no complex reconstruction of transcripts). A lot of annoying time and explanations spend on horrid formats and format conversions. But then we get this nice mapping that we can visualize in IGV:

IGV

Visualization of reads mapping to a little portion of the chromosome which we sequenced and annotated in the previous semester, visualized with IGV.

Third, count reads with HTseq; again, since we have no issues with splicing, simple counting works. This is what the counts look like in a rapid PCA; they group by condition, good sanity check. There is signal!

PCA of read counts

PCA of read counts

Finally, the students have investigated differential expression between conditions using EdgeR on the counts. And low and behold there are differences:

Differential expression between pairs of conditions, with genes significant at FDR<0.05 highlighted in red.

Differential expression between pairs of conditions, with genes significant at FDR

Many thanks to the students who provided the figures used here, from their work.

Posted in bioinformatics, rnaseq | Leave a comment

Experimental preparation of #RNAseq samples

This winter, the Master students preformed the experimental part of our RNA-seq. This blog is a bit late, so they have in the meantime started the bioinformatics analysis, which will be reported in a future blogpost (hopefully less late than this one).

So, four conditions to study gene expression:

growth conditions

Growth of bacteria in: liquid medium with succinate; liquid medium with toluene; sand with succinate; sand with toluene.

And after extraction the RNA looks good:

QC of RNA, 4 conditions with 4 replicates each

QC of RNA, 4 conditions with 4 replicates each

Quality control of the rRNA depletion process (Bioanalyzer report): no rRNA, no degradation.

Quality control of the rRNA depletion process (Bioanalyzer report): no rRNA, no degradation.

Quantification shows a very good yield.

bioanalyzer2

Quality Control (Bioanalyzer): near optimal size distribution; similar profiles for all samples.

So all these samples were sent to our Genomic Technologies Facility, and the single end reads found their way to our Vital-IT cluster.

Posted in experimental, rnaseq, sequencing | 1 Comment

Starting annotation of the P. veronii genome #biocuration

After completing genome assembly (see previous blog post), our students have now started 4 sessions of annotation. They have chosen among 15 proposed topics, of interest either for all bacterial genomes, or specifically to this isolate from a solvent-contaminated soil. Annotation will be by pairs of students, using by default the GenDB system, which provides a first pass automatic annotation as well as access to various bioinformatics tools.

The topics chosen are:

  • solvent and/or antibiotics resistance
  • anaerobic respiration
  • chaperones and heat shock proteins
  • metal resistances
  • aerobic respiration
  • prophage regions
  • transposons types and copy numbers
  • genomic islands
  • aromatic compound metabolism
  • chemotaxis and flagella
  • secretion systems (type II, III, IV…)
  • any toxin proteins? What are they?

A sample view of the GenDB window for a random region of the P. veronii genome:

gendb_pveroniiEarlier in the year we presented rapidly the theory behind different annotation resources (Blast, GO, KEGG, etc). We provide the students with a simple guide to annotation, which of course must be followed intelligently (if it were automatic, we’d automatize it, not have students do it), and 7 teachers are available to guide this annotation effort. Some of us are microbiologists, some are bioinformaticians, and some even know both fields. 😉

annotation_guide_schema

Schema from the annotation guide provided to our students

Posted in annotation | Leave a comment

Finishing PacBio assembly, and a presentation by our local PacBio representative

In this first session of the course “Sequence a genome” since we sent DNA to the sequencing facility, we have had the visit of Gerrit Kuhn, who supports PacBio in our area, and who gave a nice presentation of how PacBio works, the workflow, and of the specific requirements during sample preparation and how it influences data quality. While the sleek animations are expected from a corporate presentation, kudos to Gerrit for being open and giving insight into the advantages and specificities of the PacBio workflow, and answering all questions straightforwardly. (Updated paragraph.)

Gerrit Kuhn of PacBio Switzerland

Gerrit Kuhn of PacBio Switzerland talking to our students.

Gerrit Kuhn of PacBio Switzerland

Gerrit Kuhn of PacBio Switzerland talking some more to our students.

After our first pass of PacBio, we have 9 contigs, with the longest at 5’815’706 bp. We have asked the students to use Mauve to look at our assembly, and compare it to a version of the Pseudomonas veronii genome in NCBI, which has 63 contigs.

screenshot of Mauve comparison of our contigs with the NCBI genome

Screenshot of Mauve comparison of our contigs (top) with the NCBI genome (bottom). Red lines separate contigs, colored blocks are recognized as similar between the two genomes.

We have two very large contigs, unitig-3 and -5, with strong similarity to the NCBI genome, and some other contigs with almost none. Notice the spaghetti of relations between contig blocks, due to the fragmented assembly in the NCBI genome. We also compared our contigs between themselves. We are thus able to eliminate 3 contigs which are small and entirely redundant with larger contigs within our assembly. We can also find that the two largest contigs have enough overlap to join them, which provides us a main chromosome of 6.8 Mb. 🙂 We are not able to circularize it, though. Two other groups of contigs can be joined into additional molecules, plasmids or secondary chromosomes, to be determined at annotation (in a few weeks). A group of 3 contigs groups into an apparent mega-plasmid, which we can circularize thanks to similarity at the ends of the contigs; this also has high similarity to NCBI contigs. The other potential plasmid, formed of two contigs which cannot circularize, has no similarity to the NCBI sequence, and ends with potential transposons (sequences found elsewhere in the genome).

Thus overall, the PacBio experiment seems to have worked: the DNA extracted by our students and sequenced in our facility has produced a usable assembly. In the next weeks we will annotate this chromosome and these two plasmids.

Posted in assembly | 1 Comment

A new adventure: PacBio sequencing and RNA-seq in the classroom

This class has been running since 2010-11, on the following principle:

  • autumn semester: isolate bacterial DNA, sequence using Illumina, assemble;
  • spring semester: close assembly gaps, annotate genome.

As anyone following genomics knows, the times they are a’changing again and again, so this is less and less state-of-the-art. So we have decided to try a new course plan this year, taking advantage of the progress in bacterial genome sequencing with long PacBio reads.

Our new principle is, hopefully:

  • autumn semester: isolate bacterial DNA, sequence using PacBio, assembly trivial, annotate genome;
  • spring semester: RNA-seq under 2 growth conditions, experiments, Illumina sequencing and bioinformatics analysis.

“Hopefully” because PacBio on bacteria is not yet routine, depending on the genus and the growth conditions. We are thus trying two different bacteria, a Pseudomonas which has a cool story for the RNA-seq part, and a Caulobacter which has been shown to work with PacBio. Preliminary studies on the Pseudomonas are somewhat discouraging for the PacBio sequencing, but we will still try, with adaptations of the protocol. We will also keep the possibility to reverting to Illumina sequencing plus assembly, but we would like to avoid that (if Caulobacter is plan B, this is plan C).

And of course, we have never done RNA-seq with master students, so this year will be a new adventure, comparable to our first course in 2010. Stressful and exciting. 🙂

Posted in course plan, sequencing | 2 Comments

Last session of annotation: using literature, KEGG, Blast and UniProt (mostly)

Following last week’s summary of some student annotation topics, here are some more. I didn’t have time to interview all the students, so some topics were not yet covered, and some only very rapidly, but it gives an idea of what we’re up to.

Motility: swarming, swimming, twitching: The students searched the flagellin gene first, and found that it was well annotated automatically. They then annotated neighboring genes on the chromosome, and checked the literature and KEGG for flagella proteins. They found a cluster of genes automatically named with the prefix fli*, near the flagellin hag gene. Through KEGG, they also identified flg genes, which are in a different cluster in the genome. Both clusters have nearby chemotaxis genes. An intringuing observation which was not yet solved at time of writing is that motA and motB genes both appear duplicated to another region of the genome, but the annotation is unclear; could there be two flagella in this species? Such a configuration seems to exist in other species, notably for swarming within biofilms. The students also found pilus / twitching genes, as well as two genes involved in swarming (starting from annotations in the Pseudomonas database).

Organic acid metabolism: Starting with a literature review (Badri & Vivanco 2009), the students found organic acids which are secreted by the plant and can be taken up by Pseudomonas. They then looked in KEGG for pathways which can use these compounds, and found two main pathways in Pseudomonas: TCA and Glyoxylate and dicarboxylate metabolism (and pyruvate metabolism, but they gave it lower priority and didn’t have time to annotate it). After checking annotation of these two pathways, they looked for uptake of organic acids, starting from the literature plus KEGG again. They found three uptake systems, and are annotating their regulation.

Metabolism pathways to the TCA, from KEGG

Metabolism pathways to the TCA, from KEGG

Quorum sensing/Gac system: Started with a literature search for the system GacA/GacS, the students found an article (Cheung et al 2013) with a transcriptome study of this system, and associated genes. They then selected the most upregulated genes in response to the KO of GacS, and identified them in our genome: they are organised in clusters as expected, conserved with the observations in Pseudomonas fluorescens. They then identified the two small RNAs which regulate these clusters.

From pseudomonas.com (click to go to original)

From pseudomonas.com (click to go to original)

Secondary metabolite production: phloroglucinols (DAPG), phenazines, hydrogen cyanide: The students checked the annotation of genes involved in the production of these three classes of compounds. They notably found a candidate gene cluster for the biosynthesis of phloroglucinols, confirmed it by BLAST, and annotated its regulation.

Polysaccharide synthesis: From a literature search, the students found well characterized genes in other species. They checked these gene names in UniProtKB and KEGG, compared them to our genome by BLAST, and found interesting potential operons. Notably, they found potential small regulatory genes in front of one operon, which were not in the expected conserved genes. They also found an intriguing massive cluster of 20 genes, with two functions: polysaccharide synthesis or lipopolysaccharide synthesis.

Type VI secretion: The students started from KEGG, and found good BLAST hits in our genome. The found common genes with the genomic island annotation group (see previous post), but the overlap is only partial, which is intriguing: the secretion system was expected to be entirely inside the genomic island, or outside it. They also found regulatory genes for the secretion system, which they are annotating.

That’s it for now. Of course, it’s a work in progress, and an account based on 5 min interviews with students who are working on their annotation. It gives an idea of the biology that we can start to extract from a genome with a class of master students, in a few hours. Next week the students will present their results in formal presentations to the whole class.

Posted in annotation, bioinformatics, students | Leave a comment

Second paper from the Sequence a genome class published

After a first paper in June last year, a second paper from our class of sequencing, assembling and annotating bacterial genomes with Master students is out:

Sequencing and characterizing the genome of Estrella lausannensis as an undergraduate project: training students and biological insights
Claire Bertelli, Sébastien Aeby, Bérénice Chassot, James Clulow, Olivier Hilfiker, Samuel Rappo, Sébastien Ritzmann, Paolo Schumacher, Céline Terrettaz, Paola Benaglio, Laurent Falquet, Laurent Farinelli, Walid H. Gharib, Alexander Goesmann, Keith Harshman, Burkhard Linke, Ryo Miyazaki, Carlo Rivolta, Marc Robinson-Rechavi, Jan Roelof van der Meer and Gilbert Greub
Front. Microbiol. doi: 10.3389/fmicb.2015.00101

The master student names are in bold.

In this paper, we both report the genome, and a description of the class organisation the first year that the class operated, with the challenges of this new approach at the time:

Organization of the course “Sequence a genome” in 2010–2011

Moreover, all the biological conclusions and associated figures are drawn directly from the student reports on their annotation effort:

Map of the 9.1-kb Estrella plasmid

We also included a comparison of similar genomics hands-on courses around the world:

Comparison of genomics courses and teaching initiatives

We hope that this will be helpful not only to microbiologists, but also to teachers everywhere who want to implement similar classes.

Posted in article, course plan | Leave a comment

Our first student genome paper is out: Miyazaki et al Environ Microbiol

The first paper describing a genome sequenced, assembled and annotating by students in our course, which was accepted in April (that blogpost includes the full list of authors with student names in bold), is now online:

Comparative genome analysis of Pseudomonas knackmussii B13, the first bacterium known to degrade chloroaromatic compounds

Article first published online: 28 MAY 2014 | DOI: 10.1111/1462-2920.12498
Environmental Microbiology

Here are snapshots of the main figures, clicking on them will take you to Wiley Environ Microbiol:

emi12498-fig-0001-t

Whole genome map of P. knackmussii B13 compared with related Pseudomonas species

Phylogenetic analysis of P. knackmussii B13 with other species of the Pseudomonas genus

Phylogenetic analysis of P. knackmussii B13 with other species of the Pseudomonas genus

Detailed comparisons of ICEclc with putative ICE regions in five other bacterial genomes

Detailed comparisons of ICEclc with putative ICE regions in five other bacterial genomes

The five gene regions in the B13 genome encoding flagella components

The five gene regions in the B13 genome encoding flagella components

 

Posted in article | 2 Comments