Progress in genome assembly: 3 strains, 6 QC parameters, 3 software, 2 k-mers #bioinformatics

Restarting this blog after a pause due to other duties, extra motivated by the acceptance of our first paper

This autumn, our students worked hard to make their millions of reads into assembled genomes.

The students have worked on a combination of different strains, quality score and read length thresholds for quality control, assembly software, and k-mer length for the assembly:

131119_Assembly_Page_26First, quality control of the reads. Example before trimming:

per_base_quality
See the big dip on the right? That’s quality going down at the end of the reads. Then we trimmed with fastq-mcf, with a quality threshold of 20 or 30, and a minimum read length after trimming of 150, 200 or 250 nucleotides. After trimming, we obtain the following:

per_base_quality trimmed
After assembly with diverse parameters, we get a large variation of assemblies, whose N50 varies from 19’271 bp to 148’738 bp, and whose total length varies from 1.03 Mb to 6.17 Mb for one strain. We chose the best assemblies based on N50, total length and number of contigs >1kb.

We kept the following assemblies:

Bacterium N contigs > 1000 N50 Total length Assembly parameters
705 92 145562 6142164 250nt Spades 79
705 93 148738 6149199 150nt Spades 91
705 101 108208 6160442 150nt Edena 79
705 116 108277 6144546 150nt Velvet 79
743 50 313482 7399154 200nt Spades 81
743 102 128710 7249125 150nt Edena 75
743 100 121247 7233116 150nt Velvet 75
757 82 185576 6218735 250nt Spades 87
757 90 159912 6144964 200nt Spades 73
757 98 118871 6162918 150nt Edena 83
757 110 113990 6146891 150nt Velvet 91

The first line for each bacterial strain is considered the best assembly.

This entry was posted in assembly, students. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *