< 3rd term

Genome assembly

Last update on the 11th of December, 2017

The genome of bacterium Buchnera aphidicola is assembled de novo and compared with reference genome via BLAST. Processed spreadsheet data is contained in table.ods

List of downloads
File Link
Spreadsheet table table.ods

Read preprocessing

Reads were taken in ENA under accession SRR4240378. First, all Illumina adapters were combined into single file with for i in /P/y16/term3/block3/adapters/*; do cat $i >> adapters.fasta; echo -e "\n" >> adapters.fasta; done. Then, all adapters were removed with Trimmomatic: java -jar /usr/share/java/trimmomatic.jar SE -phred33 SRR4240378.fastq adapter_removed.fastq ILLUMINACLIP:adapters.fasta:2:7:7. Input reads of 4420587 reduced to 4338744 (98,15%). Next, low quality end bases were removed as well as reads shorter than 30 with java -jar /usr/share/java/trimmomatic.jar SE -phred33 adapter_removed.fastq trimmed.fastq LEADING:20 TRAILING:20 MINLEN:30. Number of reads reduced to 4161689 then 94,14% of original reads persist.

Assembly

29-mers were prepared with velveth assembly 29 -short -fastq trimmed.fastq and assembled with velvetg assembly. To spot N50 and top-3 extended contigs, following command was executed infoseq contigs.fa -only -name -length > length_cont.txt and resulting file was processed with spreadsheet software. The properties of assembly are described in table 1. The point is that infoseq and velvet log information differ in contig length. Here and next the data obtained with infoseq is used.

Table 1. Main properties of velvet assembly with 29-mers.
Property Length Coverage
N50 14938 N/A
top1 55727 27,43
top2 32758 29,53
top3 26619 27,35
total 667222 28,94

Total number of contigs is 176, 52 of which exhibited coverage 5-fold lower than average one. The distribution of their lengths is shown in fig. 1.

Fig. 1. Distribution of contig lengths with aberrant coverage.
Most contigs are of 100 nts long.

BLAST analysis

Top-3 contigs were extracted with seqret -seq @mylist.txt -out stdout | seqretsplit -seq stdin -auto and were exposed to B. aphidicola chromosome under CP009253 with megablast. Hit maps of top3 contigs are shown in figs. 2-4.

Fig. 2. Hit map of top1 contig against chromosome.
Query contig contains some stretches that do not align on chromosome (aka indels). Whole length of contig is aligned.
Fig. 3. Hit map of top2 contig against chromosome.
Query contig contains some stretches that do not align on chromosome (aka indels). Whole length of contig is aligned. As bacterial chromosome is circular no embarassment arises.
Fig. 4. Hit map of top3 contig against chromosome.
Contig segment 0-12Kb didn't aligned on chromosome. 12-26Kb aligned properly.

Main properties of alignments are shown in table 2.

Table 2. Properties of contigs' alignments on chromosome.
contig mismatches gaps chromosome query coverage identity
top1 8803 1242 528276-473503 77.00% 76.00%
top2 5103 618 627104-613658
17919-2004
79.00% 78.00%
top3 2907 435 126623-140555 52.00% 75.00%

Contigs 1 and 2 are aligned reversely whereas contig 3 alligned forwardly, thus indicating inversion of some segment. As few contigs selected it is impossible to guess which contigs are inverted over others. First two contigs exibit some segments that are not presented in reference genome. Whereas third contig aligns almost totally. Third contig exhibits fewer mismatches and gaps but query coverage is relatively low. However, identities are almost equal among contigs.

This data makes possible to suggest that presented genome evolved steadily from reference one (close values of identity) with several huge genome rearrangements (indels as can be seen in hit maps).

Assembler variations

I sought to compare assembly properties using various input data and assemblers. First, I assembled reads with velvet and 25-mers (velveth 25 25 -short -fastq trimmed.fastq; velvetg 25). Then I chose only half of reads with split -l 8323378 trimmed.fastq and then assembled only first half of reads. Finally, I used an alternative assembler SPAdes (spades.py -s trimmed.fastq -k 29 -o spb). Contigs were processed with infoseq and put into spreadsheet. The results for comparison are shown in table 3.

Table 3. Assembly results.
Parameter Velvet_29 Velvet_25 Velvet_half SPAdes
N50 14938 4576 3878 8703
top1 length 55727 26656 13817 23193
top1 coverage 27,43 46,58 16,07 26,74
top2 length 32758 18679 12314 20478
top2 coverage 29,53 43,67 14,71 31,10
top3 length 26619 14875 10755 19104
top3 coverage 27,35 41,24 14,83 28,40
total length 667222 698915 660998 666685
total coverage 28,94 41,76 14,2 28,26
contigs 176 671 355 197

First, some impressions of assemblers. Velvet works significantly faster, than SPAdes (2 mins vs 18 mins). However, contigs' lengths derived from assembler and from infoseq differ in case of Velvet and remains the same in case of SPAdes, thus suggesting Velvet to measure length incorrectly.

Velvet_29 generated assembly with the highest N50. Using 25-mers, a 4-fold increase in amount of contigs and 1,5-fold increase in total coverage was observed along with 30Kb more included in assembly, but N50 decreased drastically. Velvet_half generated 2 times more contigs and exhibited 2 times less coverage, than velvet_29. as anticipated. SPAdes run resulted in relatively close coverage, number of contigs and total length to velvet_29, but N50 was 1,7 times lower. The distributions of lengths and coverages are exhibited on figs. 5 and 6.

Fig. 5. Distribution of contig lengths.
All assemblies contain few extended contigs and many shortened thus indicating putative common feature of assemblers based on de Brujin graphs.
Fig. 6. Distribution of contig coverages.
Few very high values are omitted (4125, 3075 for velvet_29 and 1463, 553,5 for SPAdes). All assemblies exhibits few high values, plateus equal to overall coverage and low values.

Hence, three conclusions emerge. 29-mers is more preferable than 25-mers concerning N50 and therefore assemble quality. Reads are shuffled in random way thus any huge enough part can be chosen to assemble the genome relatively good. And, finally, Velvet deals better with assembling than SPAdes concerning time and N50 although exhibiting some mistakes in length count.