< 3^rd term

Genome assembly

Last update on the 11^th of December, 2017

The genome of bacterium Buchnera aphidicola is assembled de novo and compared with reference genome via BLAST. Processed spreadsheet data is contained in table.ods

List of downloads
File	Link
Spreadsheet table	table.ods

Read preprocessing

Reads were taken in ENA under accession SRR4240378. First, all Illumina adapters were combined into single file with for i in /P/y16/term3/block3/adapters/*; do cat $i >> adapters.fasta; echo -e "\n" >> adapters.fasta; done. Then, all adapters were removed with Trimmomatic: java -jar /usr/share/java/trimmomatic.jar SE -phred33 SRR4240378.fastq adapter_removed.fastq ILLUMINACLIP:adapters.fasta:2:7:7. Input reads of 4420587 reduced to 4338744 (98,15%). Next, low quality end bases were removed as well as reads shorter than 30 with java -jar /usr/share/java/trimmomatic.jar SE -phred33 adapter_removed.fastq trimmed.fastq LEADING:20 TRAILING:20 MINLEN:30. Number of reads reduced to 4161689 then 94,14% of original reads persist.

Assembly

29-mers were prepared with velveth assembly 29 -short -fastq trimmed.fastq and assembled with velvetg assembly. To spot N50 and top-3 extended contigs, following command was executed infoseq contigs.fa -only -name -length > length_cont.txt and resulting file was processed with spreadsheet software. The properties of assembly are described in table 1. The point is that infoseq and velvet log information differ in contig length. Here and next the data obtained with infoseq is used.

Table 1. Main properties of velvet assembly with 29-mers.
Property	Length	Coverage
N50	14938	N/A
top1	55727	27,43
top2	32758	29,53
top3	26619	27,35
total	667222	28,94

Total number of contigs is 176, 52 of which exhibited coverage 5-fold lower than average one. The distribution of their lengths is shown in fig. 1.

BLAST analysis

Top-3 contigs were extracted with seqret -seq @mylist.txt -out stdout | seqretsplit -seq stdin -auto and were exposed to B. aphidicola chromosome under CP009253 with megablast. Hit maps of top3 contigs are shown in figs. 2-4.

Main properties of alignments are shown in table 2.

Table 2. Properties of contigs' alignments on chromosome.
contig	mismatches	gaps	chromosome	query coverage	identity
top1	8803	1242	528276-473503	77.00%	76.00%
top2	5103	618	627104-613658 17919-2004	79.00%	78.00%
top3	2907	435	126623-140555	52.00%	75.00%

Contigs 1 and 2 are aligned reversely whereas contig 3 alligned forwardly, thus indicating inversion of some segment. As few contigs selected it is impossible to guess which contigs are inverted over others. First two contigs exibit some segments that are not presented in reference genome. Whereas third contig aligns almost totally. Third contig exhibits fewer mismatches and gaps but query coverage is relatively low. However, identities are almost equal among contigs.

This data makes possible to suggest that presented genome evolved steadily from reference one (close values of identity) with several huge genome rearrangements (indels as can be seen in hit maps).

Assembler variations

I sought to compare assembly properties using various input data and assemblers. First, I assembled reads with velvet and 25-mers (velveth 25 25 -short -fastq trimmed.fastq; velvetg 25). Then I chose only half of reads with split -l 8323378 trimmed.fastq and then assembled only first half of reads. Finally, I used an alternative assembler SPAdes (spades.py -s trimmed.fastq -k 29 -o spb). Contigs were processed with infoseq and put into spreadsheet. The results for comparison are shown in table 3.

Table 3. Assembly results.
Parameter	Velvet_29	Velvet_25	Velvet_half	SPAdes
N50	14938	4576	3878	8703
top1 length	55727	26656	13817	23193
top1 coverage	27,43	46,58	16,07	26,74
top2 length	32758	18679	12314	20478
top2 coverage	29,53	43,67	14,71	31,10
top3 length	26619	14875	10755	19104
top3 coverage	27,35	41,24	14,83	28,40
total length	667222	698915	660998	666685
total coverage	28,94	41,76	14,2	28,26
contigs	176	671	355	197

First, some impressions of assemblers. Velvet works significantly faster, than SPAdes (2 mins vs 18 mins). However, contigs' lengths derived from assembler and from infoseq differ in case of Velvet and remains the same in case of SPAdes, thus suggesting Velvet to measure length incorrectly.

Velvet_29 generated assembly with the highest N50. Using 25-mers, a 4-fold increase in amount of contigs and 1,5-fold increase in total coverage was observed along with 30Kb more included in assembly, but N50 decreased drastically. Velvet_half generated 2 times more contigs and exhibited 2 times less coverage, than velvet_29. as anticipated. SPAdes run resulted in relatively close coverage, number of contigs and total length to velvet_29, but N50 was 1,7 times lower. The distributions of lengths and coverages are exhibited on figs. 5 and 6.

Hence, three conclusions emerge. 29-mers is more preferable than 25-mers concerning N50 and therefore assemble quality. Reads are shuffled in random way thus any huge enough part can be chosen to assemble the genome relatively good. And, finally, Velvet deals better with assembling than SPAdes concerning time and N50 although exhibiting some mistakes in length count.