Genome assembly
Last update on the 11th of December, 2017The genome of bacterium Buchnera aphidicola is assembled de novo and compared with reference genome via BLAST. Processed spreadsheet data is contained in table.ods
File | Link |
---|---|
Spreadsheet table | table.ods |
Read preprocessing
Reads were taken in ENA under accession SRR4240378. First, all Illumina adapters were combined into
single file with for i in /P/y16/term3/block3/adapters/*; do cat $i >> adapters.fasta; echo -e "\n" >> adapters.fasta; done
.
Then, all adapters were removed with Trimmomatic:
java -jar /usr/share/java/trimmomatic.jar SE -phred33 SRR4240378.fastq adapter_removed.fastq ILLUMINACLIP:adapters.fasta:2:7:7
.
Input reads of 4420587 reduced to 4338744 (98,15%). Next, low quality end bases were removed as well as reads shorter than 30
with java -jar /usr/share/java/trimmomatic.jar SE -phred33 adapter_removed.fastq trimmed.fastq LEADING:20 TRAILING:20 MINLEN:30
.
Number of reads reduced to 4161689 then 94,14% of original reads persist.
Assembly
29-mers were prepared with velveth assembly 29 -short -fastq trimmed.fastq
and assembled with velvetg assembly
.
To spot N50 and top-3 extended contigs, following command was executed infoseq contigs.fa -only -name -length > length_cont.txt
and resulting file was processed with spreadsheet software. The properties of assembly are described in table 1. The point is that infoseq and velvet
log information differ in contig length. Here and next the data obtained with infoseq is used.
Property | Length | Coverage |
---|---|---|
N50 | 14938 | N/A |
top1 | 55727 | 27,43 |
top2 | 32758 | 29,53 |
top3 | 26619 | 27,35 |
total | 667222 | 28,94 |
Total number of contigs is 176, 52 of which exhibited coverage 5-fold lower than average one. The distribution of their lengths is shown in fig. 1.
BLAST analysis
Top-3 contigs were extracted with seqret -seq @mylist.txt -out stdout | seqretsplit -seq stdin -auto
and were
exposed to B. aphidicola chromosome under CP009253 with megablast. Hit maps of top3 contigs are shown in figs. 2-4.
Main properties of alignments are shown in table 2.
contig | mismatches | gaps | chromosome | query coverage | identity |
---|---|---|---|---|---|
top1 | 8803 | 1242 | 528276-473503 | 77.00% | 76.00% |
top2 | 5103 | 618 | 627104-613658 17919-2004 |
79.00% | 78.00% |
top3 | 2907 | 435 | 126623-140555 | 52.00% | 75.00% |
Contigs 1 and 2 are aligned reversely whereas contig 3 alligned forwardly, thus indicating inversion of some segment. As few contigs selected it is impossible to guess which contigs are inverted over others. First two contigs exibit some segments that are not presented in reference genome. Whereas third contig aligns almost totally. Third contig exhibits fewer mismatches and gaps but query coverage is relatively low. However, identities are almost equal among contigs.
This data makes possible to suggest that presented genome evolved steadily from reference one (close values of identity) with several huge genome rearrangements (indels as can be seen in hit maps).
Assembler variations
I sought to compare assembly properties using various input data and assemblers. First,
I assembled reads with velvet and 25-mers (velveth 25 25 -short -fastq trimmed.fastq; velvetg 25
).
Then I chose only half of reads with split -l 8323378 trimmed.fastq
and then assembled only
first half of reads. Finally, I used an alternative assembler SPAdes (spades.py -s trimmed.fastq -k 29 -o spb
).
Contigs were processed with infoseq and put into spreadsheet. The results for comparison are shown in table 3.
Parameter | Velvet_29 | Velvet_25 | Velvet_half | SPAdes |
---|---|---|---|---|
N50 | 14938 | 4576 | 3878 | 8703 |
top1 length | 55727 | 26656 | 13817 | 23193 |
top1 coverage | 27,43 | 46,58 | 16,07 | 26,74 |
top2 length | 32758 | 18679 | 12314 | 20478 |
top2 coverage | 29,53 | 43,67 | 14,71 | 31,10 |
top3 length | 26619 | 14875 | 10755 | 19104 |
top3 coverage | 27,35 | 41,24 | 14,83 | 28,40 |
total length | 667222 | 698915 | 660998 | 666685 |
total coverage | 28,94 | 41,76 | 14,2 | 28,26 |
contigs | 176 | 671 | 355 | 197 |
First, some impressions of assemblers. Velvet works significantly faster, than SPAdes (2 mins vs 18 mins). However, contigs' lengths derived from assembler and from infoseq differ in case of Velvet and remains the same in case of SPAdes, thus suggesting Velvet to measure length incorrectly.
Velvet_29 generated assembly with the highest N50. Using 25-mers, a 4-fold increase in amount of contigs and 1,5-fold increase in total coverage was observed along with 30Kb more included in assembly, but N50 decreased drastically. Velvet_half generated 2 times more contigs and exhibited 2 times less coverage, than velvet_29. as anticipated. SPAdes run resulted in relatively close coverage, number of contigs and total length to velvet_29, but N50 was 1,7 times lower. The distributions of lengths and coverages are exhibited on figs. 5 and 6.
Hence, three conclusions emerge. 29-mers is more preferable than 25-mers concerning N50 and therefore assemble quality. Reads are shuffled in random way thus any huge enough part can be chosen to assemble the genome relatively good. And, finally, Velvet deals better with assembling than SPAdes concerning time and N50 although exhibiting some mistakes in length count.