Assembly de novoI created a subdirectory /nfs/srv/databases/ngs/potanina.darya/pr14. This directory contains the sample of RNA sequencing of thale cress (Arabidopsis thaliana) and all prodused during making the tasks files. Materials preparingI downloaded a few files contain sequences of adapters from /P/y16/term3/block3/adapters/ and joined them by seqret in adapters.fa. Reads preparation | TrimmomaticRNA sequencing reads was cleaned of adapters by Trimmomatic installed on kodomo. The processing step ILLUMINACLIP has syntax: ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>.
Output:
According the task reads cut of adapters was processed by steps SLIDINGWINDOW and MINLEN of Trimmomatic with given values SLIDINGWINDOW:5:28 (the length of window is 5 nucleotides, the min average quality is 28) and MINLEN:32 (to leave only more then 32 nucleotides length reads).
Output:
Then for checking I conducted quality control of received reads by FastQC installed on kodomo.
The result you can see here:
Outcome: the original file's size (F.fastq) is 1022153 bit or 3869869 db, after cutting of adapters (F_without_adapters.fastq): 1007715 bit or 3869420 db, after cleaning (F_cleaned.fastq): 829597 bit or 3444237 db. Accembly | VelvethFor the genome assembly I prepaired K-mers by Velveth (k = 31). The syntax is: velveth output_directory hash_length [[-file_format][-read_type] filename]. In this case hash_length = 31, file_format = fastq, read_type = short.
Output files:
Then accembly of contigs was run by velvetg.
Output:
Output files:
As we can see in output N50 = 27, the length of the longest contig = 657. Analysed 'stats.txt' I found out some data about contigs (see table 1). All values of length are measured in k-mers. To recieve real lengthes of contigs you should add (k - 1) = 30 to shown values. The one interesting point is that arabidopsis' genome assembly differs from other organisms' genome assemblies, as I have noticed studying the work of other students. There are short contigs and naturally the number of nodes of final graph is high in arabidopsis. But the majority of students worked with other organisms received the final graph contained less nodes because of long contigs. I don't know what this differences is caused of, perhaps it is so because of the characteristics of the initial data. Table 1.
AnnotationThe longest contigFirstly I received the longest contig (found in 'stats.txt') from 'contigs.fa'. It's name is NODE_49528_length_657_cov_10.707763 (find by comand: $ grep "length_657" contigs.fa). It was cut off by 'seqretsplit' and saved in FASTA file. Using BLAST I found the most similar sequences. I used MegaBlast, E-value = 0.0001, word size = 256, db - Nucleotide collection (nr/nt). There were 6 findings (see fig 1). The best finding was Arabidopsis thaliana Coatomer epsilon subunit mRNA (see fig 2). Figure 1. Figure 2. The best covered contigMost of the contigs with high coverage are 1-10 nucleotide length, I think it is useless to align them. So I chose contig longer than 30 which has the highest coverage in 'stats.txt' - this one: NODE_76045_length_37_cov_505.621613 (length = 37, cov = 505.621613). It was analised by MegaBlast (word size = 16). There were 4 findings (see fig 3). The best finding is Arabidopsis thaliana RALF-like 5 (RALFL5), mRNA (see fig 4). Figure 3. Figure 4. The least covered contigI chose the longest contig has minimum coverage in 'stats.txt' - NODE_472902_length_69_cov_1.000000 (length = 69, cov = 1.000000) and analised it by MegaBlast (word size = 16). There were 4 findings (see fig 5). Three findings have the same characteristics so I chose as the best one this: Arabidopsis thaliana egg cell-secreted-like protein (DUF1278) mRNA (see the first one in fig 6). Figure 5. Figure 6.
© Darya Potanina, 2017 |