< 3rd term

BLAST

Last update on the 30th of October, 2017

Here I perform BLAST search in nucleotide databases and guess tazonomy, find genes and protein orthologs.

List of downloads
File Link
BLAST of proteins in sample organism table3.ods
Distance metrics distance.ods

Brief description of tasks

The following tasks were done and reported via Google Forms:

  • taxonomy identification of sample nucleotide sequence;
  • comparison of 3 BLAST algorithms;
  • protein homology in sample genome;
  • gene spotting in sample genome scaffold.

Some tasks needed additional information, which is stored in this page.

Taxonomy of sample sequence

Fig. 1. Tree of findings.

Various BLAST algorithms

Fig. 2. Diagram for megablast findings.
Fig. 3. Diagram for discontigous megablast findings.
Fig. 4. Diagram for blastn findings.

Protein homology

table3.ods

Gene finding

Fig. 5. Diagram for blastx findings.

Virus kinship by sequence similarity

distance.ods

I chose five viruses: banana streak OL, MY, GF, UA, IM. To estimate kinship of viruses I ran all-against-all tblastx, which compares translated query and database. Derieved table was cleaned with python script to cut off identical findings. To evaluate parameters for further workplace narrowing, alignment lengths were sorted and binned by length with step of 10 nt (fig. 6) in LibreOffice Calc. The boundary was set to 130 nt as bin size dramaticaly decreases since that point. Further analysis showed identities to be more than 30%; to cut off low-quality alignments, the bar was set to 40%. Then, python script cleaned the table with those parameters and E-value more than 1E-3.

Fig. 6. Alignment length bins counts.

Derieved table was further investigated. First, relations between several properties were built (fig. 7). Aignment length and its bit-score showed strong correlation (fig. 7a), which is obvious as low-quality alignments were cleaned off. Ddentity and bit-score showed weak correlation (fig. 7b) with increased dense at low values. However, group of several alignments showed strong correlation indicating putatively homologous DNA segments.

Fig. 7. Relations between properties of BLAST-derieved alignments.
(a) Strong correlation between alignment length and its bit-score. (b) Weak correlation between identity % and bit-score. (c) Two approaches for distance counting correlate weakly.

Based on these observation, the model for counting distances between species was proposed. The central part of it is a bit-score to alignment length ratio. This ratio indicates quality of each alligned nucleotide. Bigger ratio means better alignment. Regarding convenience, this ratio was inversed to reflect distance between two alignments. To calculate distance between pair of species two approaches were proposed. In first, distance is count as ratio of length sum and bit-score sum for given pair. In second, the average of ratios for all alignments is taken. Both approaches were done. Distances correlate weakly (fig. 7c) with nonlinear dependence suggesting disrepant results of kinship based on both approaches.

I sought to build trees for both approaches using some software. MEGA provided tree building based on custom distance matrix (fig. 8).

Fig. 8. Trees of viruses relationships.
The evolutionary history was inferred using the Neighbor-Joining method[1]. Evolutionary analyses were conducted in MEGA7[2]. (a) Tree based on ratio-of-sums approach. (b) Tree based on average-ratio approach.

Both trees and approaches can hold no points against using them. But quick observation of first distance matrix revealed close kinship of GF-IM and GF-UA species with distant of IM-UA. I considered it strange so I assume average-ratio approach to be more plausible.

BLAST data and calculated values are placed in distance.ods file.

References

  1. Saitou N. and Nei M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4:406-425.
  2. Kumar S., Stecher G., and Tamura K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33:1870-1874.