< 3^rd term

BLAST

Last update on the 30^th of October, 2017

Here I perform BLAST search in nucleotide databases and guess tazonomy, find genes and protein orthologs.

List of downloads
File	Link
BLAST of proteins in sample organism	table3.ods
Distance metrics	distance.ods

Brief description of tasks

The following tasks were done and reported via Google Forms:

taxonomy identification of sample nucleotide sequence;
comparison of 3 BLAST algorithms;
protein homology in sample genome;
gene spotting in sample genome scaffold.

Some tasks needed additional information, which is stored in this page.

Taxonomy of sample sequence

Various BLAST algorithms

Protein homology

table3.ods

Gene finding

Virus kinship by sequence similarity

distance.ods

I chose five viruses: banana streak OL, MY, GF, UA, IM. To estimate kinship of viruses I ran all-against-all tblastx, which compares translated query and database. Derieved table was cleaned with python script to cut off identical findings. To evaluate parameters for further workplace narrowing, alignment lengths were sorted and binned by length with step of 10 nt (fig. 6) in LibreOffice Calc. The boundary was set to 130 nt as bin size dramaticaly decreases since that point. Further analysis showed identities to be more than 30%; to cut off low-quality alignments, the bar was set to 40%. Then, python script cleaned the table with those parameters and E-value more than 1E-3.

Derieved table was further investigated. First, relations between several properties were built (fig. 7). Aignment length and its bit-score showed strong correlation (fig. 7a), which is obvious as low-quality alignments were cleaned off. Ddentity and bit-score showed weak correlation (fig. 7b) with increased dense at low values. However, group of several alignments showed strong correlation indicating putatively homologous DNA segments.

Based on these observation, the model for counting distances between species was proposed. The central part of it is a bit-score to alignment length ratio. This ratio indicates quality of each alligned nucleotide. Bigger ratio means better alignment. Regarding convenience, this ratio was inversed to reflect distance between two alignments. To calculate distance between pair of species two approaches were proposed. In first, distance is count as ratio of length sum and bit-score sum for given pair. In second, the average of ratios for all alignments is taken. Both approaches were done. Distances correlate weakly (fig. 7c) with nonlinear dependence suggesting disrepant results of kinship based on both approaches.

I sought to build trees for both approaches using some software. MEGA provided tree building based on custom distance matrix (fig. 8).

Both trees and approaches can hold no points against using them. But quick observation of first distance matrix revealed close kinship of GF-IM and GF-UA species with distant of IM-UA. I considered it strange so I assume average-ratio approach to be more plausible.

BLAST data and calculated values are placed in distance.ods file.

References

Saitou N. and Nei M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4:406-425.
Kumar S., Stecher G., and Tamura K. (2016). MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Molecular Biology and Evolution 33:1870-1874.