BLAST and similarity search
Last update on the 1st of May, 2017This work is dedicated to getting acquainted with BLAST — software used to find regions of similarity between sequences. Here I assume homology of several proteins and make out domain organisation of homologous proteins.
File | Link |
---|---|
Jalview project with all alignments of this task | project_cobalt.jvp |
Long version of Table 1 | long_table.txt |
BLAST search strategy | blast_ss.asn |
Domain search strategy | I-set_search_strategy.asn |
Homology evaluation
project_cobalt.jvp, blast_ss.asn, long_table.txtHow to find several proteins homologous to merA[1]? BLAST can ease the task. It builds local alignments of the query sequence and any other sequence in particular data bank by a cunning algorithm and outputs the table of proteins, local alignments and their "quality" in terms of similarity and match. I've done the query (BLAST search strategy), chose seven hits with e-value of less than 1,0E-05, one of larger than 0,001 and the last with e-value larger than 1. Main properties of them are shown in the table 1.
Uniprot AC | Protein name | Coverage, % | Identity, % | E-value | Bit score | Homology |
---|---|---|---|---|---|---|
A0A126V644 | Mercuric reductase | 100 | 100 | N/A | N/A | N/A |
P16171.1 | Mercuric reductase | 97 | 47,198 | 1,88E‑139 | 416 | Y |
P85207.2 | Dihydrolipoyl dehydrogenase | 95 | 37,826 | 1,72E‑79 | 255 | Y |
P54533.1 | Dihydrolipoyl dehydrogenase | 93 | 30,968 | 1,93E‑57 | 198 | Y |
P72740.3 | Dihydrolipoyl dehydrogenase | 93 | 29,59 | 5,75E‑47 | 170 | Y |
Q60151.1 | Glutathione reductase | 86 | 30,876 | 2,95E‑38 | 145 | Y |
Q5XC60.1 | Probable NADH oxidase | 68 | 25 | 3,92E‑14 | 75,5 | Y |
P08655.1 | Uncharacterized 19.7 kDa protein in mercuric resistance operon | 26 | 27,778 | 6,98E‑12 | 65,1 | Y |
A9NFF6.2 | Ferredoxin-NADP reductase | 44 | 25 | 0,037 | 37,4 | N |
Q7TUJ1.1 | tRNA uridine 5-carboxymethylaminomethyl modification enzyme MnmG (Glucose-inhibited division protein A) | 6 | 50 | 2,3 | 32 | N |
I have downloaded those regions of hit sequences, that were presented in local alignments with query. Then I added to them query sequence and uploaded the file to COBALT multiple alignment service. The main reason for that is the fact, that multiple alignments in this program are built regarding to the information about conserved domains and local similarities, that fits my aims best. Built alignment was downloaded and mapped with putative blocks of homology (fig. 1).
So, the main question: which of these sequences are homologous to merA? I suggested that first four hits are unequivocally homologous: they have got very little e-values, huge bit scores, consimilar names and functions, exhibit same domain architecture and take part in most blocks of homology. The ninth hit was declined immediately: it shows a short region of homology, little bit score and huge e-value and exhibits differing function (although it is involved in two blocks). Talking of the eighth, even it has high coverage share, it has little identity percent and unfavourable e-value and bit score values. The seventh is much more conserved (zero gaps) than eighth (e-value) and do the same function as query protein, although it shows same identity percent to the eighth hit and less coverage percent. The fifth and the sixth hits were assumed as homologous to query because of their involvement in major share of homology blocks, favourable e-values and bit scores and similar domain organisation.
Enlarged properties can be obtained from the long table. Finally, the first seven sequences were assumed to be homologous to the merA, the last two - do not.
Domains
I-set_search_strategy.asnBlast can do the same work with only two sequences in order to find similar regions. The result is a dot matrix view with two axes in which diagonal stands for homology of relative protein parts. In order to try out this function, I have chosen W5Q1Q5 and W5QG41 proteins. They both contain immunoglobulin I-set domain, which mostly stands for cellular adhesion[2]. These proteins are found in sheep organism and may take part in ATP binding, Ser/Thr kinase[3] or muscle contraction[4] activities. The proteins were chosen during long process of finding simple enough example of domain duplication.
Main values that were used in BLAST algorithm are located in search strategy (E-value: 1E-05, word size: 2). The result dot matrix is shown in the fig. 2.
There are several "evolution" events, that can be withdrawn from the picture: indels in A4 and B1 sectors, partial duplication in B5-A2 region, full homology of B2 and A2A3, B3 and A5, B5 and A2, B2B3 and A1 regions. B4 regions is nonhomologous to any of the counterpart regions.
These observations may be shortened in such a way: B2B3 ≈ A1, B2 ≈ A2A3, B3 ≈ A5, B5 ≈ A2. As for homology relation, A1 ≈ A2A3A5. Here comes letter-coded representations of both sequences: W5Q1Q5 = KLNKLMN, W5QG41 = JKLNOK, in which same letters stand for same regions and contra versa. Diagonals and letter-coded representations shows two I-set domains in W5Q1Q5 and only one in W5QG41 with several indels and non-homologous regions.
References
- MerA page on this site;
- Pfam article about I-set domain;
- Uniprot AC: W5QG41;
- Uniprot AC: W5Q1Q5.