< 2nd term

BLAST and similarity search

Last update on the 1st of May, 2017

This work is dedicated to getting acquainted with BLAST — software used to find regions of similarity between sequences. Here I assume homology of several proteins and make out domain organisation of homologous proteins.

List of downloads
File Link
Jalview project with all alignments of this task project_cobalt.jvp
Long version of Table 1 long_table.txt
BLAST search strategy blast_ss.asn
Domain search strategy I-set_search_strategy.asn

Homology evaluation

project_cobalt.jvp, blast_ss.asn, long_table.txt

How to find several proteins homologous to merA[1]? BLAST can ease the task. It builds local alignments of the query sequence and any other sequence in particular data bank by a cunning algorithm and outputs the table of proteins, local alignments and their "quality" in terms of similarity and match. I've done the query (BLAST search strategy), chose seven hits with e-value of less than 1,0E-05, one of larger than 0,001 and the last with e-value larger than 1. Main properties of them are shown in the table 1.

Table 1. Properties of several BLAST hits.
Uniprot AC Protein name Coverage, % Identity, % E-value Bit score Homology
A0A126V644 Mercuric reductase 100 100 N/A N/A N/A
P16171.1 Mercuric reductase 97 47,198 1,88E‑139 416 Y
P85207.2 Dihydrolipoyl dehydrogenase 95 37,826 1,72E‑79 255 Y
P54533.1 Dihydrolipoyl dehydrogenase 93 30,968 1,93E‑57 198 Y
P72740.3 Dihydrolipoyl dehydrogenase 93 29,59 5,75E‑47 170 Y
Q60151.1 Glutathione reductase 86 30,876 2,95E‑38 145 Y
Q5XC60.1 Probable NADH oxidase 68 25 3,92E‑14 75,5 Y
P08655.1 Uncharacterized 19.7 kDa protein in mercuric resistance operon 26 27,778 6,98E‑12 65,1 Y
A9NFF6.2 Ferredoxin-NADP reductase 44 25 0,037 37,4 N
Q7TUJ1.1 tRNA uridine 5-carboxymethylaminomethyl modification enzyme MnmG (Glucose-inhibited division protein A) 6 50 2,3 32 N

I have downloaded those regions of hit sequences, that were presented in local alignments with query. Then I added to them query sequence and uploaded the file to COBALT multiple alignment service. The main reason for that is the fact, that multiple alignments in this program are built regarding to the information about conserved domains and local similarities, that fits my aims best. Built alignment was downloaded and mapped with putative blocks of homology (fig. 1).

Fig. 1. Multiple alignment of query and hit sequences. Putative blocks of homology are put in black borders. Coloured ClustalX.

So, the main question: which of these sequences are homologous to merA? I suggested that first four hits are unequivocally homologous: they have got very little e-values, huge bit scores, consimilar names and functions, exhibit same domain architecture and take part in most blocks of homology. The ninth hit was declined immediately: it shows a short region of homology, little bit score and huge e-value and exhibits differing function (although it is involved in two blocks). Talking of the eighth, even it has high coverage share, it has little identity percent and unfavourable e-value and bit score values. The seventh is much more conserved (zero gaps) than eighth (e-value) and do the same function as query protein, although it shows same identity percent to the eighth hit and less coverage percent. The fifth and the sixth hits were assumed as homologous to query because of their involvement in major share of homology blocks, favourable e-values and bit scores and similar domain organisation.

Enlarged properties can be obtained from the long table. Finally, the first seven sequences were assumed to be homologous to the merA, the last two - do not.

Domains

I-set_search_strategy.asn

Blast can do the same work with only two sequences in order to find similar regions. The result is a dot matrix view with two axes in which diagonal stands for homology of relative protein parts. In order to try out this function, I have chosen W5Q1Q5 and W5QG41 proteins. They both contain immunoglobulin I-set domain, which mostly stands for cellular adhesion[2]. These proteins are found in sheep organism and may take part in ATP binding, Ser/Thr kinase[3] or muscle contraction[4] activities. The proteins were chosen during long process of finding simple enough example of domain duplication.

Main values that were used in BLAST algorithm are located in search strategy (E-value: 1E-05, word size: 2). The result dot matrix is shown in the fig. 2.

Fig. 2. Dot matrix of W5Q1Q5 (horizontal axis) and W5QG41 (vertical axis) with labeled blocks of homology.

There are several "evolution" events, that can be withdrawn from the picture: indels in A4 and B1 sectors, partial duplication in B5-A2 region, full homology of B2 and A2A3, B3 and A5, B5 and A2, B2B3 and A1 regions. B4 regions is nonhomologous to any of the counterpart regions.

These observations may be shortened in such a way: B2B3 ≈ A1, B2 ≈ A2A3, B3 ≈ A5, B5 ≈ A2. As for homology relation, A1 ≈ A2A3A5. Here comes letter-coded representations of both sequences: W5Q1Q5 = KLNKLMN, W5QG41 = JKLNOK, in which same letters stand for same regions and contra versa. Diagonals and letter-coded representations shows two I-set domains in W5Q1Q5 and only one in W5QG41 with several indels and non-homologous regions.

References

  1. MerA page on this site;
  2. Pfam article about I-set domain;
  3. Uniprot AC: W5QG41;
  4. Uniprot AC: W5Q1Q5.