< 2nd term

Rational pairwise alignment

Last update on the 24th of April, 2017

In this work I observe and discuss local and global alignments, aligning homologous and non-homologous proteins, draw graphs and count score matrix. The software I used is Jalview for displaying sequence alignments, EMBOSS package in Bash shell environment for aligning sequences and calculating their properties, LibreOffice Calc for processing them, LibreOffice Draw for creating 2D models, Paint 3D for creating 3D models and personally developed python scripts for making html-tables and counting score matrix. The list of downloads is put just below the paragraph.

List of downloads
File Link
Jalview project with all alignments of this task total.jvp
Script for theatre scoring matrix theatre_count.py
Seat scheme oscar.txt

Global and local pairwise alignment

total.jvp: local_vs_global

A case when the whole sequences are aligned relatively to each other is called global alignment, whereas local alignment provides the longest homologous segment of both sequences. The first type is calculated with Needleman-Wunsch algorithm, the second one with Smith-Waterman algorithm[1]. Both algorithms are compared on the case of aligning two homologous heat shock 70 kDa proteins[2]. EMBOSS package contains two programs for it: needle for global alignment and water for local alignment. Both alignments were computed. For global alignment there are two variants: with applying end indel penalties and without. Then, EMBOSS infoalign program calculated main properties of alignments. The result is presented in the table 1.

Table 1. Global vs local alignment properties.
Algorithm
End penalty applying
Uniprot ID
Sequence length
Alignment lenth
Number of indels
Number of gaps
Gap share, %
Number of 100%
conserved residues
Percent of 100%
conserved residues
Number of functionally
conserved residues
Percent of functionally
conserved residues
Needle Y DNAK1_PROMP 665 669 2 4 0,60 290 43,35 436 65,17
Needle Y DNAK_RHILW 639 669 5 30 4,48 290 43,35 436 65,17
Needle N DNAK1_PROMP 665 672 3 7 1,04 290 43,15 433 64,43
Needle N DNAK_RHILW 639 672 3 8 1,19 290 43,15 433 64,43
Water N/A DNAK1_PROMP 635 642 3 7 1,09 290 45,17 432 67,29
Water N/A DNAK_RHILW 634 642 3 8 1,25 290 45,17 432 67,29

Relative alignments are shown in the Fig. 1. It is clearly seen that they are quite identical, but differ it the end of the alignment. Global alignment without end indel penalties is ended with non-homologous "tail" with large end indel, the other part is identical to local alignment. When the algorithm were building the global alignment with applied end penalties it tried to avoid end indels, so from 626th position till the end it differs from other alignments exhibiting no end indels.

Fig. 1. Global and local alignments of DNAK1_PROMP and DNAK_RHILW. ClustalX colour scheme, 100% identity.

So, the local and global alignment present here almost identical alignments differing in the end of them. The global one shows homology between ending residues, whereas the local one does not feel the need in it. The parameters which EMBOSS sets as default for computing alignments are presented in the table 2.

Table 2. Default EMBOSS parameters.
Attribute Global Local
Score matrix BLOSUM62 BLOSUM62
Indel opening penalty 10 10
Indel extension penalty 0,5 0,5
End gap penalty applying No N/A
End gap opening penalty 10 N/A
End gap extending penalty 0,5 N/A

Pairwise alignment of non-homologous proteins

total.jvp: non_homo

Here are taken five pairs of non-homologous proteins, one of which in each pair is merA[3], compared to pair of homologous HSP70 proteins. All pairs were aligned with water (local alignment). Properties of each alignment are presented in the table 3.

Table 3. Local pairwise alignment of non-homologous and homologous proteins, latter is bold.
Uniprot ID
Sequence length
Alignment lenth
Number of indels
Number of gaps
Gap share, %
Number of 100%
conserved residues
Percent of 100%
conserved residues
Number of functionally
conserved residues
Percent of functionally
conserved residues
A0A126V644_9RHOB 218 237 6 19 8,02 46 19,41 80 33,76
TPIS_FLAPJ 192 237 6 45 18,99 46 19,41 80 33,76
A0A126V644_9RHOB 330 379 8 49 12,93 82 21,64 121 31,93
A0A109QC07_9BRAD 279 379 13 100 26,39 82 21,64 121 31,93
A0A126V644_9RHOB 124 133 1 9 6,77 31 23,31 48 36,09
Y2262_MYCBO 101 133 7 32 24,06 31 23,31 48 36,09
A0A126V644_9RHOB 306 387 12 81 20,93 79 20,41 131 33,85
PLED_CAUCN 339 387 10 48 12,40 79 20,41 131 33,85
A0A126V644_9RHOB 424 452 5 28 6,19 89 19,69 151 33,41
C0ZGQ5_BREBN 336 452 14 116 25,66 89 19,69 151 33,41
DNAK1_PROMP 635 642 3 7 1,09 290 45,17 432 67,29
DNAK_RHILW 634 642 3 8 1,25 290 45,17 432 67,29

The significant difference of shares is clearly seen. Aligned homologous proteins hold two times bigger percents of conserved and functionally conserved positions and an order of magnitude less gap percent. The observation is strengthened with images of relative alignments in the fig. 2.

Fig. 2. Local pairwise alignment of non-homologous and homologous proteins. Coloured by percentage identity with conservation value of 50%.

The used colour scheme allows to estimate a conservation of the alignment. It is supposed that applied colour scheme would colour functionally identical residues in pale blue; however it is just partially true. The lowerest pair exhibits more fully conserved segments, that are also the most extended among all six alignments. The draw also shows other differences: smaller number and less extended indels, more functionally conserved residues as it comes from the table 3. It seems possible to say that alignment programs provide capability to distinguish homologous and non-homologous proteins on the basis of described features.

Comparing multiple and pairwise alignments

total.jvp: cross-compare

In the previous work[4] the multiple alignment of six HSP70 proteins were built. DNAK1_PROMP and DNAK_RHILW sequences were extracted and put to the comparison with local and global alignments of them. Three groups were aligned over each other; the result is shown in the fig. 3.

Fig. 3. Multiple and pairwise alignments of DNAK1_PROMP and DNAK_RHILW. ClustalX colour scheme, 100% identity.

All of the alignments are mostly identical. However, there are several mismatches. In multiple alignment DNAK1_PROMP ASP78 and GLU79 are put in one column with DNAK_RHILW PRO80 and THR81, whereas in pairwise they are put in one column with GLU78 and ASP79 (inversion?). The second example is DNAK_RHILW GLU104 and ALA105 homologued with DNAK1_PROMP LEU102 and SER103 in pairwise and PRO104 and PHE105 in multiple alignments. Finally, there is a difference in the end of the alignment: from 622th till the end multiple alignment looks like it was applied with end indel penalties.

In my opinion, the most plausible alignment is pairwise local one. Multiple-derived has some problems in homology of inner residues (homology of pairwise questionable residues seems more probable) and acts strange in the end of sequences. Comparing local and global alignments, it has to be said that in this particular case there is no necessity in aligning the end "tail" as it is quite undefinite what has happend to it during the evolution.

Graphs as alignment models

Naturally, the math model of an alignment is an oriented graph with weighted arrows (edges). The best alignment is that which weighs most. There are several approaches of building such graphs depending on the type of alignment and indel penalty scoring model. Here two approaches are discussed: global alignment with affine penalties and local alignment with linear penalties.

Global alignment with affine penalties

Affine penalties are so that the opening of the indel scores -g, extending of indel scores -h, h < < g. The basic model is a "Manhattan with Broadway and three layers" as it is stated in the lecture presentation[5]. A basic unit of such graph with labeled weights of arrows is shown in the fig. 4.

Fig. 4. A basic unit of oriented graph for global alignment with affine penalties. All weights are labeled.

Yellow arrow states for homology of residues, green arrows for opening the indel, red arrows for extending the indel. Red and green arrows are co-directed with that sequence, which exhibits residues against indel in the other one. Blue arrows mean the end of the indel, so they weighs zero. Sij describes the rate at which one residue changes to another over time. It is usually taken from a substitution matrix. This basic unit must be multiplied for necessary number of times to complete the graph for particular similarity map of two sequences. Such example is presented in the fig. 5.

Fig. 5. An example of oriented graph for two sequences (global alignment, affine penalties).

During the algorithm processing each node is being given a weight from the start to the end of the graph in such pattern: input nodes are observed, weights of routes from each of them to the taken node are calculated, the largest number is chosen and is given to the node with reference to the previous input node. Finally, all nodes are weighted and labeled in the "best" order.

Local alignment with linear penalties

A graph of this alignment is easier to draw (fig. 6). Black arrows state for gaps (or indels), diagonal blue — for homology, curved grey — for the change of the start and end points of the alignment (as it is allowed to choose only a segment of each sequence to align with no penalties). Actually, grey curved arrows come from the start of the graph to each other node and to the end of graph from each other node. However, only some of them are drawn because of expected fuzziness in the other case. Weights of each arrow are also labeled in the fig. 6.

Fig. 6. An oriented graph for local alignment with linear penalties

The algorithm process is similar to the previously described, so there is no need in doing that again. An example of particular alignment is shown in the fig. 7.

Fig. 7. An example of local alignment with linear penalties.

Scoring matrix

theatre_count.py, oscar.txt

As it is mentioned above, scoring (substitution) matrix reflects tendency of residues to change to others in homologous proteins. Each score is calculated in such a way[6]: S = λ * log (observed frequency/expected frequency). The logarithm base is a question of no value, it must be huger than 1, λ is a convenient factor to make numbers look pleasant to one's eyes.

The common example of such matrix is a matrix of men and women amicableness regarding to pair-making in theatre. Here, occurrence of each pair type — MM, MF and FF (M is for male, F is for female) — in particular theatre or hall is counted and compared to the expected occurrence, which is calculated as a multiplication of particular sex frequencies.

A sample case is calculating scoring matrix on the basis of the first central block of guests at the 87th Academy awards ceremony (fig. 8).

Fig. 8. 87th Academy awards ceremony, view from the stage[7].

In order to do that, seats should be mapped with M and F respectively to the sex of each guest and space for empty ones. As the only neural network I was able to use is my brain, I did that with my own eyes. The result is shown in the fig. 9 and text file.

Fig. 9. Seat scheme.
FFMMMFMFFFFFMFFM
MFFFFFMFFFFMFFMMF
FFMFFMMMFMMFFMFM
FMFMMMMMMFMFMMFMF
MFM FMMMFMMMMMFFFM
MFMFMMMFFMFMFMMFMFF
MFMFMFMFMFFMMMMMFF
MFFMMFMF FMMFFMFMFM
MFMFMFFMFMMMFMFMFF
MFFFFFMFFMMFMMFMFM 

Of course, manual count of scores is boring and cojucted with high mistake probability. So, I developed a python script for this task. It takes any text file filename with mapping as described above and do all stuff drawing the table to stdout and filename.table. The result of its running is shown in the table 4. The logarithm base is 2, the convenience factor is 10, the result numbers are rounded to integers.

Table 4. Scoring matrix for the 87th Academy awards ceremony.
M F Sum
M -4 3 -1
F 3 -3 -1
Sum -1 -1 -1

References

  1. Wikipedia article on sequence alignment;
  2. Wikipedia article on HSP70 family;
  3. MerA page on this site;
  4. Alignment as evolution's reflection page on this site;
  5. Lecture presentation on Kodomo server;
  6. Wikipedia article on substitution matrix;
  7. Oscar 2015 photos from the Dolby stage.