< 2^nd term

Rational pairwise alignment

Last update on the 24^th of April, 2017

In this work I observe and discuss local and global alignments, aligning homologous and non-homologous proteins, draw graphs and count score matrix. The software I used is Jalview for displaying sequence alignments, EMBOSS package in Bash shell environment for aligning sequences and calculating their properties, LibreOffice Calc for processing them, LibreOffice Draw for creating 2D models, Paint 3D for creating 3D models and personally developed python scripts for making html-tables and counting score matrix. The list of downloads is put just below the paragraph.

List of downloads
File	Link
Jalview project with all alignments of this task	total.jvp
Script for theatre scoring matrix	theatre_count.py
Seat scheme	oscar.txt

Global and local pairwise alignment

total.jvp: local_vs_global

A case when the whole sequences are aligned relatively to each other is called global alignment, whereas local alignment provides the longest homologous segment of both sequences. The first type is calculated with Needleman-Wunsch algorithm, the second one with Smith-Waterman algorithm^[1]. Both algorithms are compared on the case of aligning two homologous heat shock 70 kDa proteins^[2]. EMBOSS package contains two programs for it: needle for global alignment and water for local alignment. Both alignments were computed. For global alignment there are two variants: with applying end indel penalties and without. Then, EMBOSS infoalign program calculated main properties of alignments. The result is presented in the table 1.

Table 1. Global vs local alignment properties.
Algorithm	End penalty applying	Uniprot ID	Sequence length	Alignment lenth	Number of indels	Number of gaps	Gap share, %	Number of 100% conserved residues	Percent of 100% conserved residues	Number of functionally conserved residues	Percent of functionally conserved residues
Needle	Y	DNAK1_PROMP	665	669	2	4	0,60	290	43,35	436	65,17
Needle	Y	DNAK_RHILW	639	669	5	30	4,48	290	43,35	436	65,17
Needle	N	DNAK1_PROMP	665	672	3	7	1,04	290	43,15	433	64,43
Needle	N	DNAK_RHILW	639	672	3	8	1,19	290	43,15	433	64,43
Water	N/A	DNAK1_PROMP	635	642	3	7	1,09	290	45,17	432	67,29
Water	N/A	DNAK_RHILW	634	642	3	8	1,25	290	45,17	432	67,29

Relative alignments are shown in the Fig. 1. It is clearly seen that they are quite identical, but differ it the end of the alignment. Global alignment without end indel penalties is ended with non-homologous "tail" with large end indel, the other part is identical to local alignment. When the algorithm were building the global alignment with applied end penalties it tried to avoid end indels, so from 626^th position till the end it differs from other alignments exhibiting no end indels.

So, the local and global alignment present here almost identical alignments differing in the end of them. The global one shows homology between ending residues, whereas the local one does not feel the need in it. The parameters which EMBOSS sets as default for computing alignments are presented in the table 2.

Table 2. Default EMBOSS parameters.
Attribute	Global	Local
Score matrix	BLOSUM62	BLOSUM62
Indel opening penalty	10	10
Indel extension penalty	0,5	0,5
End gap penalty applying	No	N/A
End gap opening penalty	10	N/A
End gap extending penalty	0,5	N/A

Pairwise alignment of non-homologous proteins

total.jvp: non_homo

Here are taken five pairs of non-homologous proteins, one of which in each pair is merA^[3], compared to pair of homologous HSP70 proteins. All pairs were aligned with water (local alignment). Properties of each alignment are presented in the table 3.

Table 3. Local pairwise alignment of non-homologous and homologous proteins, latter is bold.
Uniprot ID	Sequence length	Alignment lenth	Number of indels	Number of gaps	Gap share, %	Number of 100% conserved residues	Percent of 100% conserved residues	Number of functionally conserved residues	Percent of functionally conserved residues
A0A126V644_9RHOB	218	237	6	19	8,02	46	19,41	80	33,76
TPIS_FLAPJ	192	237	6	45	18,99	46	19,41	80	33,76
A0A126V644_9RHOB	330	379	8	49	12,93	82	21,64	121	31,93
A0A109QC07_9BRAD	279	379	13	100	26,39	82	21,64	121	31,93
A0A126V644_9RHOB	124	133	1	9	6,77	31	23,31	48	36,09
Y2262_MYCBO	101	133	7	32	24,06	31	23,31	48	36,09
A0A126V644_9RHOB	306	387	12	81	20,93	79	20,41	131	33,85
PLED_CAUCN	339	387	10	48	12,40	79	20,41	131	33,85
A0A126V644_9RHOB	424	452	5	28	6,19	89	19,69	151	33,41
C0ZGQ5_BREBN	336	452	14	116	25,66	89	19,69	151	33,41
DNAK1_PROMP	635	642	3	7	1,09	290	45,17	432	67,29
DNAK_RHILW	634	642	3	8	1,25	290	45,17	432	67,29

The significant difference of shares is clearly seen. Aligned homologous proteins hold two times bigger percents of conserved and functionally conserved positions and an order of magnitude less gap percent. The observation is strengthened with images of relative alignments in the fig. 2.

The used colour scheme allows to estimate a conservation of the alignment. It is supposed that applied colour scheme would colour functionally identical residues in pale blue; however it is just partially true. The lowerest pair exhibits more fully conserved segments, that are also the most extended among all six alignments. The draw also shows other differences: smaller number and less extended indels, more functionally conserved residues as it comes from the table 3. It seems possible to say that alignment programs provide capability to distinguish homologous and non-homologous proteins on the basis of described features.

Comparing multiple and pairwise alignments

total.jvp: cross-compare

In the previous work^[4] the multiple alignment of six HSP70 proteins were built. DNAK1_PROMP and DNAK_RHILW sequences were extracted and put to the comparison with local and global alignments of them. Three groups were aligned over each other; the result is shown in the fig. 3.

All of the alignments are mostly identical. However, there are several mismatches. In multiple alignment DNAK1_PROMP ASP78 and GLU79 are put in one column with DNAK_RHILW PRO80 and THR81, whereas in pairwise they are put in one column with GLU78 and ASP79 (inversion?). The second example is DNAK_RHILW GLU104 and ALA105 homologued with DNAK1_PROMP LEU102 and SER103 in pairwise and PRO104 and PHE105 in multiple alignments. Finally, there is a difference in the end of the alignment: from 622^th till the end multiple alignment looks like it was applied with end indel penalties.

In my opinion, the most plausible alignment is pairwise local one. Multiple-derived has some problems in homology of inner residues (homology of pairwise questionable residues seems more probable) and acts strange in the end of sequences. Comparing local and global alignments, it has to be said that in this particular case there is no necessity in aligning the end "tail" as it is quite undefinite what has happend to it during the evolution.

Graphs as alignment models

Naturally, the math model of an alignment is an oriented graph with weighted arrows (edges). The best alignment is that which weighs most. There are several approaches of building such graphs depending on the type of alignment and indel penalty scoring model. Here two approaches are discussed: global alignment with affine penalties and local alignment with linear penalties.

Global alignment with affine penalties

Affine penalties are so that the opening of the indel scores -g, extending of indel scores -h, h < < g. The basic model is a "Manhattan with Broadway and three layers" as it is stated in the lecture presentation^[5]. A basic unit of such graph with labeled weights of arrows is shown in the fig. 4.

Yellow arrow states for homology of residues, green arrows for opening the indel, red arrows for extending the indel. Red and green arrows are co-directed with that sequence, which exhibits residues against indel in the other one. Blue arrows mean the end of the indel, so they weighs zero. S_ij describes the rate at which one residue changes to another over time. It is usually taken from a substitution matrix. This basic unit must be multiplied for necessary number of times to complete the graph for particular similarity map of two sequences. Such example is presented in the fig. 5.

During the algorithm processing each node is being given a weight from the start to the end of the graph in such pattern: input nodes are observed, weights of routes from each of them to the taken node are calculated, the largest number is chosen and is given to the node with reference to the previous input node. Finally, all nodes are weighted and labeled in the "best" order.

Local alignment with linear penalties

A graph of this alignment is easier to draw (fig. 6). Black arrows state for gaps (or indels), diagonal blue — for homology, curved grey — for the change of the start and end points of the alignment (as it is allowed to choose only a segment of each sequence to align with no penalties). Actually, grey curved arrows come from the start of the graph to each other node and to the end of graph from each other node. However, only some of them are drawn because of expected fuzziness in the other case. Weights of each arrow are also labeled in the fig. 6.

The algorithm process is similar to the previously described, so there is no need in doing that again. An example of particular alignment is shown in the fig. 7.

Scoring matrix

theatre_count.py, oscar.txt

As it is mentioned above, scoring (substitution) matrix reflects tendency of residues to change to others in homologous proteins. Each score is calculated in such a way^[6]: S = λ * log (observed frequency/expected frequency). The logarithm base is a question of no value, it must be huger than 1, λ is a convenient factor to make numbers look pleasant to one's eyes.

The common example of such matrix is a matrix of men and women amicableness regarding to pair-making in theatre. Here, occurrence of each pair type — MM, MF and FF (M is for male, F is for female) — in particular theatre or hall is counted and compared to the expected occurrence, which is calculated as a multiplication of particular sex frequencies.

A sample case is calculating scoring matrix on the basis of the first central block of guests at the 87^th Academy awards ceremony (fig. 8).

In order to do that, seats should be mapped with M and F respectively to the sex of each guest and space for empty ones. As the only neural network I was able to use is my brain, I did that with my own eyes. The result is shown in the fig. 9 and text file.

Fig. 9. Seat scheme.

FFMMMFMFFFFFMFFM
MFFFFFMFFFFMFFMMF
FFMFFMMMFMMFFMFM
FMFMMMMMMFMFMMFMF
MFM FMMMFMMMMMFFFM
MFMFMMMFFMFMFMMFMFF
MFMFMFMFMFFMMMMMFF
MFFMMFMF FMMFFMFMFM
MFMFMFFMFMMMFMFMFF
MFFFFFMFFMMFMMFMFM

Of course, manual count of scores is boring and cojucted with high mistake probability. So, I developed a python script for this task. It takes any text file filename with mapping as described above and do all stuff drawing the table to stdout and filename.table. The result of its running is shown in the table 4. The logarithm base is 2, the convenience factor is 10, the result numbers are rounded to integers.

Table 4. Scoring matrix for the 87^th Academy awards ceremony.
	M	F	Sum
M	-4	3	-1
F	3	-3	-1
Sum	-1	-1	-1

References

Wikipedia article on sequence alignment;
Wikipedia article on HSP70 family;
MerA page on this site;
Alignment as evolution's reflection page on this site;
Lecture presentation on Kodomo server;
Wikipedia article on substitution matrix;
Oscar 2015 photos from the Dolby stage.