< 2^nd term

Alignment as evolution's reflection

Last update on the 18^th of April, 2017

Here I am getting to grips with a true bioinformatics method — alignment. The software I used is Jalview as special program for multiple sequence alignment, bash and spreadsheet for retrieving and processing the data and python to generate html-tables. The list of files supposed to download is put just below this paragraph.

List of downloads
File	Link
Jalview project with all alignments of this task	align_project.jvp
Bash script for manual evolution section	mut_script.sh
Modified bash script for nucleotide spin-off section	nuc_mut_script.sh
All files including sequences in *.fasta in one zip folder	align1_archive.zip

Basic features of Jalview

Project: alignment1

I built a 6-sequence alignment of randomly chosen proteins in HSP70^[1] family from Eukaryota, Bacteria and Archea. The coloured alignment by ClustalX scheme with Identity Threshold = 100% condition is presented in the fig. 1. The colour scheme is such that only coloumns with 100% identical aminoacid residues in each sequence are coloured.

I mapped three examples of conservation in the alignment. First, with label C, are residues conserved for more than 80%. That means that 80% or more of aminoacid residues in the column are the same. Second, with label F, are residues which are functionally conserved. That means that all of the residues in one column share same physico-chemical properties and belongs to one class. So, in the position 181 both Ile and Val are aliphatic, in the position 189 both Asp and Glu are negatively charged and in the position 471 both Ser and Thr are polar with -OH group. Third, with label G, I mapped the columns with gaps.

Then I calculated main properties of particular alignment with EMBOSS infoalign program. I ran the program three times with options -identity 100.0, -identity 70.0 and -plurality 100.0. First was for displaying the amount of 100% identical residues, second — for 70%, whereas the third was for 100% functionally identical residues. Also I added options for counting number of gap positions (-gapcount) and the length of the alignment (-alignlength). The obtained TSV tables were exported to spreadsheet software, merged and processed. The results can be observed in the table 1.

Table 1. Main properties of alignment 1.
Uniprot AC	Domain	Sequence length	Alignment length	Amount of gap positions	Percent of gap positions	Amount of 100% conserved positions	Percent of 100% conserved positions	Amount of 70% conserved positions	Percent of 70% conserved positions	Amount of functionally conserved positions	Percent of functionally conserved positions
B5ZWQ2	Bacteria	639	724	85	11,74	140	19,34	325	44,89	270	37,29
A9A135	Archaea	636	724	88	12,15	140	19,34	316	43,65	270	37,29
O24581	Eukaryota	663	724	61	8,43	140	19,34	304	41,99	270	37,29
Q7V1H4	Bacteria	665	724	59	8,15	140	19,34	312	43,09	270	37,29
Q18GZ4	Archaea	641	724	83	11,46	140	19,34	321	44,34	270	37,29
A0T0H7	Eukaryota	613	724	111	15,33	140	19,34	317	43,78	270	37,29

There is something to say about numbers in the table 1. It is clearly seen, that amount of 100% conserved and functionally conserved positions is constant, whilst the amount of 70% conserved positions and gaps are variable. First two attributes show percentage of 22 and 41, respectively, that may be a sign of rather stable sequence length. The number of gaps depends on sequence and alignment length and varies between 8% and 15%, which is rather significant. The amount of 70% conserved positions is more interesting: it varies near 316 (rounded average) with standard deviation of 7,3, relative 0,023. The corresponding percents shows 49,19 average value, 2,3 standard deviation and relative one of 0,047. Such small values of relative stdev means that this datasets are little variable, which may be the evidence of protein conservancy.

However, cursory observation provides clear evidence of extended indels in sequences alongside with extended blocks of conserved positions. So, all the mentions can be combined in the following way: even the sequences are taken from three different domains of life, the function of them is entirely conserved, which reflects in aminoacid sequences.

Manual evolution

Project: alignment2_1 and manually fixed alignment2_2; mut_script.sh

The experiment was to artificially mutate a protein sequence, align it with computer and manually fix it. The mutations were made with bash script powered by EMBOSS on the protein section of merA^[2] from 121 to 220 residue. The fixed alignment is shown in the fig. 2.

There were made several changes in the alignment. For instance, I moved the whole block of TGAAI in 37-41 1 position right to match the letters in the p0 sequence. So, here the deletion of A in 37 p0-p1 came visible. Another example is shift of E in p0-p1:68 to p0-p1:69 to match the whole column of E and crystallize the insertion of Q in p2-p3:68 (now they are moved one position right). Actually, there were many more fixes.

The mutations were made in a way that only seven point mutations occur per each generation on the distance of seven tested generations. The information about first ten mutated positions is presented in the table 2.

Table 2. Information about first ten mutations in experiment with mutated protein.
Position	Mutation	Parental generation	Seed generation
4	Insertion of I	p0	p1
5	Insertion of W	p2	p3
6	Insertion of F	p6	p7
10	Insertion of W	p4	p5
17	Insertion of D	p6	p7
22	Deletion of V	p2	p3
24	Substitution of G by P	p3	p4
26	Substitution of T by L	p4	p5
28	Insertion of C	p1	p2
28	Deletion of C	p2	p3

Recapping this, it is clearly seen that the alignment can only emulate the possible way of evolution. The experiment has shown inability of algorithms to correctly restore even simple example of evolution process.

Nucleotide spin-off

Project: align_urease, align_urease_cds1, align_urease_cds2; nuc_mut_script.sh

This experiment is more complicated than the previous one: the CDS is under mutation, though the protein sequence is under scrutiny. I chose urease subunit gamma^[3], retrieved nucleotide sequence from GenBank^[4]. Apparently, the CDS was placed on the complement DNA chain, so I used EMBOSS revseq program before I started the mutation process. As CDS mutates randomly, nonsense mutations occur, so the script from previous experiment was modified to weed out non-suitable sequences. the protein alignment is shown in the fig. 3.

The alignment exhibits huge sectors of identical positions. Some of them repeat each one over the generations such as WNRSQ and GIDRN in 36-40 positions. This states for randomly occurring nucleotide indels which cause several frameshifts. Although such situation would not be supported by selection pressure in nature, the artificial conditions release full power of mutation process. Because of frameshift plurality it seemed unnecessary to manually fix the protein alignment. Instead of it I fixed nucleotide alignment. All alignments are stored in the common alignment project. The information about first ten mutated positions, including block mutations, is provided in the table 3.

Table 3. The information about first ten mutations in experiment with mutated CDS.
Position	Mutation	Parental generation	Seed generation
2	Substitution of N by K	p4	p5
3	Substitution of L by R	p0	p1
3	Substitution of R by Q	p2	p3
3	Substitution of Q by P	p4	p5
4	Substitution of S by A	p2	p3
4	Substitution of A by S	p4	p5
5-6	Substitution of PR by AE	p5	p6
8	Insertion of K	p4	p5
15-20	Substitution of SLAAIV by QSGGDC	p1	p2
21-23	Substitution of ARG by RTR	p1	p2

Some controversial statements

Here are presented several statements about evolution and alignment about which there is something to say.

Only nucleotide sequences evolve.

Of course, the changes accumulate in nucleotide sequences, but concerning with proteins the pressure of selection is put in aminoacid sequences because they are real functional embodiment of nucleotide sequences.

Homologous proteins show similar structure.

This statement seems obvious, but there is a paper^[5] which describes variety of proteins similar by sequence and dissimilar by structure. It also provides examples of homologous proteins, for example in this figure^[6]. There are significant structural dissimilarities. Indeed, the protein can be folded and processed in many ways, so final structures of homologous proteins may vary.

Only mutations in gametal cells inherit.

This is partially true: by this way the inheritance comes in sex reproduction. In asexual reproduction only those mutations inherit which occur in spore cells; in vegetative reproduction somatic mutations may inherit. And, taking organism reproduction aside, mutations are inherited in somatic cells of particular organism within familiar mitosis.

Revealing Craig Venter's affirmations

In 2010 The C. Venter's team claimed about the creation of synthetic bacterial cell. The paper^[7] was quite adequate, which cannot be said about Mr. Venter's interview to CNN^[8]. Here are some extracts from the interview followed by disclosures of them.

We built it from four bottles of chemicals.

Of course, there were 4 "bottles" with dNTPs, but DNA segments were synthesized chemically and connected in yeast cells by prudently saved overlaps in each segment. So, the machinery was slightly more sophisticated than Mr. Venter stated.

So it's the first living self-replicating cell that we have on the planet whose DNA was made chemically and designed in the computer.

Frankly, the described process of DNA synthesis is quite familiar in bioengineering. Moreover, it was not designed in the computer, but was taken from real living cell. There is no step forward in it.

So it has no genetic ancestors. Its parent is a computer.

There was the genetic ancestor - the bacterium of first type. And, of course, neither computer nor human can be named as direct parent of bacterium.

References

Wikipedia article on HSP70 family;
A page about merA on this site;
A Uniprot page of ureA;
Nucleotide sequence of particular ureA CDS;
Mickey Kosloff, Rachel Kolodny, Sequence-similar, structure-dissimilar protein pairs in the PDB, Proteins 2008 May 1; 71(2): 891–902, doi: 10.1002/prot.21770;
Figure 05 in the [5];
J. Craig Venter et al, Creation of a bacterial cell controlled by a chemically synthesized genome, Science 02 Jul 2010:Vol. 329, Issue 5987, pp. 52-56 DOI: 10.1126/science.1190719;
Scientist: 'We didn't create life from scratch', from CNN reports.