< 2nd term

Alignment as evolution's reflection

Last update on the 18th of April, 2017

Here I am getting to grips with a true bioinformatics method — alignment. The software I used is Jalview as special program for multiple sequence alignment, bash and spreadsheet for retrieving and processing the data and python to generate html-tables. The list of files supposed to download is put just below this paragraph.

List of downloads
File Link
Jalview project with all alignments of this task align_project.jvp
Bash script for manual evolution section mut_script.sh
Modified bash script for nucleotide spin-off section nuc_mut_script.sh
All files including sequences in *.fasta in one zip folder align1_archive.zip

Basic features of Jalview

Project: alignment1

I built a 6-sequence alignment of randomly chosen proteins in HSP70[1] family from Eukaryota, Bacteria and Archea. The coloured alignment by ClustalX scheme with Identity Threshold = 100% condition is presented in the fig. 1. The colour scheme is such that only coloumns with 100% identical aminoacid residues in each sequence are coloured.

Fig. 1. Mapped alignment 1. C — >80% conserved, F — functionally conserved, G — gaps.

I mapped three examples of conservation in the alignment. First, with label C, are residues conserved for more than 80%. That means that 80% or more of aminoacid residues in the column are the same. Second, with label F, are residues which are functionally conserved. That means that all of the residues in one column share same physico-chemical properties and belongs to one class. So, in the position 181 both Ile and Val are aliphatic, in the position 189 both Asp and Glu are negatively charged and in the position 471 both Ser and Thr are polar with -OH group. Third, with label G, I mapped the columns with gaps.

Then I calculated main properties of particular alignment with EMBOSS infoalign program. I ran the program three times with options -identity 100.0, -identity 70.0 and -plurality 100.0. First was for displaying the amount of 100% identical residues, second — for 70%, whereas the third was for 100% functionally identical residues. Also I added options for counting number of gap positions (-gapcount) and the length of the alignment (-alignlength). The obtained TSV tables were exported to spreadsheet software, merged and processed. The results can be observed in the table 1.

Table 1. Main properties of alignment 1.
Uniprot AC
Domain
Sequence length
Alignment length
Amount of gap positions
Percent of gap positions
Amount of 100%
conserved positions
Percent of 100%
conserved positions
Amount of 70%
conserved positions
Percent of 70%
conserved positions
Amount of functionally
conserved positions
Percent of functionally
conserved positions
B5ZWQ2 Bacteria 639 724 85 11,74 140 19,34 325 44,89 270 37,29
A9A135 Archaea 636 724 88 12,15 140 19,34 316 43,65 270 37,29
O24581 Eukaryota 663 724 61 8,43 140 19,34 304 41,99 270 37,29
Q7V1H4 Bacteria 665 724 59 8,15 140 19,34 312 43,09 270 37,29
Q18GZ4 Archaea 641 724 83 11,46 140 19,34 321 44,34 270 37,29
A0T0H7 Eukaryota 613 724 111 15,33 140 19,34 317 43,78 270 37,29

There is something to say about numbers in the table 1. It is clearly seen, that amount of 100% conserved and functionally conserved positions is constant, whilst the amount of 70% conserved positions and gaps are variable. First two attributes show percentage of 22 and 41, respectively, that may be a sign of rather stable sequence length. The number of gaps depends on sequence and alignment length and varies between 8% and 15%, which is rather significant. The amount of 70% conserved positions is more interesting: it varies near 316 (rounded average) with standard deviation of 7,3, relative 0,023. The corresponding percents shows 49,19 average value, 2,3 standard deviation and relative one of 0,047. Such small values of relative stdev means that this datasets are little variable, which may be the evidence of protein conservancy.

However, cursory observation provides clear evidence of extended indels in sequences alongside with extended blocks of conserved positions. So, all the mentions can be combined in the following way: even the sequences are taken from three different domains of life, the function of them is entirely conserved, which reflects in aminoacid sequences.

Manual evolution

Project: alignment2_1 and manually fixed alignment2_2; mut_script.sh

The experiment was to artificially mutate a protein sequence, align it with computer and manually fix it. The mutations were made with bash script powered by EMBOSS on the protein section of merA[2] from 121 to 220 residue. The fixed alignment is shown in the fig. 2.

Fig. 2. Fixed alignment 2. p0 — original sequence, p1-p7 — consequently mutated descendants. ClustalX, 100% identity.

There were made several changes in the alignment. For instance, I moved the whole block of TGAAI in 37-41 1 position right to match the letters in the p0 sequence. So, here the deletion of A in 37 p0-p1 came visible. Another example is shift of E in p0-p1:68 to p0-p1:69 to match the whole column of E and crystallize the insertion of Q in p2-p3:68 (now they are moved one position right). Actually, there were many more fixes.

The mutations were made in a way that only seven point mutations occur per each generation on the distance of seven tested generations. The information about first ten mutated positions is presented in the table 2.

Table 2. Information about first ten mutations in experiment with mutated protein.
Position Mutation Parental generation Seed generation
4 Insertion of I p0 p1
5 Insertion of W p2 p3
6 Insertion of F p6 p7
10 Insertion of W p4 p5
17 Insertion of D p6 p7
22 Deletion of V p2 p3
24 Substitution of G by P p3 p4
26 Substitution of T by L p4 p5
28 Insertion of C p1 p2
28 Deletion of C p2 p3

Recapping this, it is clearly seen that the alignment can only emulate the possible way of evolution. The experiment has shown inability of algorithms to correctly restore even simple example of evolution process.

Nucleotide spin-off

Project: align_urease, align_urease_cds1, align_urease_cds2; nuc_mut_script.sh

This experiment is more complicated than the previous one: the CDS is under mutation, though the protein sequence is under scrutiny. I chose urease subunit gamma[3], retrieved nucleotide sequence from GenBank[4]. Apparently, the CDS was placed on the complement DNA chain, so I used EMBOSS revseq program before I started the mutation process. As CDS mutates randomly, nonsense mutations occur, so the script from previous experiment was modified to weed out non-suitable sequences. the protein alignment is shown in the fig. 3.

Fig. 3. Alignment 3. p0 — original sequence, p1-p7 — consequent mutated descendants. Coloured by percentage identity.

The alignment exhibits huge sectors of identical positions. Some of them repeat each one over the generations such as WNRSQ and GIDRN in 36-40 positions. This states for randomly occurring nucleotide indels which cause several frameshifts. Although such situation would not be supported by selection pressure in nature, the artificial conditions release full power of mutation process. Because of frameshift plurality it seemed unnecessary to manually fix the protein alignment. Instead of it I fixed nucleotide alignment. All alignments are stored in the common alignment project. The information about first ten mutated positions, including block mutations, is provided in the table 3.

Table 3. The information about first ten mutations in experiment with mutated CDS.
Position Mutation Parental generation Seed generation
2 Substitution of N by K p4 p5
3 Substitution of L by R p0 p1
3 Substitution of R by Q p2 p3
3 Substitution of Q by P p4 p5
4 Substitution of S by A p2 p3
4 Substitution of A by S p4 p5
5-6 Substitution of PR by AE p5 p6
8 Insertion of K p4 p5
15-20 Substitution of SLAAIV by QSGGDC p1 p2
21-23 Substitution of ARG by RTR p1 p2

Some controversial statements

Here are presented several statements about evolution and alignment about which there is something to say.

Only nucleotide sequences evolve.

Of course, the changes accumulate in nucleotide sequences, but concerning with proteins the pressure of selection is put in aminoacid sequences because they are real functional embodiment of nucleotide sequences.

Homologous proteins show similar structure.

This statement seems obvious, but there is a paper[5] which describes variety of proteins similar by sequence and dissimilar by structure. It also provides examples of homologous proteins, for example in this figure[6]. There are significant structural dissimilarities. Indeed, the protein can be folded and processed in many ways, so final structures of homologous proteins may vary.

Only mutations in gametal cells inherit.

This is partially true: by this way the inheritance comes in sex reproduction. In asexual reproduction only those mutations inherit which occur in spore cells; in vegetative reproduction somatic mutations may inherit. And, taking organism reproduction aside, mutations are inherited in somatic cells of particular organism within familiar mitosis.

Revealing Craig Venter's affirmations

In 2010 The C. Venter's team claimed about the creation of synthetic bacterial cell. The paper[7] was quite adequate, which cannot be said about Mr. Venter's interview to CNN[8]. Here are some extracts from the interview followed by disclosures of them.

We built it from four bottles of chemicals.

Of course, there were 4 "bottles" with dNTPs, but DNA segments were synthesized chemically and connected in yeast cells by prudently saved overlaps in each segment. So, the machinery was slightly more sophisticated than Mr. Venter stated.

So it's the first living self-replicating cell that we have on the planet whose DNA was made chemically and designed in the computer.

Frankly, the described process of DNA synthesis is quite familiar in bioengineering. Moreover, it was not designed in the computer, but was taken from real living cell. There is no step forward in it.

So it has no genetic ancestors. Its parent is a computer.

There was the genetic ancestor - the bacterium of first type. And, of course, neither computer nor human can be named as direct parent of bacterium.

References

  1. Wikipedia article on HSP70 family;
  2. A page about merA on this site;
  3. A Uniprot page of ureA;
  4. Nucleotide sequence of particular ureA CDS;
  5. Mickey Kosloff, Rachel Kolodny, Sequence-similar, structure-dissimilar protein pairs in the PDB, Proteins 2008 May 1; 71(2): 891–902, doi: 10.1002/prot.21770;
  6. Figure 05 in the [5];
  7. J. Craig Venter et al, Creation of a bacterial cell controlled by a chemically synthesized genome, Science 02 Jul 2010:Vol. 329, Issue 5987, pp. 52-56 DOI: 10.1126/science.1190719;
  8. Scientist: 'We didn't create life from scratch', from CNN reports.