Alignment as evolution's reflection
Last update on the 18th of April, 2017Here I am getting to grips with a true bioinformatics method — alignment. The software I used is Jalview as special program for multiple sequence alignment, bash and spreadsheet for retrieving and processing the data and python to generate html-tables. The list of files supposed to download is put just below this paragraph.
File | Link |
---|---|
Jalview project with all alignments of this task | align_project.jvp |
Bash script for manual evolution section | mut_script.sh |
Modified bash script for nucleotide spin-off section | nuc_mut_script.sh |
All files including sequences in *.fasta in one zip folder | align1_archive.zip |
Basic features of Jalview
Project: alignment1I built a 6-sequence alignment of randomly chosen proteins in HSP70[1] family from Eukaryota, Bacteria and Archea. The coloured alignment by ClustalX scheme with Identity Threshold = 100% condition is presented in the fig. 1. The colour scheme is such that only coloumns with 100% identical aminoacid residues in each sequence are coloured.
I mapped three examples of conservation in the alignment. First, with label C, are residues conserved for more than 80%. That means that 80% or more of aminoacid residues in the column are the same. Second, with label F, are residues which are functionally conserved. That means that all of the residues in one column share same physico-chemical properties and belongs to one class. So, in the position 181 both Ile and Val are aliphatic, in the position 189 both Asp and Glu are negatively charged and in the position 471 both Ser and Thr are polar with -OH group. Third, with label G, I mapped the columns with gaps.
Then I calculated main properties of particular alignment with EMBOSS infoalign
program. I ran the program three times with
options -identity 100.0
, -identity 70.0
and -plurality 100.0
. First was for displaying the amount of
100% identical residues, second
— for 70%, whereas the third was for 100% functionally identical residues.
Also I added options for counting number of gap positions (-gapcount
) and the length of the alignment (-alignlength
).
The obtained TSV tables were exported to spreadsheet
software, merged and processed. The results can be observed in the table 1.
Uniprot AC |
Domain |
Sequence length |
Alignment length |
Amount of gap positions |
Percent of gap positions |
Amount of 100% conserved positions |
Percent of 100% conserved positions |
Amount of 70% conserved positions |
Percent of 70% conserved positions |
Amount of functionally conserved positions |
Percent of functionally conserved positions |
---|---|---|---|---|---|---|---|---|---|---|---|
B5ZWQ2 | Bacteria | 639 | 724 | 85 | 11,74 | 140 | 19,34 | 325 | 44,89 | 270 | 37,29 |
A9A135 | Archaea | 636 | 724 | 88 | 12,15 | 140 | 19,34 | 316 | 43,65 | 270 | 37,29 |
O24581 | Eukaryota | 663 | 724 | 61 | 8,43 | 140 | 19,34 | 304 | 41,99 | 270 | 37,29 |
Q7V1H4 | Bacteria | 665 | 724 | 59 | 8,15 | 140 | 19,34 | 312 | 43,09 | 270 | 37,29 |
Q18GZ4 | Archaea | 641 | 724 | 83 | 11,46 | 140 | 19,34 | 321 | 44,34 | 270 | 37,29 |
A0T0H7 | Eukaryota | 613 | 724 | 111 | 15,33 | 140 | 19,34 | 317 | 43,78 | 270 | 37,29 |
There is something to say about numbers in the table 1. It is clearly seen, that amount of 100% conserved and functionally conserved positions is constant, whilst the amount of 70% conserved positions and gaps are variable. First two attributes show percentage of 22 and 41, respectively, that may be a sign of rather stable sequence length. The number of gaps depends on sequence and alignment length and varies between 8% and 15%, which is rather significant. The amount of 70% conserved positions is more interesting: it varies near 316 (rounded average) with standard deviation of 7,3, relative 0,023. The corresponding percents shows 49,19 average value, 2,3 standard deviation and relative one of 0,047. Such small values of relative stdev means that this datasets are little variable, which may be the evidence of protein conservancy.
However, cursory observation provides clear evidence of extended indels in sequences alongside with extended blocks of conserved positions. So, all the mentions can be combined in the following way: even the sequences are taken from three different domains of life, the function of them is entirely conserved, which reflects in aminoacid sequences.
Manual evolution
Project: alignment2_1 and manually fixed alignment2_2; mut_script.shThe experiment was to artificially mutate a protein sequence, align it with computer and manually fix it. The mutations were made with bash script powered by EMBOSS on the protein section of merA[2] from 121 to 220 residue. The fixed alignment is shown in the fig. 2.
There were made several changes in the alignment. For instance, I moved the whole block of TGAAI in 37-41 1 position right to match the letters in the p0 sequence. So, here the deletion of A in 37 p0-p1 came visible. Another example is shift of E in p0-p1:68 to p0-p1:69 to match the whole column of E and crystallize the insertion of Q in p2-p3:68 (now they are moved one position right). Actually, there were many more fixes.
The mutations were made in a way that only seven point mutations occur per each generation on the distance of seven tested generations. The information about first ten mutated positions is presented in the table 2.
Position | Mutation | Parental generation | Seed generation |
---|---|---|---|
4 | Insertion of I | p0 | p1 |
5 | Insertion of W | p2 | p3 |
6 | Insertion of F | p6 | p7 |
10 | Insertion of W | p4 | p5 |
17 | Insertion of D | p6 | p7 |
22 | Deletion of V | p2 | p3 |
24 | Substitution of G by P | p3 | p4 |
26 | Substitution of T by L | p4 | p5 |
28 | Insertion of C | p1 | p2 |
28 | Deletion of C | p2 | p3 |
Recapping this, it is clearly seen that the alignment can only emulate the possible way of evolution. The experiment has shown inability of algorithms to correctly restore even simple example of evolution process.
Nucleotide spin-off
Project: align_urease, align_urease_cds1, align_urease_cds2; nuc_mut_script.sh
This experiment is more complicated than the previous one: the CDS is under mutation, though the protein sequence is under scrutiny.
I chose urease subunit gamma[3], retrieved nucleotide sequence
from GenBank[4]. Apparently, the CDS was placed on the complement DNA chain, so I used EMBOSS revseq
program before I started the mutation process.
As CDS mutates randomly, nonsense mutations occur, so the script from previous experiment was modified to weed out non-suitable sequences. the protein alignment is shown in the fig. 3.
The alignment exhibits huge sectors of identical positions. Some of them repeat each one over the generations such as WNRSQ and GIDRN in 36-40 positions. This states for randomly occurring nucleotide indels which cause several frameshifts. Although such situation would not be supported by selection pressure in nature, the artificial conditions release full power of mutation process. Because of frameshift plurality it seemed unnecessary to manually fix the protein alignment. Instead of it I fixed nucleotide alignment. All alignments are stored in the common alignment project. The information about first ten mutated positions, including block mutations, is provided in the table 3.
Position | Mutation | Parental generation | Seed generation |
---|---|---|---|
2 | Substitution of N by K | p4 | p5 |
3 | Substitution of L by R | p0 | p1 |
3 | Substitution of R by Q | p2 | p3 |
3 | Substitution of Q by P | p4 | p5 |
4 | Substitution of S by A | p2 | p3 |
4 | Substitution of A by S | p4 | p5 |
5-6 | Substitution of PR by AE | p5 | p6 |
8 | Insertion of K | p4 | p5 |
15-20 | Substitution of SLAAIV by QSGGDC | p1 | p2 |
21-23 | Substitution of ARG by RTR | p1 | p2 |
Some controversial statements
Here are presented several statements about evolution and alignment about which there is something to say.
Only nucleotide sequences evolve.
Of course, the changes accumulate in nucleotide sequences, but concerning with proteins the pressure of selection is put in aminoacid sequences because they are real functional embodiment of nucleotide sequences.
Homologous proteins show similar structure.
This statement seems obvious, but there is a paper[5] which describes variety of proteins similar by sequence and dissimilar by structure. It also provides examples of homologous proteins, for example in this figure[6]. There are significant structural dissimilarities. Indeed, the protein can be folded and processed in many ways, so final structures of homologous proteins may vary.
Only mutations in gametal cells inherit.
This is partially true: by this way the inheritance comes in sex reproduction. In asexual reproduction only those mutations inherit which occur in spore cells; in vegetative reproduction somatic mutations may inherit. And, taking organism reproduction aside, mutations are inherited in somatic cells of particular organism within familiar mitosis.
Revealing Craig Venter's affirmations
In 2010 The C. Venter's team claimed about the creation of synthetic bacterial cell. The paper[7] was quite adequate, which cannot be said about Mr. Venter's interview to CNN[8]. Here are some extracts from the interview followed by disclosures of them.
We built it from four bottles of chemicals.
Of course, there were 4 "bottles" with dNTPs, but DNA segments were synthesized chemically and connected in yeast cells by prudently saved overlaps in each segment. So, the machinery was slightly more sophisticated than Mr. Venter stated.
So it's the first living self-replicating cell that we have on the planet whose DNA was made chemically and designed in the computer.
Frankly, the described process of DNA synthesis is quite familiar in bioengineering. Moreover, it was not designed in the computer, but was taken from real living cell. There is no step forward in it.
So it has no genetic ancestors. Its parent is a computer.
There was the genetic ancestor - the bacterium of first type. And, of course, neither computer nor human can be named as direct parent of bacterium.
References
- Wikipedia article on HSP70 family;
- A page about merA on this site;
- A Uniprot page of ureA;
- Nucleotide sequence of particular ureA CDS;
- Mickey Kosloff, Rachel Kolodny, Sequence-similar, structure-dissimilar protein pairs in the PDB, Proteins 2008 May 1; 71(2): 891–902, doi: 10.1002/prot.21770;
- Figure 05 in the [5];
- J. Craig Venter et al, Creation of a bacterial cell controlled by a chemically synthesized genome, Science 02 Jul 2010:Vol. 329, Issue 5987, pp. 52-56 DOI: 10.1126/science.1190719;
- Scientist: 'We didn't create life from scratch', from CNN reports.