Genome alignment
Last update on the 13th of November, 2017This task is about intraspecies alignment of three genomes. Major genome rearrangements are described.
File | Link |
---|---|
Hit table analysis | table.ods |
Genomes to choose
I chose Aeromonas hydrophila species as it was the first one to be discovered by me to exhibit genome-wide rearrangements between strains. More information about strains is presented in the table 1.
Strain | Accession | Genome length |
---|---|---|
GYK1 | CP016392.1 | 5Mb |
4AK4 | CP006579.1 | 4,5Mb |
AL09-71 | CP007566.1 | 5Mb |
Blast2seq on NCBI web-page was used to build identity maps and get hit tables. As homology is to be evaluated, megablast was used. To clean up identity map the length of initial word was set to 48. The identity maps for corresponding strains are in figs. 1-3.
As it is seen from fig. 3, GYK1 and AL09 genomes do not exhibit genome rearrangemnts: both chromosomes are circular and differing strands are stored in databank. However, 4AK4 vs GYK1 (and AL09) exhibit wide genome rearrangements, all of them are reversed regions. Mapping of those region seems unnecessary, as it is obvious from identity map. Map in fig. 2 looks like transposed map in fig. 1. Evidence is supported by the fact that 4AK4 lacks 0,5Mb comparing with the other strains. Sinthenic regions are stretched along the diagonals of map.
Homology definition
table.odsTo define homology properties, hit tables of 4AK4 vs GYK1 and 4AK4 vs AL09 was gone through spreadsheet software. GYK1 vs AL09 seems to be highly conservant so it is no need in comparing them.
Dots on identity map are local alignments. Stretched lines reflecting homological regions are sets of those alignments interleaved by non-similar regions. The longest alignment is of 20Kb, but homologous regions are ten-fold longer.
BLAST seeks for best local alignments. On the scale of bacteria genomes it is easy to catch short identical DNA segments on the base of probability. To clip these artifacts as well as short out-of-homology group alignments an algorithm was developed.
First, the distances between alignments are defined. Each alignment is a line in Genome1-0-Genome2 system and has start and end coordinates. The distance between two consequtive alignments in hit table is a distance between end of previous and start of next alignments. It is count in vector way.
The group of alignments is defined as homologous group if distances between
i and i+1, i and i+2, i and i+3 consequent alignments should be less than set value.
The max alignment is 20Kb, so distances are 20Kb, 60Kb and 80Kb, respectively.
Then blocks were manually checked depending on the context and alignment
orientation. Processed file is table.ods
. Some properties
of homology are presented in the table 2. Note, that the true coverage including
inner-homologous non-similar regions isn't count due to restriction of current
protocol.
Strains | Coverage 1 | Coverage 2 | Identity |
---|---|---|---|
4AK4 vs AL09 | 65,32 | 58,91 | 87,75 |
4AK4 vs GYK1 | 50,96 | 46,63 | 87,04 |
The coverage is far away from acceptable, but it is not the real value. However, identity is very high concerning bacteria mutation rates.
Pangenome analysis
To further investigate genome rearrangements, the pangenome was built with NPG explorer. NPGE was run with recommended value of MIN_IDENTITY of 0.829
An annotation mismatch was found (fig. 4) with DNA polymerase III subunit beta. 4AK4 uses "common" start and stop codons, whereas GYK1 and AL09 utilizes alternative ones, similar between each other. Interestingly, that downstream sequence of 4AK4 gene is identical to other strains' and has an insertion before TAA stop with more common TGA codon. Thus, an annotator considered to utilize more common stop codon in case of 4AK4. The difference in start codons is unexplainable by nothing but annotator algorithm. It has to be mentioned that concerning translation table 11 all this codons are possible. Also, GYK1 and AL09 show total agreement in gene sequence (looking on sequence and taking into account shared s and h blocks).