< 3rd term

Genome alignment

Last update on the 13th of November, 2017

This task is about intraspecies alignment of three genomes. Major genome rearrangements are described.

List of downloads
File Link
Hit table analysis table.ods

Genomes to choose

I chose Aeromonas hydrophila species as it was the first one to be discovered by me to exhibit genome-wide rearrangements between strains. More information about strains is presented in the table 1.

Table 1. A. hydrophila strains.
Strain Accession Genome length
GYK1 CP016392.1 5Mb
4AK4 CP006579.1 4,5Mb
AL09-71 CP007566.1 5Mb

Blast2seq on NCBI web-page was used to build identity maps and get hit tables. As homology is to be evaluated, megablast was used. To clean up identity map the length of initial word was set to 48. The identity maps for corresponding strains are in figs. 1-3.

Fig. 1. Identity map of 4AK4 (horizontal) and GYK1 (vertical).
Several huge regions are reversed. All regions exhibit high concordance.
Fig. 2. Identity map of 4AK4 (horizontal) and AL09 (vertical).
Same as fig. 1.
Fig. 3. Identity map of GYK1 (horizontal) and AL09 (vertical).
Genomes are almost identical. No regions of differing direction.

As it is seen from fig. 3, GYK1 and AL09 genomes do not exhibit genome rearrangemnts: both chromosomes are circular and differing strands are stored in databank. However, 4AK4 vs GYK1 (and AL09) exhibit wide genome rearrangements, all of them are reversed regions. Mapping of those region seems unnecessary, as it is obvious from identity map. Map in fig. 2 looks like transposed map in fig. 1. Evidence is supported by the fact that 4AK4 lacks 0,5Mb comparing with the other strains. Sinthenic regions are stretched along the diagonals of map.

Homology definition

table.ods

To define homology properties, hit tables of 4AK4 vs GYK1 and 4AK4 vs AL09 was gone through spreadsheet software. GYK1 vs AL09 seems to be highly conservant so it is no need in comparing them.

Dots on identity map are local alignments. Stretched lines reflecting homological regions are sets of those alignments interleaved by non-similar regions. The longest alignment is of 20Kb, but homologous regions are ten-fold longer.

BLAST seeks for best local alignments. On the scale of bacteria genomes it is easy to catch short identical DNA segments on the base of probability. To clip these artifacts as well as short out-of-homology group alignments an algorithm was developed.

First, the distances between alignments are defined. Each alignment is a line in Genome1-0-Genome2 system and has start and end coordinates. The distance between two consequtive alignments in hit table is a distance between end of previous and start of next alignments. It is count in vector way.

The group of alignments is defined as homologous group if distances between i and i+1, i and i+2, i and i+3 consequent alignments should be less than set value. The max alignment is 20Kb, so distances are 20Kb, 60Kb and 80Kb, respectively. Then blocks were manually checked depending on the context and alignment orientation. Processed file is table.ods. Some properties of homology are presented in the table 2. Note, that the true coverage including inner-homologous non-similar regions isn't count due to restriction of current protocol.

Table 2. Homology properties of strains.
Strains Coverage 1 Coverage 2 Identity
4AK4 vs AL09 65,32 58,91 87,75
4AK4 vs GYK1 50,96 46,63 87,04

The coverage is far away from acceptable, but it is not the real value. However, identity is very high concerning bacteria mutation rates.

Pangenome analysis

To further investigate genome rearrangements, the pangenome was built with NPG explorer. NPGE was run with recommended value of MIN_IDENTITY of 0.829

An annotation mismatch was found (fig. 4) with DNA polymerase III subunit beta. 4AK4 uses "common" start and stop codons, whereas GYK1 and AL09 utilizes alternative ones, similar between each other. Interestingly, that downstream sequence of 4AK4 gene is identical to other strains' and has an insertion before TAA stop with more common TGA codon. Thus, an annotator considered to utilize more common stop codon in case of 4AK4. The difference in start codons is unexplainable by nothing but annotator algorithm. It has to be mentioned that concerning translation table 11 all this codons are possible. Also, GYK1 and AL09 show total agreement in gene sequence (looking on sequence and taking into account shared s and h blocks).

Fig. 4. Annotation mismatch displayed in qnpge (open image in a new tab for better quality).
A) Blocks which contain the DNAP III beta gene exhibiting homology structure of it in 3 strains.
B) Differences in start codon annotation. 4AK4 uses ATG, GYK1 and AL09 use GTG.
C) Differences in stop codon annotation. 4AK4 uses TGA (opal), others use TAA (ochre).