EMBOSS and proteome review
Last update on the 26th of March, 2017The EMBOSS is an open source analysis software package made for needs of molecular biology. It provides several tools for coping with bioinformatics data in command line interface. This work is about simple analysis of proteomes taken from Uniprot with EMBOSS and spreadsheet software LibreOffice Calc.
Proteomes
The studied proteomes are of Halochyntiibacter arcticus and Escherichia coli strain K12. The basic information about the proteomes is presented on the table 1.
Attribute | H. arcticus | E. coli K12 |
---|---|---|
Proteome ID | UP000070371 | UP000000625 |
amount of sequences |
3882 | 4306 |
amount of residues |
1158560 | 1356195 |
The work is following: download the proteomes from Uniprot in .fasta format and calculate the amount of each aminoacid residue with EMBOSS wordcount programme. Then process the results with spreadsheet software in order to compare residue frequencies in both proteomes. The output is shown in the table 2.
Aminoacid residue |
H. arcticus, % | E. coli K12, % | Difference (H. arcticus — E. coli), % |
---|---|---|---|
A | 11,08 | 9,51 | 1,57 |
L | 9,97 | 10,67 | -0,71 |
G | 8,07 | 7,37 | 0,69 |
V | 7,26 | 7,07 | 0,19 |
S | 5,97 | 5,80 | 0,17 |
E | 5,93 | 5,76 | 0,16 |
I | 5,86 | 6,01 | -0,16 |
R | 5,84 | 5,51 | 0,33 |
D | 5,75 | 5,15 | 0,60 |
T | 5,75 | 5,40 | 0,35 |
P | 4,69 | 4,43 | 0,26 |
F | 4,05 | 3,89 | 0,16 |
K | 3,99 | 4,41 | -0,42 |
Q | 3,31 | 4,44 | -1,12 |
N | 3,18 | 3,95 | -0,76 |
M | 2,76 | 2,82 | -0,06 |
Y | 2,33 | 2,85 | -0,51 |
H | 2,00 | 2,27 | -0,26 |
W | 1,31 | 1,53 | -0,22 |
C | 0,92 | 1,16 | -0,24 |
U | 0,00 | 0,00022 | -0,00022 |
The most and the least frequent residues in both H. arcticus and E. coli K12 are alanine, leucine, glycine and histidine, tryptophan, cysteine, respectively (selenocystein is not taken into account). For E. coli K12 the most frequent residue is leucine, whereas for H. arcticus — alanine. The biggest difference in favour of E. coli is for glutamine, in favour of H. arcticus is for alanine.
According to the data presented in the table 2, it seems quite possible that frequency dissimilarity is not random. First, random aminoacid residue frequencies should be of 100 / 21 ≅ 4,76 percent whereas the presented distribution is from 0 to 11,08. Second, I've calculated the ratios of difference in frequency to the average of frequency in two bacteria for each residue except for U and translated them into percent (download the table here). The average of these ratios is 10,4% with the biggest of 29,01% and the lowest of 2,12%. Such numbers allows us to suspect the non-random dissimilarities in frequencies, that can be explained by uneven distribution of residues in proteins.
User's guide
EMBOSS contains two programmes for counting "words" in sequences: wordcount and compseq. The amateur user can easily be confused in determining which programme to use. Basically, they are almost the same: the input is a file with sequence(s), output is a file with counted amount of words, the size of words is also can be determined. However, the devil is in the details. All information is provided by help resources of EMBOSS. The execution time was tested on finding all words with length of 2 in H. arcticus proteome and calculated with bash time command.
Wordcount
Input: sequence(s) filename.
Output: tab-delimited file with two columns: word and count, ranked by count (download the example here).
Qualifiers: see table 3.
Execution time: 1 min 7 sec.
Qualifier | Description |
---|---|
Standard | |
-sequence | Input sequence(s) filename |
-wordsize | Integer length of words |
-outfile | Output filename |
Optional | |
-mincount | Minimum wordcount to report |
Compseq
Input: sequence(s) filename.
Output: tab-delimited file. Lines starting with # contain information about expected frequency (if no file stated for that purpose, assume even
distribution; see -infile qualifier), qualifier-modified process (e.g. see -frame qualifier), input sequences' names, table headers. Clear lines are for word size and total count values and for 5-column table:
word, observed count, observed frequency, expected frequency and observed/expected ratio. The table is ranked by the count. The programme also seeks for "other" words, e.g. non-canonical residues (download the example here).
Qualifiers: see table 4.
Execution time: 0,236 sec.
Qualifier | Description |
---|---|
Standard | |
-sequence | Input sequence(s) filename |
-word | Integer length of words |
-outfile | Output filename |
Optional (not all) | |
-infile | Filename of previously produced by compseq file that can be used to set expected frequencies of words in current run. The wordsize must be the same. |
-frame | If set to 0 (default), compseq counts all words that occur by moving the window of "word" length by one each time. If set to another integer, the window is moved by the "word" length starting with the integer point. |
Compare
The compseq is more flexible than wordcount due to more additional qualifiers, allows to count codons in nucleotide sequences via -frame qualifier and compare frequencies through separate files. The compseq is also slightly faster than wordcount. However, the output of wordcount is simpler and more convenient to parse by scripts and spreadsheet software, though the compseq output is nicely parsed by the compseq itself. Speaking of the task above, I would rather use wordcount as raw data is simpler to import in spreadsheet through the wizard and calculating frequencies on your own in spreadsheet or by script is easier than cope with provided results of compseq in its own tab-delimited markup.