< 2^nd term

EMBOSS and proteome review

Last update on the 26^th of March, 2017

The EMBOSS is an open source analysis software package made for needs of molecular biology. It provides several tools for coping with bioinformatics data in command line interface. This work is about simple analysis of proteomes taken from Uniprot with EMBOSS and spreadsheet software LibreOffice Calc.

Proteomes

The studied proteomes are of Halochyntiibacter arcticus and Escherichia coli strain K12. The basic information about the proteomes is presented on the table 1.

Table 1. Basic proteome information
Attribute	H. arcticus	E. coli K12
Proteome ID	UP000070371	UP000000625
amount of sequences	3882	4306
amount of residues	1158560	1356195

The work is following: download the proteomes from Uniprot in .fasta format and calculate the amount of each aminoacid residue with EMBOSS wordcount programme. Then process the results with spreadsheet software in order to compare residue frequencies in both proteomes. The output is shown in the table 2.

Table 2. Residue frequency in proteomes
Aminoacid residue	H. arcticus, %	E. coli K12, %	Difference (H. arcticus — E. coli), %
A	11,08	9,51	1,57
L	9,97	10,67	-0,71
G	8,07	7,37	0,69
V	7,26	7,07	0,19
S	5,97	5,80	0,17
E	5,93	5,76	0,16
I	5,86	6,01	-0,16
R	5,84	5,51	0,33
D	5,75	5,15	0,60
T	5,75	5,40	0,35
P	4,69	4,43	0,26
F	4,05	3,89	0,16
K	3,99	4,41	-0,42
Q	3,31	4,44	-1,12
N	3,18	3,95	-0,76
M	2,76	2,82	-0,06
Y	2,33	2,85	-0,51
H	2,00	2,27	-0,26
W	1,31	1,53	-0,22
C	0,92	1,16	-0,24
U	0,00	0,00022	-0,00022

The most and the least frequent residues in both H. arcticus and E. coli K12 are alanine, leucine, glycine and histidine, tryptophan, cysteine, respectively (selenocystein is not taken into account). For E. coli K12 the most frequent residue is leucine, whereas for H. arcticus — alanine. The biggest difference in favour of E. coli is for glutamine, in favour of H. arcticus is for alanine.

According to the data presented in the table 2, it seems quite possible that frequency dissimilarity is not random. First, random aminoacid residue frequencies should be of 100 / 21 ≅ 4,76 percent whereas the presented distribution is from 0 to 11,08. Second, I've calculated the ratios of difference in frequency to the average of frequency in two bacteria for each residue except for U and translated them into percent (download the table here). The average of these ratios is 10,4% with the biggest of 29,01% and the lowest of 2,12%. Such numbers allows us to suspect the non-random dissimilarities in frequencies, that can be explained by uneven distribution of residues in proteins.

User's guide

EMBOSS contains two programmes for counting "words" in sequences: wordcount and compseq. The amateur user can easily be confused in determining which programme to use. Basically, they are almost the same: the input is a file with sequence(s), output is a file with counted amount of words, the size of words is also can be determined. However, the devil is in the details. All information is provided by help resources of EMBOSS. The execution time was tested on finding all words with length of 2 in H. arcticus proteome and calculated with bash time command.

Wordcount

Input: sequence(s) filename.
Output: tab-delimited file with two columns: word and count, ranked by count (download the example here).
Qualifiers: see table 3.
Execution time: 1 min 7 sec.

Table 3. Qualifiers for wordcount.
Qualifier	Description
Standard
-sequence	Input sequence(s) filename
-wordsize	Integer length of words
-outfile	Output filename
Optional
-mincount	Minimum wordcount to report

Compseq

Input: sequence(s) filename.
Output: tab-delimited file. Lines starting with # contain information about expected frequency (if no file stated for that purpose, assume even distribution; see -infile qualifier), qualifier-modified process (e.g. see -frame qualifier), input sequences' names, table headers. Clear lines are for word size and total count values and for 5-column table: word, observed count, observed frequency, expected frequency and observed/expected ratio. The table is ranked by the count. The programme also seeks for "other" words, e.g. non-canonical residues (download the example here).
Qualifiers: see table 4.
Execution time: 0,236 sec.

Table 4. Qualifiers for compseq.
Qualifier	Description
Standard
-sequence	Input sequence(s) filename
-word	Integer length of words
-outfile	Output filename
Optional (not all)
-infile	Filename of previously produced by compseq file that can be used to set expected frequencies of words in current run. The wordsize must be the same.
-frame	If set to 0 (default), compseq counts all words that occur by moving the window of "word" length by one each time. If set to another integer, the window is moved by the "word" length starting with the integer point.

Compare

The compseq is more flexible than wordcount due to more additional qualifiers, allows to count codons in nucleotide sequences via -frame qualifier and compare frequencies through separate files. The compseq is also slightly faster than wordcount. However, the output of wordcount is simpler and more convenient to parse by scripts and spreadsheet software, though the compseq output is nicely parsed by the compseq itself. Speaking of the task above, I would rather use wordcount as raw data is simpler to import in spreadsheet through the wizard and calculating frequencies on your own in spreadsheet or by script is easier than cope with provided results of compseq in its own tab-delimited markup.