Proteomes and Emboss

← Term 2

Last updated: 28-03-2017.

Part one. Proteomes of Escherichia coli (strain K12) and Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720)

Basic information about proteomes of Escherichia coli (strain K12) and Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) is assembled in Table 1. This table was created using information from UniProt database[0][1]. As can be seen from the table, the number of proteins and amino acids in proteomes is quite similar. More information can be obtained from Table 2, which contains detailed data about amino acids in proteomes. According to this table, in both Salmonella typhimurium and Escherichia coli the most popular amino acids are Leucine, Alanine and Glycine. Most likely this is due to the fact that these amino acids play a "skeletal" role in almost every protein. The rarest amino acids are Histidine, Tryptophan and Cysteine. I suggest they are last but not least: this amino acids primarily play a functional role and should not be presented in large quantities. The biggest difference in favor of Salmonella typhimurium is observed by Alanine (0.26%) and Arginine (0.14%). In favor of Escherichia coli, Glutamate (0.16%) and Asparagine (0.13%) are standing out.

[Download excel file containing data related to Table 2]

Parameter Info
Escherichia coli (strain K12) proteome ID UP000000625
Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) proteome ID UP000001014
Amount of proteins in Escherichia coli (strain K12) proteome 4306
Amount of proteins in Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) proteome 4533
Total amount of amino acid residues in Escherichia coli (strain K12) proteome 1356195
Total amount of amino acid residues in Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) proteome 1421444

Table 1. Some information about proteomes.

Residue (single-letter code) Residue content in Salmonella typhimurium, % Residue content in Escherichia coli, % Percentage difference, %
L 10,65% 10,67% -0,02%
A 9,77% 9,51% 0,26%
G 7,37% 7,37% 0,00%
V 7,02% 7,07% -0,05%
I 5,93% 6,01% -0,08%
S 5,82% 5,80% 0,02%
R 5,65% 5,51% 0,14%
E 5,60% 5,76% -0,16%
T 5,49% 5,40% 0,09%
D 5,21% 5,15% 0,06%
P 4,47% 4,43% 0,04%
Q 4,39% 4,44% -0,05%
K 4,31% 4,41% -0,10%
F 3,87% 3,89% -0,03%
N 3,81% 3,95% -0,13%
Y 2,88% 2,85% 0,03%
M 2,78% 2,82% -0,04%
H 2,29% 2,27% 0,03%
W 1,52% 1,53% -0,01%
C 1,16% 1,16% 0,00%
U 0,0002% 0,0002% 0,00%

Table 2. Detailed information about amino acid residues in considered proteomes.

Part two. Wordcount and compseq comparison

The information for this part was obtained using the help command and other standard tools of the Kodomo machine. Let's start with the most important parameter: run time. The measurements were carried out for two previously mentioned files: K12.fasta and SALTY.fasta, containing proteomes of Escherichia coli (strain K12) and Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) respectively. Measurements are presented in Table 3. As it seen, Compseq is approximately 7 times faster than Wordcount. So, if extremely huge amount of data should be processed, usage of Comseq is preferable. Wordcount is quite "laconic" and gives out information only about amount of each letter in proteome. But it has unique option: -mincount allows to select minimum word count to report (integer 1 or more). Compseq gives out information about amount of each letter in proteome, сounts frequency of each letter and compares it to expected frequency. Compseq has a lot of interesting functions: -ignorebz (boolean) allows to ignore not commonly used codes for Asparagine or Aspartic acid (B) and Glutamine or Glutamic acid (Z), -reverse (boolean) allows to also count words in the reverse complement of a nucleic sequence, -zerocount (boolean) helps to minimise output by not displaying the words with a zero count. Not all functions are presented here, for more information -help command can be used.

Filename Wordcount (run time) Compseq (run time)
K12.fasta 0m 0.116s 0m 0.016s
SALTY.fasta 0m 0.140s 0m 0.016s

Table 3. Run time of wordcount and compseq.

References

[0] Escherichia coli (strain K12) proteome, UniProt.
[1] Salmonella typhimurium (strain LT2 / SGSC1412 / ATCC 700720) proteome, UniProt.

© Simon Galkin, 2016