< 2nd term

UniProt database

Last update on the 14th of March, 2017

This work is about studying the UniProt database of proteins with an example of mercuric reductase.

The excluded information from protein record

The data for table 1 is derived from the original file[1]. The protein is recorded in Korean databases (KEGG and KO) as the organism was discovered and sequenced by Korean scientists[2]. The protein is inferred from homology and described to have two domains: FAD/NAD-binding and pyridine nucleotide-disulphide oxidoreductase, dimerization ones. It's ligand is FAD and the protein exhibit oxidoreductase activity (detoxification of mercury ion). There is no more information about the protein as it is not studied on evidence level. Also, the protein record has yet to be reviewed.

Table 1. Main information about my protein
Attribute Value
UniProt ID A0A126V644_9RHOB
UniProt AC A0A126V644
RefSeq ID WP_039003479.1
PDB ID -
Length 477 AA
Molecular weight 49770 Da
Submitted name Mercuric reductase

UniRef clusters

UniRef provides clustered sets of sequences from UniProtKB and UniParc in order to reduce database weight and provide faster comparison search[3].

UniRef100

UniRef100 contains all UniProtKB and selected UniParc records. In UniRef100 all identical sequences more than 11 aminoacid residues long are placed into single record.

The corresponding cluster (UniRef100_A0A126V644) contains only one, original sequence as it is supposed by the Uniref100.

UniRef90

UniRef 90 is composed by clustering Uniref100 records such that each cluster contains all the sequences which are at least 90% identical to and 80% overlap with the longest sequence (a.k.a. seed sequence) of the cluster.

The corresponding cluster (UniRef90_A0A126V644) also contains one sequence. Thus, there is no similar records in UniProtKB and UniParc to the discussed one. That may be caused by poor exploration of relative organisms or by unique evolutionary pathway.

UniRef50

UniRef50 is built by clustering Uniref50 sequences: 50% identity to and 80% overlap with the seed sequence of the cluster.

The corresponding cluster (UniRef50_M9RA81) contains 71 records: 55 from UniProtKB and 16 from UniParc. They present different species (not at all). In order to find some pattern in species classification, I mapped UniProtKB records and analized them in spreadsheet software. Unfortunately, UniParc records isn't mapped in propriate way, but there is no big problem - some of this records are denoted from UniProtKB. The results are following: 50 out of 55 species belong to class Alphaproteobacteria, 1 to class Gammaproteobacteria and 4 of indefinite taxonomic range. In Alphaproteobacteria 46 belong to order Rhodobacterales (42 in family Rhodobacteraceae and 4 in family Hyphomonadaceae), orders Parvularculales, Rhizobiales, Sphingomonadales, Rhodospirillales contains one record each. This is an evidence to high monophylecity of Rhodobacteraceae group. The inclusion of other groups may be explained by some suggestions (by decreasing the possibility): kinship inside the order and class, gene transfer, lack of study, algorithm fault.

UniProt search queries

Name search

Query "name:mercuric name:reductase", link to the result is here. Proccessed it with spreadsheet software I came up with following statements:

  • 4187 records in total;
  • 79 records are for unclassified organisms, metagenomes e.t.c.;
  • 3963 records belongs to Bacteria;
  • 84 records are for Archea;
  • 35 records are for Fungi;
  • 5 records are for Viridiplantae (4 for Ricinus comunis and 1 for Medicago truncatula);
  • 21 records for some Eukaryota such as Haptophyceae, Stramenopiles, Ciliophora, Foraminifera and Choanoflagellida.

Query 'name:"mercuric reductase"', link. Archea range decreased by 1 and Bacteria one by 201, 14 unclassifed records are also lost. In total 3971 records, 216 lower. Query "name:"mercuric reductase" NOT name:mercuric NOT name:reductase" shows no results. In other words, this type of protein is widely presented in Bacteria and not in other domains, which may stand for unique environmental and biochemical properties of Bacteria group.

Investigating the bacterium

Search query "name:"mercuric reductase" organism:halocynthiibacter", output is only one, discussed sequence. Query "name:mercuric organism:halocynthiibacter" adds one more record, mercuric transport protein periplasmic component, A0A126V427_9RHOB.

Family search

Query "name:"mercuric reductase" taxonomy:rhodobacteraceae" displays 145 records, all with annotation score no more than 3 out of 5. Thus, MerA of family Rhodobacteraceae isn't studied well at evidence level.

Phylum search

Query "name:"mercuric reductase" taxonomy:proteobacteria" outputs 1988 records of all MerA in Proteobacteria phylum, 2 are observed at protein level (MERA_PSEAI and MERA_ACIFR). So, this protein is rather unpopular among scientists studying Proteobacteria.

Lysozyme search

Table 2. Several search queries on lysozyme
Query Number of records Species capacity in group
name:lysozyme 17921
name:lysozyme taxonomy:ciliophora 15 ~3.5k[4]
name:lysozyme taxonomy:viridiplantae 17 >350k[5]

According to the data in Table 2, lysozyme is more common in Ciliophora than in Viridiplantae, which can be explained by the differences in nutrition and habitation.

Inhibitor interference

Query "name:trypsin" (13241 records) displays not only trypsin records but also other stuff like trypsin inhibitors. Query "name:trypsin name:inhibitor" (2962 records) displays only trypsin inhibitor records, 22.37% share.

RefSeq vs UniProt

RefSeq WP_039003479.1 and UniProt A0A126V644_9RHOB records stands for identical MerA protein from particular organism. What data can these files provide? Well, common information is: protein name, length, organism taxonomy and aminoacid sequence. RefSeq also provides mapped in sequence regions (this set is wider than domains one and includes all registred sensible "parts" of proteins), whereas UniProt also provides article name, database references, and domains. All files contains service information for internal properties,comments and some keywords. In such way, UniProt is more universal database than RefSeq.

Historical research

This link provides information about previous file versions of particular MerA protein. The current version is 4th. The tool provides a comparison of different versions. Comparing first to second version, there were added database references and keywords; second to third, added references to gene ontology database and information on domains in feature table, third to fourth, added references to Kyoto Encyclopedia of Genes and Genomes and KEGG Orthology databases, edited feature table. Also there were edited some minor and server lines, whereas sequence was conserved. The development took 4 months (from June to October). Current version is still unreviewed and stored in TrEMBL.

Disulfide bonds in UniProt files

There is a special section in feature table of file which provides an information about different disulfide bonds[6]. Some examples are presented below (fig 1-5).

Fig. 1. Q9U8R2, intrachain bonds, uncerainty which cysteines form bonds.
FT   DISULFID     33    123       Or C-33 with C-124.
FT   DISULFID     42     59       Or C-44 with C-59.
FT   DISULFID     44    141       Or C-42 with C-141.
FT   DISULFID    124    132       Or C-123 with C-132.
Fig. 2. P09478, intrachain bonds, indicating specific information, properties or function.
FT   DISULFID    149    163       {ECO:0000250}.
FT   DISULFID    222    223       Associated with receptor activation.
FT                                {ECO:0000250}.
Fig. 3. P83658, interchain bonds, antiparallel homodimer.
FT   DISULFID      6     29       {ECO:0000305}.
FT   DISULFID      7      7       Interchain (with C-12).
FT   DISULFID     12     12       Interchain (with C-7).
FT   DISULFID     20     26       {ECO:0000305}.
FT   DISULFID     25     50       {ECO:0000305}.
FT   DISULFID     38     57       {ECO:0000305}.
Fig. 4. P22029, interchain bonds, heterodimer.
FT   DISULFID     80     80       Interchain (with C-75 in beta chain).
FT   DISULFID    103    120
Fig. 5. Q96HE7, redox-active center.
FT   DISULFID    394    397       Redox-active. {ECO:0000244|PDB:3AHQ,
FT                                ECO:0000244|PDB:3AHR,
FT                                ECO:0000269|PubMed:20834232}.

The "syntax" isn't complicated, but extensive. It is incredible how such system of rules evolved.

References

  1. UniProt record;
  2. Yung Mi Lee et al, Complete genome sequence of Halocynthiibacter arcticus PAMC 20958T from an Arctic marine sediment sample, Journal of Biotechnology 224 (2016) 12–13, doi: 10.1016/j.jbiotec.2016.03.005;
  3. UniProt help article on UniRef;
  4. Wikipedia article on Ciliate;
  5. Wikipedia article on Viridiplantae;
  6. UniProt help article on disulfide bonds.