HMM profiles
Last update on the 18th of April, 2018Here I construct a HMM profile for TUDOR domain in Mammalians and test its predictive power.
File | Link |
---|---|
Seed alignment of Mammalian proteins | selected_mam.fasta |
HMM profile | profile_mam.hmm |
Spreadsheet | results.ods |
Profile construction
I chose to search TUDOR domain (PF00567) in Mammalians in SwissProt database. There are 79 TUDOR proteins in SwissProt 41 out of which
are Mammalian ones. Seed alignment for PF00567 was fetched from Pfam, then Mammalian sequences were filtered with filter-alignment.py
. Only 16 sequences were
filtered.
The HMM profile was built with hmm2build -g --amino
, then calibrated with hmm2calibrate
. Alignment and profile files can be accessed at
the table of downloads.
Search for domain
The search was done with hmm2search --domE 1000 --domT -50 -A 0
for gathering comparable amounts of TPs and FPs. The results were processed as described
in the task and put into spreadsheet. The statistics of findings is presented in table 1.
Property | HMM | BLAST |
---|---|---|
Real | 41 | 41 |
Found | 84 | 40 |
TP | 41 | 14 |
FP | 43 | 26 |
TN | 556969 | 556959 |
FN | 0 | 27 |
SwissProt size | 557012 | 557012 |
PPV | 0.488095 | 0.350000 |
NPV | 1.000000 | 0.999952 |
Sensitivity | 1.000000 | 0.341463 |
Specificity | 0.999923 | 0.999953 |
Accuracy | 0.999923 | 0.999905 |
TUDORs | 76 | 38 |
The HMM profile recalls all Mammalian TUDOR proteins along with the rest of them for other taxa. So it is necessary to choose a cut-off for score of findings (fig. 1). The ROC-curve (fig. 2)was built based on sampling cut-off from -60 to 210 with step 10.
Based on these charts the cut-off score of 70 was chosen as it gives the highest sensitivity with the lowest FDR in its neighbourhood area. The statistics was estimated for the list of found proteins: TP = 32/41, FN = 9/41, FP = 9/43, TN = 34/43, sensitivity = 0.78, specificity = 0.79. These values are quite satisfying. However, it seems impossible to develop an accurate enough predictor for such a tight evolutionary group of given domain as sequences are very similar among Metazoans. Another point is small size of sample group.
Comparison to BLAST
The protein TDRD7_MOUSE was queried against SwissProt with blastp (default settings). The results are in table 1. It found fewer proteins with twice as much FPs than TPs. The BLAST totally loses to HMM in low PPV and sensitivity. So it seems sane to use HMM for given task.