< 4th term

HMM profiles

Last update on the 18th of April, 2018

Here I construct a HMM profile for TUDOR domain in Mammalians and test its predictive power.

Table of downloads
File Link
Seed alignment of Mammalian proteins selected_mam.fasta
HMM profile profile_mam.hmm
Spreadsheet results.ods

Profile construction

I chose to search TUDOR domain (PF00567) in Mammalians in SwissProt database. There are 79 TUDOR proteins in SwissProt 41 out of which are Mammalian ones. Seed alignment for PF00567 was fetched from Pfam, then Mammalian sequences were filtered with filter-alignment.py. Only 16 sequences were filtered. The HMM profile was built with hmm2build -g --amino, then calibrated with hmm2calibrate. Alignment and profile files can be accessed at the table of downloads.

Search for domain

The search was done with hmm2search --domE 1000 --domT -50 -A 0 for gathering comparable amounts of TPs and FPs. The results were processed as described in the task and put into spreadsheet. The statistics of findings is presented in table 1.

Table 1. Comparative statistics of HMM and BLAST search.
Property HMM BLAST
Real 41 41
Found 84 40
TP 41 14
FP 43 26
TN 556969 556959
FN 0 27
SwissProt size 557012 557012
PPV 0.488095 0.350000
NPV 1.000000 0.999952
Sensitivity 1.000000 0.341463
Specificity 0.999923 0.999953
Accuracy 0.999923 0.999905
TUDORs 76 38

The HMM profile recalls all Mammalian TUDOR proteins along with the rest of them for other taxa. So it is necessary to choose a cut-off for score of findings (fig. 1). The ROC-curve (fig. 2)was built based on sampling cut-off from -60 to 210 with step 10.

Fig. 1. Histogram of finding scores.
There are several steps in the chart indicating plausible cut-offs.
Fig. 2. The ROC-curve.
The chosen step is marked with red dashed lines. It stands for cut-off score of 70.

Based on these charts the cut-off score of 70 was chosen as it gives the highest sensitivity with the lowest FDR in its neighbourhood area. The statistics was estimated for the list of found proteins: TP = 32/41, FN = 9/41, FP = 9/43, TN = 34/43, sensitivity = 0.78, specificity = 0.79. These values are quite satisfying. However, it seems impossible to develop an accurate enough predictor for such a tight evolutionary group of given domain as sequences are very similar among Metazoans. Another point is small size of sample group.

Comparison to BLAST

The protein TDRD7_MOUSE was queried against SwissProt with blastp (default settings). The results are in table 1. It found fewer proteins with twice as much FPs than TPs. The BLAST totally loses to HMM in low PPV and sensitivity. So it seems sane to use HMM for given task.