Protein motifs: PSI-BLAST and PROSITE
Last update on the 6th of April, 2018We build the homologous family for a certain protein with PSI-BLAST and refine the pattern of certain protein family for Proteobacteria with Prosite.
PSI-BLAST
The protein was chosen under UniProt accession O05886. It is a ribosome hibernation promotion factor (HPF) from Mycobacterium tuberculosis. This protein is required for aggregation of 70S ribosome into 100S ribosomes during stationary phase. This leads to translation inactivation. According to supfam and CPDF, this protein belongs to ribosome binding protein Y (YfiA, RaiA, SpotY), the members of which are stress-responce proteins arresting translation by binding to the A-site.
The PSI-BLAST was run with default e-value threshold of 0.005. Results are presented in table 1.
Iteration | Hits above threshold | The worst good hit | The worst hit e-value | Hits below threshold | The best bad hit e-value |
---|---|---|---|---|---|
1 | 20 | P17161.1 | 0.003 | P17160.1 | 0.005 |
2 | 28 | P9WMA8.1 | 0.003 | B4L535.1 | 0.073 |
3 | 28 | P9WMA8.1 | 3.00E-19 | P33621.1 | 0.014 |
The hit-list was stabilized after the second iteration. The difference in hit e-values between border hits in good and bad groups is significant to suppose hits have a homologous family. Uniprot anallysis unveiled all hits are involved in translation regulation and ribosome binding. 24/28 hits are HPF proteins in various bacteria, 2 proteins are YfiA ones (P0AD51, P71346) and rest are chloroplastic ribosome binding factor PSRP1 (P19954) involved in light-dependent control of protein synthesis and dormancy associated translation inhibitor (P9WMA8)induced in response to hypoxia and low levels of nitric oxide. Thus, most proteins are involved in translation inhibition during stationary phase or under adverse conditions.
PSI-BLAST standalone
To run PSI-BLAST under command line, I downloaded executables from NCBI website, unpacked them and exported the path with
export PATH=$PATH:$HOME/ncbi-blast-2.7.1+/bin
. So, there is no need in really installing it.
The program uses the common BLAST+ options like
-query, -db, -remote, -outfmt
as well as specific ones like -num_iterations, -out_pssm, -inclusion_ethresh
.
The first defines number of iterations (default is until convergence), next states the file to store checkpoint file and last
is used to define e-value inclusion threshold (default is 0.002).
I excecuted psiblast -remote -query hpf_myctu.fasta -db swissprot -inclusion_ethresh 0.005 -outfmt 7 -out psiblast_cli.txt
(-num_iterations
is incompatible with -remote
). Although options were equal to web-run, the program
found 31 hits with worst e-values greater than 1. Adding option -evalue 0.005
resulted in reduction to 20 hits.
Setting this option to 0.1 resulted in extending to 23 hits.
So, the convergence between web-based and standalone (yet remotely executed) isn't absolute. The CLI execution requires less user time and no clicks to run next iteration whereas web one is much more intuitive.
Pattern refinement
I chose the RL1 family to get its pattern. The sequence of RL1_PASMU was scanned with PROSITE. It yielded a ribosomal protein L1 signature (PS01199).
It is defined as follows: [IMGV]-x(2)-[LIVA]-x(2,3)-[LIVMY]-[GAS]-x(2)-[LMSF]-[GSNH]-[PTKR]-[KRAVG]-[GN]-x-[LIMF]-P-[DENSTKQPRAGVI]
Looks nasty. To refine the pattern for Proteobacteria I did several steps and produced 4 other patterns. First, I took the original alignment of 8 bacteria and refined the pattern
(named small)
according to the alignment (fig. 1), run the PROSITE scanning and got matches. Then, I extended small pattern to the right (named long) and did the same. The next pattern I obtained
from alignment of all 18 presented bacteria (named full). Those RL1 proteins of Proteobacteria, that were not found by this pattern, I added to the alignment and produced
the extended pattern. False positives (FPs) and false negatives (FNs) were obtained with diff -u FILE_MATCH FILE_UNIPROT | grep -e ""-[[:alnum:]]\{6\}""
("+ for FNs),
where FILE_MATCH
is for matches' accessions file and FILE_UNIPROT
is for Proteobacteria RL1 accessions from Uniprot.
True positives (TP) were count arithmetically. The results are presented in the table 2.
Pattern | TP | FP | FN |
---|---|---|---|
original | 222 | 448 | 5 |
small | 210 | 0 | 217 |
long | 200 | 0 | 227 |
full | 245 | 0 | 182 |
extended | 354 | 79 | 73 |
The original pattern yielded a huge amount of false positives as it is specified for all RL1 proteins. Small pattern was quite specific yet did not found a half of proteins. The long pattern obtained a bit fewer proteins then the small one as expected. The full pattern added bulk of new TP proteins. All three patterns didn't yield and FP which is quite good. The extended pattern yielded much more TPs and also many FPs. As extended pattern is the least complex pattern which define all proteins in extended accession dataset (so its narrowing is restricted by decreasing TP), it seems impossible to define special pattern for Proteobacterial RL1 proteins. The ideal one lies somewhere between full and extended patterns and can be obtained by extending the full dataset with some FNs of full pattern.