< 4th term

Motif search with MEME suite

Last update on the 29th of March, 2018

Putative transcription factor binding motifs in E. coli upstream regions of genes related to purine biosynthesis were found and analyzed.

Table of downloads
File Link
Upstream sequences upstream.fasta
MEME report meme.html
TOMTOM report tomtom.txt
FIMO report, related python and R scripts, tables fimo_out.zip

Data desription

upstream.fasta

The chosen object is E. coli strain K12 (ECOLI in Uniprot and U00096.3 in EMBL). There were found 17 reviewed proteins related to purine biosynthesis in Uniprot. Only 10 proteins were taken for further processing (see table 1).

Table 1. Properties of selected proteins.
The upstream-100 sequences have length of 102 due to arithmetic mistake. However, it is not particularly important regarding current issue: 102 is not much greater than 100. Here and below these regions will be called as upstream-100.
Entry Entry name Protein names Gene names Gene coordinates Upstream 100 coordinates
P0ADG7 IMDH_ECOLI Inosine-5'-monophosphate dehydrogenase guaB complement(2632604..2634070) complement(2634071..2634172)
P04079 GUAA_ECOLI GMP synthase [glutamine-hydrolyzing] guaA complement(2630958..2632535) complement(2634536..2634637)
P0AB89 PUR8_ECOLI Adenylosuccinate lyase purB complement(1190616..1191986) complement(1191987..1192088)
P0ACP7 PURR_ECOLI HTH-type transcriptional repressor PurR purR 1737844..1738869 1737742..1737843
P15254 PUR4_ECOLI Phosphoribosylformylglycinamidine synthase purL complement(2691656..2695543) complement(2695544..2695645)
P0AG16 PUR1_ECOLI Amidophosphoribosyltransferase purF complement(2428721..2430238) complement(2430239..24303240)
P08179 PUR3_ECOLI Phosphoribosylglycinamide formyltransferase purN 2622234..2622872 2622132..2622233
P0A7D4 PURA_ECOLI Adenylosuccinate synthetase purA 4404687..4405985 4404585..4404686
P33221 PURT_ECOLI Formate-dependent phosphoribosylglycinamide formyltransferase purT 1930881..1932059 1930779..1930880
P37051 PURU_ECOLI Formyltetrahydrofolate deformylase purU complement(1287782..1288624) complement(1288625..1288726)

First, the upstream-100 sequences were extracted from EMBL file with descseq and put into file genes.fasta. Then MEME program was run as follows: ememe -dataset genes.fasta -outdir result -nmotifs 3 -revcomp Y. 3 found motifs are pesented in table 2.

Table 2. Found motifs and their properties.
Number Logo E-value Occurence
1 1.7E-2 10/10
2 9.7E+2 6/10
3 1.5E+3 10/10

The only plausible motif is the first one with E-value of 0.017 and occurence in all given sequences. This motif is a bitty one unlike the rest. It is also the longest one. The second motif with relatively small E-value of 97 is presented only in six sequences. The third motif has the highest E-value and occur in each sequence. This phenomenon stems from its E-value as the expected number of findings in a set of the same properties not worse the given one in terms of log likelihood ratio. So it does occur in each sequence, but only once.

The motifs also have P-value for each occurence, which is a measure of probability of a random string to have the same match score with the position specific scoring matrix or higher. The motif 1 showed significant divergence in P-values in Kruskal-Wallis rank test (p-value = 0.039 against motif 2) and in Dunn's test for multiple comparisons (p-value = 0.00027 against motif 3). The distribution of P-values of all 3 motifs is presented in fig. 1.

Fig. 1. Box-plots of p-values of 3 motifs' occurences in each sequence.

Comparison with real motif

The E. coli PurR DNA-binding transcriptional repressor was found in RegulonDB[1]. It is reported to regulate 8 out of 10 genes under survey. The remaining two are purT and purU. The former gives the second greatest p-value of motif whereas the latter gives the highest p-value, which strongly suggests the purT can be regulated by purR repressor (as well as purU as its p-value is still low).

The reported motif looks quite similar to the motif 1 (fig. 2). It is shorter (16 vs 21 nts) and possesses some variations in base scores at several positions.

Fig. 2. Comparison of found and reported motifs' logos.
a. Motif 1; b. PurR repressor motif. The offset is -5 for motif 1.

Quering the motifs

tomtom.txt

One of MEME programs is a TOMTOM[2] tool used for motif comparison in given database. The online version[3] takes meme.txt file as input and outputs all findings in database and its e-values and some other information. All 3 motifs were queried as single file against Swiss Regulon DB for E. coli, the text output can be obsereved in tomtom.txt file.

Motifs yieled 10, 11 and 19 reported motifs, respectively. The PurR_17_3 motif was found for motif 1 with the lowest E-value (2.3E-5). Other motifs are of high expectancy as for motifs 2 and 3. It has to be mentioned that the motif 3 yielded one motif with E-value = 0.05 but the reported motif is much more extended (26 vs 10 nts). Furthermore the 10-gram can occur frequently in the genome so it is not the plausible finding. The high expectancy of other findings might be explained by the weakness of queried motifs, low size of database (87 motifs) and high specificity of motifs.

In total, the distribution of p-values across queried motifs is quite similar (p-values in Dunn comparisons > 0.4), see fig. 3.

Fig. 3. Boxplot of p-values of found motifs with TOMTOM tool across queried motifs.

Genome-wide search of motifs

fimo_out.zip

To search for motifs 1-3 in the bacteria genome the FIMO program was used instead of MAST because of easy parsing of it (MAST output was vaguely organized). Moreover, FIMO looks for individual motifs and yields individual p-values for each finding unlike MAST. The genome file in fasta format and genomic annotation in gff format were taken from NCBI NC_000913.3 entry for E. coli strain K12.

The FIMO was run with fimo meme.txt ecoli.fasta, fancy html output is in fimo.html. To define which motifs occur in upstream-100 sequences, the fimo.gff output was intersected with upstream-100 annotations. The annotation was obtained with bedtools flank -i sequence.gff3 -l 100 -s -g genome.gff -r 0 > flank.gff, the fimo.gff was corrected with python scrirpt to make start positions less than end ones. The intersection was done with bedtools intersect -a fimo_corr.gff -b flank.gff -loj > inter.tab, then filtered with python script and reorganized for good-looking report.

Yielded motifs were marked as "purine" in case the particular gene is regulated by purR (RegulonDB) or the GO process contain "purine" word, otherwise they were marked as "not purine". These two groups differ in p-values of obtained motifs (Wilcoxon test, p-value = 1.057e-07), see fig. 4.

Fig. 4. Boxplot of p-values of found motifs with FIMO tool across queried motifs divided by annotated involvement in purine biosynthesis.

As it is seen, there are at least one strong motif in "not purine" group thus giving the room for annotating it as involved in purine synthesis.

References

  1. RegulonDB record for purR repressor;
  2. Shobhit Gupta, JA Stamatoyannopolous, Timothy Bailey and William Stafford Noble, "Quantifying similarity between motifs", Genome Biology, 8(2):R24, 2007;
  3. TOMTOM online tool.