< 4^th term

Motif search with MEME suite

Last update on the 29^th of March, 2018

Putative transcription factor binding motifs in E. coli upstream regions of genes related to purine biosynthesis were found and analyzed.

Table of downloads
File	Link
Upstream sequences	upstream.fasta
MEME report	meme.html
TOMTOM report	tomtom.txt
FIMO report, related python and R scripts, tables	fimo_out.zip

Data desription

upstream.fasta

The chosen object is E. coli strain K12 (ECOLI in Uniprot and U00096.3 in EMBL). There were found 17 reviewed proteins related to purine biosynthesis in Uniprot. Only 10 proteins were taken for further processing (see table 1).

Table 1. Properties of selected proteins.
The upstream-100 sequences have length of 102 due to arithmetic mistake. However, it is not particularly important regarding current issue: 102 is not much greater than 100. Here and below these regions will be called as upstream-100.
Entry	Entry name	Protein names	Gene names	Gene coordinates	Upstream 100 coordinates
P0ADG7	IMDH_ECOLI	Inosine-5'-monophosphate dehydrogenase	guaB	complement(2632604..2634070)	complement(2634071..2634172)
P04079	GUAA_ECOLI	GMP synthase [glutamine-hydrolyzing]	guaA	complement(2630958..2632535)	complement(2634536..2634637)
P0AB89	PUR8_ECOLI	Adenylosuccinate lyase	purB	complement(1190616..1191986)	complement(1191987..1192088)
P0ACP7	PURR_ECOLI	HTH-type transcriptional repressor PurR	purR	1737844..1738869	1737742..1737843
P15254	PUR4_ECOLI	Phosphoribosylformylglycinamidine synthase	purL	complement(2691656..2695543)	complement(2695544..2695645)
P0AG16	PUR1_ECOLI	Amidophosphoribosyltransferase	purF	complement(2428721..2430238)	complement(2430239..24303240)
P08179	PUR3_ECOLI	Phosphoribosylglycinamide formyltransferase	purN	2622234..2622872	2622132..2622233
P0A7D4	PURA_ECOLI	Adenylosuccinate synthetase	purA	4404687..4405985	4404585..4404686
P33221	PURT_ECOLI	Formate-dependent phosphoribosylglycinamide formyltransferase	purT	1930881..1932059	1930779..1930880
P37051	PURU_ECOLI	Formyltetrahydrofolate deformylase	purU	complement(1287782..1288624)	complement(1288625..1288726)

First, the upstream-100 sequences were extracted from EMBL file with descseq and put into file genes.fasta. Then MEME program was run as follows: ememe -dataset genes.fasta -outdir result -nmotifs 3 -revcomp Y. 3 found motifs are pesented in table 2.

Table 2. Found motifs and their properties.
Number	E-value	Occurence
1	1.7E-2	10/10
2	9.7E+2	6/10
3	1.5E+3	10/10

The only plausible motif is the first one with E-value of 0.017 and occurence in all given sequences. This motif is a bitty one unlike the rest. It is also the longest one. The second motif with relatively small E-value of 97 is presented only in six sequences. The third motif has the highest E-value and occur in each sequence. This phenomenon stems from its E-value as the expected number of findings in a set of the same properties not worse the given one in terms of log likelihood ratio. So it does occur in each sequence, but only once.

The motifs also have P-value for each occurence, which is a measure of probability of a random string to have the same match score with the position specific scoring matrix or higher. The motif 1 showed significant divergence in P-values in Kruskal-Wallis rank test (p-value = 0.039 against motif 2) and in Dunn's test for multiple comparisons (p-value = 0.00027 against motif 3). The distribution of P-values of all 3 motifs is presented in fig. 1.

Comparison with real motif

The E. coli PurR DNA-binding transcriptional repressor was found in RegulonDB^[1]. It is reported to regulate 8 out of 10 genes under survey. The remaining two are purT and purU. The former gives the second greatest p-value of motif whereas the latter gives the highest p-value, which strongly suggests the purT can be regulated by purR repressor (as well as purU as its p-value is still low).

The reported motif looks quite similar to the motif 1 (fig. 2). It is shorter (16 vs 21 nts) and possesses some variations in base scores at several positions.

Quering the motifs

tomtom.txt

One of MEME programs is a TOMTOM^[2] tool used for motif comparison in given database. The online version^[3] takes meme.txt file as input and outputs all findings in database and its e-values and some other information. All 3 motifs were queried as single file against Swiss Regulon DB for E. coli, the text output can be obsereved in tomtom.txt file.

Motifs yieled 10, 11 and 19 reported motifs, respectively. The PurR_17_3 motif was found for motif 1 with the lowest E-value (2.3E-5). Other motifs are of high expectancy as for motifs 2 and 3. It has to be mentioned that the motif 3 yielded one motif with E-value = 0.05 but the reported motif is much more extended (26 vs 10 nts). Furthermore the 10-gram can occur frequently in the genome so it is not the plausible finding. The high expectancy of other findings might be explained by the weakness of queried motifs, low size of database (87 motifs) and high specificity of motifs.

In total, the distribution of p-values across queried motifs is quite similar (p-values in Dunn comparisons > 0.4), see fig. 3.

Genome-wide search of motifs

fimo_out.zip

To search for motifs 1-3 in the bacteria genome the FIMO program was used instead of MAST because of easy parsing of it (MAST output was vaguely organized). Moreover, FIMO looks for individual motifs and yields individual p-values for each finding unlike MAST. The genome file in fasta format and genomic annotation in gff format were taken from NCBI NC_000913.3 entry for E. coli strain K12.

The FIMO was run with fimo meme.txt ecoli.fasta, fancy html output is in fimo.html. To define which motifs occur in upstream-100 sequences, the fimo.gff output was intersected with upstream-100 annotations. The annotation was obtained with bedtools flank -i sequence.gff3 -l 100 -s -g genome.gff -r 0 > flank.gff, the fimo.gff was corrected with python scrirpt to make start positions less than end ones. The intersection was done with bedtools intersect -a fimo_corr.gff -b flank.gff -loj > inter.tab, then filtered with python script and reorganized for good-looking report.

Yielded motifs were marked as "purine" in case the particular gene is regulated by purR (RegulonDB) or the GO process contain "purine" word, otherwise they were marked as "not purine". These two groups differ in p-values of obtained motifs (Wilcoxon test, p-value = 1.057e-07), see fig. 4.

As it is seen, there are at least one strong motif in "not purine" group thus giving the room for annotating it as involved in purine synthesis.

References

RegulonDB record for purR repressor;
Shobhit Gupta, JA Stamatoyannopolous, Timothy Bailey and William Stafford Noble, "Quantifying similarity between motifs", Genome Biology, 8(2):R24, 2007;
TOMTOM online tool.