MEME

Search sequence databases for the best combined matches with these motifs using MAST.
Search sequence databases for all matches with these motifs using FIMO.
Find Genome Ontology terms associated with upstream sequences matching these motifs using GOMO.
Submit these motifs to BLOCKS multiple alignment processor.

MEME - Motif discovery tool

MEME version 4.3.0 (Release date: Sat Sep 26 01:51:56 PDT 2009)

For further information on how to interpret these results or to get a copy of the MEME software please access http://meme.sdsc.edu.

This file may be used as input to the MAST algorithm for searching sequence databases for matches to groups of motifs. MAST is available for interactive use and downloading at http://meme.sdsc.edu.

REFERENCE

If you use this program in your research, please cite:

Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994.

TRAINING SET

DATAFILE= meme/meme.fasta
ALPHABET= ACGT
Sequence name	Weight	Length	Sequence name	Weight	Length
PURT_PSEPK	1.0000	101	PUR4_PSEPK	1.0000	101
PURA_PSEPK	1.0000	101	FOLD1_PSEPK	1.0000	101
FOLD2_PSEPK	1.0000	101	PUR2_PSEPK	1.0000	101
PUR9_PSEPK	1.0000	101	PUR5_PSEPK	1.0000	101
PUR7_PSEPK	1.0000	101

COMMAND LINE SUMMARY

This information can also be useful in the event you wish to report a
problem with the MEME software.

command: meme meme/meme.fasta -mod zoops -nmotifs 3 -prior dirichlet -revcomp -nostatus -dna -oc meme/

model:	mod=	zoops	nmotifs=	3	evt=	inf
object function= E-value of product of p-values
width:	minw=	8	maxw=	50	minic=	0.00
width:	wg=	11	ws=	1	endgaps=	yes
nsites:	minsites=	2	maxsites=	9	wnsites=	0.8
theta:	prob=	1	spmap=	uni	spfuzz=	0.5
em:	prior=	dirichlet	b=	0.01	maxiter=	50
	distance=	1e-05
data:	n=	909	N=	9
strands: + -
sample:	seed=	0	seqfrac=	1
Letter frequencies in dataset:
A 0.193 C 0.307 G 0.307 T 0.193
Background letter frequencies (from dataset with add-one prior applied): A 0.193 C 0.307 G 0.307 T 0.193

MOTIF 1 width = 42 sites = 9 llr = 170 E-value = 1.4e+002

SEQUENCE LOGO

PNG LOGOS require CONVERT from ImageMagick; see MEME installation guide

Information Content
28.3 (bits)

Relative Entropy
27.2 (bits)

Download LOGO
Without SSC:[EPS][PNG]
With SSC:[EPS][PNG]

NAME	STRAND	START	P-VALUE	SITES
PURT_PSEPK	+	30	1.07e-14	`GTTGATGGTG`	`TAGAATAGCGGTCCTTTTTCAGCGGGTGGCGCCTAGCCTGGC`	`GCGGCCCCGG`
FOLD2_PSEPK	+	10	3.09e-11	`TACTGCGTGA`	`TGCGAAAACGGCACCTTTGCAGGTGCCGTTTTTTTTGCCCGT`	`GACTAGTGGC`
PUR9_PSEPK	-	32	6.00e-11	`CGGTGGAGTC`	`AGCAAAAAAGGCGCCTCTGTTCAGGGAGTCGCCTTTTCTGGA`	`TGGGATTCTG`
FOLD1_PSEPK	-	3	1.88e-10	`CCTGCTTGTA`	`CGGGGCCGCTGCGACCTTGGCCCGGATTGCTTTTTGCCTGGA`	`ACC`
PUR4_PSEPK	-	46	5.90e-10	`CGGGACAGCC`	`TCAGGAAGGGGTGTGCTTAGAGGCCGTGCATTCTAGCCTAAT`	`TCGATGGCTT`
PUR7_PSEPK	+	35	1.20e-09	`GGTGGCGAAT`	`CAGGAACACGGGCTCGACTGGCGCGAAATCGCCTAGCCCAAC`	`ACCTCAAGCA`
PURA_PSEPK	+	37	2.13e-08	`CGGCCCTGAG`	`AAGGGCCGCGGCGTTTTCATTTGTGGGCATGTCTGTGCTGGC`	`CAATATCCAC`
PUR5_PSEPK	+	24	5.79e-08	`TGCAATGTAG`	`TGGTACTGCTGTCGGGCTCCGGCAGCAACCTGCAAGCCCTGA`	`TCGACAGCTG`
PUR2_PSEPK	+	17	5.79e-08	`AATACGATGG`	`TGGATTCACCGCCGCATTCGCGGGCAAGCCCGCTCCCACAGT`	`GTTCGGCGCA`

Motif 1 block diagrams

Name

Lowest
p-value

Motifs

PURT_PSEPK

1.07e-14

FOLD2_PSEPK

3.09e-11

PUR9_PSEPK

6.00e-11

-1

FOLD1_PSEPK

1.88e-10

-1

PUR4_PSEPK

5.90e-10

-1

PUR7_PSEPK

1.20e-09

PURA_PSEPK

2.13e-08

PUR5_PSEPK

5.79e-08

PUR2_PSEPK

5.79e-08

SCALE

\|	\|	\|	\|	\|
1	25	50	75	100

Motif 1 in BLOCKS format

to BLOCKS multiple alignment processor.

Motif 1 position-specific scoring matrix

Scan sequence databases for the best match in each sequence using MAST.

Motif 1 position-specific probability matrix

Scan sequence databases for all matches with this motif using FIMO.
Compare to known motifs in motif databases using Tomtom.
Find Genome Ontology terms associated with upstream regions matching this motif using GOMO.

Motif 1 regular-expression

[TAC][GA][GC][GA][AG][ACT][AC][GA]C[GT]G[CT][CG][CTG][CGT][TCG][TC][TC][GACTA][GCT][ACGTC][GC][GC][GCT][GC][GAC][AT][GA][CTG][CT][GT][TCG][CT]T[AT][GT][CG]C[TC][GA][GA][ACT]

Time 0.53 secs.

MOTIF 2 width = 15 sites = 7 llr = 78 E-value = 4.4e+003

SEQUENCE LOGO

Information Content
16.1 (bits)

Relative Entropy
16.1 (bits)

Download LOGO
Without SSC:[EPS][PNG]
With SSC:[EPS][PNG]

NAME	STRAND	START	P-VALUE	SITES
PUR9_PSEPK	+	9	1.92e-08	`AGCAGTACG`	`ACTTGTTGTAAGCCA`	`GAATCCCATC`
PUR4_PSEPK	-	16	1.71e-07	`GGCTTTCGGC`	`ACCTGCTCTAAGCCA`	`CGCCTGCCGG`
PURT_PSEPK	-	83	1.21e-06	`TTC`	`AGGTCCTCGAAGGCA`	`TCCGGGGCCG`
PUR5_PSEPK	-	0	4.19e-06	`ACTACATTGC`	`AGGTCTTGCTCGGCA`
FOLD2_PSEPK	+	75	9.71e-06	`TTCGCGAGCG`	`AGCGGTTGCACGGCC`	`CCCGCGGTCG`
PURA_PSEPK	-	2	1.34e-05	`CGGCGTCAGA`	`TCAGCCTGGAAGGCA`	`GA`
FOLD1_PSEPK	-	77	2.19e-05	`GAATCTTGT`	`ACCTGTTAAACGCTG`	`GTCAGATCGG`

Motif 2 block diagrams

Name

Lowest
p-value

Motifs

PUR9_PSEPK

1.92e-08

PUR4_PSEPK

1.71e-07

-2

PURT_PSEPK

1.21e-06

-2

PUR5_PSEPK

4.19e-06

-2

FOLD2_PSEPK

9.71e-06

PURA_PSEPK

1.34e-05

-2

FOLD1_PSEPK

2.19e-05

-2

SCALE

\|	\|	\|	\|	\|
1	25	50	75	100

Motif 2 in BLOCKS format

to BLOCKS multiple alignment processor.

Motif 2 position-specific scoring matrix

Scan sequence databases for the best match in each sequence using MAST.

Motif 2 position-specific probability matrix

Motif 2 regular-expression

A[CG][CG][TG][GC][TC]T[GC][CGT]A[AC]G[GC]CA

Time 0.74 secs.

MOTIF 3 width = 8 sites = 2 llr = 22 E-value = 3.1e+004

SEQUENCE LOGO

Information Content
16.0 (bits)

Relative Entropy
15.6 (bits)

Download LOGO
Without SSC:[EPS][PNG]
With SSC:[EPS][PNG]

NAME	STRAND	START	P-VALUE	SITES
PUR5_PSEPK	+	82	1.96e-05	`GCTGCCAAGG`	`GCAGGACA`	`GCCCGGTGCG`
FOLD1_PSEPK	+	50	1.96e-05	`CCCCGTACAA`	`GCAGGACA`	`ACCGTCGCGC`

Motif 3 block diagrams

Name

Lowest
p-value

Motifs

PUR5_PSEPK

1.96e-05

FOLD1_PSEPK

1.96e-05

SCALE

\|	\|	\|	\|	\|
1	25	50	75	100

Motif 3 in BLOCKS format

to BLOCKS multiple alignment processor.

Motif 3 position-specific scoring matrix

Scan sequence databases for the best match in each sequence using MAST.

Motif 3 position-specific probability matrix

Motif 3 regular-expression

GCAGGACA

Time 0.90 secs.

SUMMARY OF MOTIFS

Combined block diagrams: non-overlapping sites with p-value < 0.0001

Name

Combined
p-value

Motifs

PURT_PSEPK

1.41e-13

-2

PUR4_PSEPK

4.71e-10

-2

-1

PURA_PSEPK

3.32e-07

-2

FOLD1_PSEPK

1.40e-10

-1

-2

FOLD2_PSEPK

1.74e-09

PUR2_PSEPK

3.34e-04

PUR9_PSEPK

7.20e-12

-1

PUR5_PSEPK

6.16e-09

-2

PUR7_PSEPK

1.26e-05

SCALE

\|	\|	\|	\|	\|
1	25	50	75	100

Motif summary in machine readable format.

Stopped because Stopped because nmotifs = 3 reached..

CPU: kodomo.fbb.msu.ru

EXPLANATION OF MEME RESULTS

The MEME results consist of:

The version of MEME and the date it was released.
The reference to cite if you use MEME in your research.
A description of the sequences you submitted (the "training set") showing the name, "weight" and length of each sequence.
The command line summary detailing the parameters with which you ran MEME.
Information on each of the motifs MEME discovered, including:
1. A summary line showing the width, number of occurrences, log likelihood ratio and statistical significance of the motif.
2. A sequence LOGO.
3. The information content of the motif.
4. The relative entropy of the motif.
5. Downloadable LOGO files suitable for publication.
6. The occurrences of the motif sorted by p-value and aligned with each other.
7. Block diagrams of the occurrences of the motif within each sequence in the training set.
8. The motif in BLOCKS or FASTA format.
9. A position-specific scoring matrix (PSSM) for use by the MAST database search program.
10. The position specific probability matrix (PSPM) describing the motif.
11. A regular expression describing the motif.
A summary of motifs showing an optimized (non-overlapping) tiling of all of the motifs onto each of the sequences in the training set.
The reason why MEME stopped and the name of the CPU on which it ran.
This explanation of how to interpret MEME results.

MOTIFS

For each motif that it discovers in the training set, MEME prints the following information:

Summary Line
This line gives the width ('width'), number of occurrences in the training set ('sites'), log likelihood ratio ('llr') and 'E-value' of the motif. Each motif describes a pattern of a fixed width--no gaps are allowed in MEME motifs. MEME numbers the motifs consecutively from one as it finds them. MEME usually finds the most statistically significant (low E-value) motifs first. The statistical significance of a motif is based on its log likelihood ratio, its width and number of occurrences, the background letter frequencies (given in the command line summary ), and the size of the training set. The E-value is an estimate of the expected number of motifs with the given log likelihood ratio (or higher), and with the same width and number of occurrences, that one would find in a similarly sized set of random sequences. (In random sequences each position is independent with letters chosen according to the background letter frequencies.) The log likelihood ratio is the logarithm of the ratio of the probability of the occurrences of the motif given the motif model (likelihood given the motif) versus their probability given the background model (likelihood given the null model). (Normally the background model is a 0-order Markov model using the background letter frequencies, but higher order Markov models may be specified via the -bfile option to MEME.) Clicking on the buttons to the left of the motif summary line takes you to the previous motif (P) or next motif (N).

Sequence LOGO

MEME motifs are represented by position-specific probability matrices that specify the probability of each possible letter appearing at each possible position in an occurrence of the motif. These are displayed as "sequence LOGOS", containing stacks of letters at each position in the motif. The total height of the stack is the "information content" of that position in the motif in bits. The height of the individual letters in a stack is the probability of the letter at that position multiplied by the total information content of the stack.

Note: The MEME LOGO differs from those produced by the Weblogo program because a small-sample correction is NOT applied. However, MEME LOGOs in PNG and encapsulated postscript (EPS) formats with small-sample correction (SSC) are available by clicking on one of the links named "With SSC" (EPS or PNG) under Download LOGO. The MEME LOGOs without small sample correction are similarly available. Error bars are included in the LOGOs with small-sample correction.

The information content of each motif position is computed as described in the paper by Schneider and Stephens, "Sequence Logos: A New Way to Display Consensus Sequences" but the small-sample correction, e(n), is set to zero for the LOGO displayed in the MEME output. The corrected information content of position i is given by

           R(i) for amino acids   = log2(20) - (H(i) + e(n))   (1a) 
           R(i) for nucleic acids =    2    - (H(i) + e(n))    (1b)

where H(i) is the entropy of position i,

           H(l) = - (Sum f(a,i) * log2[ f(a,i) ]).             (2)

Here, f(a,i) is the frequency of base or amino acid a at position i, and e(n) is the small-sample correction for an alignment of n letters. The height of letter a in column i is given by

           height = f(a,i) * R(i)                              (3)

The approximation for the small-sample correction, e(n), is given by:

           e(n) = (s-1) / (2 * ln(2) * n),                     (4)

where s is 4 for nucleotides, 20 for amino acids, and n is the number of sequences in the alignment.

The letters in the logos are colored as follows. For DNA sequences, the letter categories contain one letter each. For proteins, the categories are based on the biochemical properties of the various amino acids. The categories and their colors are:

NUCLEIC ACIDS	COLOR
A	RED
C	BLUE
G	ORANGE
T	GREEN

AMINO ACIDS	COLOR	PROPERTIES
A, C, F, I, L, V, W and M	BLUE	Most hydrophobic[Kyte and Doolittle, 1982]
NQST	GREEN	Polar, non-charged, non-aliphatic residues
DE	MAGENTA	Acidic
KR	RED	Positively charged
H	PINK
G	ORANGE
P	YELLOW
Y	TURQUOISE

J. Kyte and R. Doolittle, 1982. "A Simple Method for Displaying the Hydropathic Character of a Protein", J. Mol Biol. 157, 105-132.

Note: the "text" output format of MEME preserves the historical MEME format where LOGOS are replaced by a simplified probability matrix, a relative entropy plot, and a multi-level consensus sequence.

Information Content
This is the information content of the motif in bits. It is equal to the sum of the uncorrected information content, R(), in the columns of the LOGO. This is equal relative entropy of the motif relative to a uniform background frequency model.
Relative Entropy
This is the relative entropy of the motif, computed in bits and relative to the background letter frequencies given in the command line summary. It is equal to the log-likelihood ratio (llr) divided by the number of occurrences (sites) of the motif times 1/ln(2),
```
               re = llr / (sites * ln(2)).
	    
```
Occurrences of the Motif
MEME displays the occurrences (sites) of the motif in the training set. The sites are shown aligned with each other, and the ten sequence positions preceding and following each site are also shown. Each site is identified by the name of the sequence where it occurs, the strand (if both strands of DNA sequences are being used), and the position in the sequence where the site begins. When the DNA strand is specified, '+' means the sequence in the training set, and '-' means the reverse complement of the training set sequence. (For '-' strands, the 'start' position is actually the position on the positive strand where the site ends.) The sites are listed in order of increasing statistical significance (p-value). The p-value of a site is computed from the the match score of the site with the position specific scoring matrix for the motif. The p-value gives the probability of a random string (generated from the background letter frequencies) having the same match score or higher. (This is referred to as the position p-value by the MAST algorithm.)
Block Diagrams of Motif Occurrences
The occurrences of the motif in the training set sequences are shown with MAST-style block diagrams. One diagram is printed for each sequence showing all the occurrences of the motif in that sequence. The sequences are sorted by the lowest p-value among all occurrences of the motif in a given sequence. (The p-value of an occurrence is the probability of a single random subsequence the length of the motif, generated according to the 0-order background model, having a score at least as high as the score of the occurrence.) When the DNA strand is specified, '+' means the motif appears from left to right on the sequence, and '-' means the motif appears from right to left on the complementary strand. A sequence position scale is shown at the end of each table of block diagrams. Very long sequences are shown with thick lines connecting the motifs and are not drawn to scale.
Motif in BLOCKS format or FASTA format>
For use with BLOCKS tools, MEME prints the occurrences of the motif in BLOCKS format.
You can convert these blocks to PSSMs (position-specific scoring matrices), LOGOS (color representations of the motifs), phylogeny trees and search them against a database of other blocks by pasting everything from the "BL" line to the "//" line (inclusive) into the Multiple Alignment Processor.
If you include the -print_fasta switch on the command line, MEME prints the motif sites in FASTA format instead of BLOCKS format.
Position-Specific Scoring Matrix
The position-specific scoring matrix corresponding to the motif is printed for use by database search programs such as MAST. This matrix is a log-odds matrix calculated by taking 100 times the log (base 2) of the ratio p/f at each position in the motif where p is the probability of a particular letter at that position in the motif, and f is the background frequency of the letter (given in the command line summary section.) This is the same matrix that is used above in computing the p-values of the occurrences of the motif in the Occurrences of the Motif and Block Diagrams of Motif Occurrences sections. The scoring matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The scoring matrix is preceded by a line starting with "log-odds matrix:" and containing the length of the alphabet, width of the motif, number of characters in the training set, the scoring threshold (obsolete) and the motif E-value.
Note: The probability p used to compute the PSSM is not exactly the same as the corresponding value in the Position Specific Probability Matrix (PSPM). The values of p used to compute the PSSM take into account the motif prior, whereas the values in the PSPM are just the observed frequencies of letters in the motif sites.
Position-Specific Probability Matrix
The motif itself is a position-specific probability matrix giving, for each position in the pattern, the observed frequency ("probability") of each possible letter. The probability matrix is printed "sideways"--columns correspond to the letters in the alphabet (in the same order as shown in the simplified motif) and rows corresponding to the positions of the motif, position one first. The motif is preceded by a line starting with "letter-probability matrix:" and containing the length of the alphabet, width of the motif, number of occurrences of the motif, and the E-value of the motif.
Note: Earlier versions of MEME gave the posterior probabilities--the probability after applying a prior on letter frequencies--rather than the observed frequencies. These versions of MEME also gave the number of possible positions for the motif rather than the actual number of occurrences. The output from these earlier versions of MEME can be distinguished by "n=" rather than "nsites=" in the line preceding the matrix.
Regular Expression
This is a regular expression (RE) describing the motif. In each column, all letters with observed frequencies greater than 0.2 are shown; less-frequent letters are not included in the RE. MEME regular expressions are interpreted as follows: single letters match that letter; groups of letters in square brackets match any of the letters in the group. Regular expressions can be used for searching for the motif in sequences (using, for example, PatMatch ) but the search accuracy will usually be better with the PSSM (using, for example MAST.)
Motif Summary Tiling
The motif summary tiling is done using the same algorithm as used by MAST. The motif occurrences shown in the motif summary may not be exactly the same as those reported in each motif section because only motifs with a position p-value of 0.0001 that don't overlap other, more significant motif occurrences are shown. The format of the machine readable motif-summary is:
```
        [sequence_name combined_p-value number_of_motif_occurrences [motif_number start_of_motif position_p-value]+]+
        
```
See the documentation for MAST output for the definition of position and combined p-values.