Signals. Motifs. PWM.

		
Task 1: To search for motives, it was suggested to find from the bacterium in the Uniprot database 8-10 genes involved in the purine biosynthesis (search for the keyword "Purine biosynthesis"). Below are the data on the proteins found in Shigella dysenteriae SECA (Uniprot-mnemonics - SHIDS).

		
Entry Entry name Protein names Gene names Coordinates
Q32HB3 PURT_SHIDS Formate-dependent phosphoribosylglycinamide formyltransferase purT SDY_1135 1053879...1055057
Q32D15 PUR4_SHIDS Phosphoribosylformylglycinamidine synthase purL SDY_2747 complement(2545533...2549420)
Q328F6 PURA_SHIDS Adenylosuccinate synthetase purA SDY_4412 complement(4127842...4129140)
Q32JK7 FOLD_SHIDS Bifunctional protein FolD folD SDY_0281 complement(274853..275719)
Q32D55 GUAA_SHIDS GMP synthase [glutamine-hydrolyzing] guaA SDY_2703 complement(2494237..2495814)
Q32AH9 PUR9_SHIDS Bifunctional purine biosynthesis protein PurH purH SDY_3720 3450399...3451988
Q32FB3 PURR_SHIDS HTH-type transcriptional repressor PurR purR SDY_1884 1718130...1719155
Q32D69 PUR5_SHIDS Phosphoribosylformylglycinamidine cyclo-ligase purM SDY_2688 2480836...2481873
Q32D90 PUR7_SHIDS Phosphoribosylaminoimidazole-succinocarboxamide synthase purC SDY_2664 complement(2456494...2457207)

		
Shigella dysenteriae Sd197, complete genome.

By coordinates of the coding sequences in the genome, the coordinates of the 100 nucleotides preceding them (from the 5 'end of the start codon) were found. Such nucleotides constitute the Upstream regions of the genes. Further, using the descseq command, upstream regions were gathered into a single file upstream.fasta.

Then the MEME program was launched, which should find the motives in the sequences. Command line: ememe upstream.fasta -nmotifs 3 -revcomp

Full result of the MEME program.

For each motive is given its information content (information content) - a numerical parameter that allows you to evaluate whether the given sequence is a motive or an accidental find. According to rough estimates, the word of length n = I / 2 (where I is the notational content) can occur in the genome once in 4 ^ n pairs of nucleotides. E-value shows how great the probability of finding with the same or greater weight. Also shown are LOGO schematic diagrams in which the height of the column is I of this position and shows its contribution to the total I of the entire motif, and the height of the letters is equal to the I column multiplied by the probability of encountering the given letter at that position. Thus, LOGO displays the most plausible sequence of the desired motive.

Motif 1 (Found between 8 sequences, E-value 2.2e+000, Information Content 16.6 bits)

Motif 2 (Found between 8 sequences, E-value 1.9e+002, Information Content 15.8 bits)

Motif 3 (Found between 2 sequences, E-value 1.6e+004, Information Content 17.0 bits)

Each one of the presented motifs can be barely considered as a real one. All of them have high e-value, and not quite high information content
		


© Popov Nikita 2016