!RUNS ON LINUX ONLY!
Requires EMBOSS and MAFFT installed

Starting with having downloaded fasta and feature table files of Pyrococus abyssi chromosome assembly (NCBI AC AL096836.1)

Obtaining the CDS coordinate table of the P. abyssi chromosome with features2CDSs.py script:

Obtaining CDS upstream sequences for SD PWM construction with MEME

Selecting and filtering CDSs with no "hypothetical" or "putative" in their names:

Filtering selected CDSs to be at least 300 nt long:

Randomly sampling 50 CDS of the filtered set:

Retrieving 40 nt upstream sequences of the sampled CDSs from the chromosome fasta file:

Adding the reverse complement of 16S rRNA 3'-prime end (containing aSD sequence) to the upstreams:

According to the chromosome's gbk file the 16S rRNA sequence is confined to 205039..206541 positions, however NCBI blastn search across Thermococci genomes has shown the end coordinate to be truncated. 12 Thermococci blastn-found 16S rRNA sequences, annotated as complete, has been obtained as thermococci_16S.fasta to build an alignment with the 16S rRNA of Pyrococcus abyssi:

So, based on the alignment above, coordinate 206547 has been chosen as the end:

upstreams.fasta file, comprising 40 nt upstreams of 50 protein coding genes and 40 nt reverse-complemented 16S rRNA gene of P. abyssi, is now passed to the MEME program for motif finding:
To find the short motif MEME is run with the options: classic mode, zoops, 1 motif to find, 0-order background model, motif strictly 5 nt long, search given strand only
To find the long motif MEME is run with the options: classic mode, zoops, 1 motif to find, 0-order background model, motif 5-10 nt long, search given strand only

Searching for SD sites in CDS upstream regions with the constructed PWMs using FIMO

$I.$ At first, SD motif sites will be searched for in the upstreams of the selected set of good CDSs (>= 300 nt long, no hypothetical/putative/NA in the product name feature)

Randomly sampling 500 CDSs from the 972 good CDSs and, provided that SD motif sites are expected to be found somewhere within 20 nt upstream to start-codons, taking their 20 nt upstream sequences:

Making a fasta file of the selected upstreams:

500_good_cds_20.fasta, comprising the 500 upstream 20 nt long sequences selected, is now passed to the FIMO program for motif searching with the help of the built PWMs:
To search for the short (more strict, 5 nt long) motif FIMO is run with the options: PWM SD_short.txt, scan given strand only, p-value threshold 0.001 (and then 0.01)
To search for the long (less strict, 9 nt long) motif FIMO is run with the options: PWM SD_long.txt, scan given strand only, p-value threshold 0.001 (and then 0.01)

As while using p-value threshold 0.001 with both PWMs there have been found much less motifs than expected from previous studies, for the next analysis only the sets of motifs found by using threshold 0.01 (fimo_20_short_01.tsv and fimo_20_long_01.tsv) are considered

Refining and analysing the set of good motifs found with the short (more strict) PWM

FDR has been chosen to be 0.1, so all the findings with q-value above 0.1 are being dismissed:

There are some sequences with more than 1 motif found, for these CDS upstreams a motif with the best p-value is being chosen:

The most of sites found seem reliable, except for the 11 ones having a C nt in their sequences.
A histogram of distances to the start codon is shown below:

Extracting the sequences of the motifs found and filtered to build the motif LOGO:

The LOGO built with no adjustment for composition:

Making a table for protein coding genes with the SD sites found:

Refining and analysing the set of good motifs found with the long (less strict) PWM

FDR has been chosen to be 0.1, so all the findings with q-value above 0.1 are being dismissed:

There are some sequences with more than 1 motif found, for these CDS upstreams a motif with the best p-value is being chosen:

As the motif, that has been searched for, is 9 nt long, there are plenty of matching variants.
To compare the performance of the long PWM with that of the short one, the frequencies of the most expected variants of the SD conserved region, as well as the frequency of a quite unxepected C nt in this region, are being estimated:

A histogram of distances to the start-codon is shown below:

Extracting the sequences of the motifs found and filtered to build the motif LOGO:

The LOGO is built with no adjustment for composition:

Making a table for protein coding genes with the SD sites found:

$II.$ Now SD motif sites will be searched for in all the CDSs of P. abyssi GE5 chromosome

Using the CDS table, obtained earlier with the features2CDSs.py script:

Making a fasta file of all the upstreams:

all_cds_20_15.fasta file, comprising 20 nt upstream - 15 nt downstream sequences of all the 1783 CDSs, is now passed to the FIMO program for motif searching with the help of the short (the more strict) PWM:
FIMO is run with the options: PWM SD_short.txt, scan given strand only, p-value threshold 0.001 (and then 0.01)

As while using 0.001 as the p-value cutoff, there have been find only 587 sites all of them having the same sequence, for the next analysis only the set of sites found by using the p-value cutoff 0.01 (fimo_all_20_15_01.tsv) is considered

Refining and analysing the set of motifs found across all the peri-CDS regions of the chromosome

FDR has been chosen to be 0.1, so all the findings with q-value above 0.1 are being dismissed:

There are some sequences with more than 1 motif found, for these CDS upstreams a motif with the best p-value is being chosen:

All the sites found seem to be reliable according to their sequences.
A histogram of the distances to annotated start-codon is shown below:

It can be seen there is a significant amount of SD sites found upstream of start-codon (marked with "ATG" in the histogram). Some of that finding might truly be the SD sites, implying wrong start-codon annotation, while the others should be the artifacts of multiplicity problem (FDR = 0.1, which allows as many as 10% of sites to be false positives)

Making a table for all the protein coding genes of P. abyssi GE5 chromosome with the SD sites found:

The check of the unique sequence distribution of sites for those ones found downstream to annotated start-codon hasn't shown anything special, so there is no great reason to consider that sites false positives:

It is interesting whether there is a greater share of poorly annotated genes among those with SD found downstream to annotated start-codon. Hence applying the chi2-test for independence of variables:

The test has shown no significant dependency between quality of gene name annotation and the position of SD site, which undermines the assumption of wrong start-codon annotation

Extracting the sequences of the motifs found and filtered to build the motif LOGO:

The LOGO is built with no adjustment for composition:

Analysis complete