getorf 


Function

   Finds and extracts open reading frames (ORFs)

Description

   This program finds and outputs the sequences of open reading frames
   (ORFs).

   The ORFs can be defined as regions of a specified minimum size between
   STOP codons or between START and STOP codons.

   The ORFs can be output as the nucleotide sequence or as the
   translation.

   The program can also output the region around the START or the initial
   STOP codon or the ending STOP codons of an ORF for those doing
   analysis of the properties of these regions.

   The START and STOP codons are defined in the Genetic Code tables. A
   suitable Genetic Code table can be selected for the organism you are
   investigating.

Usage

   Here is a sample session with getorf


% getorf -minsize 300 
Finds and extracts open reading frames (ORFs)
Input nucleotide sequence(s): tembl:eclaci
protein output sequence(s) [eclaci.orf]: 

   Go to the input files for this example
   Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
  [-outseq]            seqoutall  [.] Protein sequence
                                  set(s) filename and optional format (output
                                  USA)

   Additional (Optional) qualifiers:
   -table              menu       [0] Code to use (Values: 0 (Standard); 1
                                  (Standard (with alternative initiation
                                  codons)); 2 (Vertebrate Mitochondrial); 3
                                  (Yeast Mitochondrial); 4 (Mold, Protozoan,
                                  Coelenterate Mitochondrial and
                                  Mycoplasma/Spiroplasma); 5 (Invertebrate
                                  Mitochondrial); 6 (Ciliate Macronuclear and
                                  Dasycladacean); 9 (Echinoderm
                                  Mitochondrial); 10 (Euplotid Nuclear); 11
                                  (Bacterial); 12 (Alternative Yeast Nuclear);
                                  13 (Ascidian Mitochondrial); 14 (Flatworm
                                  Mitochondrial); 15 (Blepharisma
                                  Macronuclear); 16 (Chlorophycean
                                  Mitochondrial); 21 (Trematode
                                  Mitochondrial); 22 (Scenedesmus obliquus);
                                  23 (Thraustochytrium Mitochondrial))
   -minsize            integer    [30] Minimum nucleotide size of ORF to
                                  report (Any integer value)
   -maxsize            integer    [1000000] Maximum nucleotide size of ORF to
                                  report (Any integer value)
   -find               menu       [0] This is a small menu of possible output
                                  options. The first four options are to
                                  select either the protein translation or the
                                  original nucleic acid sequence of the open
                                  reading frame. There are two possible
                                  definitions of an open reading frame: it can
                                  either be a region that is free of STOP
                                  codons or a region that begins with a START
                                  codon and ends with a STOP codon. The last
                                  three options are probably only of interest
                                  to people who wish to investigate the
                                  statistical properties of the regions around
                                  potential START or STOP codons. The last
                                  option assumes that ORF lengths are
                                  calculated between two STOP codons. (Values:
                                  0 (Translation of regions between STOP
                                  codons); 1 (Translation of regions between
                                  START and STOP codons); 2 (Nucleic sequences
                                  between STOP codons); 3 (Nucleic sequences
                                  between START and STOP codons); 4
                                  (Nucleotides flanking START codons); 5
                                  (Nucleotides flanking initial STOP codons);
                                  6 (Nucleotides flanking ending STOP codons))

   Advanced (Unprompted) qualifiers:
   -[no]methionine     boolean    [Y] START codons at the beginning of protein
                                  products will usually code for Methionine,
                                  despite what the codon will code for when it
                                  is internal to a protein. This qualifier
                                  sets all such START codons to code for
                                  Methionine by default.
   -circular           boolean    [N] Is the sequence circular
   -[no]reverse        boolean    [Y] Set this to be false if you do not wish
                                  to find ORFs in the reverse complement of
                                  the sequence.
   -flanking           integer    [100] If you have chosen one of the options
                                  of the type of sequence to find that gives
                                  the flanking sequence around a STOP or START
                                  codon, this allows you to set the number of
                                  nucleotides either side of that codon to
                                  output. If the region of flanking
                                  nucleotides crosses the start or end of the
                                  sequence, no output is given for this codon.
                                  (Any integer value)

   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write standard output
   -filter             boolean    Read standard input, write standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Input file format

   getorf reads any nucleic acid sequence USA.

  Input files for usage example

   'tembl:eclaci' is a sequence entry in the example nucleic acid
   database 'tembl'

  Database entry: tembl:eclaci

ID   ECLACI     standard; DNA; PRO; 1113 BP.
XX
AC   V00294;
XX
SV   V00294.1
XX
DT   09-JUN-1982 (Rel. 01, Created)
DT   10-FEB-1999 (Rel. 58, Last updated, Version 2)
XX
DE   E. coli laci gene (codes for the lac repressor).
XX
KW   DNA binding protein; repressor.
XX
OS   Escherichia coli
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
XX
RN   [1]
RP   1-1113
RX   MEDLINE; 78246991.
RA   Farabaugh P.J.;
RT   "Sequence of the lacI gene";
RL   Nature 274:765-769(1978).
XX
DR   SWISS-PROT; P03023; LACI_ECOLI.
XX
CC   KST ECO.LACI
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1113
FT                   /db_xref="taxon:562"
FT                   /organism="Escherichia coli"
FT   CDS             31..1113
FT                   /db_xref="SWISS-PROT:P03023"
FT                   /note="reading frame"
FT                   /transl_table=11
FT                   /protein_id="CAA23569.1"
FT                   /translation="MKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAE
L
FT                   NYIPNRVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSG
V
FT                   EACKAAVHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFS
H
FT                   EDGTRLGVEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWS
A
FT                   MSGFQQTMQMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSS
C
FT                   YIPPSTTIKQDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASP
R
FT                   ALADSLMQLARQVSRLESGQ"
XX
SQ   Sequence 1113 BP; 249 A; 304 C; 322 G; 238 T; 0 other;
     ccggaagaga gtcaattcag ggtggtgaat gtgaaaccag taacgttata cgatgtcgca        6
0
     gagtatgccg gtgtctctta tcagaccgtt tcccgcgtgg tgaaccaggc cagccacgtt       12
0
     tctgcgaaaa cgcgggaaaa agtggaagcg gcgatggcgg agctgaatta cattcccaac       18
0
     cgcgtggcac aacaactggc gggcaaacag tcgttgctga ttggcgttgc cacctccagt       24
0
     ctggccctgc acgcgccgtc gcaaattgtc gcggcgatta aatctcgcgc cgatcaactg       30
0
     ggtgccagcg tggtggtgtc gatggtagaa cgaagcggcg tcgaagcctg taaagcggcg       36
0
     gtgcacaatc ttctcgcgca acgcgtcagt gggctgatca ttaactatcc gctggatgac       42
0
     caggatgcca ttgctgtgga agctgcctgc actaatgttc cggcgttatt tcttgatgtc       48
0
     tctgaccaga cacccatcaa cagtattatt ttctcccatg aagacggtac gcgactgggc       54
0
     gtggagcatc tggtcgcatt gggtcaccag caaatcgcgc tgttagcggg cccattaagt       60
0
     tctgtctcgg cgcgtctgcg tctggctggc tggcataaat atctcactcg caatcaaatt       66
0
     cagccgatag cggaacggga aggcgactgg agtgccatgt ccggttttca acaaaccatg       72
0
     caaatgctga atgagggcat cgttcccact gcgatgctgg ttgccaacga tcagatggcg       78
0
     ctgggcgcaa tgcgcgccat taccgagtcc gggctgcgcg ttggtgcgga tatctcggta       84
0
     gtgggatacg acgataccga agacagctca tgttatatcc cgccgtcaac caccatcaaa       90
0
     caggattttc gcctgctggg gcaaaccagc gtggaccgct tgctgcaact ctctcagggc       96
0
     caggcggtga agggcaatca gctgttgccc gtctcactgg tgaaaagaaa aaccaccctg      102
0
     gcgcccaata cgcaaaccgc ctctccccgc gcgttggccg attcattaat gcagctggca      108
0
     cgacaggttt cccgactgga aagcgggcag tga                                   111
3
//

Output file format

   The output is a sequence file containing predicted open reading frames
   longer than the minimum size, which defaults to 30 bases (i.e. 10
   amino acids).

  Output files for usage example

  File: eclaci.orf

>ECLACI_1 [735 - 1112] E. coli laci gene (codes for the lac repressor).
GHRSHCDAGCQRSDGAGRNARHYRVRAARWCGYLGSGIRRYRRQLMLYPAVNHHQTGFSP
AGANQRGPLAATLSGPGGEGQSAVARLTGEKKNHPGAQYANRLSPRVGRFINAAGTTGFP
TGKRAV
>ECLACI_2 [1 - 1110] E. coli laci gene (codes for the lac repressor).
PEESQFRVVNVKPVTLYDVAEYAGVSYQTVSRVVNQASHVSAKTREKVEAAMAELNYIPN
RVAQQLAGKQSLLIGVATSSLALHAPSQIVAAIKSRADQLGASVVVSMVERSGVEACKAA
VHNLLAQRVSGLIINYPLDDQDAIAVEAACTNVPALFLDVSDQTPINSIIFSHEDGTRLG
VEHLVALGHQQIALLAGPLSSVSARLRLAGWHKYLTRNQIQPIAEREGDWSAMSGFQQTM
QMLNEGIVPTAMLVANDQMALGAMRAITESGLRVGADISVVGYDDTEDSSCYIPPSTTIK
QDFRLLGQTSVDRLLQLSQGQAVKGNQLLPVSLVKRKTTLAPNTQTASPRALADSLMQLA
RQVSRLESGQ*
>ECLACI_3 [465 - 49] (REVERSE SENSE) E. coli laci gene (codes for the lac repre
ssor).
RRNISAGSFHSNGILVIQRIVNDQPTDALREKIVHRRFTGFDAASFYHRHHHAGTQLIGA
RFNRRDNLRRRVQGQTGGGNANQQRLFARQLLCHAVGNVIQLRHRRFHFFPRFRRNVAGL
VHHAGNGLIRDTGILCDIV

   The name of the ORF sequences is constructed from the name of the
   input sequence with an underscore character ('_') and a unique ordinal
   number of the ORF found appended. The description of the output ORF
   sequence is constructed from the description of the input sequence
   with the start and end positions of the ORF prepended.

   The unique number appended to the name is simply used to create new
   unique sequence names, it does not imply any further information
   indicating any order, positioning or sense-strand of the ORFs.

   If the ORF has been found in the reverse sense, then the start
   position will be smaller than the end position. The numbering uses the
   forward-sense positions, but read in the reverse sense. For example,
   >ECLACI_3 [465 - 49] in the output above is a reverse-sense ORF
   running from position 465 to 49. The description will also contain
   '(REVERSE SENSE)'.

   If the sequence has been specified as a circular genome (using the
   command-line switch '-circular'), then ORFs can potentially continue
   past the 'end' of the input sequence (the breakpoint of the circular
   genome) and into the 'start' of the sequence again. This is dealt with
   by appending the sequence to itself three times and reporting long
   ORFs that are found in this extended sequence. Any ORF that is longer
   that three times the sequence length (i.e one that continues without
   hitting a STOP at any point in the genome) will be reported as being a
   maximum of three times the length of the input sequence. Note that the
   end position of an ORF in circular genomes can be apparently longer
   than the input sequence if the ORF crosses the breakpoint. If the ORF
   crosses the breakpoint, then the text '(ORF crosses the breakpoint)'
   will be added to the description of the output sequence.

Data files

   The START and STOP codons used by getorf are defined in the Genetic
   Code data files. By default, Genetic Code file EGC.0 is used.

   The default file EGC.0 is the 'Standard Code' with the rarely used
   alternate START codons omitted, it only has the normal 'AUG' START
   codon. The 'Standard Code' with the rarely used alternate START codons
   included is Genetic Code file EGC.1.

   It is expected that user will sometimes wish to customise a Genetic
   Code file. To do this, use the program embossdata.

   EMBOSS data files are distributed with the application and stored in
   the standard EMBOSS data directory, which is defined by the EMBOSS
   environment variable EMBOSS_DATA.

   To see the available EMBOSS data files, run:

% embossdata -showall

   To fetch one of the data files (for example 'Exxx.dat') into your
   current directory for you to inspect or modify, run:

% embossdata -fetch -file Exxx.dat

   Users can provide their own data files in their own directories.
   Project specific files can be put in the current directory, or for
   tidier directory listings in a subdirectory called ".embossdata".
   Files for all EMBOSS runs can be put in the user's home directory, or
   again in a subdirectory called ".embossdata".

   The directories are searched in the following order:
     * . (your current directory)
     * .embossdata (under your current directory)
     * ~/ (your home directory)
     * ~/.embossdata

   The Genetic Code data files are based on the NCBI genetic code tables.
   Their names and descriptions are:

   EGC.0
          Standard (Differs from GC.1 in that it only has initiation site
          'AUG')

   EGC.1
          Standard

   EGC.2
          Vertebrate Mitochodrial

   EGC.3
          Yeast Mitochondrial

   EGC.4
          Mold, Protozoan, Coelenterate Mitochondrial and
          Mycoplasma/Spiroplasma

   EGC.5
          Invertebrate Mitochondrial

   EGC.6
          Ciliate Macronuclear and Dasycladacean

   EGC.9
          Echinoderm Mitochondrial

   EGC.10
          Euplotid Nuclear

   EGC.11
          Bacterial

   EGC.12
          Alternative Yeast Nuclear

   EGC.13
          Ascidian Mitochondrial

   EGC.14
          Flatworm Mitochondrial

   EGC.15
          Blepharisma Macronuclear

   EGC.16
          Chlorophycean Mitochondrial

   EGC.21
          Trematode Mitochondrial

   EGC.22
          Scenedesmus obliquus

   EGC.23
          Thraustochytrium Mitochondrial

   The format of these files is very simple.

   It consists of several lines of optional comments, each starting with
   a '#' character.

   These are followed the line: 'Genetic Code [n]', where 'n' is the
   number of the genetic code file.

   This is followed by the description of the code and then by four lines
   giving the IUPAC one-letter code of the translated amino acid, the
   start codons (indicdated by an 'M') and the three bases of the codon,
   lined up one on top of the other.

   For example:

------------------------------------------------------------------------------
# Genetic Code Table
#
# Obtained from: http://www.ncbi.nlm.nih.gov/collab/FT/genetic_codes.html
# and: http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
#
# Differs from Genetic Code [1] only in that the initiation sites have been
# changed to only 'AUG'

Genetic Code [0]
Standard

AAs  =   FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG
Starts = -----------------------------------M----------------------------
Base1  = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG
Base2  = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG
Base3  = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG
------------------------------------------------------------------------------

Notes

   If you have selected one of the options to report a regions around a
   START or STOP codon, then note that any such region that crosses the
   beginning or end of the sequence will not be reported.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   '-sbegin' and -send' do not work with this program.

See also

   Program name                        Description
   marscan      Finds MAR/SAR sites in nucleic sequences
   plotorf      Plot potential open reading frames
   showorf      Pretty output of DNA translations
   sixpack      Display a DNA sequence with 6-frame translation and ORFs
   syco         Synonymous codon usage Gribskov statistic plot
   tcode        Fickett TESTCODE statistic to identify protein-coding DNA
   wobble       Wobble base plot

     * checktrans - Reports STOP codons and ORF statistics of a protein
       sequence

Author(s)

   Gary Williams (gwilliam � rfcgr.mrc.ac.uk)
   MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust
   Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

   2000 - written - Gary Williams

   November 2002 - added indication of reverse sense ORFs

   November 2002 - added indication of ORFs that cross the breakpoint at
   position 1 in circular genomes.

Target users

   This program is intended to be used by everyone and everything, from
   naive users to embedded scripts.

Comments

   None