Formatrpsdb: Build databases for RPS Blast


Introduction
------------

Formatrpsdb is a utility that converts a collection of input 
sequences into a database suitable for use with Reverse
Position Specific (RPS) Blast. Each input sequence, together
with its position-specific scoring matrix (PSSM), is ASN.1
encoded into a PssmWithParameters (or 'scoremat') object 
and resides in a separate file. Scoremat objects can be created
using the blastPGP binary in the Standalone BLAST distribution.
Formatrpsdb is given a list of these files and produces the 
corresponding database. 

Formatrpsdb is designed to perform the work of formatdb,
makemat and copymat simultaneously, without generating the 
large number of intermediate files these utilities would need 
to create an RPS Blast database. Further, scoremat objects 
are in more general use than the binary format makemat requires. 
It is hoped that direct manipulation of scoremat objects 
will encourage conversion of more diverse sequence 
collections into RPS Blast databases.

Databases generated by formatrpsdb are binary compatible 
with databases generated by formatdb/makemat/copymat, 
although the database files will in general not be byte-
for-byte identical.


Relevant Documents
------------------

Information on RPS Blast, as well as instructions for creating 
RPS Blast databases using formatdb/makemat/copymat:

Information on rpsblast

Information on formatdb

Information on blastpgp

The ASN.1 specification for PssmWithParameters is available
in the NCBI C toolkit sources, in tools/scoremat.asn


Preconditions for Use
---------------------

This section assumes some familiarity with the documents pre-
viously specified. 

An RPS Blast database consists of two groups of files. The first
group is a standard protein database generated by formatdb
(RPS Blast cannot use nucleotide databases). The second group
of files contains precomputations used to speed up RPS Blast
searches of the standard protein database. Previously, formatdb
would build the first group of files, and makemat/copymat would 
be used to build the second group (the 'RPS data files'). 

As was mentioned, formatrpsdb performs all of these steps in 
a single pass. However, The collection of sequences passed to 
formatrpsdb must already be consistent in several important ways:

        - All sequences must use the same protein alphabet.

        - All scores in all PSSMs must be scaled by the same
          factor.

        - If the scoremat does not contain a PSSM, it must
          contain a set of residue frequencies that formatrpsdb
          can use to create a PSSM manually. The PSSM creation
          process is identical to that performed by makemat,
          and requires a scaling factor, gap existence and
          extension penalties, and an underlying score matrix.
          These must be provided as command line options to formatrpsdb, 
          or each scoremat can contain one or more of these values
          (which will be used in place of the values specified as
          input arguments). If a sequence contains both a PSSM and 
          residue frequencies, the latter will be ignored (see the 
          command line options below).

Regarding the last requirement, a collection of sequences
passed to formatrpsdb may include a mixture of sequences for which
a PSSM is available and sequences for which only the residue
frequencies are available. The present version of formatrpsdb
requires that all parameters (scale factor, gap open/extend,
underlying score matrix), whether appearing within a scoremat or
supplied from the command line, must be the same for all sequences.

Prebuilt collections of sequences that satisfy these criteria
are available from NCBI, along with tools capable of building
compliant sequence files. Further, blastpgp is capable of reading
and writing scoremat files containing residue frequencies.


Command Line Options
--------------------

A list of the command line options and the current version for 
formatrpsdb may be obtained by executing formatrpsdb without 
options, as in:

    formatrpsdb -

The formatrpsdb options are listed below:

  -t  Title for database file [String]  Optional

This will be printed by utilities like fastacmd as the
title of the generated database.

  -i  Input file containing list of ASN.1 Scoremat filenames [File In]

Each Scoremat file contains the score matrix (or residue
frequencies) and identification data for a single sequence. 
Filenames should appear one per line in this file, and the 
corresponding sequences will be added to the database in 
the order listed in this file. There are no restrictions 
on the filenames that appear in the list.

  -l  Logfile name: [File Out]  Optional
    default = formatrpsdb.log

Status and error information will be written to this file.

  -o  Create index files for database [T/F]  Optional
    default = F

If the "-o" option is TRUE and the sequence identifiers 
within each scoremat allow it, formatrpsdb will generate
index files for the generated database. These will allow
retrieval of individual sequences by utilities like fastacmd.

  -v  Database volume size in millions of letters [Integer]  Optional
    default = 0
    range from 0 to 

This option breaks up large collections of sequences into 
'volumes' (each with a maximum size of 1 billion letters). 

  -b  Scoremat files are binary [T/F]  Optional
    default = F

The scoremat ASN.1 format allows sequence data in human-readable
text format or a more compact binary format. Setting this option
to 'T' signals to formatrpsdb that all of the scoremat files 
listed in the file for '-i' option contain binary ASN.1 scoremat data.
If set to 'F', scoremat files will all be treated as containing 
ASCII text ASN.1

  -f  Threshold for extending hits for RPS database [Real]  Optional
    default = 11.0

Formatrpsdb builds a Blast lookup table while the database is 
being generated. This table indexes each input sequence for 
searches using RPS Blast. The argument to '-f' specifies the
threshold value; groups of letters in any input sequence which
score above this value are added to the lookup table.

Note that fractional threshold values (e.g. '10.5') are allowed
for this argument.

  -n  Base name of output database 
      (same as input file if not specified) [String]  Optional

By default, the database generated will consist of a collection
of files whose prefix matches that of the filename specified in
the '-i' option. To give the database files a different prefix,
specify the desired string for this option.

  -S  For scoremats that contain only residue frequencies, the 
      scaling factor to apply when creating PSSMs [Real]  Optional
      default = 100.0

When given a scoremat file that does not contain a PSSM,
formatrpsdb looks for a set of residue frequencies in the file, 
and attempts to create a PSSM using those residue frequencies. 
The creation process requires a scale factor for the computed 
scores, provided by this argument.

  -G  The gap opening penalty (if not present in the scoremat) 
      [Integer]  Optional
      default = 11
  -E  The gap extension penalty (if not present in the scoremat) 
      [Integer]  Optional
      default = 1

If an input file does not contain gap opening and extension 
penalties, the values of these two arguments will be substituted.
These are primarily intended for scoremat files that contain
only residue frequencies.

  -U  Underlying score matrix (if not present in the 
      scoremat) [String]  Optional
      default = BLOSUM62

If an input file does not contain the name of the NCBI standard 
score matrix from which residue frequencies were derived, the 
matrix name specified by the -U option will be substituted.
This is primarily intended for scoremat files that contain only 
residue frequencies.


Examples of Use
---------------

Given a set of three sequence files 'scoremat1', 'scoremat2'
and 'scoremat3', along with a text file 'list' consisting
of the three lines

scoremat1
scoremat2
scoremat3

the command to create an RPS blast database is

    formatrpsdb -i list 

which creates the files 

    list.pin list.psq list.phr list.rps list.loo list.aux

The first three files are a standard non-indexed protein database,
and the last three are RPS data files. To index the database for
retrieval of individual sequences, use

    formatrpsdb -i list -o T

which will add the files

    list.pin list.psd list.psi

To instead call this database 'mydb', use

    formatrpsdb -i list -o T -n mydb

which will create 'mydb.*' instead of 'list.*'


Additional Information and Help
-------------------------------

Please direct bug reports, inquiries for assistance, and requests 
for new features to blast-help@ncbi.nlm.nih.gov


Last updated July 23 2004