Formatrpsdb: Build databases for RPS Blast Introduction ------------ Formatrpsdb is a utility that converts a collection of input sequences into a database suitable for use with Reverse Position Specific (RPS) Blast. Each input sequence, together with its position-specific scoring matrix (PSSM), is ASN.1 encoded into a PssmWithParameters (or 'scoremat') object and resides in a separate file. Scoremat objects can be created using the blastPGP binary in the Standalone BLAST distribution. Formatrpsdb is given a list of these files and produces the corresponding database. Formatrpsdb is designed to perform the work of formatdb, makemat and copymat simultaneously, without generating the large number of intermediate files these utilities would need to create an RPS Blast database. Further, scoremat objects are in more general use than the binary format makemat requires. It is hoped that direct manipulation of scoremat objects will encourage conversion of more diverse sequence collections into RPS Blast databases. Databases generated by formatrpsdb are binary compatible with databases generated by formatdb/makemat/copymat, although the database files will in general not be byte- for-byte identical. Relevant Documents ------------------ Information on RPS Blast, as well as instructions for creating RPS Blast databases using formatdb/makemat/copymat: Information on rpsblast Information on formatdb Information on blastpgp The ASN.1 specification for PssmWithParameters is available in the NCBI C toolkit sources, in tools/scoremat.asn Preconditions for Use --------------------- This section assumes some familiarity with the documents pre- viously specified. An RPS Blast database consists of two groups of files. The first group is a standard protein database generated by formatdb (RPS Blast cannot use nucleotide databases). The second group of files contains precomputations used to speed up RPS Blast searches of the standard protein database. Previously, formatdb would build the first group of files, and makemat/copymat would be used to build the second group (the 'RPS data files'). As was mentioned, formatrpsdb performs all of these steps in a single pass. However, The collection of sequences passed to formatrpsdb must already be consistent in several important ways: - All sequences must use the same protein alphabet. - All scores in all PSSMs must be scaled by the same factor. - If the scoremat does not contain a PSSM, it must contain a set of residue frequencies that formatrpsdb can use to create a PSSM manually. The PSSM creation process is identical to that performed by makemat, and requires a scaling factor, gap existence and extension penalties, and an underlying score matrix. These must be provided as command line options to formatrpsdb, or each scoremat can contain one or more of these values (which will be used in place of the values specified as input arguments). If a sequence contains both a PSSM and residue frequencies, the latter will be ignored (see the command line options below). Regarding the last requirement, a collection of sequences passed to formatrpsdb may include a mixture of sequences for which a PSSM is available and sequences for which only the residue frequencies are available. The present version of formatrpsdb requires that all parameters (scale factor, gap open/extend, underlying score matrix), whether appearing within a scoremat or supplied from the command line, must be the same for all sequences. Prebuilt collections of sequences that satisfy these criteria are available from NCBI, along with tools capable of building compliant sequence files. Further, blastpgp is capable of reading and writing scoremat files containing residue frequencies. Command Line Options -------------------- A list of the command line options and the current version for formatrpsdb may be obtained by executing formatrpsdb without options, as in: formatrpsdb - The formatrpsdb options are listed below: -t Title for database file [String] Optional This will be printed by utilities like fastacmd as the title of the generated database. -i Input file containing list of ASN.1 Scoremat filenames [File In] Each Scoremat file contains the score matrix (or residue frequencies) and identification data for a single sequence. Filenames should appear one per line in this file, and the corresponding sequences will be added to the database in the order listed in this file. There are no restrictions on the filenames that appear in the list. -l Logfile name: [File Out] Optional default = formatrpsdb.log Status and error information will be written to this file. -o Create index files for database [T/F] Optional default = F If the "-o" option is TRUE and the sequence identifiers within each scoremat allow it, formatrpsdb will generate index files for the generated database. These will allow retrieval of individual sequences by utilities like fastacmd. -v Database volume size in millions of letters [Integer] Optional default = 0 range from 0 toThis option breaks up large collections of sequences into 'volumes' (each with a maximum size of 1 billion letters). -b Scoremat files are binary [T/F] Optional default = F The scoremat ASN.1 format allows sequence data in human-readable text format or a more compact binary format. Setting this option to 'T' signals to formatrpsdb that all of the scoremat files listed in the file for '-i' option contain binary ASN.1 scoremat data. If set to 'F', scoremat files will all be treated as containing ASCII text ASN.1 -f Threshold for extending hits for RPS database [Real] Optional default = 11.0 Formatrpsdb builds a Blast lookup table while the database is being generated. This table indexes each input sequence for searches using RPS Blast. The argument to '-f' specifies the threshold value; groups of letters in any input sequence which score above this value are added to the lookup table. Note that fractional threshold values (e.g. '10.5') are allowed for this argument. -n Base name of output database (same as input file if not specified) [String] Optional By default, the database generated will consist of a collection of files whose prefix matches that of the filename specified in the '-i' option. To give the database files a different prefix, specify the desired string for this option. -S For scoremats that contain only residue frequencies, the scaling factor to apply when creating PSSMs [Real] Optional default = 100.0 When given a scoremat file that does not contain a PSSM, formatrpsdb looks for a set of residue frequencies in the file, and attempts to create a PSSM using those residue frequencies. The creation process requires a scale factor for the computed scores, provided by this argument. -G The gap opening penalty (if not present in the scoremat) [Integer] Optional default = 11 -E The gap extension penalty (if not present in the scoremat) [Integer] Optional default = 1 If an input file does not contain gap opening and extension penalties, the values of these two arguments will be substituted. These are primarily intended for scoremat files that contain only residue frequencies. -U Underlying score matrix (if not present in the scoremat) [String] Optional default = BLOSUM62 If an input file does not contain the name of the NCBI standard score matrix from which residue frequencies were derived, the matrix name specified by the -U option will be substituted. This is primarily intended for scoremat files that contain only residue frequencies. Examples of Use --------------- Given a set of three sequence files 'scoremat1', 'scoremat2' and 'scoremat3', along with a text file 'list' consisting of the three lines scoremat1 scoremat2 scoremat3 the command to create an RPS blast database is formatrpsdb -i list which creates the files list.pin list.psq list.phr list.rps list.loo list.aux The first three files are a standard non-indexed protein database, and the last three are RPS data files. To index the database for retrieval of individual sequences, use formatrpsdb -i list -o T which will add the files list.pin list.psd list.psi To instead call this database 'mydb', use formatrpsdb -i list -o T -n mydb which will create 'mydb.*' instead of 'list.*' Additional Information and Help ------------------------------- Please direct bug reports, inquiries for assistance, and requests for new features to blast-help@ncbi.nlm.nih.gov Last updated July 23 2004