Copyright 1990, 1994, 1995, 1996 Gerald Z. Hertz
May be copied for noncommercial purposes.
Author:
Gerald Z. Hertz
Dept. of Molecular, Cellular, and Developmental Biology
University of Colorado
Campus Box 347
Boulder, CO 80309-0347
hertz@boulder.colorado.edu
PATSER (version 3b)
This program scores the L-mers (subsequences of length L) of the
indicated sequences against the indicated alignment or weight matrix.
The elements of an alignment matrix are simply the number of times
that the indicated letter is observed at the indicated position of a
sequence alignment. Such elements must be processed before the matrix
can be used to score an L-mer (e.g., Hertz et al., 1990, CABIOS, 6:81-92).
A weight matrix is a matrix whose elements are in a form
considered appropriate for scoring an L-mer.
Each element of an alignment matrix is converted to an element of a
weight matrix by first adding pseudo-counts in proportion to the a
priori probability of the corresponding letter (see option "-b" in
section 1 below) and dividing by the total number of sequences plus
the total number of pseudo-counts. The resulting frequency is
normalized by the a priori probability for the corresponding letter.
The final quotient is converted to an element of a weight matrix by
taking its natural logarithm. The use of pseudo-counts here differs
from previous versions of this program by being proportional to the a
priori probability.
Version 3 of this program differs from previous versions by also
numerically estimating the p-value of the scores. The p-value
calculated here is the probability of observing a particular score or
higher at a particular sequence position and does NOT account for the
total amount of sequence being scored. P-values are estimated by the
method described in Staden, 1989, CABIOS, p. 89--96. The relative
value for each element of the weight matrix is approximated by
integers in a range determined by the "-R" and "-M" options (section 6
below). The p-value is calculated for each possible integer score and
the values are stored. The actual scores for the sequences are
determined from the true weight matrix. The true scores are converted
to their corresponding integer values and their p-values are looked up.
Matrices can be either horizontal or vertical. In a horizontal
matrix, the columns correspond to the positions within the pattern,
and the rows correspond to the letters. Each row begins with the
corresponding letter (or integer, if the "-i" option is used). In a
vertical matrix, the rows correspond to the positions within the
pattern, and the columns correspond to the letters. The first row
contains the letters (or integers, if the "-i" option is used)
corresponding to each column. In both types of matrices, spaces,
tabs, and vertical bars (|) are ignored. The output of the "consensus"
and "wconsensus" programs consists of horizontal alignment matrices.
The input files can contain comments according to the following
convention. The portion of a line following a ';', '%', or '#' is
considered a comment and is ignored. Comments can begin anywhere in a
line and always end at the end of the line. The output of this
program is sent to the standard output.
The following options can be determined on the command line.
0) -h: print these directions.
1) Matrix options.
-m filename: (default name is "matrix") file containing the matrix.
-w: the matrix is a weight matrix (default: alignment matrix)
-b number: a non-negative number indicating the total number of
pseudo-counts added to each alignment position (default: 1).
Before converting an alignment matrix to a weight matrix, the
total pseudo-counts multiplied by the a priori probability
(see section 3 below) of the corresponding letter is added
to each matrix element.
-v: the matrix is a vertical matrix (default: horizontal matrix).
-p: print the weight matrix derived from the alignment matrix.
2) -f filename: this file (default: read from the standard input) contains
the names of the sequences. The corresponding sequence may follow
its name if the sequence is enclosed between backslashes (\).
Otherwise, the sequence is assumed to be in a separate file having
the indicated name.
In the sequences, whitespace, slashes (/), periods, dashes (unless
part of an integer when the "-i" option is used), and comments
beginning with ';', '%', or '#' are ignored. When using letter
characters (i.e., with the "-a" or "-A" alphabet option), integers
are also ignored so that the sequence file can contain positional
information. When using integer characters (i.e., with the "-i"
alphabet option) the integers must be separated by whitespace.
A "-c" preceding the name of a sequence file indicates that the
corresponding sequence is circular.
3) Alphabet options---the three options in this section are mutually
exclusive (default: "-a alphabet"). The a priori probabilities mentioned
below are used when converting an alignment matrix to a weight matrix.
-a filename: file containing the alphabet and normalization information.
Each line contains a letter (a symbol in the alphabet) followed by an
optional normalization number (default: 1.0). The normalization is
based on the relative a priori probabilities of the letters. For
nucleic acids, this might be be the genomic frequency of the bases
or the frequencies observed in the data used to generate the alignment.
In nucleic acid alphabets, a letter and its complement appear on the
same line, separated by a colon (a letter can be its own complement,
e.g. when using a dimer alphabet). Complementary letters may use the
same normalization number. Only the standard 26 letters are
permissible; however, when the "-CS" option is used, the alphabet is
case sensitive so that a total of 52 different characters are possible.
POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS:
letter
letter normalization
POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS:
letter:complement
letter:complement normalization
letter:complement normalization:complement's_normalization
-i filename: same as the "-a" option, except that the symbols of
the alphabet are represented by integers rather than by letters.
Any integer permitted by the machine is a permissible symbol.
-A alphabet_and_normalization_information: same as "-a" option, except
information appears on the command line (e.g., -A a:t 3 c:g 2).
4) Alphabet modifiers indicating whether ascii alphabets are case
sensitive---the two options in this section are mutually exclusive
with each other and with the "-i" option (default: ascii alphabets are
case insensitive).
-CS: ascii alphabets are case sensitive.
-CM: ascii alphabets are case insensitive, but mark the location
of lowercase letters by printing a line containing their locations.
This option is useful when lowercase letters indicate a functional
landmark such as a transcriptional start in a DNA sequence.
5) Options for adjusting or restricting which scores are printed.
-c: also score the complementary sequences. The complements are
determined by the program and are not explicitly stated in the
sequence files.
-l number: lower threshold for printing scores, inclusive.
-u number: upper threshold for printing scores, exclusive.
-t: just print the top score for each sequence.
6) Options indicating how unrecognized symbols are treated (default: -d1).
Symbols are letters when option "-a" or "-A" is used;
symbols are integers when option "-i" is used.
The following three options are mutually exclusive.
-d0: treat unrecognized symbols as errors and exit the program.
-d1: treat unrecognized symbols as discontinuities, but print a warning.
Treating a symbol as a discontinuity means that any L-mer
containing the unrecognized symbol will be ignored.
-d2: treat unrecognized symbols as discontinuities, and print NO warning.
7) Options for adjusting the estimation of p-value.
If the -R option is set to zero, the p-value is not estimated.
-R number: the range for approximating a column of the weight matrix with
integers (default: 10000). This number is the difference
between the largest and smallest integers used to estimate
the scores. Higher values increase precision, but will take
longer to calculate the table of p-values.
-M number: the minimum score for approximating p-values (default: 0).
Higher values will increase precision,
but may miss interesting scores.