Name
Description
Performs inter-conversions between various formats of
position-specific scoring matrices (PSSM).
The program also performs a statistical analysis of the original
matrix to provide different position-specific scores (weight,
frequencies, information contents), general statistics (E-value, total
information content), and synthetic descriptions (consensus).
PSSM can be used to represent the binding specificity of a transcription
factor or the conserved residues of a protein domain.
Each row of the matrix corresponds to one residue (nucleotide or
amino-acid depending on the sequence type). Each column corresponds
to one position in the alignment. The value within each cell
represents the frequency of each residue at each position.
INPUT/OUTPUT FORMATS
Some formats are supported only for input, others for output. There
are more formats accepted for input, because the general use of this
program is to convert a PSSM obtained from a database (e.g. TRANSFAC)
or a pattern-discovery program (e.g. consensus, gibbs, meme,
MotifSampler, ...) and obtain a matrix either for scanning (with
matrix-scan) or for computing statistical parameters (see the return
fields below).
- tab (input/output)
tab-delimited
file. One row per residue, one column per position. The first
column of each row indicates the residue, the following columns
give the frequency of that residue at the corresponding position
of the matrix.
The tab format accepts a user-specific set of return fields (option
-return), proviging different statistics on the matrix (counts,
frequencies, weights, information, other parameters: see description
below).
; MET4 matrix, from Gonze et al. (2005). Bioinformatics 21, 3490-500.
A | 7 9 0 0 16 0 1 0 0 11 6 9 6 1 8
C | 5 1 4 16 0 15 0 0 0 3 5 5 0 2 0
G | 4 4 1 0 0 0 15 0 16 0 3 0 0 2 0
T | 0 2 11 0 0 1 0 16 0 2 2 2 10 11 8
//
patser (output)
This format can be used as input to scan sequences with
patser, the pattern-matching program developed by Jerry Hertz.
This is actually the same format as tab (described above), but
the only return field is the count matrix.
assembly (input)
Output file from the program pattern-assembly. One assembly
file can contain zero, one or several assemblies. Each
assembly is converted to a position-specific scoring matrix by
taking, for each residue at each position, the score of the
most significant pattern (oligonucleotide) containing that
residue in this position of the assembly.
consensus (input/output)
Output file from consensus, the pattern-discovery program
developed by Jerry Hertz (Hertz et al., Comput Appl Biosci,
1990:6, 81-92). This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
gibbs (input)
Output file from gibbs, the pattern-discovery program
developed by Andrew Neuwald (Lawrence et al. Science, 1993:
262, 208-214; Neuwald, et al. Protein Sci, 1995: 4, 1618-1632)
meme (input)
Output file from MEME, the pattern-discovery program developed by
tim Bailey.This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
MotifSampler (input/output)
Output file from MotifSampler, the pattern-discovery program
developed by Gert Thijs (Thijs et al. Bioinformatics, 2001:17,
1113-1122).
TRANSFAC (input/output)
Format used in the TRANSFAC database;
(http://www.gene-regulation.com/pub/databases.html)
cb (input)
Cluster-Buster output file (usual extention .cb). The header
line starts with a > (like in fasta format). The matrix is
then printed "vertically" on the following lines: each column
corresponds to one residue, and each row to a position in the
alignment.
feature (input)
Output file from convert-features.
This format allows to obtain a PSSM from a list of (supposedly
pre-aligned) sites. These sites can themselves have been
collected by scanning sequences with a matrix (matrix-scan) or
by searching string-based patterns in a sequence
(dna-pattern).
Converting features to matrices can for example be useful for
iterative refinment of a matrix (colecting sites from a
matrix, and building a matrix from those sites).
Another application is to detect oligomers or dyads in a
sequence set, and build a matrix from these.
clustal (input)
The popular multiple alignemnt program clustalw.
RETURN FIELDS FOR THE TAB-DELIMITED OUTPUT FORMAT
- counts
-
Each cell of the matrix indicates the number of occurrences of the
residue at a given position of the alignment.
- profile
-
The matrix is printed vertically (each matrix column becomes a row in
the output text). Additional parameters (consensus, information) are
indicated besides each position, and a histogram is drawed.
- crude frequencies
-
Relative frequencies are calculated as the counts of residues divided
by the total count of the column.
-
Fij=Cij/SUMi(Cij)
-
where
- Cij
-
is the absolute frequency (counts) of residue i at position j of the alignment
- Fij
-
is the relative frequency of residue i at position j of the alignment
- frequencies corrected with pseudo-weights
-
Relative frequencies can be corrected by a pseudo-weight (b) to reduce
the bias due to the small number of observations.
-
F''ij=Cij+b*Pi/[SUMi(Cij)+b]
-
where
- Pi
-
is the prior frequency for residue i
- b
-
is the pseudo-weight, which is ``shared'' between residues according to
their prior frequencies.
- weights
-
Weights are calculated according to the formula from Hertz (1999), as
the natural logarithm of the ratio between the relative frequency
(corrected for pseudo-weights) and the prior residue probability.
-
Wij=ln(F''ij/Pi)
- information
-
The crude information content is calculated according to the formula
from Hertz (1999).
-
Iij = Fij*ln(Fij/Pi)
-
In addition, we calculate a ``corrected'' information content which
takes pseudo-weights into account.
-
I''ij = F''ij*ln(F''ij/Pi)
- P-value
-
The P-value indicates the probability to observe at least Cij
occurrences of a residue at a given position of the matrix. It is
calculated with the binomial formula:
-
k=C.j C.j! k Cij-k
Pij= SUM ---------- Pi (1-Pi)
k=Cij k!(C.j-k)!
-
where
- Cij
-
is the number of occurrences of residue i at position j of
the matrix.
- C.j
-
is the sum of all residue occurrences at position j of the
matrix.
- Pi
-
is the prior probability of residue i.
- parameters
-
Returns a series of parameters associated to the matrix. The list of
parameters to be exported depends on the input formats (each pattern
discovery program returns specific parameters, which are more or less
related to each others but not identical).
-
Some additional parameters are optionally calculated
- consensus
-
The degenerate consensus is calculated by collecting, at each
position, the list of residues with a positive weight. Contrarily to
most applications, this consensus is thus weighted by prior residue
frequencies: a residue with a high frequency might not be represented
in the consensus if this frequency does not significantly exceed the
expected frequency. Uppercases are used to highlight weights >= 1.
-
The consensus is exported as regular expression, and with the IUPAC
code for ambiguous nucleotides (http://www.chem.qmw.ac.uk/iupac/misc/naseq.html).
-
A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine)
R = A or G (puRines)
Y = C or T (pYrimidines)
W = A or T (Weak hydrogen bonding)
S = G or C (Strong hydrogen bonding)
M = A or C (aMino group at common position)
K = G or T (Keto group at common position)
H = A, C or T (not G)
B = G, C or T (not A)
V = G, A, C (not T)
D = G, A or T (not C)
N = G, A, C or T (aNy)
-
The strict consensus indicates, at each position, the residue with the
highest positive weight.
- information
-
The total information is calculated by summing the information content
of all the cells of the matrix. This parameters is already returned by
the program consensus (Hertz), but not by other programs.
- logo
-
Sequence logo, a visual representation of the motif, where each column
of the matrix is represented as a stack of letters whose size is
proportional to the corresponding residue frequency. The total height
of each column is proportional to its information content.
Sequence logo are generated using the freeware
program Weblogo.