---
output: pdf_document
---
[RSAT](RSAT_home.cgi) - [oligonucleotide analysis](oligo-analysis_form.cgi) manual
----------------------------------------------------------------------------------

#### Name

#### Description

#### Options

**Input sequence:**\
 The sequence that will be analyzed. Multiple sequences can be entered
at once with several [sequence formats](help.formats.html).

**Format:**\
 Input sequence format. [Various standards](help.formats.html) are
supported.

**Sequence type:**\
 Input sequence type

-   DNA (default)\
     Only A, C, G, and T residues are accepted. oligomers that contain
    undefined (N) or partly defined (IUPAC code) nucleotides are
    discarded from the countings.
-   protein Oligopeptide analysis instead of oligonucleotide. This
    inactivates the grouping of oligomers with their reverse
    complements, and modifies the alphabet size.
-   other Any type of letters found in the input sequence is considered
    valid. This allows to analyze texts in human language.

**Purge sequences (highly recommended)**\
 When checked, large duplicated regions (\>= 40 bp alignment with less
than 3 mismatches)) are filtered out before analysis. Purging is
essential for any motif discovery process, to avoid a bias due to
non-independence of sequences. Purging is performed with the programs
`mkvtree` and `vmatch` developed by [Stefan
Kurtz](http://www.zbh.uni-hamburg.de/kurtz)
([kurtz@zbh.uni-hamburg.de](mailto:kurtz@zbh.uni-hamburg.de)).

**Oligonucleotide size:**\
 The analysis can be performed with oligonuleotides of any size between
1 and 8. Selecting size 1 amounts to counting the alphabet utilization
within the input sequences. For the detection of regulatory sites, w
recommend starting with an analysis of hexanucleotides (size=6), and
scanning sizes between 4 and 8. When a pattern is significantly
overrepresented, it generally appears from the analyses with various
sizes.

**Count on:** (single or both strands)\
 By selecting "both strands", the occurrences of each oligonucleotide
are summed on both strands. This allows to detect elements which act in
an orientation-insensitive way (as is generally the case for yeast
upstream elements).

**Group reverse complement together in the output**\
 (only valid for two strand analysis). This parameter does not affect
the counting itself, but only the format of output. If this option is
NOT checked, two separate lines are used to show a word and its reverse
complement. This is redundant but might be useful for compatibility with
other programs.

**Prevent overlapping matches**\
 Periodic patterns (e.g. AAAAAA, ATATAT) have an aggregative tendency,
i.e. each occurrence of such a pattern strongly favors additional
occurrences in its immediate vicinity. This introduces a bias to most
statistics (binomial, log-likelihood). A simple way to correct for this
bias is to prevent counting twice mutually overlapping occurrences. \
 For example, `TATATATATATA` would represent

-   2 occurrences of `TATATA` when self-overlap is prevented
-   5 occurrences of `TATATA` when self-overlap is allowed

Note that Z-score introduces a correction for self-overlapping patterns
(see van Helden et al., 1999), but Z-scores are only valid for very
large sequences (for example a set of 6000 downstream sequences), and
are not appropriate for small gene clusters such as those extracted from
DNA chip experiment.

**Background model**\
 Various probabilistic models can be used to estimate the expected
frequency of each oligonucleotide. \

-   **Predefined background frequencies :** Compare oligo frequencies
    observed in the query sequence to those of a reference sequence (the
    background model).

    Pre-calculted tables are used to estimate expected oligonucleotide
    frequencies (background frequencies). These tables were obtained by
    counting all oligonucleotide frequencies (from size 1 to 8) in
    different sequence types, and this for each organism.

    -   **upstream**: all upstream regions, allowing overlap with
        upstream ORFs.
    -   **upstream-noorf**: all upstream regions, preventing overlap
        with upstream ORFs (sequences are clipped to discard upstream
        ORF sequences).
-   **Markov model :** expected word (oligonucleotide) frequencies are
    calculated on the basis of the subword frequencies observed in the
    input sequence set. This calculation takes into account the higher
    order dependencies between neighbouring residues.

    For example, with a Markov chain of order 4 :

    Thus

    For words of size k, the highest possible order is k-2. A Markov
    order of 0 amounts to use observed residue frequencies for
    calculating expected oligomer frequencies (no dependency between
    neighbour residues).

    The higher the Markov order, the most stringent is the analysis:
    specificity is increased, but there si a loss of sensitivity, i.e.
    some relevant patterns might be overlooked. The optimal Markov order
    depends on the size of the sequence set. For small gene families
    (e.g. 10 sequences of 800bp), taking an order \> 1 would result in a
    loss of sensitivity. For sequence sets of 1Mb, a Markov chain of 3
    is optimal for hexanucleotides.

-   **Lexicon partitioning :** Expected word frequencies are calculated
    on the basis of subword frequencies, in a similar (but not
    identical) way to the "dictionary" approach developed by Harmen
    Bussemaker. Each word is segmented in 2 subwords in all possible
    ways:

                    GATAAG  G & ATAAG
                        GA & TAAG
                        GAT & TAG
                        GATA & AG
                        GATAA & G

    The expected frequency of each segmented pair is the product of
    expected frequencies of its members. The expected word frequency is
    the maximum expected pair frequency.

-   **Residue frequencies from input sequence :** (Note: this is
    equivalent to a Markov chain with order 0).
-   **Equiprobable residues:** This option gives very poor results and
    should never be used in practice. I leave it there only for didactic
    purposes (to allow anyone to test how bad it performs).
-   **Upload your own expected frequency file:**

    You can upload your own table of expected frequencies. This option
    can be useful if you are working with an organism which is not
    supported on the web server.

    **File format:** The expected frequency file must be a tab-delimited
    text file, with one row per oligonucleotide. The first column
    contains the oligonucleotide, the second column the expected
    frequency. Oligonucleotides must be of the size selected for the
    analysis.
    [Examples](data/genomes/Saccharomyces_cerevisiae/oligo-frequencies)
    can be found in the [Data folder](data/).

    **How to generate an expected frequency file ?** \
     An expected frequency file can be generated with oligo-analysis
    itself.

    -   Enter your background sequence file with the *sequence file
        upload* option.
    -   Select the appropriate *oligonucleotide size*.
    -   Select the option ***Count on single strand***. This option has
        to be selected, even if you count on both strands for the
        subsequent analyses. The expected frequency files must only
        contain the single-strand counts, and when required,
        oligo-analysis calculates double-strand frequencies by summing
        expected frequencies for each pair of reverse complements.
    -   In the *return* option, select "Frequencies", and **deselect**
        all other options (especifically the probabilities).
    -   In the *return* options, set the threshold on significance to
        "none" (instead of the default value 0).
    -   Select email output and enter your email address.
    -   Run the analysis.
    -   When the result arrives in your mailbox, save the result as a
        text file on your hard drive.
    -   Check the format of the expected frequency file (you probably
        need to edit the file in order to remove the email header)).

    You can now use this file to specify your own expected frequencies
    for the analysis of other sequences.

**Pseudo-frequency for the background model**\

**Return:**\
 Various measures of oligonucleotide distribution can be returned/

-   **Occurrences:** a simple count of the number of occurrences of each
    oligonucleotide. Overlapping matches are detected and summed in the
    counting.
-   **Frequencies:** relative frequencies, i.e. the number of occurrence
    of the oligonucleotides divided by the sum of occurrences for all
    oligonucleotides.
-   **Matching sequences:** the number of sequences from the input set
    which contain at least one occurrence of the oligonucleotide.
-   **Ratio:** observed/expected occurrence ratio. This ratio can be
    used as a rough indicator of over-representation, but it has the
    weakness to overestimate the patterns with a very weak number of
    expected occurrences. For instance, observing 1 occurrence when
    expecting 0.1 will have a very high index of 10 while it is quite
    likely to occur at random (proba \~10%). For comparison, observing
    20 occurrences when expecting 10 has a probability of \~0.3%,
    although the ratio is only 2!
-   **Proba:** probabilities. Different statistics are calculated (see
    below for [](#proba)details of calculation).
    -   **Expected occurrences (exp\_occ):** the number of occurrences
        expected for the considered oligonucleotide within the set of
        sequences. The calculation of this value depends on the
        probabilistic model selected by the user (see above).
    -   **Occurrence probability (occ\_pro):** the probability to have N
        or more occurrences, given the expected number of occurrences
        (where N is the observed number of occurrences).
    -   **Expected matching sequences (exp\_ms):** the expected number
        of sequences with at least one occurrence.
    -   **Matching sequence probability (ms\_pro):** the probability to
        have L or more sequences with at least one occurrence of the
        oligonucleotide, given the probabilistic model (where L is the
        observed number of matching sequences).
    -   **Significance index (sig):** this is a conversion of the
        occurrence probability, taking into account the number of
        possible oligonucleotides (which varies with oligo size) and
        doing a logarithmic transformation. The highest sig corresponds
        to the most overrepresented oligonucleotide. Sig value higher
        than 0 indicate overrepresentation.

**Thresholds:**\

#### Probabilities

* * * * *
