RSAT - footprint-discovery manual

NAME
VERSION
DESCRIPTION
AUTHORS
CATEGORY
USAGE
EXAMPLES
INPUT FORMAT
OUTPUT FORMAT
REFERENCES
- Description of the footprint-discovery method
- Inference of co-regulation network from the footprints
SEE ALSO
WISH LIST
OPTIONS

NAME

footprint-discovery

VERSION

$program_version

DESCRIPTION

Detect phylogenetic footprints by applying dyad-analysis in promoters of a set of orthologous genes.

Adapted from the procedure described in Janky & van Helden (2008).

AUTHORS

Rekin's Janky <Rekins.Janky\@vib.be>
Jacques van Helden <jacques.van.helden@ulb.ac.be>

USAGE

footprint-discovery [-i inputfile] -o [output_prefix] \ -org query_organism -taxon ref_taxon \ -q query_gene [-q query_gene2 ...] \ [-v #] [...]

EXAMPLES

Single-gene footprint discovery

Discover conserved motifs in the promoters of the orthologs of lexA in Enterobacteriaceae.

 footprint-discovery  -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
                -lth occ 1 -lth occ_sig 0 -uth rank 50 \
                -return occ,proba,rank -filter  \
                -bg_model taxfreq -q lexA

Analysis of a few genes

Discover conserved motifs in the promoters of the orthologs of lexA in Enterobacteriaceae.

 footprint-discovery  -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
                -lth occ 1 -lth occ_sig 0 -uth rank 50 \
                -return occ,proba,rank -filter  \
                -bg_model taxfreq \
                -sep_genes -q lexA -q recA -q uvrA

Note the option -sep_genes indicating that the genes have to be analyzed separately rather than grouped.

The genes can also be specified in a file with the option -genes.

Footprint discovery applied iteratively to each gene of a genome

Iterate footprint discovery for each gene separately.

 footprint-discovery  -v 1 -org Escherichia_coli_GCF_000005845.2_ASM584v2 -taxon Enterobacteriaceae \
                -lth occ 1 -lth occ_sig 0 -uth rank 50 \
                -return occ,proba,rank -filter \
                -bg_model taxfreq -all_genes -sep_genes

INPUT FORMAT

The program takes as input a taxon of interest + one or several query genes.

OUTPUT FORMAT

The output consists in a set of files, containing the results of the different steps of the analysis.

[prefix]_log.txt

Log file listing the analysis parameters + output file names;

[prefix]_query_genes.tab

List of query genes (one or several genes can be entered)

[prefix]_ortho_bbh.tab

List of orthologous genes

[prefix]_ortho_seq.fasta

Promoter sequences of the orthologous genes

[prefix]_ortho_seq_purged.fasta

Purged promoter sequences (for motif discovery) =item [prefix]_ortho_filter_dyads.tab

Dyads found in the query genes (for dyad filtering)

[prefix]_ortho_dyads.tab

Significant dyads found in the promoters of orthologous genes (the footprints)

[prefix]_ortho_dyads.asmb

Assembled dyads

[prefix]_ortho_dyads.png

Feature-map

NOTE : 'ortho' is replaced by 'leaders' in the filename prefix with option -infer_operons

REFERENCES

Description of the footprint-discovery method

Janky, R. and van Helden, J. Evaluation of phylogenetic footprint discovery for the prediction of bacterial cis-regulatory elements (2008). BMC Bioinformatics 2008, 9:37 [Pubmed 18215291].

Inference of co-regulation network from the footprints

Brohee, S., Janky, R., Abdel-Sater, F., Vanderstocken, G., Andre, B. and van Helden, J. (2011). Unraveling networks of co-regulated genes on the sole basis of genome sequences. Nucleic Acids Res. [Pubmed 21572103] [Open access]

WISH LIST

The following options are not yet implemented, but this should be done soon.

-taxa: Specify a file containing a list of taxa, each of which will be analyzed separately. The results are stored in a separate folder for each taxon. The folder name is defined automatically.
-all_taxa: Automatically analyze all the taxa, and store each result in a separate folder (the folder name is defined automatically).

OPTIONS

-lth field value

Lower threshold for dyad-analysis.

See the manual of dyad-analysis for a description of the fields on which a threshold can be imposed.

-uth field value

Upper threshold for dyad-analysis.

See the manual of dyad-analysis for a description of the fields on which a threshold can be imposed.

-return dyad_return_fields

Return fields for dyad-analysis. This argument is passed to dyad-analysis for the discovery of dyads in promoters of orthologous genes.

Multiple-fields can be entered either by calling this argument iterativelyk or by entering multiple fields separated by commas.

Type dyad-analysis -help to obtain the list of supported return fields.

-bg_model taxfreq|org_list|monads|file

Allow the user to choose among alternative background model (see Janky & van Helden, 2008).

Supported background model types:

monads

Expected dyad frequencies are estimated by taking the product of the monad frequencies observed in the input sequence set. Example:

   F_exp(CAGn{10}GTA) = F_obs(CAG) * F_obs(GTA)

taxfreq

Only valid in combination with the option -taxon.

Expected dyad frequencies are computed by summing the frequencies of all dyads in the non-coding upstream sequences of all genes for all the organisms of the reference taxon.

org_list

Only valid in combination with the option -org_list.

Expected dyad frequencies are computed by summing the frequencies of all dyads in the non-coding upstream sequences of all genes for each organism of user-specified list.

file

Only valid in combination with the option -bgfile.

Precises that the background model that will be used for dyad-analysis will be a file given as argument (with the option -bgfile, see below)

-bgfile

File containing the word frequencies to be used as the background model for dyad-analysis. This option must be used in combination with the option -bg_model file

-filter

Only accept dyads found in the promoter of the query gene, in the query organism. (option selected by default)

-no_filter

Accept all dyads, even if they are not found in the promoter of the query gene, in the query organism. (will cancel -filter option if selected)

-max_dyad_degree #

Maximal dyad degree for network inference. Default: 20.

Some dyads are found significant in a very large number of genes, for various reasons (binding motifs of global factors, low-complexity motifs). These "ubiquitous" dyads create many links in the network, which makes problem to extract clusters of putatively co-regulated genes. To circumvent this problem, we discard "hub" dyads, i.e. dyads found in the footprints of too many query genes.

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message.

-help

Same as -h

-org query_organism

Query organism, to which the query genes belong.

-taxon reference_taxon

Reference taxon, in which orthologous genes have to be collected.

Alternatively, reference organisms can be specified with the option -org_list.

-org_list organisms_list_file

This option gives the posibility to analyse a user-specified set of reference organisms rather than a full taxon.

File format: the first word of each line is used as organism ID. Any subsequent text is ignored. The comment char is ";".

This option is incompatible with the option "-taxon".

-no_purge

This option can only be used combined with the -org_list option, this gives the posibility to analyse a given set of sequences managing sequence redundancy using a list of "no redundant" organisms.

The file format is one organisms per line, the comment char is ";"

-orthologs_list file

This option gives the posibility to analyse a user-specified set of orthologs for specific reference organisms instead of using the BBH set of orthologs provided by RSAT.

The query genes included here will be the ones analyzed by the program.

File format: Tab delimited file with three columns.

  ID of the query gene (in the query organism)
  ID of the reference gene
  ID of the reference organism

Further columns will be ignored. The comment char is ";".

This option is incompatible with the option "-taxon", and "-bg_model taxfreq" option.

-q query

Query gene.

This option can be used iteratively on the command line to specify multiple genes.

-genes

Specify a file containing a list of genes. Multiple genes can also be specified by using iteratively the option -q.

-all_genes

Automatically analyze all the genes of a query genome, and store each result in a separate folder (the folder name is defined automatically).

-max_genes

Maximal number of genes to analyze.

-skip #

Skip the first # genes (useful for quick testing and for resuming interrupted tasks).

-last #

Stop after having treated the first # genes (useful for quick testing).

-o output_root_dir

Main output directory. The results will be dispatched in sub-directories, defined according to the taxon, query organism and query gene name(s).

If the main output dir is not specified, the program automatically sets it to "footprints".

-batch

Generate one command per query gene, and post it on the queue of a PC cluster.

-dry

Dry run: print the commands but do not execute them.

-nodie

Do not die in case a sub-program returns an error.

The option -nodie allows you to circumvent problems with specific sub-tasks, but this is not recommended because the results may be incomplete.

-sep_genes

Search footprints for each query gene separately. The results are stored in a separate folder for each gene. The folder name is defined automatically.

By default, when several query genes are specified, the program collects orthologs and analyzes their promoters altogether. The option -sep allows to automatize the detection of footprint in a set of genes that will be treated separately.

-infer_operons

Infer operons in order to retrieve the promoters of the predicted operon leader genes rather than those located immediately upstream of the orthologs. This method uses a threshold on the intergenic distance.

-dist_thr value

Specify here the intergenic distance threshold in base pairs. Pair of adjacent genes with intergenic distance equal or less than this value are predicted to be within operon. (default : 55)

-task

Specify a subset of tasks to be executed.

By default, the program runs all necessary tasks. However, in some cases, it can be useful to select one or several tasks to be executed separately. For instance, after having collected all the promoter sequences of ortholog genes, one might desire to run the pattern detection with various parameter values without having to retrieve the same sequences each time.

Beware: task selection requires expertise, because most tasks depends on the prior execution of some other tasks in the workflow. Selecting tasks before their prerequisite tasks have been completed will provoke fatal errors.

Supported tasks:

For all footprint programs (footprint-discovery, footprint-scan).

all

Run all supported tasks. If no task is specified, all tasks are performed.

operons

Infer operons (using infer-operons. This option should be used only for Bacteria.

query_seq

Retrieve upstream sequence of the query genes (using retrieve-seq).

orthologs

Identify theorthologs of the query genes in the selected taxon (using get-orthologs).

ortho_seq

Retrieve upstream sequences of the orthologs (using retrieve-seq-multigenome).

purge

Purge upstream sequences of the orthologs (using purge-seq).

index

Generate an HTML index with links to the result files. This option is used for the web interface, but can also be convenient to index results, especially when several genes or taxa are analyzed (options -genes, -all_genes, -all_taxa).

With the option -sep_genes, one index is generated for each gene separately. An index summarizing the results for all genes can be generated using the option -synthesis.

synthesis

Generate a HTML table with links to the individual result files. The table contains one row per query gene, one column by output type (sequences, dyads, maps, ...) for footpritn-discovery and for footprint-scan on line per TF-gene interacction.

Tasks specific to footprint-discovery

filter_dyads: Detect all dyads present with at elast one occurrence in the upstream sequence of the query gene (using dyad-analysis). Those dyads will be used as filter if the option -filter has been specifed.
dyads: Detect significantly over-represented in upstream sequences of orhtologs (using dyad-analysis).
map: Draw feature maps showing the location of over-represented dyads in upstream sequences of promoters (using feature-map).
network: Infer a co-regulation network from the footprints, as described in Brohee et al. (2011).
index: Generate an index file for each gene separately. The index file is in the gene-specific directory, it is complementary to the general index file generated with the task "synthesis".

Tasks specific to footprint-scan

orthologs_tf: Ortholgous genes will be obtained for the genes realted to the specified trasncription factors. This task shoulb be executed befor the option -orthologs when a tf is specified. See -tf option description for more information.
occ_sig: Compute the significance of number of matrix hit occurrences as a function of the weight score (using matrix-scan and matrix-scan-quick).
occ_sig_graph: Generate graphs showing the distributions of occurrences and their significances, as a function of the weight score (using >XYgraph>).
scan: Scan upstream sequences to detect hits above a given threshold (using matrix-scan).
map: Draw the feature map of the hits (using feature-mp).

-rand

When the option -rand is activated, the program replaces each ortholog by a gene selected at random in the genome where this ortholg was found.

This option is used (for example by footprint-scan and footprint-discovery to perform negative controls, i.e. check the rate of false positives in randomly selected promoters of the reference taxon.

-map_format

Format for the feature map.

Supported: any format supported by the program feature-map.

-index

Deprecated, replaced by the task "index".

-synthesis

This option generated synthetic tables (in tab-delimited text and html) for all the results. It should be combined with the option -sep_genes. The synthetic tables contain one row per gene, and one column per parameter. They summarize the results (maximal significance, top-ranking motifs) and give pointers to the detailed result files.