footprint-scan
Scan promoters of orthologous genes with one or several position-specific scoring matrices (PSSM) in order to detect enriched motifs, and thereby predict phylogenetically conserved target genes.
footprint-scan [-m matrix_inputfile] [-o outputfile] [-v #] [...]
The analysis can be performed either on a single gene, or several genes separately (option -sep_genes), or on a group of genes altogether.
Query genes can be entered on the command line (option -q) or in a text file (option-genes). Alternatively, teh option -all_genes will run the analysis on all the genes of a genome.
footprint-scan requires a collection of (at least one) position-specific scoring matrices (PSSM).
All the format supported by matrix-scan can be used to enter the matrices. However, we recommend to use the TRANSFAC format, which supports multiple matrices (we usually want tos can promoters with a full collection of matrices), and associates an identifier with each matrix (e.g. the name of the transcription factor).
The following example shows a text file describing two matrices, representing the binding motifs annotated in RegulonDB for AgaR and AraC, respectively. Motifs must be separated by a line containing a double slash (//).
The complete file can be downloaded from RegulonDB (http://regulondb.ccg.unam.mx/).
AC ECK12_ECK120012515_AgaR.24
XX
ID ECK12_ECK120012515_AgaR.24
XX
P0 A T C G
1 5 0 1 5
2 6 1 4 0
3 4 0 5 2
4 5 4 0 2
5 4 6 0 1
6 1 5 3 2
7 0 2 8 1
8 4 1 1 5
9 4 5 1 1
10 3 8 0 0
11 5 6 0 0
12 1 8 1 1
13 2 0 4 5
14 4 5 2 0
15 3 8 0 0
16 3 8 0 0
17 0 2 9 0
18 0 2 2 7
19 3 7 1 0
20 4 7 0 0
21 3 8 0 0
22 3 4 0 4
23 3 4 0 4
24 3 4 3 1
25 3 8 0 0
XX
//
AC ECK12_ECK120012316_AraC.18
XX
ID ECK12_ECK120012316_AraC.18
XX
P0 A T C G
1 0 10 0 3
2 7 4 1 1
3 0 6 5 2
4 2 2 3 6
5 0 0 6 7
6 9 0 0 4
7 0 2 9 2
8 2 7 3 1
9 9 3 0 1
10 7 4 0 2
11 4 8 0 1
12 3 3 5 2
13 2 10 0 1
14 2 7 1 3
15 6 1 6 0
16 0 11 2 0
17 1 0 3 9
18 1 5 5 2
19 5 2 0 6
XX
//
The result comprises several files for the orthologs, upstream sequences, matrix-scan results, feature-maps. By default, a directory is created for each query gene, with a name indicating the parameters:
footprints/[taxon]/[Organism]/[gene]
Alternatively, the output folder can be specified manually with the option -o.
Let us assume that we have a collection of PSSMs annotated for a given organism (e.g. the matrices for all the Escherichia coli transcription factors annotated in RegulonDB). We would likt to scan the promoters of orthologs of a given gene, in order to predict the transcription factors that might be involved in its regulation. The program will count the hits for each matrix, and report those showing a significant enrichment in the promoters of its orthologs.
In this example, we use a slightly higher verbosity than usually (-v 2) in order to keep track of the progress of the analysis. This also reports the commands that are executed, and allows us to examine all their parameters.
footprint-scan -v 2 -org Escherichia_coli_K12 \
-taxon Enterobacteriales -q sodA -q lexA -q araC \
-bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriales/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriales-noov-1str.freq.gz \
-m RegulonDB_matrices_transfac_format.txt \
-matrix_format transfac \
-matrix_suffix RegulonDB \
-sep_genes
footprint-scan -v 2 -org Escherichia_coli_K12 \
-taxon Enterobacteriales -q sodA \
-bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriales/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriales-noov-1str.freq.gz \
-m RegulonDB_matrices.tab \
-matrix_format tab \
-matrix_suffix RegulonDB \
-sep_genes
Given a PSSM we would like detect new putative binding sites for a given Transcription Factor. The usual approach would be to retrieve all upstream region sequences of the organism of interest and then search for high scored sites with matrix-scan, althougth to have a high score in one sequence doesn’t mean is a real binding site.
As we know sequences with a functional relevance migth be conserved througth some branches of phylogeny. So we expect binding sites with a functional rele- vance to be conserved in a group of close othologous sequences. footprint-scan can search for putative bindign sites in the hole set of up- stream regions of an organism while evaluating if the detected binding sites are conserved (over-represented) in the respective orthologous sequences.
footprint-scan -v 2 -org Escherichia_coli_K12 \
-taxon Enterobacteriales -all_genes \
-bgfile ${RSAT}/public_html/data/taxon_frequencies/Enterobacteriales/dyads_3nt_sp0-20_upstream-noorf_Enterobacteriales-noov-1str.freq.gz \
-m MetJ_Regulon_matrix.tab \
-matrix_format tab \
-matrix_suffix RegulonDB \
-sep_genes
The difference betsween footprint-scan and footprint-discovery is that footprint-scan requires prior knowledge of the motifs (in the form of position-specific matrices), whereas footprint-discovery perfoms ab initio motif discovery.
When the option -rand is activated, footprint-scan scans random selections of promoters rather than promoters of orthologs.
This option serves to perform negative controls in orde to estimate empirically the rate of false prediction and check its correspondence with the theoretical estimation of the significance.
The random selections are done by passing the option -rand to the program get-orthologs.
Return Cis-Regulatory elements Enriched-Regions (CRER).
Calculate the statistical significance of the number of hits in
windows of variable sizes. The number of hits is the sum of
matches above a predefined threshold set on hits p-values, for
all matrices and on both strands (if -2str). The maximum size
for a CRER is defined by the option -crer_max.
The prior probability to find an instance of the motif is the
same for all matrices, and corresponds to the chosen pval
threshold. Within a region of maximal CRER size, subwindows are
defined between each hits, and the observed number of matches in
a subwindow is the sum of hits above the threshold. The
significance of the observed number of matches in a subwindow is
estimated by calculating a P-value using the binomial
distribution (Aerts et al., 2003).
Minimal CRER size in bps
Pval cutoff for selecting CRERs
Maximal CRER size in bps
The manual is still very incomplete, Jacques van Helden needs to revise and complete it.
On the basis of the existing Web service for footprint-discovery.
Alejandra Medina-Rivera will implement the Web interface. It would be more convenient to program the Web page after the Web services, in order to benefit ffrom the support of Web services (including the token). To be checked with Morgane Thomas-Chollier & Olivier Sand.
It would be worth preparing a tutorial (or a chapter in Methods in Molecular Biology) to explain in detail the interpretation of the result.
The tutorial could cover the 3 interfaces (command-line, Web services and Web form).
After having detected the motifs in the different sequences, analyze their co-occurrences in order to report the factors having sites in the same sequences (putatively interacting factors). Actually , this option should be implemented in matrix-scan rather than footprint-scan, because it applies to any type of analysis.
Add name of upstream neighbour to the synthetic tables, in order to detect pairs of gene sharing the same promoter.
Matrix file. This argument is mandatory.
This argument can be used iteratively to scan the sequence with multiple matrices.
Matrix format. Default is tab. This argument is mandatory.
Matrix suffix. This argument is mandatory.
The matrix suffix indicates the nature of the matrix file. For example, if your matrix file contains a single matrix for a transcription factor (say LexA), you can indicate it with
-matrix_suffix LexA
whereas if your matrix files contains all the matrices from the RegulonDB database, you can specify
-matrix_suffix RegulonDB
The matrix suffix will be concatenated to the output prefix, in order to maintain separate output files for distinct analyses performed on the same promoter sequences. For example, if you run successively the analysis with the matrix LexA, and then with the matrix CRP, you don't want to loose the results of the first scanning when running the second scanning.
Most matrices are derived from specific TFBS, so they represent the preferential sequence where a TF binds. This option will search for all the genomes in the given taxon where there is an ortholog for the specified tf. Orthologs for the query genes will only be retrived if the organism has an ortholog for the TF.
-tf gene_name
If the option -matrix_table is used instead of the name of an specific TF specify the names are in the file using:
-tf file
Pseudo-count for the matrix (default: 1). See matrix-scan for details.
Background model file.
Format of background model file. For supported formats see: convert-background-model -h
Calculate background model from the input sequence set.
Order of the markov chain for the background model.
This option is incompatible with the option -bgfile.
Size of the sliding window for the background model calculation. When this option is specified, the matrix pseudo-count is equally distributed.
The background model is calculated locally at each step of the scan, by computing transition frequencies from a sliding window centred around the considered segment. The model is thus updated at each scanned position. This model is called "adaptive". Note that the sliding window must be large enough to train the local Markov model. The required sequence length increases exponentially with the Markov order. This option is thus usually suitable for low order models only (-markov 0 to 1).
This option is incompatible with the option -bgfile.
Pseudo frequency for the background model. Value must be a real between 0 and 1.
If this option is not specified, the pseudo-frequency value depends on the background calculation.
For -bginput and -window, the pseudo frequency is automatically calculated from the length (L) of the sequence following this formula:
sqrt(L)/(L+sqrt(L))
For -bgfile, default value is 0.05.
In other cases, if the length (L) of the training sequence is known (e.g. all promoters for the considered organism), the value can be set manually by using the option -bg_pseudo. In such case, the background pseudo-frequency might be set, as suggested by Thijs et al., to the following value:
sqrt(L)/(L+sqrt(L))
Filter TF-interactions that are not present on the query organism. The option -filter_pval can be used to set the threshold for the detected sites.
Background model file for the scanning of query sequences for filtering,.
Set the threshold to filter out TF-interactions that are not present on the query organism.
Set the threshold on site p-value to report only the evaluated over-representations of binding sites whenever the individual sites crossed it. The default is set to 1e-4.
Threshold set on the occurrence significance (over-representation) for scores that have p-value equal or smaller thant the one given as threshold in the option -pval.
All genes are completely analyzed, only the genes that pass both threshold on pvalue and occ_sig will be included on the synthesis table and html of the matrix.
Default is set to 5 .
Additional options passed to matrix-scan for the test of over-representation of matrix hits.
Supported threshold fields for the matches : score pval eval sig normw proba_M proba_B rank crer_sites crer_size
Supported threshold fields for score distributions: occ occ_sum inv_cum exp_occ occ_pval occ_eval occ_sig occ_sig_rank
Examples: To return only the "best" score for each gene -occ_sig_opt '-uth rank 1'
To analyze the distribution only above a weight threshold of 7. -occ_sig_opt '-lth score 7'
To analyze the distribution for sites having a P-value threshold of 1e-3. -occ_sig_opt '-uth pval 1e-3'
Note: the argument passed to matrix-scan is delimited by single quotes, and can thus not contain any quote.
Draw reference lines on the significance profile plots, to highlight some particular values.
- horizontal axis (Y=0), in violet
- vertical axis (X=0), in violet
- the weight value associated with maximal significance (only weights >=0 are considered), in red
Additional options passed to XYgraph for drawing the occurrence significance graph.
Note: the argument passed to XYgraph is delimited by single quotes, and can thus not contain any quote.
Format for the occurrence plots (occurrence frequencies, occurrence sinificance). Supported: all formats supported by the program XYgraph
Additional options passed to matrix-scan for site detection and feature-map drawing.
Examples:
Scan sequences with an upper threshold of 0.001 on pval. -scan_opt '-uth pval 0.001'
Note: the argument passed to matrix-scan is delimited by single quotes, and can thus not contain any quote.
Default: By default sites are filtered with a threshodl on p-value on 1e-4
Additional options passed to feature-map for feature-map drawing.
Examples:
Change the thickness of the maps -map_opt '-mapthick 12'
Write the weight score above each site (also activate the auto adjustment of map thickness to ensure there is enough space for drawing the labels). -map_opt '-label score -mapthick auto'
Note: the argument passed to feature-map is delimited by single quotes, and can thus not contain any quote.
Default= " -mlen 300 -mspacing 2"
When the option -rand is activated, the program replaces each ortholog by a gene selected at random in the genome where this ortholg was found.
This option is used (for example by footprint-scan and footprint-discovery to perform negative controls, i.e. check the rate of false positives in randomly selected promoters of the reference taxon.
A table providing the paths to matrix files (one file per row) + optional columns to specify parameters (factor name, format) for each martrix.
The matrix list is provided as a tab-delimited text file, where each row specifies one matrix.
- The first column indicates the path to a file containing a single matrix (in the format specified with -matrix_format).
- The second column (optional) indicates a common name for the matrix (e.g. transcription factor name) which will be displayed in the synthetic report tables. If the option '-tf file' is used, this column must indicate the name of the transcription factor on which the taxonomic filter will be applied (i.e. the analysis will only be led in species of the taxon where an ortholog has been found for the factor).
- The third column (optional) indicates the format of each matrix, in case the search would be done with matrices obtained from different sources (e.g. TRANSFAC, consensus, meme). Note that if the file contains a third column, the option -matrix_format cannot be used.
Generate one footprint-scan command per matrix and post it on the queue of a PC cluster.
Skip the first # matrices in the matrix_table (useful for quick testing and for resuming interrupted tasks when using a matrix_table or when several matrices are entered with the option -m ).
Stop after having treated the first # matrices in the matrix table (useful for quick testing when using a matrix_table or when several matrices are entered with the option -m ).
Level of verbosity (detail in the warning messages during execution)
Display full help message.
Same as -h
Query organism, to which the query genes belong.
Reference taxon, in which orthologous genes have to be collected.
Alternatively, reference organisms can be specified with the option -org_list.
This option gives the posibility to analyse a user-specified set of reference organisms rather than a full taxon.
File format: the first word of each line is used as organism ID. Any subsequent text is ignored. The comment char is ";".
This option is incompatible with the option "-taxon".
This option can only be used combined with the -org_list option, this gives the posibility to analyse a given set of sequences managing sequence redundancy using a list of "no redundant" organisms.
The file format is one organisms per line, the comment char is ";"
This option gives the posibility to analyse a user-specified set of orthologs for specific reference organisms instead of using the BBH set of orthologs provided by RSAT.
The query genes included here will be the ones analyzed by the program.
File format: Tab delimited file with three columns.
ID of the query gene (in the query organism)
ID of the reference gene
ID of the reference organism
Further columns will be ignored. The comment char is ";".
This option is incompatible with the option "-taxon", and "-bg_model taxfreq" option.
Query gene.
This option can be used iteratively on the command line to specify multiple genes.
Specify a file containing a list of genes. Multiple genes can also be specified by using iteratively the option -q.
Automatically analyze all the genes of a query genome, and store each result in a separate folder (the folder name is defined automatically).
Maximal number of genes to analyze.
Skip the first # genes (useful for quick testing and for resuming interrupted tasks).
Stop after having treated the first # genes (useful for quick testing).
Main output directory. The results will be dispatched in sub-directories, defined according to the taxon, query organism and query gene name(s).
If the main output dir is not specified, the program automatically sets it to "footprints".
Generate one command per query gene, and post it on the queue of a PC cluster.
Dry run: print the commands but do not execute them.
Do not die in case a sub-program returns an error.
The option -nodie allows you to circumvent problems with specific sub-tasks, but this is not recommended because the results may be incomplete.
Search footprints for each query gene separately. The results are stored in a separate folder for each gene. The folder name is defined automatically.
By default, when several query genes are specified, the program collects orthologs and analyzes their promoters altogether. The option -sep allows to automatize the detection of footprint in a set of genes that will be treated separately.
Infer operons in order to retrieve the promoters of the predicted operon leader genes rather than those located immediately upstream of the orthologs. This method uses a threshold on the intergenic distance.
Specify here the intergenic distance threshold in base pairs. Pair of adjacent genes with intergenic distance equal or less than this value are predicted to be within operon. (default : 55)
Specify a subset of tasks to be executed.
By default, the program runs all necessary tasks. However, in some cases, it can be useful to select one or several tasks to be executed separately. For instance, after having collected all the promoter sequences of ortholog genes, one might desire to run the pattern detection with various parameter values without having to retrieve the same sequences each time.
Beware: task selection requires expertise, because most tasks depends on the prior execution of some other tasks in the workflow. Selecting tasks before their prerequisite tasks have been completed will provoke fatal errors.
Supported tasks:
Run all supported tasks. If no task is specified, all tasks are performed.
Infer operons (using infer-operons. This option should be used only for Bacteria.
Retrieve upstream sequence of the query genes (using retrieve-seq).
Identify theorthologs of the query genes in the selected taxon (using get-orthologs).
Retrieve upstream sequences of the orthologs (using retrieve-seq-multigenome).
Purge upstream sequences of the orthologs (using purge-seq).
Generate an HTML index with links to the result files. This option is used for the web interface, but can also be convenient to index results, especially when several genes or taxa are analyzed (options -genes, -all_genes, -all_taxa).
With the option -sep_genes, one index is generated for each gene separately. An index summarizing the results for all genes can be generated using the option -synthesis.
Generate a HTML table with links to the individual result files. The table contains one row per query gene, one column by output type (sequences, dyads, maps, ...) for footpritn-discovery and for footprint-scan on line per TF-gene interacction.
Detect all dyads present with at elast one occurrence in the upstream sequence of the query gene (using dyad-analysis). Those dyads will be used as filter if the option -filter has been specifed.
Detect significantly over-represented in upstream sequences of orhtologs (using dyad-analysis).
Draw feature maps showing the location of over-represented dyads in upstream sequences of promoters (using feature-map).
Infer a co-regulation network from the footprints, as described in Brohee et al. (2011).
Generate an index file for each gene separately. The index file is in the gene-specific directory, it is complementary to the general index file generated with the task "synthesis".
Ortholgous genes will be obtained for the genes realted to the specified trasncription factors. This task shoulb be executed befor the option -orthologs when a tf is specified. See -tf option description for more information.
Compute the significance of number of matrix hit occurrences as a function of the weight score (using matrix-scan and matrix-scan-quick).
Generate graphs showing the distributions of occurrences and their significances, as a function of the weight score (using >XYgraph>).
Scan upstream sequences to detect hits above a given threshold (using matrix-scan).
Draw the feature map of the hits (using feature-mp).
When the option -rand is activated, the program replaces each ortholog by a gene selected at random in the genome where this ortholg was found.
This option is used (for example by footprint-scan and footprint-discovery to perform negative controls, i.e. check the rate of false positives in randomly selected promoters of the reference taxon.
Format for the feature map.
Supported: any format supported by the program feature-map.
Deprecated, replaced by the task "index".
This option generated synthetic tables (in tab-delimited text and html) for all the results. It should be combined with the option -sep_genes. The synthetic tables contain one row per gene, and one column per parameter. They summarize the results (maximal significance, top-ranking motifs) and give pointers to the detailed result files.
Hey! The above document had some coding errors, which are explained below:
Non-ASCII character seen before =encoding in 'doesn’t'. Assuming UTF-8