| title: "RSAT - matrix-clustering manual" |
| output: |
| html_document: |
| toc: yes |
| toc_depth: 3 |
| pdf_document: |
| toc: yes |
| toc_depth: 3 |
| css: course.css |
matrix-clustering
$program_version
Taking as input a set of position-specific scoring matrices, identify clusters of similar matrices and build consensus motifs by merging the matrices that belong to the same cluster.
The clustering step relies on MCL, the graph-based clustering algorithm developed by Stijn Van Dongen. MCL must be installed and its path indicated in the RSAT configuration file ($RSAT/RSAT_config.props). The installation of MCL can be done with a RSAT makefile:
cd $RSAT
make -f makefiles/install_software.mk install_mcl
Various R packages are required in matrix-clustering to convert the hierarchical tree into different output formats and to manipulate the dendrogram which is exported.
RJSONIO : http://cran.r-project.org/web/packages/RJSONIO/index.html
ctc : http://www.bioconductor.org/packages/release/bioc/html/ctc.html
dendextend : http://cran.r-project.org/web/packages/dendextend/index.html
For visualize the logo forest it is required the JavaScript D3 (Data Driven Documents) library, the user can select an option to connect directly with the server to load the functions of this library (see option -d3_base).
D3 : http://d3js.org/
Jacques van Helden
The following collaborator contributed to the definition of requirements for this program.
Morgane Thomas-Chollier
util
matrix-clustering [-i inputfile] [-o outputfile] [-v ] [...]
compare-matrices
The program compare-matrices is used by cluster-matrices to measure pairwise similarities and define the best alignment (offset, strand) between each pair of matrices.
-v #
Level of verbosity (detail in the warning messages during execution)
-h
Display full help message
-help
Same as -h
-i input matrix file
The input file contains a set of position-specific scoring matrices.
-matrix_format matrix_format
Specify the input matrix format.
Supported matrix formats
Since the program takes several matrices as input, it only accepts matrices in formats supporting several matrices per file (transfac, tf, tab, clusterbuster, cb, infogibbs, meme, stamp, uniprobe).
For a description of these formats, see the help of convert-matrix.
-title graph_title
Title displayed on top of the report page.
-display_title
If it is selected. The title is displayed in the trees and in the result table. This is ideal when the user wants to compare motifs from different sources (files).
-root_matrices_only
When this option is selected. matrix-clustering returns a file with the motifs at the root of each cluster. This save time and memory consumption because the branch-motifs, heatmaps, and trees are not exported.
-o output_prefix
Prefix for the output files.
Mandatory option: since the program cluster-matrices returns a list of output files (pairwise matrix comparisons, matrix clusters).
-heatmap
Display consensus of merged matrices on the internal branches of the tree.
-export format
Specify format for the output tree.
The hierarchical tree in JSON format is always exported, since it is required to display the logo tree with the d3 library. Additional formats are proposed in option to enable visualization with classical phylogeny analysis tools.
Supported trees formats
(JSON, newick)
JSON (default)
File format used for D3 library to visualize the logo forest in HTML.
newick (optional)
Widely used textual format to describe phylogenetic trees.
-task tasks
Specify one or several tasks to be run. If this option is not specified, all the tasks are run.
Note that some tasks depend on other ones. This option should thus be used with caution, by experimented users only.
Supported tasks: (all, comparison, clustering)
all
Execute all the parts of the program (default)
clustering
Skip the matrix comparison step and only executes the clustering step.
Assumes the users already have the description table and comparison table exported from the program compare-matrices.
This option is ideal to saving time once all comparison beteen the input motifs had been done.
-label
Option to select the matrix label fields displayed in the html tree
Supported labels
(name, consensus, id)-quick
With this option the motif comparison step is done with the program compare-matrices-quick (implemented in C) rather than the classic version compare-matrices (implemented in Perl). The quick version runs x100 times faster, but has not all implemented options as in the Perl version.
We suggest use this option for a big set of input motifs > 300 motifs.
NOTE: By the moment the only threshold used in quick version is Ncor.
-clone_input
If this option is selected, the input motif database is exported in the results folder.
NOTE: take into account the input file size.
-max_matrix
This option specify how many matrices can be clustered in the same analysis. If there are more matrices than the specified number, the program reports an error.
This parameter can be useful when the user analyse a big dataset of matrices.
-hclust_method
Option to select the agglomeration rule for hierarchical clustering.
Supported agglomeration rules:
complete (default)
Compute inter-cluster distances based on the two most distant nodes.
average
Compute inter-cluster distances as the average distance between nodes belonging to the relative clusters.
single
Compute inter-cluster distances based on the closest nodes.
-top X
Only analyze the first X motifs of the input file. This options is convenient for quick testing before starting the full analysis.
-skip X
Skip the first X motifs of the input file. This options is convenient for testing the program on a subset of the motifs before starting the full analysis.
-consensus_labels
Option to select the labels displayed in the consensus alignment picture
Default: consensus, id, strand
Supported labels
(consensus, id, strand, number)-uth param upper_threshold
Threshold on some parameter (-lth: lower, -uth: upper threshold).
Threshold parameters are passed to compare-classes.
In addition, if a threshold is defined in the (unique) metrics used as clustering score (option -score), this threshold will be used to decide whether motifs should be aligned or not. If two motifs have a similarity score lower (or distance score higher) than the selected threshold, their aligment will be skipped. The status of each motif (Aligned or Non-aligned) is reported in the file prefix_matrix_alignment_table.tab
Suggested thresholds:
cor >= 0.7
Ncor >= 0.4-score metric
Select the metric which will be used to cluster the motifs.
Supported metrics : cor, Ncor
Default: Ncor