This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.
It is better to follow the corresponding tutorials before this one.
The program matrix-clustering enables to discover and align groups of similarities among motif collections displaying the results with different motif-representation formats.
Transcription factor binding motifs (TFBM) are classically represented either as consensus strings (stric consensus, IUPAC or regular expressions), or as position-specific scoring matrices (PSSM).
Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB, TRANSFAC, etc). These PSSMs were traditionally built from collections of transcription factor binding sites (TFBS) obtained by various experimental methods (e.g. EMSA, DNAse footprinting, SELEX).
TFBMs can also be discovered ab initio from genome-scale datasets: promoters of co-expressed genes, ChIP-seq peaks, phylogenetic footprints, etc.
Motif collections (databases as well as ab initio motif discovery results) sometimes contain groups of similar motifs, for different reasons: curation of alternative motifs for a same TF; homologous proteins sharing a particular DNA binding domain, motifs discovered with analytic workflows combining several algorithms (e.g. RSAT peak-motifs, or MEME-chip).
The RSAT tool matrix-clustering addresses these needs, and includes several interesting features which will be illustrated in this tutorial.
In this tutorial, we explain how to tune the parameters and interpret of results of matrix-clustering.
Goal: clustering a set of partly redundant motifs discovered by various algorithms.
Data set: To illustrate the use of motif clustering to filter out redundancy, we will analyze a set of motifs discovered with the workflow peak-motifs, which combines several motif discovery approaches:
We ran these algorithms on a set of peak sequences obtained by pulling down genomic regions bound by the transcription factor Oct4 in mouse ES cells. This experiment had been performed in the context of a wider study, where Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).
In this example,
Beware: this tutorial requires to let the program running for a while, and mobilize important computer resources. It is thus only available for the command line version of the tool.
Goal: cluster all motifs from specific sections (Vertebtrates, Insects) of the JASPAR database (Sandelin et al., 2004), in order to identify groups of similar motifs, which may be bound by TFs of the same family (e.g. several factors of the Hox family).
Note that in some cases factors with unrelated DNA-binding domain may recognize partly similar motifs (e.g. yeast Pho4p and Met4p recognize a motif centered on CACGTG, but differing by the flanking residues).
The set of 21 motifs discovered with peak-motifs in Oct4 ChIP-seq.
JASPAR vertebrate and insect motifs are available on the section data of RSAT.
JASPAR insect core motifs (136 matrices, 2013 release)
JASPAR vertebrate core motifs (263 matrices, 2013 release)
NOTE: This section will be done using the command line
Open a terminal.
Move to a directory where the result will be saved.
Execute this command (matrix-clustering demo):
matrix-clustering -v 2 -i \ $RSAT/public_html/demo_files/peak-motifs_Oct4_matrices.tf \ -matrix_format tf -cons -hclust_method average \ -lth Ncor 0.4 -lth cor 0.6 -lth w 5 -label name,consensus \ -o results/peak-motifs_Oct4/Oct4_analysis_Ncor_0.4_cor_0.6/Oct4_analysis_Ncor_0.4_cor_0.6
Execute this command (changing the threshold values affect the resulting clusters):
matrix-clustering -v 2 -i \ $RSAT/public_html/demo_files/peak-motifs_Oct4_matrices.tf \ -matrix_format tf -cons -hclust_method average \ -lth Ncor 0.55 -lth cor 0.625 -lth w 5 -label name,consensus \ -o results/peak-motifs_Oct4/Oct4_analysis_Ncor_0.55_cor_0.625/Oct4_analysis_Ncor_0.55_cor_0.625
You can read the help of the parameters used in matrix-clustering typing on therminal: matrix-clustering -h
Note: You can change the lower threshold values with the parameter -lth [cor|Ncor|w]. Example: -lth Ncor 0.6 -lth cor 0.75 -lth w 5
Note: You can change the input motif with the parameter -i. Example: -i folder/motif_file.tf -format tf
Note: You can change the output folder name with the parameter -o. Example: -o folder/prefix
Note: You can change the clustering method with the parameter -hclust_method [average|complete|single]. Example: -hclust_method single
NOTE: This section will be done using the command line
Open a terminal.
Move to a directory where the negative control results will be saved.
Execute this command:
convert-matrix -i $RSAT/public_html/demo_files/peak-motifs_Oct4_matrices.tf -from tf -to tf -perm 1 -o peak-motifs_Oct4_matrices_permuted.tf
You can read the help of the parameters used in convert-matrix typing on therminal: convert-matrix -h
Execute this command:
matrix-clustering -v 2 -i peak-motifs_Oct4_matrices_permuted.tf \ -matrix_format tf -cons -hclust_method average \ -lth Ncor 0.4 -lth cor 0.6 -lth w 5 -label name,consensus \ -o results/peak-motifs_Oct4/Oct4_analysis_Ncor_0.4_cor_0.6_permuted/Oct4_analysis_Ncor_0.4_cor_0.6_permuted
NOTE: This section will be done using the command line
Open a terminal.
Move to a directory where the result will be saved.
Execute the next commands:
JASPAR insect core (136 motifs):
matrix-clustering -v 2 -i \ $RSAT/data/motif_databases/JASPAR/jaspar_core_insects_2013-11.tf \ -matrix_format tf -cons -hclust_method average \ -lth Ncor 0.5 -lth cor 0.7 -lth w 5 -label name,consensus \ -o results/JASPAR_insects/Jaspar_insects
JASPAR vertebrate core (263 motifs):
matrix-clustering -v 2 -i $RSAT/data/motif_databases/JASPAR/jaspar_core_vertebrates_2013-11.tf -matrix_format tf -cons -hclust_method average -lth Ncor 0.5 -lth cor 0.7 -lth w 5 -label name,consensus -o results/JASPAR_vertebrates/Jaspar_vertebrates
Remember: you can read the help of the parameters used in matrix-clustering typing: matrix-clustering -h
Note:These are bigger dataset than the study case 1 and matrix-clustering will take much more time to analyse them.
In this example we will study a set of motifs dicovered in a ChIP-seq experiment done for the TF Oct4 (Pou5f1) which is an essential TF in cell fate decision, ES cells and early embryonic development, it binds the canonical sequence 5'-ATGCAAAT-3'.
In ES cells, Oct4 often interacts with another TF, Sox2, which binds to an adjacent Sox motif 5'-CATTGTA-3'. Together, both TFs coregulate specific genes.
During the analysis of Oct4 or Sox2 binding peaks, the so-called SOCT motif is usually found, which is a composite motif encompassing both Oct and Sox motifs. (Figure 1)
This file has the link to all the output files and show separately the clusters found in the analysis.
You will find the next files:
NOTE: the number of clusters in the analysis of the randomized set of motifs may variate.
NOTE: in this dataset there are two motifs which are almost similar to each other, the A-rich motifs (position_6nt_m2 and position_7nt_m3). Notice that after randomization these motifs remain almost similar and is probably that they will group together.
Figure 1. 3D model showing the cooperative binding between Sox2 and Oct4 TFs whose closely interact to bind DNA. Together, they recognize a composite motif called the SOCT motif (SOx+OCT).
Oct4 ChIP-seq discovered motifs table WITH thresholds
Figure 2. Table with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks andalized with matrix-clustering. Ncor<=0.4; cor>=0.6:
Cluster 1 logo tree
Figure 3. Logo tree of the cluster 1 found in the Oct4 ChIP-seq motifs. The hierarchical tree displays the logo aligment in both orientations. For each branch is calculated a branchwise-motif.
Cluster 1 branch-motifs table
Figure 4. Branch-motif table for cluster 1. You can download the motif in TRANSFAC format or the logo in both orientations by clicking on them.
Logo Forest
Figure 5. Logo Forest with the 21 motifs discovered by peak-motifs in the Oct4 ChIP-seq peaks. Using a combination of values as threshold (cor = 0.6; Ncor = 0.4) these motif were separated in 6 different clusters and each one is displayed in a tree.
Branch-motif analysis
Figure 6. The logo tree of cluster one showing the branch-motifs
Figure 7. Three examples of consensus tree when we are using the same data (21 motifs discovered by peak-motifs in the Oct4 ChIP-seq) and the same threslhold values (cor >= 0.4; Ncor >= 0.6).
In this picture we only change the hierarchical clustering agglomeration rule. From top to down: average, complete, single linkage.Each cluster is represented with a different color. Observe how the number of clusters and the tree topology change depending on the selected method.Logo tree for cluster_1
Logo tree for cluster_3
Figure 8. Logo trees for cluster 1 and 3 which actually correspond to the SOCT and Oct4 motifs respectively. The threshold parameters used were: Ncor>=0.55 and cor>=0.625 .
Summary table
Figure 9. Summary table of an example of matrix-clustering results with randomized motifs
matrix-clustering can be used to group motifs bound by TF belonging to the same TF family because usually they share a common DNA binding domain.
We will analyze separately two sections of the 'core' JASPAR database, corresponding to insect and vertebrate motifs.
It must be noticed that even the curated databases may contain different motifs for the same TF because more than one different technique can be used to discover the binding motif.
Before the analysis with matrix-clustering, for both cases, the motifs of each collection where compared to each other, using RSAT compare-matrices, the resulting motif-to-motif similarities were visualized as a motif-to-motif network using Cytoscape where each motif is represented as a vertex and the edges denote pairs of similar motif, with color and thickness reflecting the similarity scores.(Figures 10 and 11).
This network representation allows a visualization of the groups, however this visualization does not provide a partition of the data into proper clusters. So we will run matrix-clustering to study the similar motifs.
results/JASPAR_insects
results/JASPAR_vertebrates
Figure 10. Motif-to-motif network done with Cytoscape for all the JASPAR insect motifs. Notice the big group encompassing most of the motifs of the network.
Figure 11. Motif-to-motif network done with Cytoscape for all the JASPAR vertebrate motifs.
Figure 12. Logo tree showing the highly similar motif visualized in the motif-to-motif insects network.
Figure 13. Some clusters resulting from matrix-clustering are signaled in the motif-to-motif network of vertebrate motifs.