RSA-tools - Tutorials - matrix-clustering

Contents

  1. Prerequisite
  2. Introduction
  3. Study case
  4. Test sets
  5. Tuning matrix-clustering parameters
  6. Interpreting the result
  7. Additional exercises
  8. References


Prerequisite

This tutorial assumes that you are familiar with the concepts developed in the following parts of the theoretical course.

  1. PSSM theory
  2. Motif comparison

It is better to follow the corresponding tutorials before this one.

  1. Position-specific scoring matrices.
  2. peak-motifs: discover cis-regulatory motifs and predict putative TFBS from a set of peak sequences identified by high-throughput methods such as ChIP-seq.


Introduction

The program matrix-clustering enables to discover and align groups of similarities among motif collections displaying the results with different motif-representation formats.

Transcription factor binding motifs (TFBM) are classically represented either as consensus strings (stric consensus, IUPAC or regular expressions), or as position-specific scoring matrices (PSSM).

Thousands of curated TFBM are available in specialized databases (JASPAR, RegulonDB, TRANSFAC, etc). These PSSMs were traditionally built from collections of transcription factor binding sites (TFBS) obtained by various experimental methods (e.g. EMSA, DNAse footprinting, SELEX).

TFBMs can also be discovered ab initio from genome-scale datasets: promoters of co-expressed genes, ChIP-seq peaks, phylogenetic footprints, etc.

Motif collections (databases as well as ab initio motif discovery results) sometimes contain groups of similar motifs, for different reasons: curation of alternative motifs for a same TF; homologous proteins sharing a particular DNA binding domain, motifs discovered with analytic workflows combining several algorithms (e.g. RSAT peak-motifs, or MEME-chip).

The RSAT tool matrix-clustering addresses these needs, and includes several interesting features which will be illustrated in this tutorial.

  1. For the computation of inter-matrices distances, support for a large series of alternative metrics (Pearson correlation, Euclidian distance, SSD, Sandelin-Wasserman, logo dot product, and length-normalized version of these scores).
  2. Possibility to select a custom combination between several of these similarity metrics, in order to compute an integrative threshold.
  3. The set of input motifs is split into separate clusters, each of which canbe displayed in user-interactive ways.
  4. User-friendly display of motif trees with aligned logos and consensuses.
  5. At each level of the hierarchical tree, all the descendent matrices are aligned (multiple alignement), and a merged motif is computed (branch motif), which can be represented schematically as a branch logo or a branch consensus.

In this tutorial, we explain how to tune the parameters and interpret of results of matrix-clustering.

  1. Motif redundancy: examples in motif-discovery results and in motif databases.
  2. Thresholds: setting a combination of similarity measures values as a threshold to define the groups of similarities.
  3. Impact of parameters: some example showing how changing the values of the parameters can affect cluster composition or tree topology.


Study cases

Study case 1

Goal: clustering a set of partly redundant motifs discovered by various algorithms.

Data set: To illustrate the use of motif clustering to filter out redundancy, we will analyze a set of motifs discovered with the workflow peak-motifs, which combines several motif discovery approaches:

  1. oligo-analysis
  2. position-analysis
  3. local-word-analysis
  4. dyad-analysis

We ran these algorithms on a set of peak sequences obtained by pulling down genomic regions bound by the transcription factor Oct4 in mouse ES cells. This experiment had been performed in the context of a wider study, where Chen and colleagues characterized the binding location of 12 transcription factors involved in mouse embryonic stem cell differentiation (Chen et al., 2008).

In this example,

  1. We will run matrix-clustering with its default parameters, which were tuned to generally provide a suitable result;
  2. We will then run the same analysis with alternative threshold values, in order to analyze the impact of this crucial parameter on the result.
  3. Finally, we will run a negative control by randomizing the motifs (random permutations of the columns for each matrix) and running the clustering. In principle, the clustering algorithm should assign each matrix to a singleton (cluster containing only one element).

Study case 2

Beware: this tutorial requires to let the program running for a while, and mobilize important computer resources. It is thus only available for the command line version of the tool.

Goal: cluster all motifs from specific sections (Vertebtrates, Insects) of the JASPAR database (Sandelin et al., 2004), in order to identify groups of similar motifs, which may be bound by TFs of the same family (e.g. several factors of the Hox family).

Note that in some cases factors with unrelated DNA-binding domain may recognize partly similar motifs (e.g. yeast Pho4p and Met4p recognize a motif centered on CACGTG, but differing by the flanking residues).


Test sets


Tuning matrix-clustering parameters


Interpreting the results


References

  1. Chen, X., Xu, H., Yuan, P., Fang, F., Huss, M., Vega, V. B., Wong, E., Orlov, Y. L., Zhang, W., Jiang, J., Loh, Y. H., Yeo, H. C., Yeo, Z. X., Narang, V., Govindarajan, K. R., Leong, B., Shahab, A., Ruan, Y., Bourque, G., Sung, W. K., Clarke, N. D., Wei, C. L. and Ng, H. H. (2008). Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 133, 1106-17. [Pubmed 18555785].
  2. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2011). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets Nucleic Acids Research doi:10.1093/nar/gkr1104, 9. [Open access]
  3. Mathelier, A., Zhao, X., Zhang, A. W., Parcy, F., Worsley-Hunt, R., Arenillas, D. J., Buchman, S., Chen, C.-y., Chou, A., Ienasescu, H., Lim, J., Shyr, C., Tan, G., Zhou, M., Lenhard, B., Sandelin, A. and Wasserman, W. W. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles Nucleic Acids Research, 2013 [Open access]

For suggestions or information request, please contact Jaime Castro-Mondragon or Jacques van Helden