Medina-Rivera et al, 2010
Theoretical and empirical quality assessment of transcription factor binding motifs
Supplementary material
Alejandra Medina-Rivera, Cei Abreu-Goodger, Morgane Thomas-Chollier, Heladia Salgado-Osorio, Julio Collado-Vides and Jacques van Helden
Abstract
Position-specific scoring matrices are routinely used to predict
transcription factor binding sites in genome sequences. However, their
reliability to predict novel binding sites can be far from optimum,
due to the use of a small number of training sites or the
inappropriate choice of parameters when building the matrix or when
scanning sequences with it. Measures of matrix quality such as E-value
and information content rely on theoretical models, and may fail in
the context of full genome sequences. We propose a method, implemented
in the program matrix-quality, that combines theoretical and empirical
score distributions to assess the predictive capability of
position-specific matrices. We applied matrix-quality to evaluate the
matrices of 60 transcription factors from RegulonDB, detected poorly
predictive motifs, and quantified the improvements obtained by
applying multi-genome motif discovery. Interestingly, we show that the
method reveals differences between global and specific regulators, and
can highlight the enrichment of binding sites in sequence sets
obtained from high-throughput ChIP-chip and ChIP-seq experiments. Our
method has many applications, from improving motif collections in
transcription factor databases to the analysis of data coming from new
sequencing technologies for characterizing genomes, gene regulation,
and transcription factor-DNA interactions.
Study Cases