Medina-Rivera et al, 2010

Theoretical and empirical quality assessment of transcription factor binding motifs

Supplementary material

Alejandra Medina-Rivera, Cei Abreu-Goodger, Morgane Thomas-Chollier, Heladia Salgado-Osorio, Julio Collado-Vides and Jacques van Helden

Abstract

Position-specific scoring matrices are routinely used to predict transcription factor binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program matrix-quality, that combines theoretical and empirical score distributions to assess the predictive capability of position-specific matrices. We applied matrix-quality to evaluate the matrices of 60 transcription factors from RegulonDB, detected poorly predictive motifs, and quantified the improvements obtained by applying multi-genome motif discovery. Interestingly, we show that the method reveals differences between global and specific regulators, and can highlight the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip and ChIP-seq experiments. Our method has many applications, from improving motif collections in transcription factor databases to the analysis of data coming from new sequencing technologies for characterizing genomes, gene regulation, and transcription factor-DNA interactions.

Study Cases