ReMap - A regulatory map of transcription factor binding sites

Public datasets sources

We analysed 668 ChIP-seq experiments present the GEO repository starting from July 2008. Those datasets were complemented by 28 datasets present in ArrayExpress. After applying quality filters (described below) we retained 395 datasets. Here we define a “dataset” as a ChIP-seq experiment in a given GEO series (e.g. GSE41561), for a given TF (e.g. ESR1), in a particular biological condition (i.e. cell line, tissue type, disease state or experimental conditions ; e.g. MCF-7). Datasets were labeled by concatenating these three pieces of information such as GSE41561.ESR1.MCF-7.

For each transcription factor and each cell type, two tables recapitulate the datasets used. A green table represents the datasets used to create the catalogue of binding sites and a red table the datasets filtered out after applying quality filters.

Read mapping & Peak calling

Reads were mapped using Bowtie 2 with the following parameters (--end-to-end --sensitive) against the human genome (GRCh37/hg19 assembly). We used MACS to identify transcription factor binding sites. We retained peaks with stringent thresholds (p-value: 1e-5, enrichment: 10, FDR: 0.01) and dismissed others. However the original raw MACS output can be downloaded from the site.

For each dataset, mapping statistics are available (number of mapped and unmapped reads) as well as peak calling information (number of peaks identified before and after applying thresholds).

Integration of ENCODE data

To complete and validate our catalogue of binding sites, we integrated the ENCODE release V3 (August 2013) representing all ENCODE TF ChIP-seq experiments passing quality assessments. ENCODE data provides binding sites clustered by factor, based on 690 datasets and 161 TFs (including 3 TFs removed and 3 others renamed). Similar TF binding sites between Public datasets and ENCODE dataset were merged.

Catalogues of All peaks and Non-redundant peaks

After consistent peak calling, we identified 8.9 million sites bound by transcription factors (13 million with ENCODE data included). These numbers include overlapping sites for indentical TFs which were studied in various conditions. To address this we merged overlapping TF binding sites for similar TFs obtaining a catalogue of 5.4 million non-redundant binding sites (8.8 million with ENCODE data).

Annotation and classification of transcription factors

Function and description of transcription factors present in this catalogue (Public, Public+Encode) were retrieved from HGNC and RefSeq databases. Each transcription factor was also annotated using the classification of human transcription factors allowing users to filter specific TFs based on the characteristics of their DNA-binding domains.

Genomic visualization of peaks and analyses

To perform a de novo motifs analysis for each TF present in our catalogue, we provide a link to the Regulatory Sequence Analysis Tools. A link to the UCSC Genome Browser was also added to facilitate genomic integration of the binding sites with other genome annotations. Our BED tracks allow for the visualization of our catalogues of binding sites on the human genome. Finally, different analyses such as the quality of datasets and DNA constraint analysis are provided for each transcription factor.

Datasets quality assessment

As not every ChIP-seq datasets are equal in terms of quality, we used four different metrics based on ENCODE ChIP-seq guidelines to retain high quality datasets for downstream analyses. First we used the normalized strand cross-correlation coefficient (NSC) which is a normalized ratio between the fragment-length cross-correlation peak and the background cross-correlation, and the relative strand cross-correlation coefficient (RSC), a ratio between the fragment-length peak and the read-length peak to exclude low quality datasets. We also used the fraction of reads in peaks (FRiP) and the number of peaks identified in each dataset to filter datasets.

A bubble plot is available on each page representing the quality of selected dataset(s). Dataset(s) are plotted in a 2D vizualization with NSC and RSC as x- and y-axis, the size of points is correlated with the number of peaks identified in the dataset and colours highlight the datasets conserved (green) or excluded (red) from the catalogue of binding sites.

DNA constraints under non-redundant peaks

DNA conservation around transcription factor binding sites was computed using the Genomic Evolutionary Rate Profiling (GERP) score obtained and extracted from the Ensembl Compara database for Ensembl v73 using the Compara Perl API. Detailed information can be found at Ensembl and the Sidow Lab.
A plot is available for each transcription factor representing the average DNA constraint for each nucleotide under peak summits.

Downloading peaks and sequences

For each transcription factor, cell type and dataset, we provide files to download peaks in BED format and sequences in FASTA format. The entire catalogue is openly accessible to the community and available to download in BED format. Please do not hesitate to contact us if you need this catalogue in a different format.

Annotation Tool

We provide an annotation tool to annotate user's submitted regions with our catalogue of transcription factor binding sites and to compute statistical enrichments of TFs in these regions.

To do this, we first use intersectBed tool from the BEDTools suite to identify which binding sites of our catalogue fall into the submitted regions. Then, we compare the number of these overlapping regions with the number of overlaps obtained with random regions (same size and number as the submitted regions) to calculate the enrichment of TFs.

Several settings are used by this tool :

the first option allows you to extend genes (in upstream and/or downstream) before processing the overlap of our catalogue with regions defined around the genes submitted.
This option is available only if a list of Ensembl IDs is provided.
the two next options indicate the minimum percentage of overlap required, and on which data you want to apply this percentage (submitted data or our catalogue). By default, each of our binding sites must overlap your regions by at least 10% to be take into account.
the last option indicate if the minimum percentage of overlap needs to be applied on both data (submitted data and our catalogue).

Highcharts

This site uses the Highcharts library following the Highcharts licence under the Creative Commons Attribution-NonCommercial 3.0 License (CC BY-NC) as a non-profit / University site (Aix-Marseille University).

ReMap 2015

How the ReMap catalogue was constructed ?

Which information can you find on ReMap ?

How the annotation tool works ?

Credits