Exercise

K-mer occurrences were counted for various k-mer sizes (\(k=6\), \(7\), and \(8\), resp.) in three types of sequences.

  1. CEBPA peaks: A set of chip-seq peaks obtained by immunoprecipitating the mouse transcription factor CEBPA in liver tissues (Smith et al, 2010).
  2. Genomic occurrences: k-mer occurrences counted in the whole mouse genome.
  3. Random regions: regions picked up at random in the mouse genome, with the same number and sizes as the CEBPA peaks.

The goal of this practical is to detect over-represented k-mers in the CEBPA peaks, in order to predict putative transcription factor binding motifs.

We will approach the problem by drawing plots comparing the frequencies of k-mers in the sequences of interest (CEBPA peaks) and in the other sequence types, respectively. We will then progressively refine the statistics in order to assess the statistical significance of the k-mer over-representation.

  1. Exploration k-mer occurrences in CEBPA peaks and genomic regions.

    1. Load the count table for 6-mer occurrences in CEBPA peaks.
    2. Count the sum of occurrences, and the sum of frequencies.
    3. Plot an histogram with the distribution of k-mer occurrences.
    4. Do the same with genomic occurrences.
  2. Exploration k-mer occurrences in random regions.

    1. Each student should pick up a random number \(i\) between 1 and 8, with the R function sample().
    2. Load the count table for 6-mer occurrences in the \(i^{th}\) repetition of random regions.
    3. Count the sum of occurrences, and the sum of frequencies.
    4. Plot an histogram with the distribution of k-mer occurrences.
  3. Compare k-mer occurrences between CEBPA peaks and genomic regions.

    1. Draw an XY plot of k-mer occurrences in CEBPA peaks versus random regions.
    2. Compute some comparison statistics to detect over-representated k-mers in CEBPA peaks relative to rangom regions.
  4. MA plots

    1. Compute the ratio between k-mer occurrences found in peaks and your random regions, resp.
    2. Compute the average of occurrences between peaks and your random regions.
    3. Plot a graph comparing the occurrence ratio with the average occurrence number.
    4. Compute the log2-ratio of occurrences, and draw a graph equivalent to the MA-plot.
  5. Compute the p-value of k-mer over-representation.

    1. With a Poisson distribution
    2. With a binomial distribution.
  6. Homework

    1. Based on genomic occurrencs, compute the occurrences that would be expected by chance in peaks of the same size as CEBPA peaks.
    2. Analyse over-represented words, and compare them with the mouse CEBPA motif annotated in the JASPAR database (http://jaspar.genereg.net/).

Datasets

K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).

Solutions