Datasets

K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).

Count distributions

K-mer occurrences in the peaks

K-mer occurrences in random genomic regions

mean min max sum
peaks 90.26060 1 426 187381
rand 86.99903 1 435 180262

Estimating the background from k-mer frequencies in random peaks

Build a table to compare k-mer occurrences between peaks and random genome regions

       Row.names  identifier.x   obs_freq.x occ.x ovl_occ.x forbocc.x
aaaaaa    aaaaaa aaaaaa|tttttt 0.0009417790   178       232       856
aaaaac    aaaaac aaaaac|gttttt 0.0008729974   165         0       798
aaaaag    aaaaag aaaaag|cttttt 0.0010052697   190         0       942
aaaaat    aaaaat aaaaat|attttt 0.0010211424   193         0       921
aaaaca    aaaaca aaaaca|tgtttt 0.0016454678   311         6      1525
aaaacc    aaaacc aaaacc|ggtttt 0.0006560708   124         0       613
        identifier.y  obs_freq.y occ.y ovl_occ.y forbocc.y
aaaaaa aaaaaa|tttttt 0.002096105   385       564      1893
aaaaac aaaaac|gttttt 0.001497218   275         0      1354
aaaaag aaaaag|cttttt 0.001535329   282         0      1378
aaaaat aaaaat|attttt 0.002313882   425         3      2104
aaaaca aaaaca|tgtttt 0.002003550   368        10      1808
aaaacc aaaacc|ggtttt 0.001012664   186         0       918
[1] 2079
       peaks rand    peak.freq   rand.freq    mean.freq
aaaaaa   178  385 0.0009499362 0.002135780 0.0015428582
aaaaac   165  275 0.0008805589 0.001525557 0.0012030581
aaaaag   190  282 0.0010139769 0.001564390 0.0012891832
aaaaat   193  425 0.0010299870 0.002357679 0.0016938332
aaaaca   311  368 0.0016597200 0.002041473 0.0018505965
aaaacc   124  186 0.0006617533 0.001031831 0.0008467924
          identifier     obs_freq occ ovl_occ forbocc
aaaaaa aaaaaa|tttttt 0.0009417790 178     232     856
aaaaac aaaaac|gttttt 0.0008729974 165       0     798
aaaaag aaaaag|cttttt 0.0010052697 190       0     942
aaaaat aaaaat|attttt 0.0010211424 193       0     921
aaaaca aaaaca|tgtttt 0.0016454678 311       6    1525
aaaacc aaaacc|ggtttt 0.0006560708 124       0     613
          identifier    obs_freq occ ovl_occ forbocc
aaaaaa aaaaaa|tttttt 0.002096105 385     564    1893
aaaaac aaaaac|gttttt 0.001497218 275       0    1354
aaaaag aaaaag|cttttt 0.001535329 282       0    1378
aaaaat aaaaat|attttt 0.002313882 425       3    2104
aaaaca aaaaca|tgtttt 0.002003550 368      10    1808
aaaacc aaaacc|ggtttt 0.001012664 186       0     918

Occurrence ratios

[1]   0 Inf

Log2-ratios

M-A plot

Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)

Log-likelihood ratio (LLR)

\[LLR=f_{exp} \cdot log_2(f_{obs}/f_{exp})\]

The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (left side of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasis on them.

Compute p-value of over-representation with the Poisson law

Intermediate interpretation

So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) as background sequences in order to estimate the expected number of occurrences of each k-mer in the peaks. These random peaks had been selected with the same size as the actual peaks, so the total number of occurrences was supposed to be more or less the same as in the peaks (small differences may occur due to the presence of N character in the genomic sequences).

However, the results are problematic, because the random expectation is estimated based on a small sequence set, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers have zero occurrences in the random peaks.

Estimating the background from genomic k-mer frequencies

          identifier    obs_freq     occ ovl_occ  forbocc
aaaaaa aaaaaa|tttttt 0.001888701 5013666 6890306 25068330
aaaaac aaaaac|gttttt 0.001292811 3431840       0 17159200
aaaaag aaaaag|cttttt 0.001506224 3998357       0 19991785
aaaaat aaaaat|attttt 0.002029702 5387959   23054 26939795
aaaaca aaaaca|tgtttt 0.001889655 5016197  216873 25080985
aaaacc aaaacc|ggtttt 0.000887235 2355216       0 11776080