K-mer occurrences in CEBPA peaks from Smith et al (2010) in the mouse genome (Mus musculus).
| Data type | k | repeat | Table |
|---|---|---|---|
| CEBPA peaks | 6 | CEBPA_mm9_SWEMBL_R0.12_6nt-noov-2str.tab | |
| genomic occurrences | 6 | full genome | mm10_genome_6nt-noov-2str.tab |
| Random regions | 6 | 01 | random-genome-fragments_mm10_repeat01_6nt-noov-2str.tab |
| Random regions | 6 | 02 | random-genome-fragments_mm10_repeat02_6nt-noov-2str.tab |
| Random regions | 6 | 03 | random-genome-fragments_mm10_repeat03_6nt-noov-2str.tab |
| Random regions | 6 | 04 | random-genome-fragments_mm10_repeat04_6nt-noov-2str.tab |
| Random regions | 6 | 05 | random-genome-fragments_mm10_repeat05_6nt-noov-2str.tab |
| Random regions | 6 | 06 | random-genome-fragments_mm10_repeat06_6nt-noov-2str.tab |
| Random regions | 6 | 07 | random-genome-fragments_mm10_repeat07_6nt-noov-2str.tab |
| Random regions | 6 | 08 | random-genome-fragments_mm10_repeat08_6nt-noov-2str.tab |
| Data type | k | repeat | Table |
|---|---|---|---|
| CEBPA peaks | 7 | CEBPA_mm9_SWEMBL_R0.12_7nt-noov-2str.tab | |
| genomic occurrences | 7 | full genome | mm10_genome_7nt-noov-2str.tab |
| Random regions | 7 | 01 | random-genome-fragments_mm10_repeat01_7nt-noov-2str.tab |
| Random regions | 7 | 02 | random-genome-fragments_mm10_repeat02_7nt-noov-2str.tab |
| Random regions | 7 | 03 | random-genome-fragments_mm10_repeat03_7nt-noov-2str.tab |
| Random regions | 7 | 04 | random-genome-fragments_mm10_repeat04_7nt-noov-2str.tab |
| Random regions | 7 | 05 | random-genome-fragments_mm10_repeat05_7nt-noov-2str.tab |
| Random regions | 7 | 07 | random-genome-fragments_mm10_repeat07_7nt-noov-2str.tab |
| Random regions | 7 | 07 | random-genome-fragments_mm10_repeat07_7nt-noov-2str.tab |
| Random regions | 7 | 08 | random-genome-fragments_mm10_repeat08_7nt-noov-2str.tab |
| Data type | k | repeat | Table |
|---|---|---|---|
| CEBPA peaks | 8 | CEBPA_mm9_SWEMBL_R0.12_8nt-noov-2str.tab | |
| genomic occurrences | 8 | full genome | mm10_genome_8nt-noov-2str.tab |
| Random regions | 8 | 01 | random-genome-fragments_mm10_repeat01_8nt-noov-2str.tab |
| Random regions | 8 | 02 | random-genome-fragments_mm10_repeat02_8nt-noov-2str.tab |
| Random regions | 8 | 03 | random-genome-fragments_mm10_repeat03_8nt-noov-2str.tab |
| Random regions | 8 | 04 | random-genome-fragments_mm10_repeat04_8nt-noov-2str.tab |
| Random regions | 8 | 05 | random-genome-fragments_mm10_repeat05_8nt-noov-2str.tab |
| Random regions | 8 | 08 | random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab |
| Random regions | 8 | 08 | random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab |
| Random regions | 8 | 08 | random-genome-fragments_mm10_repeat08_8nt-noov-2str.tab |
| mean | min | max | sum | |
|---|---|---|---|---|
| peaks | 90.26060 | 1 | 426 | 187381 |
| rand | 86.99903 | 1 | 435 | 180262 |
Row.names identifier.x obs_freq.x occ.x ovl_occ.x forbocc.x
aaaaaa aaaaaa aaaaaa|tttttt 0.0009417790 178 232 856
aaaaac aaaaac aaaaac|gttttt 0.0008729974 165 0 798
aaaaag aaaaag aaaaag|cttttt 0.0010052697 190 0 942
aaaaat aaaaat aaaaat|attttt 0.0010211424 193 0 921
aaaaca aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525
aaaacc aaaacc aaaacc|ggtttt 0.0006560708 124 0 613
identifier.y obs_freq.y occ.y ovl_occ.y forbocc.y
aaaaaa aaaaaa|tttttt 0.002096105 385 564 1893
aaaaac aaaaac|gttttt 0.001497218 275 0 1354
aaaaag aaaaag|cttttt 0.001535329 282 0 1378
aaaaat aaaaat|attttt 0.002313882 425 3 2104
aaaaca aaaaca|tgtttt 0.002003550 368 10 1808
aaaacc aaaacc|ggtttt 0.001012664 186 0 918
[1] 2079
peaks rand peak.freq rand.freq mean.freq
aaaaaa 178 385 0.0009499362 0.002135780 0.0015428582
aaaaac 165 275 0.0008805589 0.001525557 0.0012030581
aaaaag 190 282 0.0010139769 0.001564390 0.0012891832
aaaaat 193 425 0.0010299870 0.002357679 0.0016938332
aaaaca 311 368 0.0016597200 0.002041473 0.0018505965
aaaacc 124 186 0.0006617533 0.001031831 0.0008467924
identifier obs_freq occ ovl_occ forbocc
aaaaaa aaaaaa|tttttt 0.0009417790 178 232 856
aaaaac aaaaac|gttttt 0.0008729974 165 0 798
aaaaag aaaaag|cttttt 0.0010052697 190 0 942
aaaaat aaaaat|attttt 0.0010211424 193 0 921
aaaaca aaaaca|tgtttt 0.0016454678 311 6 1525
aaaacc aaaacc|ggtttt 0.0006560708 124 0 613
identifier obs_freq occ ovl_occ forbocc
aaaaaa aaaaaa|tttttt 0.002096105 385 564 1893
aaaaac aaaaac|gttttt 0.001497218 275 0 1354
aaaaag aaaaag|cttttt 0.001535329 282 0 1378
aaaaat aaaaat|attttt 0.002313882 425 3 2104
aaaaca aaaaca|tgtttt 0.002003550 368 10 1808
aaaacc aaaacc|ggtttt 0.001012664 186 0 918
[1] 0 Inf
Finally, I prefer to keep the mean occurrences on the X axis rather than the log2(mean occ)
\[LLR=f_{exp} \cdot log_2(f_{obs}/f_{exp})\]
The log-likelihood ratio is effective in reducing the impact of small number fluctuations: the rare k-mers (left side of the LLR plot) achieve very low scores, whereas the ratio or log2-ratio tended to put a high emphasis on them.
So far, we performed all our analyses using a random selection of genomic regions (“random peaks”) as background sequences in order to estimate the expected number of occurrences of each k-mer in the peaks. These random peaks had been selected with the same size as the actual peaks, so the total number of occurrences was supposed to be more or less the same as in the peaks (small differences may occur due to the presence of N character in the genomic sequences).
However, the results are problematic, because the random expectation is estimated based on a small sequence set, so that the numbers can fluctuate, especially for rare k-mers. We even noticed that some hexamers have zero occurrences in the random peaks.
identifier obs_freq occ ovl_occ forbocc
aaaaaa aaaaaa|tttttt 0.001888701 5013666 6890306 25068330
aaaaac aaaaac|gttttt 0.001292811 3431840 0 17159200
aaaaag aaaaag|cttttt 0.001506224 3998357 0 19991785
aaaaat aaaaat|attttt 0.002029702 5387959 23054 26939795
aaaaca aaaaca|tgtttt 0.001889655 5016197 216873 25080985
aaaacc aaaacc|ggtttt 0.000887235 2355216 0 11776080