# Practicals - motif occurrences - choosing a theoretical distribution

This is the continuation of the tutorial
"motif occurrences - multitesting
corrections". In the previous tutorial, we computed the P-value
using the binomial distribution.

The goal of this tutorial is to compute the P-value with
alternative theoretical distribution, and to compare the results.

## Detecting exceptional words with the Poisson distribution

The Poisson distribution requires one parameter: *lambda*, the
expected mean of the distribution.

The expected number of occurrences is obtained by multiplying the
prior probabiliy (*p*) by the number of trials (*N*) of the
binomial (these have been computed in the previous tutorial).

## Compute the P-value using the Poisson distribution
m.est <- N*oligos$exp_freq
oligos$Pval.Poisson <- ppois(q=oligos$occ-1,lambda=m.est, lower.tail=F)
oligos[1:10,] ## Check the first 10 rows
## Plot the Poisson versus binomial P-values
plot(oligos$Pval.binom,oligos$Pval.Poisson, log='xy',col='blue',
xlab='Binomial P-value',ylab='Poisson P-value',
panel.first=c(
grid(lty='solid',col='#BBBBBB'),
abline(v=1,col='#888888'),
abline(h=1,col='#888888'),
abline(a=0,b=1,col='#888888')
))

## Detecting exceptional words with the normal distribution

The normal distribution requires one parameter: *m*, the
mean, and *s*, the standard deviation.

We will estimate the mean and sd of the normal by taking the
theoretical mean and sd of the binomial distribution.

## Compute the P-value using the Poisson distribution
sd.est <- sqrt(N*oligos$exp_freq*(1-oligos$exp_freq))
oligos$Pval.normal <- pnorm(q=oligos$occ-0.5,m=m.est, sd=sd.est, lower.tail=F)
oligos[1:10,] ## Check the first 10 rows
## Plot the normal versus binomial P-values
plot(oligos$Pval.binom,oligos$Pval.normal, log='xy',col='blue',
xlab='Binomial P-value',ylab='normal P-value',
panel.first=c(
grid(lty='solid',col='#BBBBBB'),
abline(v=1,col='#888888'),
abline(h=1,col='#888888'),
abline(a=0,b=1,col='#888888')
))

The huge discrepancies between the binomial P-value and its normal
aproximation can be explained by the fact that we are far from the
conditions of convergence, because for most words the expected mean is
much lwoer than 10. This can be easily checked by plotting the
histogram of expected occurrences.

## Plot a histogram showing the distribution of expected occurrences for all hexanucleotides
hist(oligos$exp_occ, breaks=100)
abline(v=10,col='red')

## Script with more sophisticated treatment

A script with a more complete treatment of this data set is provided
in the scripts repository.