Practicals - motif occurrences - choosing a theoretical distribution

This is the continuation of the tutorial "motif occurrences - multitesting corrections". In the previous tutorial, we computed the P-value using the binomial distribution.

The goal of this tutorial is to compute the P-value with alternative theoretical distribution, and to compare the results.

Detecting exceptional words with the Poisson distribution

The Poisson distribution requires one parameter: lambda, the expected mean of the distribution.

The expected number of occurrences is obtained by multiplying the prior probabiliy (p) by the number of trials (N) of the binomial (these have been computed in the previous tutorial).

## Compute the P-value using the Poisson distribution
m.est <- N*oligos$exp_freq
oligos$Pval.Poisson <- ppois(q=oligos$occ-1,lambda=m.est, lower.tail=F)
oligos[1:10,] ## Check the first 10 rows

## Plot the Poisson versus binomial P-values
plot(oligos$Pval.binom,oligos$Pval.Poisson, log='xy',col='blue',
    xlab='Binomial P-value',ylab='Poisson P-value',
    panel.first=c(
       grid(lty='solid',col='#BBBBBB'),
       abline(v=1,col='#888888'),
       abline(h=1,col='#888888'),
       abline(a=0,b=1,col='#888888')
    ))

Detecting exceptional words with the normal distribution

The normal distribution requires one parameter: m, the mean, and s, the standard deviation.

We will estimate the mean and sd of the normal by taking the theoretical mean and sd of the binomial distribution.

## Compute the P-value using the Poisson distribution
sd.est <- sqrt(N*oligos$exp_freq*(1-oligos$exp_freq))
oligos$Pval.normal <- pnorm(q=oligos$occ-0.5,m=m.est, sd=sd.est, lower.tail=F)
oligos[1:10,] ## Check the first 10 rows

## Plot the normal versus binomial P-values
plot(oligos$Pval.binom,oligos$Pval.normal, log='xy',col='blue',
    xlab='Binomial P-value',ylab='normal P-value',
    panel.first=c(
       grid(lty='solid',col='#BBBBBB'),
       abline(v=1,col='#888888'),
       abline(h=1,col='#888888'),
       abline(a=0,b=1,col='#888888')
    ))

The huge discrepancies between the binomial P-value and its normal aproximation can be explained by the fact that we are far from the conditions of convergence, because for most words the expected mean is much lwoer than 10. This can be easily checked by plotting the histogram of expected occurrences.

## Plot a histogram showing the distribution of expected occurrences for all hexanucleotides
hist(oligos$exp_occ, breaks=100)
abline(v=10,col='red')

Script with more sophisticated treatment

A script with a more complete treatment of this data set is provided in the scripts repository.