In the tutorial on word probabilities, we assumed that the number of oligonucleotide occurrences observed would folow a binomial distribution.
We will now test this assumption by analyzing the distribution of word occurrences observed in biological sequences. We will fit a binomial distribution on the observed count distribution, and test the goodness of the fitting.
The table below shows the distribution of occurrences of the word GATAA in a set of 1000 sequences of 800 base pairs each.
occurrences | # sequences |
---|---|
0 | 223 |
1 | 337 |
2 | 247 |
3 | 119 |
4 | 54 |
5 | 13 |
6 | 6 |
7 | 0 |
8 | 1 |
distrib <- data.frame(occ=0:8, obs=c(223, 337, 247, 119, 54, 13, 6, 0, 1) ) print(distrib)
We will later add some columns to this data frame, with the theoretical distributions (binomial, Poisson and normal, respectively).
The relative frequencies are the fraction of sequences containing a given number of occurrences. These are obtained by dividing the number of sequences (obs) by the total number of sequences (N).
We will first calculate the number of sequences. Actually this number was given above (1000), but we will check that our data contains the right number of ovservations.
N <- sum(distrib$obs) ## Number of sequences print(N)We can now compute the relative frequencies.
distrib$f <- distrib$obs/N print(distrib)
L <- 800 ## sequence length k <- 4 ## word length pos.per.seq <- L - k + 1 ## Number of possible positions for the word in one sequence print(pos.per.seq) n <- pos.per.seq
Another possibility would be to estimate the word probability on the basis of the observed distribution itself: we can calculate the average number of occurrences per sequence, and divide this number by the number of positions per sequences to obtain the average number of occurrences per position, i.e. an estimator of the word probability. The total number of occurrences can be obtained by multiplying each value of occurrence (from 0 to 8) by he number of sequences where these occurrences have been observed. This can be done very easily with R, by multiplying the two vectors.
## A vector giving the product of each occurrence ## value by the corresponding number of sequences distrib$occ * distrib$obs ## The sum of this vector is the total number of occurrences occ.total <- sum(distrib$occ*distrib$obs) print(occ.total) ## Average occurrences per sequence occ.per.seq <- occ.total/N print(occ.per.seq) ## Average occurrences per position occ.per.pos <- occ.per.seq/pos.per.seq print(occ.per.pos)
Notice that this number is clearly different from what would have been obtained under a model of equiprobable tetranucleotide (1/4^4=0.0039).
## Read the description of the binomial density function help(dbinom) ## Calculate the binomial density for each value between 0 and 8 distrib$dbinom <- dbinom(0:8, size=pos.per.seq, prob=occ.per.pos) ## Calculate expected occurrences for 1000 sequences distrib$exp <- N*dbinom(0:8, size=pos.per.seq, prob=occ.per.pos) ## Print the distribution and compare the observed and expected distributions print(distrib)
X11(width=8,height=6) ## Plot the observed distribution plot(distrib$occ, # X axis values distrib$obs, # Y axis values type='h', # Plot type: histogram-like lwd=5, # Line width col="#0000BB", # Line color: blue main='GATAA occurrences', # graph title xlab='Occurrences', # Label for the X axis ylab='Number of sequences', # Label for the Y axis ylim=c(0,400), # Limits of the Y axis panel.first=grid(col='#000000') # Grid )Let us now draw the theoretical distribution on top of the observed distribution.
## Draw the binomial distribution over the observed distribution lines(distrib$occ, distrib$exp, type='l', lwd=2, col='#00BB00' )
We will compute the chi2 statistics manually, in order to get familiar with the details of the computation.
We will first apply the raw formula of the chi2: chi2_{obs} = SUM [(obs -exp) ^{2} / exp]
distrib$diff <- distrib$obs - distrib$exp distrib$diff2 <- distrib$diff^2 distrib$chi2 <- distrib$diff^2 / distrib$exp print(distrib) chi2.obs <- sum(distrib$chi2) print(chi2.obs)
BEWARE: this result is not what we are supposed to obtain. Why ?
We applied the raw formula, but we forgot to check two essential conditions for being able to apply Pearson's chi-squared test.
The domain of definition of the binomial distribution encompasses alll possible values from 0 to n (the number of trials). In the scripts above, we only computed the binomial density (dbinom) and the corresponding expected values (E(x) = N*dbinom())from 0 to 8. All the values from 9 to n where thus ignored so far.
A bad way to solve this problem would be to include in our computation of the chi-square all the possible values from 0 to n. This would fix the problem of the sum of expected values, but raise another problem, since at the right tail of the distribution, the values are decreasing very rapidly, so that expected values will be too small to meet the condition of applicability that all expected values are >= 5.
The correct way to solve this is to merge all the classes on the right tail (from x=9 to x=n) together with the last class where an observation was found (x=8). This ast class will thus from now on represent all the possible values from x=8 to x=n.
Basically, you should perform the same steps as for the binomial, except that you will use other parameters to fit the theoretical distribution on the observed one.