Statistics for Bioinformatics
Practicals - Word Count Distributions

In the tutorial on word probabilities, we assumed that the number of oligonucleotide occurrences observed would folow a binomial distribution.

We will now test this assumption by analyzing the distribution of word occurrences observed in biological sequences. We will fit a binomial distribution on the observed count distribution, and test the goodness of the fitting.

Tutorial: occurrence probabilities

The table below shows the distribution of occurrences of the word GATAA in a set of 1000 sequences of 800 base pairs each.
occurrences# sequences


  1. Fit a binomial, a Poisson and a normal distribution on on the observed distribution.
  2. Draw the observed and fitted distributions and compare the fittings obtained with the different theoretical distributions.
  3. Use a Q-Q plot to compare each fitted distribution with the observed one


Encoding the data

The first step is to encode in R the data provided in the table. We can do this with a data frame, containing one column for the word counts, and one column for the number of seqences where these word counts have been observed.

We will later add some columns to this data frame, with the theoretical distributions (binomial, Poisson and normal, respectively).

Computing relative frequencies

The relative frequencies are the fraction of sequences containing a given number of occurrences. These are obtained by dividing the number of sequences (obs) by the total number of sequences (N).

We will first calculate the number of sequences. Actually this number was given above (1000), but we will check that our data contains the right number of ovservations.

We can now compute the relative frequencies.
distrib$f <- distrib$obs/N

Fitting a binomial

We need 2 parameters to characterize a binomial
  1. the number of trials
  2. the probability of success at each trial
Number of trials
In our case, a trial is a position in a sequence. Each sequence contains L=800bp and the considered word (GATAA) contains k=4 letters, there are n = L - k + 1 possible positions for a GATAA in each sequence (the three last position cannot contain a tetranucleotide).
Probability of succcess (occurrence) at each trial (position)
How to estimate the probability to find a GATAA at a given position of a sequence ? One possibility would be to estimate that all the tetranucleotides have the same probability, but we have no reason a priori to make such a strong assumption.

Another possibility would be to estimate the word probability on the basis of the observed distribution itself: we can calculate the average number of occurrences per sequence, and divide this number by the number of positions per sequences to obtain the average number of occurrences per position, i.e. an estimator of the word probability. The total number of occurrences can be obtained by multiplying each value of occurrence (from 0 to 8) by he number of sequences where these occurrences have been observed. This can be done very easily with R, by multiplying the two vectors.

Notice that this number is clearly different from what would have been obtained under a model of equiprobable tetranucleotide (1/4^4=0.0039).

Computing the theoretical distribution
We have now the two parameters required for fitting a binomial distribution on our observed distribution.

Drawing the observed and theoretical distribution

We will plot the observed distribution as vertical bars, and superimpose the theoretical distribution as a frequency polygon. You can draw a very simple plot with the following command: But this is not very aesthetic. The plot can be improved in the following way (the comments can be copy-pasted to R, they will be ignored). Let us now draw the theoretical distribution on top of the observed distribution.

Testing the goodness of the fit

We will compute the chi2 statistics manually, in order to get familiar with the details of the computation.

We will first apply the raw formula of the chi2: chi2obs = SUM [(obs -exp) 2 / exp]

distrib$diff <- distrib$obs - distrib$exp
distrib$diff2 <- distrib$diff^2
distrib$chi2 <- distrib$diff^2 / distrib$exp


chi2.obs <- sum(distrib$chi2)

BEWARE: this result is not what we are supposed to obtain. Why ?

We applied the raw formula, but we forgot to check two essential conditions for being able to apply Pearson's chi-squared test.

We will see how to treat these two problems.

Sums of obervation and expectation must be equal

The domain of definition of the binomial distribution encompasses alll possible values from 0 to n (the number of trials). In the scripts above, we only computed the binomial density (dbinom) and the corresponding expected values (E(x) = N*dbinom())from 0 to 8. All the values from 9 to n where thus ignored so far.

A bad way to solve this problem would be to include in our computation of the chi-square all the possible values from 0 to n. This would fix the problem of the sum of expected values, but raise another problem, since at the right tail of the distribution, the values are decreasing very rapidly, so that expected values will be too small to meet the condition of applicability that all expected values are >= 5.

The correct way to solve this is to merge all the classes on the right tail (from x=9 to x=n) together with the last class where an observation was found (x=8). This ast class will thus from now on represent all the possible values from x=8 to x=n.

All expected values should be >= 5


Fitting a Poisson and a normal distribution

Fit a Poisson and a normal distribution on the word occurrences provided above. Plot the observed and fitted distributions, and compare them with a Q-Q plot.

Basically, you should perform the same steps as for the binomial, except that you will use other parameters to fit the theoretical distribution on the observed one.

Goodness of fit

Estimate the goodness of fit for the 3 fitted distributions (binomial, Poisson, and normal).

Test of significance

On the basis of the 3 fitted distributions, estimate the probability to observe at least 3, 4, 10 and 25 occurrences, respectively, of the word GATAA in a 800bp sequence.

Distributions of occurrences in upstream sequences

The occurrences of some hexanucleotides were counted in all the yeast upstream sequences. The result can be found in the data directory, in the file word_counts/ For each of these words:
  1. Draw an histogram with the observed distribution.
  2. Fit a binomial distribution on this histogram.
  3. Fit a Poisson distribution on this histogram.
  4. Fit a normal distribution on this histogram.
What do you conclude from these drawings ?
Jacques van Helden (