Introduction

In this practical, we will inspect the statistical tests used to compare a set of genes of interest to a set of reference genes. This practical is essentially a tutorial, based on the result returned by David in the previous practical Handling genomic coordinates.

In this tutorial, we hade submitted a set of predicted E2F target genesc(see file M00920_targets.txt) to the Web tool DAVID, to compare it to various catalogues of functional annotations (Gene Ontology, KEGG, …). DAVID returned a table reporting the functional classes for which our gene set showed significant enrichment.

The goal of this tutorial is to reproduce the calculation of the significance. We will show two distinct ways to model the problem:

  1. Hypergeometric test: drawing at random a certain number of balls from an urn containing marked and non-marked balls (hypergeometric test).
  2. Fisher’s exact test: testing the independence between regulation (genes belonging or not to the E2F predicted gene set) and class membership (genes annotated or not as involved in cell cycle, in the GO annotations).

Click to open the image in a separate window


Tutorial

Hypergeometric test

In a first time, we model the association between genes and GO class using a hypergeometric distribution. The classical example for the hypergeometric is the ranomd selection of “k” balls in an urn containing “m” marked and “n” non-marked balls, and the observation that the selection contains “x” marked ball.

To illustrate the way to perform the hypergeometric test, we will re-analyze the second row of the David result displayed above: correspondence between a set of predicted E2F target genes and the genes annotated in the functional class “cell cycle” of the GOTERM_BP_FAT terms.

We define the parameters of the hypergeometric test in the following way:

Variale Description
\(m = 611\) number of “marked” elements, i.e. total number of genes annotated for the selected GO term (cell cycle in GOTERM_BP_FAT annotations).
\(N = 13588\) total number of genes with some annotation in GOTERM_BP_FAT. Note: this is lower than the total number of human genes, since many genes are of totally unknown function.
\(n = N - m\) number of “non-marked” elements, i.e. the number of genes that have some annotation in GOERM_BP_FAT, but are not associated to the selected GO term (cell cycle).
\(k = 59\) Size of the selection, i.e. number of genes predicted as E2F targets, and associated to at least one “Biological Process” in the Gene Ontology.
\(x = 19\) number of “marked” elements in the selection, i.e. number of genes predicted as E2F targets AND associated to the process “cell cycle” in GO annotations.
g <- 75 ## Number of submitted genes
k <- 59 ## Size of the selection, i.e. submitted genes with at least one annotation in GO biological processes
m <- 611 ## Number of "marked" elements, i.e. genes associated to this biological process
N <- 13588 ## Total number of genes with some annotation in GOTERM_BP_FAT.  
n <- N - m ## Number of "non-marked" elements, i.e. genes not associated to this biological process
x <- 19 ## Number of "marked" elements in the selection, i.e. genes of the group of interest that are associated to this biological process

Small exercise: random expectation

Compute the following statistics:

  1. the percent of the gene selection involved in the process.
  2. the percentage of the biological process covered by our gene selection.
  3. expected number of marked elements in the selection
  4. “fold enrichment”.
Answer
<< Hide | Show >>
## Percent of the gene selection involved in "cell cycle". This
## corresponds to the "%" column returned by DAVID.
(percent.selection <- x / g * 100)
## [1] 25.33333
## Percent, among submitted genes with at least one annotation, which
## are involved in "cell cycle".
(percent.selection <- x / k * 100) 
## [1] 32.20339
## Percent of the biological process covered by the gene set.
(percent.process <- x / m * 100) 
## [1] 3.109656
## Random expectation
(marked.proportion <- m / N)
## [1] 0.04496615
(exp.x <- k * marked.proportion)
## [1] 2.653003
## Fold enrichment, as computed by David
(fold.enrichment <-  (x / k ) / (m / N)) 
## [1] 7.161697
Interpretation

The calculation of the proportion of marked genes indicates that 4.5% of the genes annotated in GO_BP_FAT are associated to cell cycle. It is not surprizing that cell cycle is mobilizing an important fraction of the gene products, this was already shown for yeast in a pioneering microarray study (Spellmann et al., 1998).

Our selection contains 59 genes with at least some annotation in GO_BP_FAT. If we would have selected 59 genes at random, we would thus expect 59*4.5% of them to be annotated as “cell cycle”.

Drawing the hypergeometric distribution

Use the plot() and the dhyper() functions to draw the distribution of probability densities, i.e. the probability to observe a particular value of \(x\): \(P(X=x)\).

<< Hide | Show >>
## Define the range of possible values for x.
## The number of marked elements in the selection can neither be 
## higher than the size of the selection, nor higher than the 
## number of marked elements.
x.range <- 0:min(k,m) 

help(dhyper) ## Always read the documentation of a function before using it. 

## Compute the distribution of density P(X=x)
dens <- dhyper(x=x.range, m=m, n=n, k=k)

## Plot the distributon of hypergeometric densities
plot (x.range, dens, type="h", lwd=2, col="blue", main="Hypergeometric density", xlab="x = marked elements in the selection", ylab="density = P(X=x)", ylim=c(0, max(dens)*1.25))

## Draw an arrow indicating the expected value
arrows(exp.x, max(dens)*1.15, exp.x, max(dens)*1.05, lwd=2, col='darkgreen', angle=30, length=0.1, code=2)
text(exp.x, max(dens)*1.20, labels=paste("exp=", round(exp.x, digits=2)), col="darkgreen", font=2)

## Draw an arrow indicating the observed value
arrows(x, max(dens)*1.15, x, max(dens)*1.05, lwd=2, col='red', angle=30, length=0.1, code=2)
text(x, max(dens)*1.20, labels=paste("x=", x), col="red", font=2)