- Introduction
- Generating random data sets
- Density distributions of p-values
- Estimating the proportions of truly null and truly alternative hypotheses
- Why do we need to correct for multiple testing ?
- Expected number of false positives (e-value)
- Family-Wise Error Rate
- Controlling the False Discovery Rate (FDR) with the q-value
- Bibliographic references

Most analyses in current bioinformatics consist in performing thousands, or even millions of statistical tests in parallel. This applies to the anlaysis of microarrays (e.g. differentially expressed genes), motif detection (e.g. discovering motifs in promoters of co-expressed genes), sequence similarity searches (comparing a query sequence against the millions of peptidic sequences currently available in Uniprot), etc.

In a previous practical
(Selecting
differentially expressed genes), we applied a Welch test to
select differentially expressed genes from a microarray series
containing 190 ALL samples. By doing this, we successively
tested for more than 22,000 probesets the equality of the mean
expression values, between two classes of ALL. For each gene, we
computed a **nominal p-value**, which indicates the
probability to obtain by chance a difference at least as large
as the one observed in the data. This p-value can be interpreted
as an estimate of the **False positive risk** (**FPR**):
the probability of considering an observation as significant
whereas it is not. However, we did not take into account an
important factor: since the same test was successively applied
to 22,283 probeset, the risk of false positives was repeated for
each probeset. This situation is classically denoted
as **multiple testing**. For example, if we accept an
individual risk of 1%, we expect to observe 1% * 22000=220 false
positives when the same risk is taken for each probe of the
microarray series.

We will thus need to be more restrictive if we want to control the false positives. Several methods have been proposed to control the risk of false positives in situations of multiple testing. In this practical, we will investigate the practical consequences of multiple testing and explore some of the proposed solutions.

[back to contents]In order to get an intuition of the problems arising from multiple testing, we will generate three datasets where no difference is expected to be found between the mean expression values of two groups of samples.

- Reload the normalized expression matrix from DenBoer, as described in the practical "Selecting differentially expressed genes"
- Generate an expression matrix of the same size, in a data
frame named "rnorm.matrix", and fill if with random values
sampled in a normal distribution of mean
*m=0*and standard deviation*sd=1*. - Create a second data frame of the same size, name it
"denboer.permuted.values", and fill it with the actual values
of the DenBoer expression matrix, sampled in random order.
Random sapling can be done with the R function

`sample()`. Read the help page of this function to know how to use it in this context. -
Create a vector name "denboer.permuted.subtypes" and fill it
with a random sampling of the cancer subtypes
(these values can be found in
`pheno$Sample_title`).

In the context of a Welch test, the p-value indicates the
probability to obtain by chance a *t* statistics greater
than to the one measured from the samples. under the null
hypothesis (i.e. if the data were sampled from populations with
equal means) a p-value of 5% is expected every 20 tests, a
p-value of 50% every two tests, etc. The goal of this exercise
is to test if the p-values returned by our multiple Welch test
correspond to this expectation.

For each of the 3 control sets prepared in the previous section, and for the actual data set from DenBoer 2009, apply the following procedure, and compare the results.

- Run the Welch test on each probeset, using our custom
function
`t.test.multi()`(as we did the practical on Selecting differentially expressed genes), to compare the two subtypes "hyperdiploid" and "TEL-AML1". - Draw a plot comparing mean expression of "hyperdiploid" (abcsissa) and "TEL-AML1" (ordinate), and highlight the probesets declared significant with a p-value threshold of 1%. Count the number of significant genes and compare it to the random expectation.
- Draw a histogram of the p-value density, by chunks of 5%, and draw a line representing the theoretical expectation for this distribution.

As discussed in the solution of the previous section, the density histogram of p-values suggests that the DenBoer dataset contains a mixture of probesets corresponding either of two categories:

- Genes whose mean expression value does not differ between
the two considered cancer types (hyperdiploid and
TEL_AML1). The corresponding probesets are called "truly
null", because they fit the null hypothesis:
*H*_{0}: m_{hyper}= m_{TEL_ALL1}*m*._{0} - Genes expressed differentially between the two considered
cancer types. The corresponding probeset are called "truly
alternative" because they fit the alternative hypothesis.
*H*_{1}: m_{hyper}≠ m_{TEL_ALL1}*m*._{1}

Storey and Tibshirani (2003) proposed an elegant way to estimate the number of probesets belonging to these two categories.

The next exercise will lead you, step by step, to discover how
Storey and Tibshirani estimate the numbers of truly null
(m_{0}) and alternative (m_{1}) hypothesis.

- In the result table of the Welch test (DenBoer dataset),
count the number of probesets having a p-value ≥ 0.6 (let
us call the threshold p-value
*λ*("lambda"), and the number of probeset above the λ p-value*n*)._{λ} - We can reasonably assume that the truly alternate
probesets will tend to be concentrated in the low range of
p-values on the density histogram. Consequently, the we can
suppose that the area of the histogram covering p-values
from 0.6 to 1 is principally made of truly null
probesets. On the basis of the
number
*n*(with_{λ}*λ=60*), try to estimate- the total number
*m*of truly null probesets._{0} - the total number
*m*of truly alternative probesets._{1} - The proportion
*π*("pi zero") of truly null among all probesets._{0}

- the total number
- After havng estimated these three parameters for the DenBoer expression dataset, do the same for the three negative controls.

A classical approach in statistical tests is to define *a
priori* a level of significance, i.e. a threshold on the
p-value. We will reject the null hypothesis if a test returns
a lower p-value than the threshold, and accept it
otherwise. Rejecting the null hypothesis means that we
consider the test as significant. In the current case, this
means that rejected hypotheses will correspond to
differentially expressed genes.

However, we are confronted to a problem: with the dataset from DenBoer (2009), we ran 22,283 tests in parallel. This means that, if we decide to use a classical p-value threshold of 1%, we accept, for each probeset, a risk of 1% to consider it significant whereas if follows the null hypothesis. Since this risk is multiplied 22,283 times, we expect an average of 223 false positives for any dataset.

Count the numbers of significant probesets in the four datasets described above.

The E-value is the expected number of false positives. This is the simplest and most intuitive correction for multiple testing: given a threshold on the nominal p-value and the number of tests performed, we can easily compute the expected number of false positives.

- The result table of the function
`t.test.multi()`contains a column indicating the p-value. For the 4 datasets (DenBoer + the three negative controls), use the column "P.value" of the t.test.multi() result to compute the**E-value**(**expected number of false positives**) associated to each probeset. Compare it to the E-value indicated in the t.test.multi() result. - Count the number of probesets declared "positive" with a threshold of 1% on E-value.

The **Family-Wise Error Rate (FWER)** is the probability to
obtain at least one false positive by chance: P(FP >= 1), for a
given threshold on p-value and taking into account the number of
tests performed.

- For each one of the datasets analyzed above, add to the t.test.multi() a column indicating the FWER.
- Count the number of probesets declared "positive" with a threshold of 1% on FWER. Compare it with the number of false positives when the control was exerted at the level of nominal p-value, and E-value, respectively.
- Draw a plot comparing the E-value and FWER to the nominal p-value, for all the probesets of the DenBoer dataset.

The **q-value** estimates the **False discovery rate**
(**FDR**), i.e. the expected proportion of false
positives *among the cases declared positives*. This makes
an essential distinction with the **False Positive Rate**
(**FPR**), which indicates the proportion of false positves
among all the tests performed.

Under the null hypothesis, the FDR can be estimated in the following way (hochberg and Benjamini):

- sort elements by increasing p-value
- for each rank
*i*of the sorted list, compute*q(i) = π0 * Pval * m / i* - choose a given significance level
*q**(in our case, let us chose q*= 1%). Let*k*be the largest*i*for which*q(i) ≤ q**. Reject all hypotheses*H*with_{(i)}*i = 1, 2, ..., k*.

- Compute the q-value for the DenBoer dataset and for the three control tests described above.
- Generate a plot comparing the nominal p-value (abscissa) with the different corrections (E-value, FWER, q-value).
- Count the number of probesets declared significant at a level of 1%, and compare it to the numbers of positives detected above, when the control was performed at the level of the p-value, E-value or FWER.

- Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRStatistSocB 57(1): 289-300.
- Storey JD, Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100(16): 9440-9445. [PMID 12883005] [free article]
- Den Boer ML, van Slegtenhorst M, De Menezes RX, Cheok MH, Buijs-Gladdines JG, Peters ST, Van Zutven LJ, Beverloo HB, Van der Spek PJ, Escherich G et al. 2009. A subtype of childhood acute lymphoblastic leukaemia with poor treatment outcome: a genome-wide classification study. Lancet Oncol 10(2): 125-134.

Denis Puthier (http://tagc.univ-mrs.fr/tagc/index.php/research/developmental-networks/d-puthier) & Jacques van Helden (TAGC, Université de la Méditerranée).