Statistics for Bioinformatics
Practicals - Student test

Introduction

Golub et al. (1999) measured the level of expression of human genes in 38 patients suffering from two cancer types: acute lymphoblastic leukemia (ALL, 27 patients) and acute myeloid leukemia (AML, 11 patients).

The aim of such experiments is to detect a subset of genes which can be used for diagnostic tests, in order to asses whether a new patient suffers from either of these cancer types.

For this practical, we will use a pre-normalized and pre-filtered subset of 3051 genes, and select those which show a significant difference in expression between AML and ALL patients.

We will thus apply a Student test to each gene g independendly, to test the null hypothesis.

Theory

This tutorial is an application of concepts seen in the following chapters of the course:

  1. Hypothesis testing
  2. Conformity tests
  3. Multitesting

Prerequisite

This tutorial assumes you already executed the script config.R as described in the configuration page.

Tutorial

Loading the data

The pre-normalized data can be loaded from the web site of the course. The variable golub.expr contains the normalized and standardized microarray data, with 38 chips (one per column) and 3051 genes. Each chip has been centered (on the median) and scaled (the standard deviation was estimated on the basis of the inter-quartile range).

We also need to define the cancer type for each chip.

Applying a single Student test

R contains a specific package for hypothesis testing, which of course includes the classical Student test. We will start by arbitrary selecting one gene, and testing whether this particular gene is differentially expressed between ALL and AML patients.

Exercise

Select the gene on the 347th row of the golub dataset, and test the following null hypothesis.

  1. Calculate the sample means, standard deviation, t.obs, P.value, and E.value with standard R functions.

  2. How do you interpret the results of the Student test ? Is the gene g differentially expressed between AML and ALL patients ? Give an interpretation for the P-value and E-value.

  3. Apply the Student test with the the t.test() function implemented in R.

  4. Apply the Welch test with the the t.test() function implemented in R.

  5. Discuss about the meaning of the different attributes, between students and with the teacher.

  6. Compare the results obtained with the Student-Fischer and the Welch test, respectively.

Solution

We will first select a gene, and get the expression values for patients of the type ALL (sample 1) and AML (sample 2), respectively.

In order to familiarize ourselves with the parameters of a Student test, we will first perform manually all the steps to compute the Student statistics (tobs).

Actually, there is no need to perform all those steps manually, since R contains a predefined function t.test() which computes the Student as well as the Welch test.

Interpretation

TO BE WRITTEN

Multiple testing

One simple way to select differentially expressed genes would be to apply the previous test to each gene successively. This would however be very inefficient in terms of calculation time. You can easily check this by inserting the previous code in a loop. If it takes too much time, press the Esc key to interrupt the calculation.

Multi-testing

Multiple testing with the function t.test.multi()

In general, loops are not very efficient in R, and should be relplaced, whenever possible, by directly applying computations to either columns or rows of a matrix, with the function apply(). If you want to know more about this function, call its help with help(apply).

As a matter of illustration, we will test a R program that uses the function apply() in order to run Welch's t-test on all rows of a table. All calculations are performed in parallel, by using the R apply function. This script can be loaded from the R scripts directory, with the following command.

For the sake of curiosity, you can also have a look at the R code.

We will now apply the t-test to each gene of the data set, and store the result in a variable called golub.t.result.

The computation takes ~4 seconds for 3051 genes on my laptop. The result of t.test.multi is a data frame with 3051 rows (one per gene) and 9 columns (one per statistics).

As is always the case with multiple testing, nominal P-values have to be interpreted with caution: since we applied the same test to 3051 genes, a P-value of 0.01 is expected to occur 30 times by chance alone. One classical way to correct this effect is to calculate and E-value, i.e. the product of the P-value by the number of tests. This is done automatically by t.test.multi, and the result is stored in the column "E-value".

We can now select genes according to the result of the t-test. Let us chose a threshold of E-value ≤ 1.

Naming the samples

In pevious tutorials, the gene expression profiles were analyzed in a "blind" way, without explicitly taking into accout the nature of the samples (patient types, cell types, ...).

The selection of differentially expressed genes relies on the exitence of different subgroups of samples (e.g. cell types, subtypes).

The Golub data set comes along with an information table describing some features of each sample.

We will rename the samples in order to explicitly indicate their cell types:

Storing the result

We will store the selection of 243 differentially expressed genes in a text file. These selected genes will be used for the tutorial on clustering. Open the exported file with a text editor to check its content. There shoul be 38 columns (one per sample), and 243 rows (one per selected gene).

Exercise

Use the t.test function to check the results obtained with t.test.multi for the genes number 1, 2, 317 and 3051.

T-test with robust estimators

As mentionned above, one drawback of the t-test is that the average and variance of the two samples can be drastically affected by outliers. The function t.test.multi includes an option robust.est to use robust estimators of central tendency (the median) and dispersion (inter-quartile-range). We will use this option and compare the results with the standard Welch test.

Note that this calculation is slower than with the standard estimators. On my laptop it takes 11 seconds instead of 4. We will now compare the results.

Heat map of the selected gene profiles

We can now draw a heat map to visualize the profiles of expression of the 243 genes selected as significant in our multiple Student test.

Principal component analysis of the selected profiles

Discussion

Discuss the results, between students first, and then with the teacher.

Additional material

References

  1. DeRisi JL, Iyer VR, Brown PO (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. Oct 24;278(5338):680-6. PMID: 9381177.
  2. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. Dec;11(12):4241-57. PMID: 11102521.
  3. Cui, X, and Churchill, G.A. (2003). Statistical tests for differential expression in cDNA microarray experiments. Genome Biology 4: 210.1-210.10.
  4. PMID:12702200
  5. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. & Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439), 531-7. PMID:10521349

Jacques van Helden (jvhelden@ulb.ac.be)