Selecting differentially expressed genes with R or TmeV


  1. Retrieving the den Boer normalized dataset
  2. Loading data into R
  3. Basics about Welch's t test
  4. Applying Welch's t-test to the den Boer dataset
  5. Significance Analysis of Microarrays (SAM)

Retrieving the den Boer normalized dataset

Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

[back to contents]

Loading data into R

Start R.

View solution| Hide solution


The dataset from DenBoer contains 190 samples belonging to various tumour classes. We can already notice that there is an important imbalance between the sizes of the tumour classes: T-ALL, pre-B ALL, TEL-AML1 and hyperdiploid are each represented by more than 40 samples, whereas other classes (e.g. BCR-ABL, E2A-rearranged) are represented by a handful of samples.

The number of samples per group is a very important factor for selecting differentially expressed genes: in general, the power of the tests (i.e. the capacity to detect effectively differentially expressed genes) increases with group sizes.

[back to contents]

Basics about Welch's t test

Welch's test is a variant of the classical Student test, whose goal is to test the equality between two means.

where mg,1 and mg,2 represent the respective mean expression values for a given gene (g) in two populations (for example, all existing patients suffering from T-ALL versus all patients suffering from pre-B ALL). Of course, we do not dispose of measurments for all the patients suffering from these two types of ALL in the world (the population). We only dispose of two sets of samples, covering 36 (T-ALL) and 44 (pre-B ALL) patients, respectively. On the basis of these samples, we will estimate how likely it is that genes g is generally expressed at similar levels in the populations from which the samples were drawn.

The essential difference between Student and Welch is that the proper Student test relies on the assumption that the two sampled populations have the same variance, whereas Welch's test is designed to treat populations with unequal variances.

When detecting differentially expressed genes, we cannot assume equal variance. Indeed, a typical case would be that a gene of interest is expressed at very low level in the first group, and high level in the second group. The inter-individual fluctuations in expression values are expected to be larger when the gene is expressed at a high level than when it is poorly expressed. It is thus generally recommended to use Welch rather than Student test when analyzing microarray expression profiles.

Welch's t-test defines the t statistic by the following formula:


The t.test() function can be used to calculate this score (and additional informations such as p.value). This function returns an S3 object whose slots can be listed using the names() function and accessed using the $ operator (such as with lists in R).

In order to get an intuition of the t statistics, let us create artificial datasets and run a Welch test on it. In the following example x and y can be viewed as the values for gene g in class x and y.

## Generate 4 random numbers following a normal distribution, to
## simulate the samples 1 and 2. We deliberately set the means to 
## the same values (to fall under the null hypothesis), but 
## different standard deviations.
x <- rnorm(n=4, mean=6, s=1)
y <- rnorm(n=4, mean=6, s=2)

## Run the Welch test (this is specified by indicating that we don't expect equal variances)
denboer.welch <- t.test(x,y, var.equal=FALSE)
## note: each student should have a different result, since values were generated at random

## Retrieve the t statistics

What would be the expected distribution of t when comparing the expression levels in class x and y for 10000 genes whose values would be randomly drawn from normal distributions with mean 0 and standard deviations of 1 and 2, resp ?

## Run iteratively 10000 Welch tests with random values
t.values <- replicate(10000,t.test(rnorm(4, 0, 1),rnorm(4, 0, 2))$statistic)
dens <- density(t.values)

We can now compute the empirical frequency of t statistics with absolute value above or equal to 3.

length(t.values[abs(t.values) >= 3])/10000

We can visualize this by coloring the tails of the density function on a plot.

lower.bound <- 1
upper.bound <- tail(which(dens$x <= -3),1)
lower.bound <- head(which(dens$x >= 3),1)
upper.bound <- length(dens$x)

Now we can check if the empirical distribution we have drawn fits the theorical t distribution.

Student's t-distribution has the probability density function given by:


The degree of freedom, which is related to the number of samples and variance in class 1 and 2, is the parameter that will control the shape of the probability density function.

The degree of freedom can be computed using the following formula.

Here we will consider that variance is equal to 1 (as x and y were drawn from a normal distribution with mean 0 and standard deviation 1)

So we can compute the degree of freedom using R and draw the theorical density of the t-distribution.

## Compute the degrees of freedom for Welch test
N1 <- length(x)
N2 <- length(y)
var1 <- var(x) ## Since we ignore the variance of the population, we estimate it from the sample
var2 <- var(y)
df <- (var1/N1+var2/N2)^2/(var1^2/(N1^2*(N1-1))+var2^2/(N2^2*(N1-1)))

## Compare the degrees of freedom computed by ourselves with the df
## reported by the t.test() function.
[back to contents]

Applying Welch's t-test to the den Boer dataset

We would like to define genes that discriminate between "TEL-AML1" tumors and "hyperdiploid" tumors.

One possibility would be to iterate over all probesets, and to successively run the R method t.test() on each one. This would however be quite inefficient, and the results would not be very easy to handle, since it would be a list of objects of the class t.test.

Instead, we will use a custom function that runs Student or Welch test in parallel on all the elements of a data table.

Running t-tests on each row of a data matrix

## Load the library for multiple t tests
source(file.path(dir.util, "util_student_test_multi.R"))

For the sake of curiosity, you can also have a look at the R code.

We will now run the Welch test on all the genes of the Den Boer dataset. We will apply the test to select genes differentially expressed between the subtypes "hyperdiploid" and "TEL-AML1".

## Define a Boolean vector indicating which samples belong to the two selected subtypes. <- pheno$Sample.title == "hyperdiploid" |  pheno$Sample.title == "TEL-AML1"

## Define a vector with the sample types for the two selected cancer subtype
cancer.type <- as.vector(pheno[, "Sample.title"])

## Run the Welch test on each probesets of the DenBoer expression matrix
denboer.welch <- t.test.multi(expr.matrix[], cancer.type, volcano.plot=FALSE)

## Inspect the result table

## Select genes with a stringent threshold on E-value
threshold <- 0.05
significant.probesets <- denboer.welch$E.value <= threshold
table(significant.probesets) ## Count the number of significant probesets

Comparing sample means

We will compare the mean expression value between the two subtypes, and highlight the significant genes.

## Plot the gene-wise means
plot(denboer.welch[, c("mean.TEL-AML1", "mean.hyperdiploid")], col="darkgray")
abline(a=0,b=1, col="black")
lines(denboer.welch[significant.probesets, c("mean.TEL-AML1", "mean.hyperdiploid")], type='p', col="red") ## Highlight significant genes

How do you explain that the regions covered by gray (non-significant) and red (significant) probesets overlap on the mean-mean plot ?

View solution| Hide solution [back to contents]

Significance Analysis of Microarrays (SAM)

The most popular method for differential expression analysis of microarray data is "Significance Analysis of Microarrays" (SAM). SAM is implemented in several software including R (library siggenes). Here we will use the MultiExperiment Viewer (MeV).

Installing Mev

Type the command below in a terminal (this are shell/bash commands)

mkdir -p ~/bin
cd ~/bin
wget ""
tar xvfz MeV_4_8_1_r2727_linux.tar.gz
cd -
echo -e "\nalias tmev=\"cd ~/bin/MeV_4_8_1/; sh -c 'nohup ./ &'; cd -\"" >> ~/.bashrc
source ~/.bashrc

Running MeV