Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.
Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.[back to contents]
?read.table url.base <- "http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/data/gene_expression/denboer_2009/" ## Load expression values expr.matrix <- read.table(file.path(url.base, "GSE13425_Norm_Whole.txt"),sep="\t", head=T, row=1) ## Load phenotypic data pheno <- read.table(file.path(url.base, 'phenoData_GSE13425.tab'), sep='\t', head=TRUE, row=1) ## Load Absent/Marginal/Present (AMP) calls amp <- read.table(file.path(url.base, "GSE13425_AMP_Whole.txt"),sep="\t", head=T, row=1)
The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2). The GSE13425_phenoData.txt file contains phenotypic data about samples. The GSE13425_AMP_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).
dir.output <- "~/GSE13425" ## Define the output directory. You can adapt this to your local configuration. dir.create(dir.output, showWarnings=FALSE, recurs=TRUE) ## Create the directory if it does not exist yet setwd(dir.output)
The dataset from DenBoer contains 190 samples belonging to various tumour classes. We can already notice that there is an important imbalance between the sizes of the tumour classes: T-ALL, pre-B ALL, TEL-AML1 and hyperdiploid are each represented by more than 40 samples, whereas other classes (e.g. BCR-ABL, E2A-rearranged) are represented by a handful of samples.
The number of samples per group is a very important factor for selecting differentially expressed genes: in general, the power of the tests (i.e. the capacity to detect effectively differentially expressed genes) increases with group sizes.[back to contents]
Welch's test is a variant of the classical Student test, whose goal is to test the equality between two means.
The essential difference between Student and Welch is that the proper Student test relies on the assumption that the two sampled populations have the same variance, whereas Welch's test is designed to treat populations with unequal variances.
When detecting differentially expressed genes, we cannot assume equal variance. Indeed, a typical case would be that a gene of interest is expressed at very low level in the first group, and high level in the second group. The inter-individual fluctuations in expression values are expected to be larger when the gene is expressed at a high level than when it is poorly expressed. It is thus generally recommended to use Welch rather than Student test when analyzing microarray expression profiles.
Welch's t-test defines the t statistic by the following formula:
The t.test() function can be used to calculate this score (and additional informations such as p.value). This function returns an S3 object whose slots can be listed using the names() function and accessed using the $ operator (such as with lists in R).
In order to get an intuition of the t statistics, let us create artificial datasets and run a Welch test on it. In the following example x and y can be viewed as the values for gene g in class x and y.
## Generate 4 random numbers following a normal distribution, to ## simulate the samples 1 and 2. We deliberately set the means to ## the same values (to fall under the null hypothesis), but ## different standard deviations. x <- rnorm(n=4, mean=6, s=1) y <- rnorm(n=4, mean=6, s=2) ## Run the Welch test (this is specified by indicating that we don't expect equal variances) denboer.welch <- t.test(x,y, var.equal=FALSE) print(denboer.welch) ## note: each student should have a different result, since values were generated at random ## Retrieve the t statistics names(denboer.welch) denboer.welch$statistic
What would be the expected distribution of t when comparing the expression levels in class x and y for 10000 genes whose values would be randomly drawn from normal distributions with mean 0 and standard deviations of 1 and 2, resp ?
## Run iteratively 10000 Welch tests with random values t.values <- replicate(10000,t.test(rnorm(4, 0, 1),rnorm(4, 0, 2))$statistic) dens <- density(t.values) plot(dens,xlim=c(-5,5))
We can now compute the empirical frequency of t statistics with absolute value above or equal to 3.
length(t.values[abs(t.values) >= 3])/10000
We can visualize this by coloring the tails of the density function on a plot.
plot(dens,xlim=c(-5,5)) lower.bound <- 1 upper.bound <- tail(which(dens$x <= -3),1) plot(dens) polygon(c(dens$x[lower.bound:upper.bound],dens$x[upper.bound]),c(dens$y[lower.bound:upper.bound],0),col="tan") lower.bound <- head(which(dens$x >= 3),1) upper.bound <- length(dens$x) polygon(c(dens$x[lower.bound],dens$x[lower.bound:upper.bound]),c(0,dens$y[lower.bound:upper.bound]),col="tan")
Now we can check if the empirical distribution we have drawn fits the theorical t distribution.
Student's t-distribution has the probability density function given by:
The degree of freedom, which is related to the number of samples and variance in class 1 and 2, is the parameter that will control the shape of the probability density function.
The degree of freedom can be computed using the following formula.
Here we will consider that variance is equal to 1 (as x and y were drawn from a normal distribution with mean 0 and standard deviation 1)
So we can compute the degree of freedom using R and draw the theorical density of the t-distribution.
## Compute the degrees of freedom for Welch test N1 <- length(x) N2 <- length(y) var1 <- var(x) ## Since we ignore the variance of the population, we estimate it from the sample var2 <- var(y) df <- (var1/N1+var2/N2)^2/(var1^2/(N1^2*(N1-1))+var2^2/(N2^2*(N1-1))) ## Compare the degrees of freedom computed by ourselves with the df ## reported by the t.test() function. names(denboer.welch) print(denboer.welch$parameter)[back to contents]
We would like to define genes that discriminate between "TEL-AML1" tumors and "hyperdiploid" tumors.
One possibility would be to iterate over all probesets, and to successively run the R method t.test() on each one. This would however be quite inefficient, and the results would not be very easy to handle, since it would be a list of objects of the class t.test.
Instead, we will use a custom function that runs Student or Welch test in parallel on all the elements of a data table.
## Load the library for multiple t tests source('http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/R-files/config.R') source(file.path(dir.util, "util_student_test_multi.R"))
For the sake of curiosity, you can also have a look at the R code.
We will now run the Welch test on all the genes of the Den Boer dataset. We will apply the test to select genes differentially expressed between the subtypes "hyperdiploid" and "TEL-AML1".
## Define a Boolean vector indicating which samples belong to the two selected subtypes. samples.to.keep <- pheno$Sample.title == "hyperdiploid" | pheno$Sample.title == "TEL-AML1" ## Define a vector with the sample types for the two selected cancer subtype cancer.type <- as.vector(pheno[samples.to.keep, "Sample.title"]) print(cancer.type) table(cancer.type) ## Run the Welch test on each probesets of the DenBoer expression matrix denboer.welch <- t.test.multi(expr.matrix[samples.to.keep], cancer.type, volcano.plot=FALSE) ## Inspect the result table dim(denboer.welch) names(denboer.welch) ## Select genes with a stringent threshold on E-value threshold <- 0.05 significant.probesets <- denboer.welch$E.value <= threshold table(significant.probesets) ## Count the number of significant probesets
We will compare the mean expression value between the two subtypes, and highlight the significant genes.
## Plot the gene-wise means plot(denboer.welch[, c("mean.TEL-AML1", "mean.hyperdiploid")], col="darkgray") grid() abline(a=0,b=1, col="black") lines(denboer.welch[significant.probesets, c("mean.TEL-AML1", "mean.hyperdiploid")], type='p', col="red") ## Highlight significant genes
How do you explain that the regions covered by gray (non-significant) and red (significant) probesets overlap on the mean-mean plot ?
The most popular method for differential expression analysis of microarray data is "Significance Analysis of Microarrays" (SAM). SAM is implemented in several software including R (library siggenes). Here we will use the MultiExperiment Viewer (MeV).
Type the command below in a terminal (this are shell/bash commands)
mkdir -p ~/bin cd ~/bin wget "http://freefr.dl.sourceforge.net/project/mev-tm4/mev-tm4/MeV%204.8.1/MeV_4_8_1_r2727_linux.tar.gz" tar xvfz MeV_4_8_1_r2727_linux.tar.gz cd - echo -e "\nalias tmev=\"cd ~/bin/MeV_4_8_1/; sh -c 'nohup ./tmev.sh &'; cd -\"" >> ~/.bashrc source ~/.bashrc tmev