Selecting differentially expressed genes with R or TmeV

Retrieving the den Boer normalized dataset

Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

[back to contents]

Start R.

• Have a look at the description of the read.table() function.
• We will now load three data tables into R using the read.table function. The function allows us to directly read the tables from the Web server. We wll successively load 3 files providing complementary information.
• the expression matrix (GSE13425_Norm_Whole.txt),
• the A/P/M matrix (GSE13425_AMP_Whole.txt)
• phenotypic data (GSE13425_phenoData.txt)
?read.table
url.base <- "http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/data/gene_expression/denboer_2009/"

## Load expression values

## Load phenotypic data
pheno <- read.table(file.path(url.base, 'phenoData_GSE13425.tab'), sep='\t', head=TRUE, row=1)

## Load Absent/Marginal/Present (AMP) calls


The file GSE13425_Norm_Whole.txt contains genes as rows and samples as columns. Data were previously normalized using rma algorithm (they are thus transformed in logarithm base 2). The GSE13425_phenoData.txt file contains phenotypic data about samples. The GSE13425_AMP_Whole.txt file contains information about A/P/M calls (genes as rows and samples as columns).

• We will now define a directory to store the results. In principle this directory has been created in the previosu tutorial (normalization of Affymetrix data), but it is worth checking that it still exists.
• dir.output <- "~/GSE13425" ## Define the output directory. You can adapt this to your local configuration.
dir.create(dir.output, showWarnings=FALSE, recurs=TRUE) ## Create the directory if it does not exist yet
setwd(dir.output)

• How many rows and columns does the data object contains ?
• Does it correspond to the dimensions of the A/P/M matrix ?
• What informations are available about samples ?
• How many samples from each tumor subtype are present in the DenBoer dataset ?

View solution| Hide solution

Interpretation

The dataset from DenBoer contains 190 samples belonging to various tumour classes. We can already notice that there is an important imbalance between the sizes of the tumour classes: T-ALL, pre-B ALL, TEL-AML1 and hyperdiploid are each represented by more than 40 samples, whereas other classes (e.g. BCR-ABL, E2A-rearranged) are represented by a handful of samples.

The number of samples per group is a very important factor for selecting differentially expressed genes: in general, the power of the tests (i.e. the capacity to detect effectively differentially expressed genes) increases with group sizes.

[back to contents]

Basics about Welch's t test

Welch's test is a variant of the classical Student test, whose goal is to test the equality between two means.

H0: mg,1 = mg,2
where mg,1 and mg,2 represent the respective mean expression values for a given gene (g) in two populations (for example, all existing patients suffering from T-ALL versus all patients suffering from pre-B ALL). Of course, we do not dispose of measurments for all the patients suffering from these two types of ALL in the world (the population). We only dispose of two sets of samples, covering 36 (T-ALL) and 44 (pre-B ALL) patients, respectively. On the basis of these samples, we will estimate how likely it is that genes g is generally expressed at similar levels in the populations from which the samples were drawn.

The essential difference between Student and Welch is that the proper Student test relies on the assumption that the two sampled populations have the same variance, whereas Welch's test is designed to treat populations with unequal variances.

When detecting differentially expressed genes, we cannot assume equal variance. Indeed, a typical case would be that a gene of interest is expressed at very low level in the first group, and high level in the second group. The inter-individual fluctuations in expression values are expected to be larger when the gene is expressed at a high level than when it is poorly expressed. It is thus generally recommended to use Welch rather than Student test when analyzing microarray expression profiles.

Welch's t-test defines the t statistic by the following formula:

where

• $$\ \bar{X}_{i}$$ is the sample mean,
• $$\ s^2_{i}$$ the sample variance,
• $$\ N_{i}$$ the sample size.

The t.test() function can be used to calculate this score (and additional informations such as p.value). This function returns an S3 object whose slots can be listed using the names() function and accessed using the $operator (such as with lists in R). In order to get an intuition of the t statistics, let us create artificial datasets and run a Welch test on it. In the following example x and y can be viewed as the values for gene g in class x and y. ## Generate 4 random numbers following a normal distribution, to ## simulate the samples 1 and 2. We deliberately set the means to ## the same values (to fall under the null hypothesis), but ## different standard deviations. x <- rnorm(n=4, mean=6, s=1) y <- rnorm(n=4, mean=6, s=2) ## Run the Welch test (this is specified by indicating that we don't expect equal variances) denboer.welch <- t.test(x,y, var.equal=FALSE) print(denboer.welch) ## note: each student should have a different result, since values were generated at random ## Retrieve the t statistics names(denboer.welch) denboer.welch$statistic



What would be the expected distribution of t when comparing the expression levels in class x and y for 10000 genes whose values would be randomly drawn from normal distributions with mean 0 and standard deviations of 1 and 2, resp ?

## Run iteratively 10000 Welch tests with random values
t.values <- replicate(10000,t.test(rnorm(4, 0, 1),rnorm(4, 0, 2))$statistic) dens <- density(t.values) plot(dens,xlim=c(-5,5))  We can now compute the empirical frequency of t statistics with absolute value above or equal to 3. length(t.values[abs(t.values) >= 3])/10000  We can visualize this by coloring the tails of the density function on a plot. plot(dens,xlim=c(-5,5)) lower.bound <- 1 upper.bound <- tail(which(dens$x <= -3),1)
plot(dens)
polygon(c(dens$x[lower.bound:upper.bound],dens$x[upper.bound]),c(dens$y[lower.bound:upper.bound],0),col="tan") lower.bound <- head(which(dens$x >= 3),1)
upper.bound <- length(dens$x) polygon(c(dens$x[lower.bound],dens$x[lower.bound:upper.bound]),c(0,dens$y[lower.bound:upper.bound]),col="tan")


Now we can check if the empirical distribution we have drawn fits the theorical t distribution.

Student's t-distribution has the probability density function given by:

where

• $$\ \nu$$ is the number of degrees of freedom.
• $$\ \Gamma$$ is the Gamma function.

The degree of freedom, which is related to the number of samples and variance in class 1 and 2, is the parameter that will control the shape of the probability density function.

The degree of freedom can be computed using the following formula.

Here we will consider that variance is equal to 1 (as x and y were drawn from a normal distribution with mean 0 and standard deviation 1)

So we can compute the degree of freedom using R and draw the theorical density of the t-distribution.

## Compute the degrees of freedom for Welch test
N1 <- length(x)
N2 <- length(y)
var1 <- var(x) ## Since we ignore the variance of the population, we estimate it from the sample
var2 <- var(y)
df <- (var1/N1+var2/N2)^2/(var1^2/(N1^2*(N1-1))+var2^2/(N2^2*(N1-1)))

## Compare the degrees of freedom computed by ourselves with the df
## reported by the t.test() function.
names(denboer.welch)
print(denboer.welch$parameter)  [back to contents] Applying Welch's t-test to the den Boer dataset We would like to define genes that discriminate between "TEL-AML1" tumors and "hyperdiploid" tumors. One possibility would be to iterate over all probesets, and to successively run the R method t.test() on each one. This would however be quite inefficient, and the results would not be very easy to handle, since it would be a list of objects of the class t.test. Instead, we will use a custom function that runs Student or Welch test in parallel on all the elements of a data table. Running t-tests on each row of a data matrix ## Load the library for multiple t tests source('http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/R-files/config.R') source(file.path(dir.util, "util_student_test_multi.R"))  For the sake of curiosity, you can also have a look at the R code. We will now run the Welch test on all the genes of the Den Boer dataset. We will apply the test to select genes differentially expressed between the subtypes "hyperdiploid" and "TEL-AML1". ## Define a Boolean vector indicating which samples belong to the two selected subtypes. samples.to.keep <- pheno$Sample.title == "hyperdiploid" |  pheno$Sample.title == "TEL-AML1" ## Define a vector with the sample types for the two selected cancer subtype cancer.type <- as.vector(pheno[samples.to.keep, "Sample.title"]) print(cancer.type) table(cancer.type) ## Run the Welch test on each probesets of the DenBoer expression matrix denboer.welch <- t.test.multi(expr.matrix[samples.to.keep], cancer.type, volcano.plot=FALSE) ## Inspect the result table dim(denboer.welch) names(denboer.welch) ## Select genes with a stringent threshold on E-value threshold <- 0.05 significant.probesets <- denboer.welch$E.value <= threshold
table(significant.probesets) ## Count the number of significant probesets


Comparing sample means

We will compare the mean expression value between the two subtypes, and highlight the significant genes.

## Plot the gene-wise means
plot(denboer.welch[, c("mean.TEL-AML1", "mean.hyperdiploid")], col="darkgray")
grid()
abline(a=0,b=1, col="black")
lines(denboer.welch[significant.probesets, c("mean.TEL-AML1", "mean.hyperdiploid")], type='p', col="red") ## Highlight significant genes


How do you explain that the regions covered by gray (non-significant) and red (significant) probesets overlap on the mean-mean plot ?

View solution| Hide solution [back to contents]

Significance Analysis of Microarrays (SAM)

The most popular method for differential expression analysis of microarray data is "Significance Analysis of Microarrays" (SAM). SAM is implemented in several software including R (library siggenes). Here we will use the MultiExperiment Viewer (MeV).

Installing Mev

Type the command below in a terminal (this are shell/bash commands)

mkdir -p ~/bin
cd ~/bin
wget "http://freefr.dl.sourceforge.net/project/mev-tm4/mev-tm4/MeV%204.8.1/MeV_4_8_1_r2727_linux.tar.gz"
tar xvfz MeV_4_8_1_r2727_linux.tar.gz
cd -
echo -e "\nalias tmev=\"cd ~/bin/MeV_4_8_1/; sh -c 'nohup ./tmev.sh &'; cd -\"" >> ~/.bashrc
source ~/.bashrc
tmev


Running MeV

• Load the file using File > Load data > Select file loader > Tab delimited.
• Browse to file "GSE13425_Norm_TEL-AML1vsHyperdip.txt", click on the upper-leftmost expression value and click on the "load" button.
• Select Adjust data > Gene/Rows adjustment > median center Genes/Rows
• Select Analysis > Statistics > SAM
• Set all samples from GSM338746.CEL.gz to GSM338789.CEL.gz to class B.
• Set the number of permutations to 500, select "Construct hierachical clustering" and click " OK"
• Accept default parameters for hierarchical clustering
• Set the delta value to 2 and click OK.
• Select Analysis results > SAM > Hierarchical trees > All Significant genes
• Select Display > Set color scale limits and set lower limit to -4, midpoint value to 0 and upper limit to 4.
• Select Display > Set Element Size > 5x2
• To store the resulting file right click to select the whole gene tree and select "Save cluster"