- Retrieving the den Boer normalized dataset
- Loading data into R
- Basics about Welch's
*t*test - Applying Welch's t-test to the den Boer dataset
- Significance Analysis of Microarrays (SAM)

Here we will use the GSE13425 experiment which which was retrieved from the Gene Expression Omnibus (GEO) public database. In this experiment, the authors were interested in the molecular classification of acute lymphoblastic leukemia (ALL) that are characterized by the abnormal clonal proliferation, within the bone marrow, of lymphoid progenitors blocked at a precise stage of their differentiation.

Data were produced using Affymetrix geneChips (Affymetrix Human Genome U133A Array, HGU133A). Informations related to this platform are available on GEO website under identifier GPL96.

[back to contents]Start R.

- Have a look at the description of the
`read.table()`function. - We will now load three data tables into R using
the
`read.table`function. The function allows us to directly read the tables from the Web server. We wll successively load 3 files providing complementary information.- the expression matrix (
*GSE13425_Norm_Whole.txt*), - the A/P/M matrix (
*GSE13425_AMP_Whole.txt*) - phenotypic data (
*GSE13425_phenoData.txt*)

?read.table url.base <- "http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/data/gene_expression/denboer_2009/" ## Load expression values expr.matrix <- read.table(file.path(url.base, "GSE13425_Norm_Whole.txt"),sep="\t", head=T, row=1) ## Load phenotypic data pheno <- read.table(file.path(url.base, 'phenoData_GSE13425.tab'), sep='\t', head=TRUE, row=1) ## Load Absent/Marginal/Present (AMP) calls amp <- read.table(file.path(url.base, "GSE13425_AMP_Whole.txt"),sep="\t", head=T, row=1)

The file

*GSE13425_Norm_Whole.txt*contains genes as rows and samples as columns. Data were previously normalized using`rma`algorithm (they are thus transformed in logarithm base 2). The*GSE13425_phenoData.txt*file contains phenotypic data about samples. The*GSE13425_AMP_Whole.txt*file contains information about A/P/M calls (genes as rows and samples as columns). - the expression matrix (
- We will now define a directory to store the results. In principle this directory has been created in the previosu tutorial (normalization of Affymetrix data), but it is worth checking that it still exists.

dir.output <- "~/GSE13425" ## Define the output directory. You can adapt this to your local configuration. dir.create(dir.output, showWarnings=FALSE, recurs=TRUE) ## Create the directory if it does not exist yet setwd(dir.output)

- How many rows and columns does the
*data*object contains ? - Does it correspond to the dimensions of the A/P/M matrix ?
- What informations are available about samples ?
- How many samples from each tumor subtype are present in the DenBoer dataset ?

View solution| Hide solution

The dataset from DenBoer contains 190 samples belonging to various tumour classes. We can already notice that there is an important imbalance between the sizes of the tumour classes: T-ALL, pre-B ALL, TEL-AML1 and hyperdiploid are each represented by more than 40 samples, whereas other classes (e.g. BCR-ABL, E2A-rearranged) are represented by a handful of samples.

The number of samples per group is a very important factor for selecting differentially expressed genes: in general, the power of the tests (i.e. the capacity to detect effectively differentially expressed genes) increases with group sizes.

[back to contents]Welch's test is a variant of the classical Student test, whose goal is to test the equality between two means.

The essential difference between Student and Welch is that the proper Student test relies on the assumption that the two sampled populations have the same variance, whereas Welch's test is designed to treat populations with unequal variances.

When detecting differentially expressed genes, we cannot assume equal variance. Indeed, a typical case would be that a gene of interest is expressed at very low level in the first group, and high level in the second group. The inter-individual fluctuations in expression values are expected to be larger when the gene is expressed at a high level than when it is poorly expressed. It is thus generally recommended to use Welch rather than Student test when analyzing microarray expression profiles.

Welch's t-test defines the *t* statistic by the following
formula:

where

- \(\ \bar{X}_{i} \) is the sample mean,
- \(\ s^2_{i} \) the sample variance,
- \(\ N_{i} \) the sample size.

The `t.test()` function can be used to calculate this score
(and additional informations such as p.value). This function returns
an S3 object whose slots can be listed using the `names()`
function and accessed using the $ operator (such as with lists in
R).

In order to get an intuition of the *t* statistics, let us
create artificial datasets and run a Welch test on it. In the
following example x and y can be viewed as the values for gene g in
class x and y.

## Generate 4 random numbers following a normal distribution, to ## simulate the samples 1 and 2. We deliberately set the means to ## the same values (to fall under the null hypothesis), but ## different standard deviations. x <- rnorm(n=4, mean=6, s=1) y <- rnorm(n=4, mean=6, s=2) ## Run the Welch test (this is specified by indicating that we don't expect equal variances) denboer.welch <- t.test(x,y, var.equal=FALSE) print(denboer.welch) ## note: each student should have a different result, since values were generated at random ## Retrieve the t statistics names(denboer.welch) denboer.welch$statistic

What would be the expected distribution of *t* when comparing
the expression levels in class x and y for 10000 genes whose values
would be randomly drawn from normal distributions with mean 0 and
standard deviations of 1 and 2, resp ?

## Run iteratively 10000 Welch tests with random values t.values <- replicate(10000,t.test(rnorm(4, 0, 1),rnorm(4, 0, 2))$statistic) dens <- density(t.values) plot(dens,xlim=c(-5,5))

We can now compute the empirical frequency of *t*
statistics with absolute value above or equal to 3.

length(t.values[abs(t.values) >= 3])/10000

We can visualize this by coloring the tails of the density function on a plot.

plot(dens,xlim=c(-5,5)) lower.bound <- 1 upper.bound <- tail(which(dens$x <= -3),1) plot(dens) polygon(c(dens$x[lower.bound:upper.bound],dens$x[upper.bound]),c(dens$y[lower.bound:upper.bound],0),col="tan") lower.bound <- head(which(dens$x >= 3),1) upper.bound <- length(dens$x) polygon(c(dens$x[lower.bound],dens$x[lower.bound:upper.bound]),c(0,dens$y[lower.bound:upper.bound]),col="tan")

Now we can check if the empirical distribution we have drawn
fits the theorical *t* distribution.

Student's t-distribution has the probability density function given by:

where

- \(\ \nu \) is the number of degrees of freedom.
- \(\ \Gamma \) is the Gamma function.

The degree of freedom, which is related to the number of samples and variance in class 1 and 2, is the parameter that will control the shape of the probability density function.

The degree of freedom can be computed using the following formula.

Here we will consider that variance is equal to 1 (as x and y were drawn from a normal distribution with mean 0 and standard deviation 1)

So we can compute the degree of freedom using R and draw the theorical density of the t-distribution.

## Compute the degrees of freedom for Welch test N1 <- length(x) N2 <- length(y) var1 <- var(x) ## Since we ignore the variance of the population, we estimate it from the sample var2 <- var(y) df <- (var1/N1+var2/N2)^2/(var1^2/(N1^2*(N1-1))+var2^2/(N2^2*(N1-1))) ## Compare the degrees of freedom computed by ourselves with the df ## reported by the t.test() function. names(denboer.welch) print(denboer.welch$parameter)[back to contents]

We would like to define genes that discriminate between "TEL-AML1" tumors and "hyperdiploid" tumors.

One possibility would be to iterate over all probesets, and to successively run the R method t.test() on each one. This would however be quite inefficient, and the results would not be very easy to handle, since it would be a list of objects of the class t.test.

Instead, we will use a custom function that runs Student or Welch test in parallel on all the elements of a data table.

## Load the library for multiple t tests source('http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/R-files/config.R') source(file.path(dir.util, "util_student_test_multi.R"))

For the sake of curiosity, you can also have a look at the R code.

We will now run the Welch test on all the genes of the Den Boer dataset. We will apply the test to select genes differentially expressed between the subtypes "hyperdiploid" and "TEL-AML1".

## Define a Boolean vector indicating which samples belong to the two selected subtypes. samples.to.keep <- pheno$Sample.title == "hyperdiploid" | pheno$Sample.title == "TEL-AML1" ## Define a vector with the sample types for the two selected cancer subtype cancer.type <- as.vector(pheno[samples.to.keep, "Sample.title"]) print(cancer.type) table(cancer.type) ## Run the Welch test on each probesets of the DenBoer expression matrix denboer.welch <- t.test.multi(expr.matrix[samples.to.keep], cancer.type, volcano.plot=FALSE) ## Inspect the result table dim(denboer.welch) names(denboer.welch) ## Select genes with a stringent threshold on E-value threshold <- 0.05 significant.probesets <- denboer.welch$E.value <= threshold table(significant.probesets) ## Count the number of significant probesets

We will compare the mean expression value between the two subtypes, and highlight the significant genes.

## Plot the gene-wise means plot(denboer.welch[, c("mean.TEL-AML1", "mean.hyperdiploid")], col="darkgray") grid() abline(a=0,b=1, col="black") lines(denboer.welch[significant.probesets, c("mean.TEL-AML1", "mean.hyperdiploid")], type='p', col="red") ## Highlight significant genes

How do you explain that the regions covered by gray (non-significant) and red (significant) probesets overlap on the mean-mean plot ?

The most popular method for differential expression analysis of microarray data is "Significance Analysis of Microarrays" (SAM). SAM is implemented in several software including R (library siggenes). Here we will use the MultiExperiment Viewer (MeV).

Type the command below in a terminal (this are shell/bash commands)

mkdir -p ~/bin cd ~/bin wget "http://freefr.dl.sourceforge.net/project/mev-tm4/mev-tm4/MeV%204.8.1/MeV_4_8_1_r2727_linux.tar.gz" tar xvfz MeV_4_8_1_r2727_linux.tar.gz cd - echo -e "\nalias tmev=\"cd ~/bin/MeV_4_8_1/; sh -c 'nohup ./tmev.sh &'; cd -\"" >> ~/.bashrc source ~/.bashrc tmev

- Load the file using File > Load data > Select file loader > Tab delimited.
- Browse to file "GSE13425_Norm_TEL-AML1vsHyperdip.txt", click on the upper-leftmost expression value and click on the "load" button.
- Select Adjust data > Gene/Rows adjustment > median center Genes/Rows
- Select Analysis > Statistics > SAM
- Set all samples from GSM338746.CEL.gz to GSM338789.CEL.gz to class B.
- Set the number of permutations to 500, select "Construct hierachical clustering" and click " OK"
- Accept default parameters for hierarchical clustering
- Set the delta value to 2 and click OK.
- Select Analysis results > SAM > Hierarchical trees > All Significant genes
- Select Display > Set color scale limits and set lower limit to -4, midpoint value to 0 and upper limit to 4.
- Select Display > Set Element Size > 5x2
- To store the resulting file right click to select the whole gene tree and select "Save cluster"