AMU :: M2 BBSG :: ASG1 :: 2012/2013 :: Denis Puthier & Jacques van Helden

Statistics for Bioinformatics - Practicals - Supervised classification


Contents


Abbreviations

ALLacute lymphoblastic leukemia
AMLacute myeloblastic leukemia
DEGdifferentiall expressed genes
GEOGene Expression Omnibus database
KNNK-nearest neighbours
LDAlinear discriminant analysis
QDAquadratic discriminant analysis
PCAprincipal component anlaysis
[back to contents]

R configuration

We first need to define the URL of the course (dir.base), from which we will download some pieces of R code and the data sets to analyze.

## Specify the URL of the base for the course
dir.base <- 'http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics'
The following command loads a general configuration file, specifying the input directories (data, R utilities) and creating output directories on your computer (dir.results, dir.figures)..
## Load the general configuration file
source(file.path(dir.base, 'R-files', 'config.R'))
[back to contents]

Goal of this tutorial

In this tutorial, we will get familiar with the basic concepts of supervised classification.
  1. Distinction between unsupervised (clustering) and supervised classification
  2. Training, testing and prediction
  3. Cross-validation, leave-one-out
  4. Over-dimensionality
  5. Overfitting
[back to contents]

Study case

For the practical, we will use a cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from DenBoer (2009). The raw data has previously been retrieved from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/).

Data loading

We can now load the profile table and check its dimensions.

## Define the location of data directory and file containing expression profiles
dir.denboer <- file.path(dir.base, 'data', 'gene_expression','denboer_2009')
file.expr.table <- file.path(dir.denboer, 'GSE13425_Norm_Whole.txt')

## Load the expression profiles (one row per gene, one column per sample).
##
## BEWARE, the whole file makes 20Mb. 
## It can take a few minutes to be downloaded from the Web site and loaded in R.
expr.matrix <- read.table(file.expr.table, sep = "\t", head = T, row = 1)
print(dim(expr.matrix))
## Should give this: 22283   190

Once the whole data set has been loaded, the data frame "expr.matrix" should contain 22,283 rows (genes) and 190 columns (samples).

We can now load the sample descriptions ("phenoData").

## Load the table describing each sample 
## (one row per sample, one column per description field).
pheno <- read.table(file.path(dir.denboer, 'phenoData_GSE13425.tab'), sep='\t', head=TRUE, row=1)
dim(pheno)
## [1] 190   4

## We can check the content of the phenoData table by
## consulting its column names.
names(pheno)
[1] "Sample.title"               "Sample.source.name.ch1"    
[3] "Sample.characteristics.ch1" "Sample.description"        

The column Sample.title indicates the cancer subtype corresponding to each sample. We can count the number of samples per subtype, and display them by decreasing group size.

## Print the number of samples per cancer type
print(data.frame("n"=sort(table(pheno$Sample.title),decreasing=T)))
                          n
hyperdiploid             44
pre-B ALL                44
TEL-AML1                 43
T-ALL                    36
E2A-rearranged (EP)       8
BCR-ABL                   4
E2A-rearranged (E-sub)    4
MLL                       4
BCR-ABL + hyperdiploidy   1
E2A-rearranged (E)        1
TEL-AML1 + hyperdiploidy  1
For the sake of visualization, we will define short labels corresponding to each ALL subtype, and assign these short labels to the samples.
## Define an abbreviated name for each canceer subtype 
## (will be used later visualization on the plots)
group.abbrev <- c(
      'BCR-ABL + hyperdiploidy'='Bch',
      'BCR-ABL'='Bc',
      'E2A-rearranged (E)'='BE',
      'E2A-rearranged (E-sub)'='BEs',
      'E2A-rearranged (EP)'='BEp',
      'MLL'='BM',
      'T-ALL'='T',
      'TEL-AML1 + hyperdiploidy'='Bth',
      'TEL-AML1'='Bt',
      'hyperdiploid'='Bh',
      'pre-B ALL'='Bo'
                   )
sample.subtypes <- as.vector(pheno$Sample.title)
sample.labels <- group.abbrev[sample.subtypes]
names(sample.labels) <- names(expr.matrix)


## Check the label for a random selection of 10 samples. 
## Each run should give a different result
sample(sample.labels, size=10)
GSM338831 GSM338722 GSM338816 GSM338838 GSM338752 GSM338819 GSM338855 GSM338851 
     "Bo"      "Bt"      "Bo"      "Bo"      "Bh"      "Bo"      "Bo"      "Bo" 
GSM338672 GSM338693 
      "T"       "T" 

We can also define group-specific colors and assign them to samples.

## Define group-specific colors
group.colors <- c(
                    'BCR-ABL + hyperdiploidy'='cyan',
                    'BCR-ABL'='black',
                    'E2A-rearranged (E)'='darkgray',
                    'E2A-rearranged (E-sub)'='green',
                    'E2A-rearranged (EP)'='orange',
                    'MLL'='#444400',
                    'T-ALL'='violet',
                    'TEL-AML1 + hyperdiploidy'='#000066',
                    'TEL-AML1'='darkgreen',
                    'hyperdiploid'='red',
                    'pre-B ALL'='blue'
                    )

## Assign group-specific colors to patients
sample.colors <- group.colors[as.vector(pheno$Sample.title)]
names(sample.colors) <- names(expr.matrix)
sample(sample.colors,size=20)
  GSM338802   GSM338760   GSM338821   GSM338846   GSM338750   GSM338735 
    "green"       "red"      "blue"      "blue"       "red" "darkgreen" 
  GSM338724   GSM338768   GSM338776   GSM338730   GSM338753   GSM338706 
"darkgreen"       "red"       "red" "darkgreen"       "red" "darkgreen" 
  GSM338720   GSM338701   GSM338708   GSM338828   GSM338771   GSM338805 
"darkgreen"    "violet" "darkgreen"      "blue"       "red"     "black" 
  GSM338705   GSM338716 
"darkgreen" "darkgreen" 
[back to contents]

Reducing the dimensionality of the data - feature selection

An omnipresent problem with microarray data is the large dimensionality of the data space. The dataset we are analyzing contains 190 samples (the subjects), each characterized by 22,283 gene expression values (the variables).

The number of variables (also called features, the genes in our case) thus exceeds by far the number of objects (samples in this case). This sitation is qualified of over-dimensionality, and poses serious problems for classification. In unsupervised classification (clustering), the relationships between objects will be affected, since a vast majority of the features are likely to be uninformative, but will however contribute to the computed (dis)similarity metrics (whichever metrics is used).

A pair of genes

To get some feeling about the data, we will compare the expression levels of two arbitrarily genes selected (the 236th and the 1213th rows of the profile table).

## Plot the expression profiles of two arbitrarily selected genes
g1 <- 236
g2 <- 1213
x <- as.vector(as.matrix(expr.matrix[g1,]))
y <- as.vector(as.matrix(expr.matrix[g2,]))
plot(x,y,
       col=sample.colors,
       type='n',
       panel.first=grid(col='black'), 
         main='Den Boer (2009), two selected genes', 
         xlab=paste('gene', g1), ylab=paste('gene', g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)
legend('topleft',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')
     

Each dot corresponds to one sample. Since the genes were initially selected at random, we don't expect them to be particularly good at discriminating the different subtypes of ALL. However, we can already see some emerging patterns: the T-ALL (labelled "T") show a trends to strongly express the 236th gene, and, to a lesser extent, several hyperdiploid B-cells (label "Bh") show a high expression of the 1213th gene. Nevertheless, it would be impossible to draw a boundary that would perfectly separate two subtypes in this plot.

Variance filter

In order to reduce the dimensionality of the data set, we will sort the genes according to their variance, and retain a defined number of the top-ranking genes. Note that this ranking is a very rudimentary way to select a subset of genes. Indeed, nothing guarantees us that the genes showing the largest inter-individual fluctuations of expression have anything related to cancer type. In further sections, we will present alternative methods of variable ordering, which will explicitly take into account the inter-group variance.

The R function apply() can be used to apply any function to each row (or alternatively each column) of a data table. We set the second argument (margin) to 1, thereby indicating that the function (third argument) must be applied to each row of the input table (first argument).

## Compute sample-wise variance
var.per.sample <- apply(expr.matrix, 2, var)
head(var.per.sample)

## Inspect the distribution of sample-wise variance
hist(var.per.sample, breaks=20)
GSM338666 GSM338667 GSM338668 GSM338669 GSM338670 GSM338671 
 2.636681  2.636490  2.749702  2.436275  2.634296  2.538188 
## Compute gene-wise variance
var.per.gene <- apply(expr.matrix, 1, var)

## Inspect the distribution of gene-wise variance
hist(var.per.gene, breaks=100)

We notice that gene-wise variances have a wide dispersion. the histogram shows a right-skewed distribution, with a long tail due to the presence of a few genes with high variance.

If we sort genes by decreasing variance, we can see that there is a strong difference between the top and the bottom of the list.

## Sort genes per decreasing variance
genes.by.decr.var <- sort(var.per.gene,decreasing=TRUE)

## Print the 5 genes with highest variance
head(genes.by.decr.var)
     CD9|201005_at  IL23A|211796_s_at     CD3D|213539_at S100A8|202917_s_at   KLF4|221841_s_at 
          6.042537           5.929345           5.696317           5.506174           5.245490 
 HLA-DRA|208894_at 
          5.160823 
## Print the 5 genes with lowest variance
tail(genes.by.decr.var)
 C7orf25|53202_at     FMO3|40665_at     SCLY|59705_at NA|AFFX-DapX-5_at  HSD17B6|37512_at NA|AFFX-LysX-5_at 
      0.007035592       0.007000936       0.006630802       0.006543372       0.006439234       0.005077204 

We can then select an arbitrary number of top-ranking genes in the list.

## Select the 30 top-ranking genes in the list sorted by variance.
## This list of genes will be used below as training variables for
## supervised classification.
top.nb <- 20 ## This number can be changed for testing
genes.selected.by.var <- names(genes.by.decr.var[1:top.nb])

## Check the names of the first selected genes
head(genes.selected.by.var, n=20) 
 [1] "CD9|201005_at"       "IL23A|211796_s_at"   "CD3D|213539_at"      "S100A8|202917_s_at" 
 [5] "KLF4|221841_s_at"    "HLA-DRA|208894_at"   "NA|216379_x_at"      "IL23A|210915_x_at"  
 [9] "HLA-DRA|210982_s_at" "ITM2A|202746_at"     "RPS4Y1|201909_at"    "NA|209771_x_at"     
[13] "NPY|206001_at"       "SH3BP5|201811_x_at"  "NRN1|218625_at"      "LCK|204891_s_at"    
[17] "DDX3Y|205000_at"     "IL23A|213193_x_at"   "DEFA1|205033_s_at"   "TCF4|212386_at"     

In the next sections, we will compare methods for selecting genes on the basis of different criteria (variance, T-test, ANOVA test, step-forward procedure in Linear Discriminant Analysis). For this purpose, we will create a table indicating the values and the rank of each gene (rows of the table) according to eachs election criterion (columns).

## Create a data frame to store gene values and ranks 
## for different selection criteria.
gene.ranks <- data.frame(var=var.per.gene)
head(gene.ranks)
                         var
DDR1|1007_s_at   0.188580722
RFC2|1053_at     0.060546361
HSPA6|117_at     0.456587060
PAX8|121_at      0.027817178
GUCA1A|1255_g_at 0.007829279
UBA7|1294_at     0.190982366
## Beware, we rank according to minus variance, because 
## we want to associate the lowest ranks to the highest variances
gene.ranks$var.rank <- rank(-gene.ranks$var, ties.method='random')
head(gene.ranks)
                         var var.rank
DDR1|1007_s_at   0.188580722     5369
RFC2|1053_at     0.060546361    10659
HSPA6|117_at     0.456587060     1971
PAX8|121_at      0.027817178    17936
GUCA1A|1255_g_at 0.007829279    22265
UBA7|1294_at     0.190982366     5306
## Check the rank of the 5 genes with highest and lowest variance, resp.
gene.ranks[names(genes.by.decr.var[1:5]),]
                        var var.rank
CD9|201005_at      6.043017        1
IL23A|211796_s_at  5.928425        2
CD3D|213539_at     5.696326        3
S100A8|202917_s_at 5.508163        4
KLF4|221841_s_at   5.245057        5
#1# Print the bottom 5 genes of the variance-sorted table
gene.ranks[names(tail(genes.by.decr.var)),]
C7orf25|53202_at  0.007035592    22278
FMO3|40665_at     0.007000936    22279
SCLY|59705_at     0.006630802    22280
NA|AFFX-DapX-5_at 0.006543372    22281
HSD17B6|37512_at  0.006439234    22282
NA|AFFX-LysX-5_at 0.005077204    22283

We have no specific reason to think that genes having a high variance will be specially good at discriminating ALL subtypes. Indeed, a high variance might a priori either reflect cell-type specificities, or differences between samples that result from any other effect (individual patient genomes, transcriptome, condition, ...). For the sake of curiosity, let us plot samples on a XY plot where the abcsissa represents the top-raking, and the ordinate the second top-ranking gene in the variance-ordered list.

## Plot the expression profiles of the two genes with highest variance
g1 <- names(genes.by.decr.var[1])
print(g1)
[1] "CD9|201005_at"
(g2 <- names(genes.by.decr.var[2]))
[1] "IL23A|211796_s_at"
x <- as.vector(as.matrix(expr.matrix[g1,]))
y <- as.vector(as.matrix(expr.matrix[g2,]))
plot(x,y,
       col=sample.colors,
       type='n',
       panel.first=grid(col='black'), 
       main="2 genes with the highest variance", 
       xlab=paste('gene', g1), 
       ylab=paste('gene', g2))
text(x, y,labels=sample.labels,col=sample.colors,pch=0.5)
legend('topright',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')

Somewhat surprizingly plot shows that, despite the roughness of our ranking criterion (variance across all cancer types), the two top-ranking genes have each some discriminant power:

Selecting differentially expressed genes (DEG) as predictor variables

Variable ordering by Welch t-test (2-groups test)

The T-test tests the hypothesis of equality between the means of two populations. We can use the Welch version of the t-test (which assumes that groups can have different variances) in order to select genes differentially expressed between two goups of samples.

We thus need to define two groups of samples (multi-group comparisons will be treated by ANOVA). For the dataset on ALL? we have several subtypes of cancer. we will perform pairwise comparisons of mean for a selection of subtypes. In each case, a given subtype (e.g. T-cells) will be compared to all other subtypes pooled together.

We will successvely run Welch's t-test for the main subtypes ("Bo", "Bh", "Bt", "T").

In each case, apply the test in to each gene, using the function t.test.mult() defined in the R utilities of this course.

## Load a utility to apply Welch test on each row of a table
source(file.path(dir.util, "util_student_test_multi.R"))

## Define a vector indicating whether each sample 
## belongs to the subtype of interest (e.g. "Bo") or not.
current.group <- "Bo"
one.group.vs.others<- sample.labels
one.group.vs.others[sample.labels != current.group] <- "other"
print(table(one.group.vs.others))
one.group.vs.others
   Bo other 
   44   146 
## Test the mean equality between Bo subtype and all other subtypes 
welch.one.group.vs.others <- t.test.multi(expr.matrix, one.group.vs.others)
[1] "Thu Aug 23 09:19:08 2012 - Multiple t-test started"
[1] "Thu Aug 23 09:19:11 2012 - Multiple t-test done"
names(welch.one.group.vs.others)
 [1] "mean.other"    "mean.Bo"       "means.diff"    "var.est.other"
 [5] "var.est.Bo"    "sd.est.other"  "sd.est.Bo"     "st.err.diff"  
 [9] "t.obs"         "df.welch"      "P.value"       "E.value"      
[13] "sig"          
## Update the gene rank table
test.name <- paste(current.group, '.vs.others.sig', sep='')
gene.ranks[,test.name] <- welch.one.group.vs.others$sig ## Add a column with significances
gene.ranks[,paste(test.name, ".rank", sep="")] <- rank(-welch.one.group.vs.others$sig, , ties.method='random') ## Add a column with significance ranks

## Do the same Welch test for the 3 other majority groups
for (current.group in c("Bh", "Bt", "T")) {
    one.group.vs.others<- sample.labels
    one.group.vs.others[sample.labels != current.group] <- "other"
    
    ## Test the mean equality between Bo subtype and all other subtypes 
    welch.one.group.vs.others <- t.test.multi(expr.matrix, one.group.vs.others)
    
    ## Update the gene rank table
    test.name <- paste(current.group, '.vs.others.sig', sep='')
    gene.ranks[,test.name] <- welch.one.group.vs.others$sig
    gene.ranks[,paste(test.name, ".rank", sep="")] <- rank(-welch.one.group.vs.others$sig, , ties.method='random')
  }


## Check the resulting gene table
head(gene.ranks)
head(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),])
tail(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),])

## Store the gene rank table in a text file (tab-separated columns)
write.table(gene.ranks, file=file.path(dir.results, 'DenBoer_gene_ranks.tab'), sep='\t', quote=F, col.names=NA)
## Plot variance against significance of the Welch test for the 4 different groups
plot(gene.ranks[,c("var", "Bo.vs.others.sig", "Bh.vs.others.sig", "Bt.vs.others.sig", "T.vs.others.sig")], col="grey")


## Plot ranks of variance against significance of the Welch test for the 4 different groups
plot(gene.ranks[,c("var.rank", "Bo.vs.others.sig.rank", "Bh.vs.others.sig.rank", "Bt.vs.others.sig.rank", "T.vs.others.sig.rank")], col="grey")

An obvious observation: the gene-wise variance and the 4 results of Welch tests (one cancer type against all other types) return very different values, and rank the gene in completely different ways. This is not surprizing, the variance and the 4 significances indicate different properties of the genes. Gene selection will thus have to be adapted to the specific purpose of the classification.

[back to contents]

Principal Component Analysis

Before starting the proper process of supervised classification, we can apply a method called Principal Component Analysis (PCA) to evaluate the repartition of the information between the mutiple variables, and to inspect the "intrinsic" structure of the data, i.e. the structure inherent to the numbers in the data table, irrespective of the labels (cancer subtypes) attached to the various samples.

Purpose of PCA

We can do the exercise of extending this 2-dimensional plot to a 3-dimensional plot, where the third dimension represents the expression level of a third gene. With an effort of imagination, we can mentally extend this 3D plot to a 4-dimensional plot, where each dimension would represent a different gene. It is likely that the groups of genes will progressively become more separated as the number of dimension increases. However, for the sake of graphical representation, it is difficult to work with more than 2 (or at most 3) dimensions.

The purpose of Principal Component Analysis (PCA) is to capture the largest part of the variance of a data set in a minimal number of dimensions.

Applying PCA transformation with stats::prcomp()

The R method stats::prcomp() performs a PCA transformation of an input table. We can feed it with the expression table, but we need to transpose it first (using the R function t()), in order to provide our objects of interest (the samples) as rows, and the variables (genes) as columns.

## load the stats library to use the princomp() and prcomp() function
library(stats) 

## Perform the PCA transformation
expr.prcomp <- prcomp(t(expr.matrix),cor=TRUE)

## Analyze the content of the prcomp result: 
## the result of the method prcomp() is an object 
## belonging to the class "prcomp"
class(expr.prcomp) 
## Get the field names of the promp objects
names(expr.prcomp) 

## Get the attributes of the promp objects
attributes(expr.prcomp) 

Repartition of the standard deviation along the components

A first information is the repartition of the variance (or its squared root, the standard deviation) between the components, which can be displayed on a plot.

plot(expr.prcomp, main='Den Boer (2009), Variance  per component', xlab='Component')

The standard deviation barplot (Figure~\ref{DenBoer_pca_variance}) highlights that the first component captures more or less twice as much standard deviation of the whole dataset as the second one. We can measure the relative importance of the standard deviations.

## Get the standard deviation and variance per principal component
sd.per.pc <- expr.prcomp$sdev
var.per.pc <- sd.per.pc^2

## Display the percentage of total variance explained by each 
sd.per.pc.percent <- sd.per.pc/sum(sd.per.pc)
var.per.pc.percent <- var.per.pc/sum(var.per.pc)
barplot(var.per.pc.percent[1:10], main='Den Boer (2009), Percent of variance  per component', xlab='Component', ylab='Percent variance', col='#BBDDFF')

Analysis of the first versus second component

We can generate a biplot, where each sample appears as a dot, and the X and Y axes respectively represent the first and second components of the PCA-transformed data. The R function stats::biplot() automatically generates the biplot, but the result is somewhat confusing, because the biplot displays both the units (samples) and a projection of the variables (genes) on the two first components.

biplot(expr.prcomp,var.axes=FALSE,
         panel.first=grid(col='black'), 
         main=paste('PCA; Den Boer (2009); ',
           ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '), 
         xlab='First component', ylab='Second component')

Let us now generate a custom plot, with labels and colors indicating the subtype of ALL.

## Plot components PC1 and PC2
plot(expr.prcomp$x[,1:2],
       col=sample.colors,
       type='n',
       panel.first=grid(col='black'), 
         main=paste('PCA; Den Boer (2009); ',
           ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '), 
         xlab='PC1', ylab='PC2')
text(expr.prcomp$x[,1:2],labels=sample.labels,col=sample.colors,pch=0.5)
legend('bottomleft',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')

The coloring and cancer-type labels highlight a very interesting property of the PCA result: the two first components of the PCA-transformed data perfectly separate the T-ALL samples from all the other subtypes. All T-ALL cells appear in one elongated cloud on the left side of the plot. The other cloud seems to contain some organization as well: subtypes are partly intermingled but there are obvious groupings.

Analysis of the second versus third component

The following figure shows that the second and third components capture information related to the cancer subtypes: the third component separates quite well TEL-AML1 (top) from hyperdiploid (bottom) samples, whereas the pre-B have intermediate values. The separation is however less obvious than the T-ALL versus all other subtypes that we observed in the two first components.

## Plot components PC2 and PC3
plot(expr.prcomp$x[,2:3],
       col=sample.colors,
       type='n',
       panel.first=grid(col='black'), 
         main=paste('PCA; Den Boer (2009); ',
           ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '), 
         xlab='PC2', ylab='PC3')
text(expr.prcomp$x[,2:3],labels=sample.labels,col=sample.colors,pch=0.5)     
legend('bottomleft',col=group.colors, 
         legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')

Exercise

Generate plots of a few additional components (PC4, PC5,...) and try to evaluate if they further separate subtypes of cancers.

Discussion of the PCA results

In the "historical" dataset from Golub (1999), three sample groups (AML, T-ALL and B-ALL) appeared almost perfectly separated by a simple display of the two first components of the PCA-transformed data. The dataset from DenBoer (2009) is less obvious to interpret, due to the nature of the data itself: firstly, all the samples come from the same cell type (lymphoblasts), whereas the main separation of Golub was between myeloblasts and lymphoblasts. Secondly, Den Boer and co-workers defined a wider variety of subtypes. However, we see that a simple PCA transformation already reveals that the raw expression data is well structured. It is quite obvious that we will have no difficulty to find a signature discriminating T-ALL from other ALL subtypes, since T-ALL are already separated on the plane of the two first components. We would however like to further refine the diagnostics, by training a classifier to discriminate between the other subtypes as well.

Note that until now we did not apply any training: PCA transformation is a "blind" (unsupervised) approach, performing a rotation in the variable space without using any information about pre-defined classes. Since we dispose of such information for the 190 samples of the training set, we can use a family of other approaches, generically called supervised classification, that will rely on class membership of the training samples in order to train a classifier, which will further be used to assign new samples to the same classes.

[back to contents]

Linear Discriminant Analysis (LDA)

Training the classifier

As a first trial, we will train a LDA classifier using as variables the 20 top-raking genes in the list of genes sorted by variance.

## Load the library containing the linear discriminant analysis function
library(MASS)

## Create a list of gene names sorted by decreasing variance
sorted.names <- rownames(gene.ranks)[order(gene.ranks$var, decreasing=TRUE)]


## Select the 20 top-ranking genes sorted by decreasing variance
top.variables <- 20
selected.genes <- sorted.names[1:top.variables]
## Train the classifier
lda.classifier <- lda(t(expr.matrix[selected.genes,]),sample.labels,CV=FALSE) 

Evaluating the hit rate by Leave-One-Out (LOO) cross-validation}
## Use the MASS:lda() function with the cross-validation option
lda.loo <- lda(t(expr.matrix[selected.genes,]),sample.labels,CV=TRUE) 

## Collect the LOO prediction result in a vector
loo.predicted.class <- as.vector(lda.loo$class)
print(loo.predicted.class)
table(loo.predicted.class)

## Build a contingency table of known versus predicted class
lda.loo.xtab <- table(sample.labels, loo.predicted.class)
print(lda.loo.xtab)

The result looks like this:

             loo.predicted.class
sample.labels Bc BE BEp BEs Bh BM Bo Bt  T
          Bc   0  0   0   0  0  0  4  0  0
          Bch  0  0   0   0  0  0  0  0  0
          BE   0  0   0   0  0  0  0  0  0
          BEp  0  1   6   0  0  0  1  0  0
          BEs  0  0   0   3  0  0  1  0  0
          Bh   0  0   0   1 40  0  3  0  0
          BM   0  0   0   0  0  3  1  0  0
          Bo   3  0   2   2  7  1 25  4  0
          Bt   0  0   2   1  0  0  1 39  0
          Bth  0  0   0   0  0  0  0  0  0
          T    0  0   0   0  0  0  0  0 36
## Display the contingency table as a heat map
image(lda.loo.xtab)
library(lattice)
levelplot(lda.loo.xtab)

## Compute the hit rate
hits <- sample.labels == loo.predicted.class
errors <- sample.labels != loo.predicted.class

## Compute the number of hits 
## (we need to omit NA values because LDA fails to assign a group to some objects).
(nb.hits <- sum(na.omit(hits))) ## this should give 152
(nb.pred <- length(na.omit(hits))) ## This should give 187
(hit.rate <- nb.hits / nb.pred ) ## This should give 0.81

Testing the random expectation with permutation tests}

Exercises

  1. Use the R function sample() to randomly permute the training labels (store the result in a vector called "sample.labels.perm"), and test the hit rate of a classifier trained with these randomized labels.
  2. Use the R function sample() to randomly permute the values of the input table (store the result in a data frame called "expr.perm"), and test the hit rate of a classifier trained with these randomized data.

Solutions

## Permute the training labels
sample.labels.perm <- as.vector(sample(sample.labels))

## Compare original training groups and permuted labels.
table(sample.labels, sample.labels.perm)
             sample.labels.perm
sample.labels Bc Bch BE BEp BEs Bh BM Bo Bt Bth  T
          Bc   0   0  0   0   0  0  0  2  0   1  1
          Bch  0   0  0   0   0  0  0  0  1   0  0
          BE   0   0  0   0   0  0  0  0  0   0  1
          BEp  0   0  0   1   0  3  0  3  1   0  0
          BEs  0   0  0   0   0  1  0  1  1   0  1
          Bh   1   0  0   1   2 18  0  7 10   0  5
          BM   0   0  0   0   0  2  0  1  0   0  1
          Bo   1   0  0   3   1  7  2 11 13   0  6
          Bt   1   1  0   2   0  6  2 10 10   0 11
          Bth  0   0  0   1   0  0  0  0  0   0  0
          T    1   0  1   0   1  7  0  9  7   0 10
## Run LDA in cross-validation (LOO) mode with the permuted labels
lda.loo.labels.perm <- lda(t(expr.matrix[selected.genes,]),sample.labels.perm,CV=TRUE) 

## Build a contingency table of known versus predicted class
loo.predicted.class.labels.perm <- as.vector(lda.loo.labels.perm$class)
lda.loo.labels.perm.xtab <- table(sample.labels.perm, loo.predicted.class.labels.perm)
print(lda.loo.labels.perm.xtab)

The particular values sould vary at each trial, since the sample labels are permuted at random.

## Display the contingency table as a heat map
image(lda.loo.labels.perm.xtab)

## Levelplot uses a nice color palette
levelplot(lda.loo.labels.perm.xtab)


## Compute the number of hits 
## (we need to omit NA values because LDA fails to assign a group to some objects).
hits.label.perm <- sample.labels.perm == loo.predicted.class.labels.perm
(nb.hits.label.perm <- sum(na.omit(hits.label.perm))) ## This gives 25 this time, but should give different results at each trial
(nb.pred.label.perm <- length(na.omit(hits.label.perm))) ## This should give 187
(hit.rate.label.perm <- nb.hits.label.perm / nb.pred.label.perm ) ## This gives 0.13 this time, should give different results at each trial

Exercises

Exercise: Impact of the number of training variables

  1. Test the impact of the number of variables used for training the classifier. Draw a plot representing the hit rate (Y axis) as a function of the number of top genes used for the LDA classification, using different selection criteria:
    1. Gene-wise variance
    2. P-value of a Welch test (T-cells against all other types)

Random expectation

  1. Generate a randomized expression matrix by sampling the values of the Den Boer dataset with the R function sample() (as we did in the practical on multiple testing corrections).
  2. Run the same procedure with this randomized matrix as we did above with the original expression matrix, and calculate hit rate with increasing number of variables selected by
    • variance
    • Welch test (T-cells against all other types)
  3. Compare the hit rates obtained with the permuted matrix and with the original expression matrix.

Denis Puthier (http://tagc.univ-mrs.fr/tagc/index.php/research/developmental-networks/d-puthier) & Jacques van Helden (TAGC, Université de la Méditerranée).