AMU :: M2 BBSG :: ASG1 :: 2012/2013 :: Denis Puthier & Jacques van Helden

# Statistics for Bioinformatics - Practicals - Supervised classification

## Abbreviations

 ALL acute lymphoblastic leukemia AML acute myeloblastic leukemia DEG differentiall expressed genes GEO Gene Expression Omnibus database KNN K-nearest neighbours LDA linear discriminant analysis QDA quadratic discriminant analysis PCA principal component anlaysis
[back to contents]

## R configuration

• Open a terminal
• Start R

We first need to define the URL of the course (dir.base), from which we will download some pieces of R code and the data sets to analyze.

## Specify the URL of the base for the course
dir.base <- 'http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics'

The following command loads a general configuration file, specifying the input directories (data, R utilities) and creating output directories on your computer (dir.results, dir.figures)..
## Load the general configuration file
source(file.path(dir.base, 'R-files', 'config.R'))

[back to contents]

## Goal of this tutorial

In this tutorial, we will get familiar with the basic concepts of supervised classification.
1. Distinction between unsupervised (clustering) and supervised classification
2. Training, testing and prediction
3. Cross-validation, leave-one-out
4. Over-dimensionality
5. Overfitting
[back to contents]

## Study case

For the practical, we will use a cohort comprized of 190 samples from patients suffering from Acute Lymphoblastic Leukemia (ALL) from DenBoer (2009). The raw data has previously been retrieved from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/).

We can now load the profile table and check its dimensions.

## Define the location of data directory and file containing expression profiles
dir.denboer <- file.path(dir.base, 'data', 'gene_expression','denboer_2009')
file.expr.table <- file.path(dir.denboer, 'GSE13425_Norm_Whole.txt')

## Load the expression profiles (one row per gene, one column per sample).
##
## BEWARE, the whole file makes 20Mb.
## It can take a few minutes to be downloaded from the Web site and loaded in R.
expr.matrix <- read.table(file.expr.table, sep = "\t", head = T, row = 1)
print(dim(expr.matrix))
## Should give this: 22283   190


Once the whole data set has been loaded, the data frame "expr.matrix" should contain 22,283 rows (genes) and 190 columns (samples).

We can now load the sample descriptions ("phenoData").

## Load the table describing each sample
## (one row per sample, one column per description field).
dim(pheno)
## [1] 190   4

## We can check the content of the phenoData table by
## consulting its column names.
names(pheno)

[1] "Sample.title"               "Sample.source.name.ch1"
[3] "Sample.characteristics.ch1" "Sample.description"


The column Sample.title indicates the cancer subtype corresponding to each sample. We can count the number of samples per subtype, and display them by decreasing group size.

## Print the number of samples per cancer type
print(data.frame("n"=sort(table(pheno$Sample.title),decreasing=T)))   n hyperdiploid 44 pre-B ALL 44 TEL-AML1 43 T-ALL 36 E2A-rearranged (EP) 8 BCR-ABL 4 E2A-rearranged (E-sub) 4 MLL 4 BCR-ABL + hyperdiploidy 1 E2A-rearranged (E) 1 TEL-AML1 + hyperdiploidy 1  For the sake of visualization, we will define short labels corresponding to each ALL subtype, and assign these short labels to the samples. ## Define an abbreviated name for each canceer subtype ## (will be used later visualization on the plots) group.abbrev <- c( 'BCR-ABL + hyperdiploidy'='Bch', 'BCR-ABL'='Bc', 'E2A-rearranged (E)'='BE', 'E2A-rearranged (E-sub)'='BEs', 'E2A-rearranged (EP)'='BEp', 'MLL'='BM', 'T-ALL'='T', 'TEL-AML1 + hyperdiploidy'='Bth', 'TEL-AML1'='Bt', 'hyperdiploid'='Bh', 'pre-B ALL'='Bo' ) sample.subtypes <- as.vector(pheno$Sample.title)
sample.labels <- group.abbrev[sample.subtypes]
names(sample.labels) <- names(expr.matrix)

## Check the label for a random selection of 10 samples.
## Each run should give a different result
sample(sample.labels, size=10)

GSM338831 GSM338722 GSM338816 GSM338838 GSM338752 GSM338819 GSM338855 GSM338851
"Bo"      "Bt"      "Bo"      "Bo"      "Bh"      "Bo"      "Bo"      "Bo"
GSM338672 GSM338693
"T"       "T"


We can also define group-specific colors and assign them to samples.

## Define group-specific colors
group.colors <- c(
'BCR-ABL + hyperdiploidy'='cyan',
'BCR-ABL'='black',
'E2A-rearranged (E)'='darkgray',
'E2A-rearranged (E-sub)'='green',
'E2A-rearranged (EP)'='orange',
'MLL'='#444400',
'T-ALL'='violet',
'TEL-AML1 + hyperdiploidy'='#000066',
'TEL-AML1'='darkgreen',
'hyperdiploid'='red',
'pre-B ALL'='blue'
)

## Assign group-specific colors to patients
sample.colors <- group.colors[as.vector(pheno$Sample.title)] names(sample.colors) <- names(expr.matrix) sample(sample.colors,size=20)   GSM338802 GSM338760 GSM338821 GSM338846 GSM338750 GSM338735 "green" "red" "blue" "blue" "red" "darkgreen" GSM338724 GSM338768 GSM338776 GSM338730 GSM338753 GSM338706 "darkgreen" "red" "red" "darkgreen" "red" "darkgreen" GSM338720 GSM338701 GSM338708 GSM338828 GSM338771 GSM338805 "darkgreen" "violet" "darkgreen" "blue" "red" "black" GSM338705 GSM338716 "darkgreen" "darkgreen"  [back to contents] ## Reducing the dimensionality of the data - feature selection An omnipresent problem with microarray data is the large dimensionality of the data space. The dataset we are analyzing contains 190 samples (the subjects), each characterized by 22,283 gene expression values (the variables). The number of variables (also called features, the genes in our case) thus exceeds by far the number of objects (samples in this case). This sitation is qualified of over-dimensionality, and poses serious problems for classification. In unsupervised classification (clustering), the relationships between objects will be affected, since a vast majority of the features are likely to be uninformative, but will however contribute to the computed (dis)similarity metrics (whichever metrics is used). ### A pair of genes To get some feeling about the data, we will compare the expression levels of two arbitrarily genes selected (the 236th and the 1213th rows of the profile table). ## Plot the expression profiles of two arbitrarily selected genes g1 <- 236 g2 <- 1213 x <- as.vector(as.matrix(expr.matrix[g1,])) y <- as.vector(as.matrix(expr.matrix[g2,])) plot(x,y, col=sample.colors, type='n', panel.first=grid(col='black'), main='Den Boer (2009), two selected genes', xlab=paste('gene', g1), ylab=paste('gene', g2)) text(x, y,labels=sample.labels,col=sample.colors,pch=0.5) legend('topleft',col=group.colors, legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')  Each dot corresponds to one sample. Since the genes were initially selected at random, we don't expect them to be particularly good at discriminating the different subtypes of ALL. However, we can already see some emerging patterns: the T-ALL (labelled "T") show a trends to strongly express the 236th gene, and, to a lesser extent, several hyperdiploid B-cells (label "Bh") show a high expression of the 1213th gene. Nevertheless, it would be impossible to draw a boundary that would perfectly separate two subtypes in this plot. ### Variance filter In order to reduce the dimensionality of the data set, we will sort the genes according to their variance, and retain a defined number of the top-ranking genes. Note that this ranking is a very rudimentary way to select a subset of genes. Indeed, nothing guarantees us that the genes showing the largest inter-individual fluctuations of expression have anything related to cancer type. In further sections, we will present alternative methods of variable ordering, which will explicitly take into account the inter-group variance. The R function apply() can be used to apply any function to each row (or alternatively each column) of a data table. We set the second argument (margin) to 1, thereby indicating that the function (third argument) must be applied to each row of the input table (first argument). ## Compute sample-wise variance var.per.sample <- apply(expr.matrix, 2, var) head(var.per.sample) ## Inspect the distribution of sample-wise variance hist(var.per.sample, breaks=20)  GSM338666 GSM338667 GSM338668 GSM338669 GSM338670 GSM338671 2.636681 2.636490 2.749702 2.436275 2.634296 2.538188  ## Compute gene-wise variance var.per.gene <- apply(expr.matrix, 1, var) ## Inspect the distribution of gene-wise variance hist(var.per.gene, breaks=100)  We notice that gene-wise variances have a wide dispersion. the histogram shows a right-skewed distribution, with a long tail due to the presence of a few genes with high variance. If we sort genes by decreasing variance, we can see that there is a strong difference between the top and the bottom of the list. ## Sort genes per decreasing variance genes.by.decr.var <- sort(var.per.gene,decreasing=TRUE) ## Print the 5 genes with highest variance head(genes.by.decr.var)   CD9|201005_at IL23A|211796_s_at CD3D|213539_at S100A8|202917_s_at KLF4|221841_s_at 6.042537 5.929345 5.696317 5.506174 5.245490 HLA-DRA|208894_at 5.160823  ## Print the 5 genes with lowest variance tail(genes.by.decr.var)   C7orf25|53202_at FMO3|40665_at SCLY|59705_at NA|AFFX-DapX-5_at HSD17B6|37512_at NA|AFFX-LysX-5_at 0.007035592 0.007000936 0.006630802 0.006543372 0.006439234 0.005077204  We can then select an arbitrary number of top-ranking genes in the list. ## Select the 30 top-ranking genes in the list sorted by variance. ## This list of genes will be used below as training variables for ## supervised classification. top.nb <- 20 ## This number can be changed for testing genes.selected.by.var <- names(genes.by.decr.var[1:top.nb]) ## Check the names of the first selected genes head(genes.selected.by.var, n=20)   [1] "CD9|201005_at" "IL23A|211796_s_at" "CD3D|213539_at" "S100A8|202917_s_at" [5] "KLF4|221841_s_at" "HLA-DRA|208894_at" "NA|216379_x_at" "IL23A|210915_x_at" [9] "HLA-DRA|210982_s_at" "ITM2A|202746_at" "RPS4Y1|201909_at" "NA|209771_x_at" [13] "NPY|206001_at" "SH3BP5|201811_x_at" "NRN1|218625_at" "LCK|204891_s_at" [17] "DDX3Y|205000_at" "IL23A|213193_x_at" "DEFA1|205033_s_at" "TCF4|212386_at"  In the next sections, we will compare methods for selecting genes on the basis of different criteria (variance, T-test, ANOVA test, step-forward procedure in Linear Discriminant Analysis). For this purpose, we will create a table indicating the values and the rank of each gene (rows of the table) according to eachs election criterion (columns). ## Create a data frame to store gene values and ranks ## for different selection criteria. gene.ranks <- data.frame(var=var.per.gene) head(gene.ranks)   var DDR1|1007_s_at 0.188580722 RFC2|1053_at 0.060546361 HSPA6|117_at 0.456587060 PAX8|121_at 0.027817178 GUCA1A|1255_g_at 0.007829279 UBA7|1294_at 0.190982366  ## Beware, we rank according to minus variance, because ## we want to associate the lowest ranks to the highest variances gene.ranks$var.rank <- rank(-gene.ranks$var, ties.method='random') head(gene.ranks)   var var.rank DDR1|1007_s_at 0.188580722 5369 RFC2|1053_at 0.060546361 10659 HSPA6|117_at 0.456587060 1971 PAX8|121_at 0.027817178 17936 GUCA1A|1255_g_at 0.007829279 22265 UBA7|1294_at 0.190982366 5306  ## Check the rank of the 5 genes with highest and lowest variance, resp. gene.ranks[names(genes.by.decr.var[1:5]),]   var var.rank CD9|201005_at 6.043017 1 IL23A|211796_s_at 5.928425 2 CD3D|213539_at 5.696326 3 S100A8|202917_s_at 5.508163 4 KLF4|221841_s_at 5.245057 5  #1# Print the bottom 5 genes of the variance-sorted table gene.ranks[names(tail(genes.by.decr.var)),]  C7orf25|53202_at 0.007035592 22278 FMO3|40665_at 0.007000936 22279 SCLY|59705_at 0.006630802 22280 NA|AFFX-DapX-5_at 0.006543372 22281 HSD17B6|37512_at 0.006439234 22282 NA|AFFX-LysX-5_at 0.005077204 22283  We have no specific reason to think that genes having a high variance will be specially good at discriminating ALL subtypes. Indeed, a high variance might a priori either reflect cell-type specificities, or differences between samples that result from any other effect (individual patient genomes, transcriptome, condition, ...). For the sake of curiosity, let us plot samples on a XY plot where the abcsissa represents the top-raking, and the ordinate the second top-ranking gene in the variance-ordered list. ## Plot the expression profiles of the two genes with highest variance g1 <- names(genes.by.decr.var[1]) print(g1)  [1] "CD9|201005_at"  (g2 <- names(genes.by.decr.var[2]))  [1] "IL23A|211796_s_at"  x <- as.vector(as.matrix(expr.matrix[g1,])) y <- as.vector(as.matrix(expr.matrix[g2,])) plot(x,y, col=sample.colors, type='n', panel.first=grid(col='black'), main="2 genes with the highest variance", xlab=paste('gene', g1), ylab=paste('gene', g2)) text(x, y,labels=sample.labels,col=sample.colors,pch=0.5) legend('topright',col=group.colors, legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')  Somewhat surprizingly plot shows that, despite the roughness of our ranking criterion (variance across all cancer types), the two top-ranking genes have each some discriminant power: • The expression level of CD9 (abcsissa of the plot) gives a pretty good separation between some cancer types with low expression levels (T and Bt), and some other types characterized by a high expression level (Bh, BEs, BEp). Cancer type Bo is however dispersed over the whole range of CD9 expression. • IL23A clearly separates the subtype "T" (high levels) from all other cancer subtypes (low levels). ### Selecting differentially expressed genes (DEG) as predictor variables #### Variable ordering by Welch t-test (2-groups test) The T-test tests the hypothesis of equality between the means of two populations. We can use the Welch version of the t-test (which assumes that groups can have different variances) in order to select genes differentially expressed between two goups of samples. H0: m1 = m2 We thus need to define two groups of samples (multi-group comparisons will be treated by ANOVA). For the dataset on ALL? we have several subtypes of cancer. we will perform pairwise comparisons of mean for a selection of subtypes. In each case, a given subtype (e.g. T-cells) will be compared to all other subtypes pooled together. We will successvely run Welch's t-test for the main subtypes ("Bo", "Bh", "Bt", "T"). H0(Bo): mBo = mothers H0(Bh):mBh = mothers H0(Bt):mBt = mothers H0(T):mT = mothers In each case, apply the test in to each gene, using the function t.test.mult() defined in the R utilities of this course. ## Load a utility to apply Welch test on each row of a table source(file.path(dir.util, "util_student_test_multi.R")) ## Define a vector indicating whether each sample ## belongs to the subtype of interest (e.g. "Bo") or not. current.group <- "Bo" one.group.vs.others<- sample.labels one.group.vs.others[sample.labels != current.group] <- "other" print(table(one.group.vs.others))  one.group.vs.others Bo other 44 146  ## Test the mean equality between Bo subtype and all other subtypes welch.one.group.vs.others <- t.test.multi(expr.matrix, one.group.vs.others)  [1] "Thu Aug 23 09:19:08 2012 - Multiple t-test started" [1] "Thu Aug 23 09:19:11 2012 - Multiple t-test done"  names(welch.one.group.vs.others)   [1] "mean.other" "mean.Bo" "means.diff" "var.est.other" [5] "var.est.Bo" "sd.est.other" "sd.est.Bo" "st.err.diff" [9] "t.obs" "df.welch" "P.value" "E.value" [13] "sig"  ## Update the gene rank table test.name <- paste(current.group, '.vs.others.sig', sep='') gene.ranks[,test.name] <- welch.one.group.vs.others$sig ## Add a column with significances
gene.ranks[,paste(test.name, ".rank", sep="")] <- rank(-welch.one.group.vs.others$sig, , ties.method='random') ## Add a column with significance ranks ## Do the same Welch test for the 3 other majority groups for (current.group in c("Bh", "Bt", "T")) { one.group.vs.others<- sample.labels one.group.vs.others[sample.labels != current.group] <- "other" ## Test the mean equality between Bo subtype and all other subtypes welch.one.group.vs.others <- t.test.multi(expr.matrix, one.group.vs.others) ## Update the gene rank table test.name <- paste(current.group, '.vs.others.sig', sep='') gene.ranks[,test.name] <- welch.one.group.vs.others$sig
gene.ranks[,paste(test.name, ".rank", sep="")] <- rank(-welch.one.group.vs.others$sig, , ties.method='random') } ## Check the resulting gene table head(gene.ranks) head(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),])
tail(gene.ranks[order(gene.ranks$T.vs.others.sig.rank),]) ## Store the gene rank table in a text file (tab-separated columns) write.table(gene.ranks, file=file.path(dir.results, 'DenBoer_gene_ranks.tab'), sep='\t', quote=F, col.names=NA)  ## Plot variance against significance of the Welch test for the 4 different groups plot(gene.ranks[,c("var", "Bo.vs.others.sig", "Bh.vs.others.sig", "Bt.vs.others.sig", "T.vs.others.sig")], col="grey") ## Plot ranks of variance against significance of the Welch test for the 4 different groups plot(gene.ranks[,c("var.rank", "Bo.vs.others.sig.rank", "Bh.vs.others.sig.rank", "Bt.vs.others.sig.rank", "T.vs.others.sig.rank")], col="grey")  An obvious observation: the gene-wise variance and the 4 results of Welch tests (one cancer type against all other types) return very different values, and rank the gene in completely different ways. This is not surprizing, the variance and the 4 significances indicate different properties of the genes. Gene selection will thus have to be adapted to the specific purpose of the classification. [back to contents] ## Principal Component Analysis Before starting the proper process of supervised classification, we can apply a method called Principal Component Analysis (PCA) to evaluate the repartition of the information between the mutiple variables, and to inspect the "intrinsic" structure of the data, i.e. the structure inherent to the numbers in the data table, irrespective of the labels (cancer subtypes) attached to the various samples. ### Purpose of PCA We can do the exercise of extending this 2-dimensional plot to a 3-dimensional plot, where the third dimension represents the expression level of a third gene. With an effort of imagination, we can mentally extend this 3D plot to a 4-dimensional plot, where each dimension would represent a different gene. It is likely that the groups of genes will progressively become more separated as the number of dimension increases. However, for the sake of graphical representation, it is difficult to work with more than 2 (or at most 3) dimensions. The purpose of Principal Component Analysis (PCA) is to capture the largest part of the variance of a data set in a minimal number of dimensions. ### Applying PCA transformation with stats::prcomp() The R method stats::prcomp() performs a PCA transformation of an input table. We can feed it with the expression table, but we need to transpose it first (using the R function t()), in order to provide our objects of interest (the samples) as rows, and the variables (genes) as columns. ## load the stats library to use the princomp() and prcomp() function library(stats) ## Perform the PCA transformation expr.prcomp <- prcomp(t(expr.matrix),cor=TRUE) ## Analyze the content of the prcomp result: ## the result of the method prcomp() is an object ## belonging to the class "prcomp" class(expr.prcomp)  ## Get the field names of the promp objects names(expr.prcomp) ## Get the attributes of the promp objects attributes(expr.prcomp)  ### Repartition of the standard deviation along the components A first information is the repartition of the variance (or its squared root, the standard deviation) between the components, which can be displayed on a plot. plot(expr.prcomp, main='Den Boer (2009), Variance per component', xlab='Component')  The standard deviation barplot (Figure~\ref{DenBoer_pca_variance}) highlights that the first component captures more or less twice as much standard deviation of the whole dataset as the second one. We can measure the relative importance of the standard deviations. ## Get the standard deviation and variance per principal component sd.per.pc <- expr.prcomp$sdev
var.per.pc <- sd.per.pc^2

## Display the percentage of total variance explained by each
sd.per.pc.percent <- sd.per.pc/sum(sd.per.pc)
var.per.pc.percent <- var.per.pc/sum(var.per.pc)
barplot(var.per.pc.percent[1:10], main='Den Boer (2009), Percent of variance  per component', xlab='Component', ylab='Percent variance', col='#BBDDFF')


### Analysis of the first versus second component

We can generate a biplot, where each sample appears as a dot, and the X and Y axes respectively represent the first and second components of the PCA-transformed data. The R function stats::biplot() automatically generates the biplot, but the result is somewhat confusing, because the biplot displays both the units (samples) and a projection of the variables (genes) on the two first components.

biplot(expr.prcomp,var.axes=FALSE,
panel.first=grid(col='black'),
main=paste('PCA; Den Boer (2009); ',
ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '),
xlab='First component', ylab='Second component')


Let us now generate a custom plot, with labels and colors indicating the subtype of ALL.

## Plot components PC1 and PC2
plot(expr.prcomp$x[,1:2], col=sample.colors, type='n', panel.first=grid(col='black'), main=paste('PCA; Den Boer (2009); ', ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '), xlab='PC1', ylab='PC2') text(expr.prcomp$x[,1:2],labels=sample.labels,col=sample.colors,pch=0.5)
legend('bottomleft',col=group.colors,
legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')


The coloring and cancer-type labels highlight a very interesting property of the PCA result: the two first components of the PCA-transformed data perfectly separate the T-ALL samples from all the other subtypes. All T-ALL cells appear in one elongated cloud on the left side of the plot. The other cloud seems to contain some organization as well: subtypes are partly intermingled but there are obvious groupings.

### Analysis of the second versus third component

The following figure shows that the second and third components capture information related to the cancer subtypes: the third component separates quite well TEL-AML1 (top) from hyperdiploid (bottom) samples, whereas the pre-B have intermediate values. The separation is however less obvious than the T-ALL versus all other subtypes that we observed in the two first components.

## Plot components PC2 and PC3
plot(expr.prcomp$x[,2:3], col=sample.colors, type='n', panel.first=grid(col='black'), main=paste('PCA; Den Boer (2009); ', ncol(expr.matrix), 'samples *', nrow(expr.matrix), 'genes', sep=' '), xlab='PC2', ylab='PC3') text(expr.prcomp$x[,2:3],labels=sample.labels,col=sample.colors,pch=0.5)
legend('bottomleft',col=group.colors,
legend=names(group.colors),pch=1,cex=0.7,bg='white',bty='o')


### Exercise

Generate plots of a few additional components (PC4, PC5,...) and try to evaluate if they further separate subtypes of cancers.

### Discussion of the PCA results

In the "historical" dataset from Golub (1999), three sample groups (AML, T-ALL and B-ALL) appeared almost perfectly separated by a simple display of the two first components of the PCA-transformed data. The dataset from DenBoer (2009) is less obvious to interpret, due to the nature of the data itself: firstly, all the samples come from the same cell type (lymphoblasts), whereas the main separation of Golub was between myeloblasts and lymphoblasts. Secondly, Den Boer and co-workers defined a wider variety of subtypes. However, we see that a simple PCA transformation already reveals that the raw expression data is well structured. It is quite obvious that we will have no difficulty to find a signature discriminating T-ALL from other ALL subtypes, since T-ALL are already separated on the plane of the two first components. We would however like to further refine the diagnostics, by training a classifier to discriminate between the other subtypes as well.

Note that until now we did not apply any training: PCA transformation is a "blind" (unsupervised) approach, performing a rotation in the variable space without using any information about pre-defined classes. Since we dispose of such information for the 190 samples of the training set, we can use a family of other approaches, generically called supervised classification, that will rely on class membership of the training samples in order to train a classifier, which will further be used to assign new samples to the same classes.

[back to contents]

## Linear Discriminant Analysis (LDA)

### Training the classifier

As a first trial, we will train a LDA classifier using as variables the 20 top-raking genes in the list of genes sorted by variance.

## Load the library containing the linear discriminant analysis function
library(MASS)

## Create a list of gene names sorted by decreasing variance

### Testing the random expectation with permutation tests}

#### Exercises

1. Use the R function sample() to randomly permute the training labels (store the result in a vector called "sample.labels.perm"), and test the hit rate of a classifier trained with these randomized labels.
2. Use the R function sample() to randomly permute the values of the input table (store the result in a data frame called "expr.perm"), and test the hit rate of a classifier trained with these randomized data.

#### Solutions

## Permute the training labels
sample.labels.perm <- as.vector(sample(sample.labels))

## Compare original training groups and permuted labels.
table(sample.labels, sample.labels.perm)

             sample.labels.perm
sample.labels Bc Bch BE BEp BEs Bh BM Bo Bt Bth  T
Bc   0   0  0   0   0  0  0  2  0   1  1
Bch  0   0  0   0   0  0  0  0  1   0  0
BE   0   0  0   0   0  0  0  0  0   0  1
BEp  0   0  0   1   0  3  0  3  1   0  0
BEs  0   0  0   0   0  1  0  1  1   0  1
Bh   1   0  0   1   2 18  0  7 10   0  5
BM   0   0  0   0   0  2  0  1  0   0  1
Bo   1   0  0   3   1  7  2 11 13   0  6
Bt   1   1  0   2   0  6  2 10 10   0 11
Bth  0   0  0   1   0  0  0  0  0   0  0
T    1   0  1   0   1  7  0  9  7   0 10

## Run LDA in cross-validation (LOO) mode with the permuted labels
lda.loo.labels.perm <- lda(t(expr.matrix[selected.genes,]),sample.labels.perm,CV=TRUE)

## Build a contingency table of known versus predicted class
loo.predicted.class.labels.perm <- as.vector(lda.loo.labels.perm\$class)
lda.loo.labels.perm.xtab <- table(sample.labels.perm, loo.predicted.class.labels.perm)
print(lda.loo.labels.perm.xtab)


The particular values sould vary at each trial, since the sample labels are permuted at random.

## Display the contingency table as a heat map
image(lda.loo.labels.perm.xtab)

## Levelplot uses a nice color palette
levelplot(lda.loo.labels.perm.xtab)

## Compute the number of hits
## (we need to omit NA values because LDA fails to assign a group to some objects).
hits.label.perm <- sample.labels.perm == loo.predicted.class.labels.perm
(nb.hits.label.perm <- sum(na.omit(hits.label.perm))) ## This gives 25 this time, but should give different results at each trial
(nb.pred.label.perm <- length(na.omit(hits.label.perm))) ## This should give 187
(hit.rate.label.perm <- nb.hits.label.perm / nb.pred.label.perm ) ## This gives 0.13 this time, should give different results at each trial


## Exercises

### Exercise: Impact of the number of training variables

1. Test the impact of the number of variables used for training the classifier. Draw a plot representing the hit rate (Y axis) as a function of the number of top genes used for the LDA classification, using different selection criteria:
1. Gene-wise variance
2. P-value of a Welch test (T-cells against all other types)

### Random expectation

1. Generate a randomized expression matrix by sampling the values of the Den Boer dataset with the R function sample() (as we did in the practical on multiple testing corrections).
2. Run the same procedure with this randomized matrix as we did above with the original expression matrix, and calculate hit rate with increasing number of variables selected by
• variance
• Welch test (T-cells against all other types)
3. Compare the hit rates obtained with the permuted matrix and with the original expression matrix.

Denis Puthier (http://tagc.univ-mrs.fr/tagc/index.php/research/developmental-networks/d-puthier) & Jacques van Helden (TAGC, Université de la Méditerranée).