Practicals - Clustering with R

For this, we will use the profiles of expression of genes differentially expressed between two cancer types (AML and ALL). These genes have previously been selected on the basis of a Student test, described in the tutorial on Student test.

dir.golub.data <- file.path(dir.course, 'data', 'gene_expression','golub_1999') golub.expr.E.1 <- read.table(file.path(dir.golub.data, 'golub_standardized_filtered_student_E1.tab')) ## Check the dimension of the table containing ## the selected expression profiles dim(golub.expr.E.1)This should give the following result:

> dim(golub.expr.E.1) [1] 243 38

We have thus selected 243 genes, showing a significantly different level of expression between the 27 patients suffering from ALL (acute lymphoblastic leukemia) and the 11 patients suffering from AML (acute myeloblastic leukemia).

In principle, these libraries should be installed on your machine before the beginning of the practicals. Assuming this is the case, you can load the libraries with the following instructions.

library(combinat) library(maptree) library(e1071)

If the libraries are loaded properly, you do not need to install them, and you can skip the next section.

To install the required libraries, log in as system administrator,
open **R** and type the following command.

install.packages("combinat") install.packages("maptree") install.packages("e1071")

As usual, you need to load the configuration file before starting this tutorial. See the configuration page if you forgot how to do this.

Hierarchical clustering is performed in two steps:

- Calculate a matrix representing the distances between the objects (genes)
- Build a tree from the distance matrix

## Read the help for the function dist() help(dist) ## Calculate Euclidian distances d.euclidian <- dist(golub.expr.E.1,method="euclidian") ## Have a look to the attributes of the distance matrix attributes(d.euclidian)

for the coefficient of correlation, the maximal value is 1. We can thus calculate the "correlation distance" accordingly.

d.correlation <- as.dist(1 - cor(golub.expr.E.1))

The method `cor()` returns an object of class
`matrix`. For the hierarchical clustering, we need an object of
class `dist`. The method `as.dist()` is used to cast
(reformat) a matrix into a dist object

## Read the help on the hierarchical clustering method. help(hclust) ## Build a tree with the Euclidian distance tree.euclidian <- hclust(d.euclidian) ## Plot the tree plot(tree.euclidian)The plot s barely readable, due to the large number of genes. We can slightly decrease the overlap between labels by decreasing the font size.

x11(width=16,height=8) plot(tree.euclidian,cex=0.7)

The package ` stats` contains another useful function,

## Cut the tree to the level of 7 branches ## Each object (gene) is assigned to one of the 7 clusters clusters.euclidian <- cutree(tree.euclidian,k=7) ## Count the number of genes per cluster table(clusters.euclidian)

Another library, ` maptree`, contains the method

library(maptree) pr <- clip.clust(tree.euclidian,k=7) plot(pr, main=paste("pruned tree, k=7"))

- Build a tree with the "correlation distance", and compare the two trees
- Which agglomeration rule was used for building the trees ? Which
other rules are supported by
`hclust()`? Use`help(hclust)`to answer these question. - Select one distance metric (for example the correlation distance), and compare visually the trees obtained with different agglomeration rules.

clusters.kmeans <- kmeans(d.euclidian, 7)Berware: each time you run the kmeans clustering, you can obtain a different result, since the initial centers are placed randomly in the variable space.

The simplest way to compare two clustering results is with a contingency table. For instance, we could count the number of common genes between each cluster obtained by kmeans with each cluster obtained by hierarchical clustering.

cont.table <- table(clusters.kmeans$cluster,as.vector(clusters.euclidian)) print(cont.table)The package

library(e1071) ## Find optimal match between the two classifications class.match <- matchClasses(as.matrix(cont.table),method="exact") ## Print the confusion table, with rows permuted to maximize the diagonal print(cont.table[,class.match])

- perform a first kmeans clustering with 4 clusters
- perform a second kmeans clustering and compare the results between the two trials (build a confusion table, maximize its diagonal and calculate the hit rate)
- Repeat this operation 100 times and calculate the average hit rate.

- Perform a hierarchical clustering on the samples (columns) rather
than on the genes. For this, you can use the transpose function
`t()`from**R**. - Compare the result of this clustering with the known classes of leukemia (ALL of AML).
- Test different agglomeration rules (single, average, complete, ward) and distance metrics (Euclidian, Manhattan, correlation, ...), and compare each result with the known cancer types. Which clustering method aand distance metric give the best results ?

Jacques van Helden (jvhelden@ulb.ac.be)