Statistics for Bioinformatics
Practicals - Clustering with R

In this tutorial, we apply different clustering methods implemented in R, to classify genes according to their expression profile.

For this, we will use the profiles of expression of genes differentially expressed between two cancer types (AML and ALL). These genes have previously been selected on the basis of a Student test, described in the tutorial on Student test.

This should give the following result:

We have thus selected 243 genes, showing a significantly different level of expression between the 27 patients suffering from ALL (acute lymphoblastic leukemia) and the 11 patients suffering from AML (acute myeloblastic leukemia).


For this tutorial, we will use some R libraries, providing methods which are not part of the basic R package.

Loading libraries

In principle, these libraries should be installed on your machine before the beginning of the practicals. Assuming this is the case, you can load the libraries with the following instructions.

If the libraries are loaded properly, you do not need to install them, and you can skip the next section.

Installing libraries

If the libraries are not installed on your machine, you can install them yourself easily, provided you have an interet connection. You also need system administrator probvileges. If this is not the case, take contact with your system administrator.

To install the required libraries, log in as system administrator, open R and type the following command.

Hierarchical clustering

As usual, you need to load the configuration file before starting this tutorial. See the configuration page if you forgot how to do this.

Hierarchical clustering is performed in two steps:

  1. Calculate a matrix representing the distances between the objects (genes)
  2. Build a tree from the distance matrix

Calculating distances

Euclidian distance

The most familiar distance metric is the Euclidian distance. R includes a function dist() which automatically calculates the Euclidian distance between each pair of rows of an input matrix.

Coefficient of correlation

Euclidian distance does is probably not the best metric to reflect the biological concept of co-regulation. Pearson's coefficient of correlation better reflects this concept, but it gives an indication of similarity rather than a distance. A similarity can however be easily converted by a distance, with the folmula.

for the coefficient of correlation, the maximal value is 1. We can thus calculate the "correlation distance" accordingly.

The method cor() returns an object of class matrix. For the hierarchical clustering, we need an object of class dist. The method as.dist() is used to cast (reformat) a matrix into a dist object

Building and plotting the tree

The plot s barely readable, due to the large number of genes. We can slightly decrease the overlap between labels by decreasing the font size.

Pruning the tree

The package stats contains another useful function, cutree(), which prunes the tree, i.e. cuts the branches at a certain level.

Another library, maptree, contains the method clip.clust(), which allows to plot the main branches of the tree.

Exercises on hierarchical clustering

  1. Build a tree with the "correlation distance", and compare the two trees
  2. Which agglomeration rule was used for building the trees ? Which other rules are supported by hclust() ? Use help(hclust) to answer these question.
  3. Select one distance metric (for example the correlation distance), and compare visually the trees obtained with different agglomeration rules.

K-means clustering

K-means clustering is implemented in R, as part of the library mva. It is very easy to use. Berware: each time you run the kmeans clustering, you can obtain a different result, since the initial centers are placed randomly in the variable space.

Comparing the results of two clustering methods

The simplest way to compare two clustering results is with a contingency table. For instance, we could count the number of common genes between each cluster obtained by kmeans with each cluster obtained by hierarchical clustering.

The package e1071 contains a useful function, matchClasses(), which allows you to finds the optimal matches between two clustering results. This is done by permuting columns and rows of the contingency table in order to maximize the values on the diagonal.


Reproducibility of the k-means clustering

Using the same dataset as above,
  1. perform a first kmeans clustering with 4 clusters
  2. perform a second kmeans clustering and compare the results between the two trials (build a confusion table, maximize its diagonal and calculate the hit rate)
  3. Repeat this operation 100 times and calculate the average hit rate.

Evaluation of distance metrics and agglomeration rules for hierarchical clustering

Using the same data set as above,
  1. Perform a hierarchical clustering on the samples (columns) rather than on the genes. For this, you can use the transpose function t() from R.
  2. Compare the result of this clustering with the known classes of leukemia (ALL of AML).
  3. Test different agglomeration rules (single, average, complete, ward) and distance metrics (Euclidian, Manhattan, correlation, ...), and compare each result with the known cancer types. Which clustering method aand distance metric give the best results ?

Jacques van Helden (