Statistics for Bioinformatics
Practicals - Microarray analysis

Introduction

The analysis of microarray data has emerged as a predominant subject in bioinformatics. The number of papers on this topics is steadily increasing, as well as the quantity of available data. Biologists and bioinformaticians are thus more and more frequently confronted to the tasks of analyzing profiles of gene expression, or related data types (ChIP-on-chip experiments, phylogenetic profiles, ...).

This type of analysis belongs to the field of multivariate analysis: the basic data set is a table with one row per object (e.g. a gene), and one column per variable (e.g. condition, tissue type, patient).

A large variety of statistical and machine learging approaches can be applied to answer different questions about multivariate data. We will present here some of the basic questions which can be addressed to interpret such data, and some methods to answer these questions.

The concepts have been introduced during the lectures, the aim of these tutorials is to gain a practical experience by using programs (R, TMEV) to answer these questions.

Configuration

Before starting the tutorials, you need to proceed to the following initialization.

  1. If you need a special configuration of the http proxy (computer rooms of the USTL and ULB), follow the configuration steps as described earlier.
  2. After this, you must have an open session of the program R and the configuration file must have been loaded.
      
      ## Load default configuration file for this course
      source('http://pedagogix-tagc.univ-mrs.fr/courses/statistics_bioinformatics/R-files/config.R')
      	  
  3. Execute the following commands to specify the directories where you will store the results and figures.
      
      ## Specify the directory to store your results for the analysis of
      ## yeast data on carbon sources (Gasch, 2000)
      dir.gasch <- file.path(dir.results, 'microarrays', 'gasch_2000')
      dir.create(dir.gasch,showWarnings=F, recursive=T)
      
      ## Specify the directory to store your results for the analysis of
      ## the dataset from Golub et al. (1999) on ALL-B versus ALL-T
      ## signatures.
      dir.golub <- file.path(dir.results, 'microarrays', 'golub_1999')
      dir.create(dir.golub,showWarnings=F, recursive=T)
      
      ## Specify the directory to store your results for the analysis of
      ## the dataset from DenBoer et al (2009) on discrimination between
      ## various types of ALL-T.
      dir.denboer <- file.path(dir.results, 'microarrays', 'denboer_2099')
      dir.create(dir.denboer,showWarnings=F, recursive=T)
      	     

Questions and approaches treated in these practicals

Questions Approaches
Normalization of the raw measurements Median centring, Local regression, ...
Fitting a normal distribution on the expression chips Fitting
Selection of significantly regulated genes in a single chip Standardization
Significance test
Measure the correlation between samples Correlation
Selection of differentially expressed genes Student test
Identification of groups of co-expressed genes Clustering
Graphical representations of large data sets (profiles, clusters, ...)
  • Principal component analysis
  • Multidimensional scaling
  • Expression profiles
  • Heat maps
    • Profile heat maps
    • Correlation heat maps
Prediction of cancer types from expression profiles of patients Supervised classification (discriminant analysis, SVM, ...)

Jacques van Helden (jvhelden@ulb.ac.be)