Practicals - Standardization

In this tutorial, we will load a microarray data set (cDNA microarrays), and fit a normal distribution on the log ratio values.

**centring:**substracting the mean from each value**scaling:**dividing the centered value by the standard deviation*z*is the standardized value (the z-score)*m*is a sample-based estimate of the population mean_{est}*s*is a sample-based estimate of the population standard deviation_{est}- the median provides a robust estimate of the mean;
- the inter-quartile range (IQR) provides a robust estimate of the standard deviation.

**Standardization** is a transformation of the data set which aims
at reducing it to a standard normal distribution, i.e. centered on 0
and with a dispersion of 1. The classical way to standardize consists
in two steps.

Standardization is closely related to the normal fitting. Actually, the first step of standardization is to fit a normal curve on the data, in order to estimate th central tendency (for the centring), and the dispersion (for the scaling).

In microarray analysis, standardization is typically performed on a chip-per-chip basis (chip-wise standardization). Chip-wise standardization relies on assumptions at the level of the gene population (the whole set of genes spotted on one chip): the average value is brought to 0, which means that up- and down-regulation compensate each other on the whole chip. Whilst this is often the case under precisely defined experimental conditions (e.g. when comparing the expression of cells fed with carbon versus other carbon sources, a very few genes are expected to respond), there are some experimental conditions where a global effect on transcription would be expected (e.g. RNA polymerase mutants). Thus, even at the chip level, the user has thus to make sure that standardization is appropriate or not, depending on his/her biological knowledge of the experiment.

Some programs also propose a gene-wise standardization, but this relies on a very strong biological assumption: it assumes that (1) all the genes are on the average unregulated (in other terms, that the up- and down-regulations compensate each other in a profile), and that (2) the variations in expression of all the genes can be considered equivalent, i.e. whether the gene has a strongly or weakly varying expression profile, its standard deviation will be equal to 1 after standardization. I can hardly imagine any situation where all the genes could fulfill these requirements, and I thus recommend to avoid gene-wise standardization.

For this practical, we selected 13 experiments where only a few genes are expected to respond. We will apply chip-wise standardization to each chip independently, and analyze the result.

A convenient way to handle microarray samples is to standardize
each sample, i.e. to bring its mean (or an estimate thereof) to 0, and
its standard deviation (or an estimate thereof) to 1. The standardized
values are usually referred to as ** z-scores**.

In the tutorial on fitting, we saw that

In the tutorial below, we will use these robust estimators for the chip-wise standardization of the carbon source experiments from Gasch (2000).

- This tutorial assumes you already loaded the
script
`config.R`as described in the configuration page. - The concepts and practice of the tutorial on fitting should have been acquired before following the present tutorial.

We will perform the last steps as in the solution of the previous tutorial, in order to obtain these robust estimates, and use them to convert the lo-ratios into z-scores.

## Select chip labelled "galactose" in a separate variable gal <- as.vector(carbon$galactose) ## Robust estimates of the mean and standard deviation m.est <- median(gal,na.rm=T) iqr <- IQR(gal,na.rm=T) iqr.normal.range <- qnorm(0.75)-qnorm(0.25) s.est <- iqr/iqr.normal.range ## Convert log-rations into z-scores gal.z <- (gal - m.est) / s.est

We can check that this transformation did not affect the shape of the distribution, but that the new distribution is nicely centred on 0, and that it is well fitted by a standard normal distribution, i.e. a normal distribution with a mean of 1 and standard deviation of 1.

## Plot the distribution of z-scores ## The option "freq=F" displays the density instead of the counts on the Y axis gal.z.hist <- hist(gal.z, breaks=60, col="#BBBBFF", border="#BBBBFF", freq=F) ## Draw the standard normal distribution lines(gal.z.hist$mids, dnorm(gal.z.hist$mids),col="darkgreen", lwd=2)

Not surprizingly, this histogram has the same shape as the one obtained in the tutorial on fitting. However, its range (abcsissa) is quite different: the standardized data have a median of 0, and a standard IQR (1.348980).

The standardized data are convenient for selecting genes that are significantly up- or down-regulated in a single microarray chip, as will be illustrated in the tutorial on significance testing.

- Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO (2000). Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. Dec;11(12):4241-57. PMID: 11102521.

Jacques van Helden (jvhelden@ulb.ac.be)