In the tutorial above, we followed step by step the procedure to standardized a single chip, and discussed about the results. A typical analysis usually includes a series of microarray samples, corrsponding to multiple conditions (or patients), or to succesive points in a time series.
The R scripts associated to the course include a utility for the standardization of a series of DNA chips. The input is supposed to be a table with 1 column per chip and 1 row per gene, where values are log-ratios. The carbon data set that was loaded above corresponds to this format.
To load the standardization utility, type the following command.
source(paste(dir.util, "util_chip_standardization.R", sep="/"))
We will first apply the usual standardization, which consists in using the sample mean as estimator of central tenency, and the sample standard deviation as estimator of dispersion. We will then evaluate the quality of the result.
## standardize carbon.standard <- standardize.chips(carbon, quartile.est=FALSE)
We can now inspect the result.
## Contents of the resulting object names(carbon.standard) ## Mean value for each chip carbon.standard$m.est ## Standard deviation for each chip carbon.standard$s.est ## z-scores: calculate the data size and print the first 5 rows dim(carbon.standard$z) carbon.standard$z[1:5,] ## Column names for the extr attribtue names(carbon.standard$extr) ## Extreme values for the 5 first genes carbon.standard$extr[1:5,] ## E-values: calculate the data size and print the first 5 rows dim(carbon.standard$E.value) carbon.standard$E.value[1:5,] ## We can now compare the box plots before and after standardization x11(width=12,height=8) par(mai=c(1.5,1,1,0.1)) boxplot(carbon,col='#0000BB',main='before standardization',las=2) export.plot(file.path(dir.gasch, 'carbon_boxplots_log_ratios'),export.formats='pdf',width=10,height=8) x11(width=12,height=8) par(mai=c(1.5,1,1,0.1)) boxplot(carbon.standard$z, col='#BB00FF',main='standardized',las=2) export.plot(file.path(dir.gasch, 'carbon_boxplots_z'),export.formats='pdf',width=10,height=8)
The result object contains different types of information.
## Plot the histogram of standardized values for the first chip br <- 0.2*(-50:25) ## Scale of the curve hist(carbon.standard$z[,1], breaks=br) ## Draw vertical lines at 0, -1 and 1 abline(v=c(-1,0,1),col="blue",lwd=2)
As discussed in the tutorial on standardization, a weakness of the classical estimators is their sensitivity to outliers: if the sample contains a few points with a very high or a very low value, the mean and standard deviation will be strongly affected.
This problem can be circumvented by using robust estimators: the median can be used as an estimator of the mean, and the dispersion can be estimated on the basis of the inter-quartile range.
## Standardize with robust (quartile) estimates carbon.standard.robust <- standardize.chips(carbon, quartile.est=TRUE)
## select genes with an E-value <= 1 in at least one chip selected <- carbon.standard$extr$E.value.min <= 1 selected.robust <- carbon.standard.robust$extr$E.value.min <= 1 ## count selected genes sum(selected) sum(selected.robust)We will store the selected genes in a text file. These selected genes will be used for the tutorial on clustering.
## Go to the export directory setwd(dir.export) ## Create a sub-directory for Gasch data dir.create("gasch2000") setwd("gasch2000") getwd() ## Save selected genes for further analysis write.table(carbon.standard$z[selected,], file="Gasch_Evalue_1.tab",sep='\t', quote=FALSE,col.names=TRUE,row.names=T) write.table(carbon.standard.robust$z[selected.robust,], file="Gasch_robust_Evalue_1.tab",sep='\t', quote=FALSE,col.names=TRUE) list.files()
Note that we saved two files: one with the classical and one with the robust estimators. Each of these file contains z-scores.