The first papers on microarray analysis were using a naive approach based on "fold change" to select up- and down-regulated genes. The principle was to consider as up-regulated all the gene whose expression ratio was above a certain threshold, typically 2, and as down-regulated all those whose ratio was below the inverse threshold (typically 1/2).
In the historical publication on the DNA chip technology (deRisi et al., 1997) he magical choice of a threshold was based on the observation that, in some control chip, no more than 19 genes (out of 6300) passed this value. The rate of false positive seemed thus acceptable.
This "fold-change" threshold is of course arbitrary, and presents several drawbacks.
It seems thus necessary to improve the criterion for selecting regulated genes.
Ideally, one would like to dispose of several repeats of each condition, in order to apply the classical tests of differential analysis (t-test, ANOVA). This is however not always possible, and expression data is currently available for a lot of different conditions, but with a single repetition per condition. Although not ideal for statisticians, this data certainly contain a lot of biological information, and we can try to make the best use of it.
We will present below a strategy for selecting up- and down-regulated genes from a single chip experiment, with no repetition. This strategy can be applied to each chip of a series (e.g. time series, set of alternative substrates, ...), to select all the genes responding to one or another of these conditions.
The principle is to standardize each chip (centring and scaling), so that the fold-ratios can be converted in z-scores. under assumption of normality, each z-score can be associated to some P-value.
Of course, the assumption of normality is problematic, since microarray data usually do not pass a normality test. However, we will see that the fitting is not so bad, provided one chooses robust estimators of central tendency (median rather than mean) and dispersion (inter-quartile range rather than standard deviation).
In the previous tutorial (fitting), we fitted a theoretical normal distribution on the expression values of a microarray sample (one chip). We saw that the observed distribution of expression log-rations roughly follows a normal distribution, especially when using robust estimates of central tendency and dispersion.
We will assume that the fitted normal distribution represents the random fluctuations coming from different sources (biological conditions, spotting of the microarray, measurement of light intensities, ...).
If this is the case, we can consider that the genes which significantly discard from this normal distribution (the outliers) do so for a non-random reason. A good reason for showing non-random distribution in the log-ratios of expression is that these genes are up- or down-regulated.
In this tutorial, we will apply this reasoning to estimate, for each gene, the significance of its deviation from the normal distribution, in order to assess if this gene is likely to be up- or down-regulated.
This tutorial is an application of concepts seen in the following chapters of the course.
## Suppress the NA values from the gal vector gal <- na.omit(gal) gal.z <- na.omit(gal.z) length(gal) length(gal.z) ## This should give 5993 ## Calculate the P-value associated to down-regulation (right tail) gal.Pval.down <- pnorm(gal.z, lower.tail=T) ## Calculate the P-value associated to up-regulation (right tail) gal.Pval.up <- pnorm(gal.z, lower.tail=F)
To facilitate our inspection of the results, we will assemble the different statistics (log-ratio, z-score, Pvalue, and the next ones to be calculated) in a single data frame.
## Summarize the result in a data frame
gal.sig.test <- data.frame(row.names=names(gal),
## Display 20 elements of the data frame
The two P-values (Pval.up and Pval.down) calculated above represent two distinct biological questions: the right tail test (Pval.up) aims at detecting up-regulated genes, i.e. those which are expressed more intensely in presence of galactose than with other carbon source. Conversely, the left tail test (Pval.down) aims at detecting down-regulated genes, i.e. those being expressed less intensely in presence of galactose than with other carbons sources.
Each gene will be either considered as up-regulated (if its z-score is very high, compared to the random fluctuations), or down-regulated (if its z-score is very low, compared to the random expectation), but not both together.
We could thus calculate a single P-value, reflecting either the left (Pval.down) or right (Pval.up) tail of the normal distribution, depending on the negative (down-regulation) or positive (up-regulation) value of the z-score.
In statistical textbooks, the practice for a two-tail test is to "share" the risk (P-value) between the two tails. However, in our case, we want to maintain as two separate questions the selection of up- and down-regulated genes. Indeed, we expect that these genes will potentially regulated distinct metabolic pathways (in particular galactose utilization, versus utilization of other sources). Thus, since each group of selected genes will be analyzed separately, we calculate the P-value on either the right tail (for genes with a positive z-score), or the left tail (genes with negative z-score), without dividing the P-value by 2.
## Add a column in the table to indicate whether the test is performed gal.sig.test$tail <- NA up <- gal.sig.test$z >= 0 gal.sig.test[up, "tail"] <- "right" down <- gal.sig.test$z < 0 gal.sig.test[down, "tail"] <- "left" ## Check the result print(gal.sig.test[180:199,]) ## Create a column for a unisuqe Pval (either left or right depending on z-score sign) gal.sig.test$Pval <- NA gal.sig.test[up, "Pval"] <- gal.sig.test[up, "Pval.up"] gal.sig.test[down, "Pval"] <- gal.sig.test[down, "Pval.down"] ## Display 20 rows of the table print(gal.sig.test[180:199,])
For the analysis of the results, it is convenient to sort genes by z-score, in order to have the most strongly up-regulated genes at the top of the list, and the most strongly down-regulated at the bottom.
## Sort the rows by decreasing value of z-score z.order <- order(gal.sig.test$z, decreasing=F) gal.sig.test <- gal.sig.test[z.order,] ## Display the 20 top rows of the table print(gal.sig.test[1:20,]) ## Display the 20 bottom rows of the table N <- nrow(gal.sig.test) print(gal.sig.test[(N-20):N,])
With the P-value as calculated above, we estimate the risk, for each gene separately, to consider it as significant whereas it is not. however, since we apply the same test to a large number of genes (~6,000), we have to introduce corrections for multi-testing.
## Number of tests in the multi-test series T <- nrow(na.omit(gal.sig.test)) ## Calculate the E-value ## The E-value represents the expected number of false positive genes ## for a series of N tests (e.g. one test per gene in the microarray) gal.sig.test$Eval <- gal.sig.test$Pval*T ## Calculate the Family-Wise Error Rate (FWER) ## The FWER represents the probability to have at least one false ## positive in a series of N tests. gal.sig.test$fwer <- pbinom(0,T,gal.sig.test$Pval, lower.tail=F) ## Check the result with the 20 top and 20 bottom rows print(gal.sig.test[1:20,]) print(gal.sig.test[(T-20):T,]) ## Check values at intermediate levels of the table print(gal.sig.test[80:100,]) print(gal.sig.test[1000:1019,]) print(gal.sig.test[5000:5019,])
Note that for most genes, the FWER is very high (FWER=1). This means that these genes should in no case be considered as significantly up- or down-regulated. Actually, in a control experiment (where the same sample has been hybridized on the green and red channels), all FWER should be close to 1, and all genes should be considered as non-significant.
Since we proposed two corrections for multi-testing (E-value and FWER), one could wonder which one is the most appropriate. The interpretation of these to corrections differs.
Despite this difference, we will see that he two corrections give similar results within their range of interest, i.e. for very low P-values (which correspond to high significance).
A simple way to compare the statistics is to plot them.
## Compare the E-value and the FWER plot(gal.sig.test$Eval, gal.sig.test$fwer, type="b", col="blue", xlab="E-value", ylab="FWER" ) grid(col="black")
This graph is not very informative, because for almost all the genes, the FWER is 1, and the E-value is high (varying from 1 to T). Typically, we are interested in the very few genes which are on the left of the graph, those with very low FWER and E-value. A simple way to highlight these low values, is to use logarithmic scales.
## Compare the E-value and the FWER plot(gal.sig.test$Eval, gal.sig.test$fwer, col="blue", log="xy", xlab="E-value", ylab="FWER", main="Multitesting corrections", panel.first=c(grid(col="black"), abline(a=0,b=1,col="darkgreen"), abline(h=1,col="darkred"), abline(v=1,col="darkred") ) ) ## Plot the E-value and FWER versus nominal P-value plot(gal.sig.test$Pval, gal.sig.test$Eval, col='green', pch=20, panel.first=c(grid(col='lightgray',lty='solid',equilogs=F), abline(h=1,col='black')), xlab='Nominal P-value', ylab='Multitesting corrected', log='xy' ) lines(gal.sig.test$Pval, gal.sig.test$fwer, col='darkblue', pch=1,type='p') legend('bottomright', legend=c('E-value','FWER'),col=c('green','darkblue'),pch=c(20,1)) ## Save the plot export.plot(file.path(dir.gasch, 'gal_Eval_vs_FWER'),export.formats="pdf",height=8,width=8)
On the logarithmic plot, we clearly see that the E-value ad FWER are almost un-distinguishable below 10-1.
We can now select the genes which are significantly up- and down-regulated, respectively.
## Specify an upper threshold on the E-value Eval.threshold <- 1 ## This means that we accept more or less one false positive for this chip. ## select genes with an Evalue below the thresold gal.selected <- gal.sig.test$Eval < Eval.threshold ## This returns a boolean vector, indicating, for each gene, whether ## it is or not up-regulated print (gal.selected) ## Count the number of significantly regulated (up- and down- togther) genes sum(gal.selected) ## Select up-regulated genes gal.selected.up <- gal.selected & gal.sig.test$z > 0 sum(gal.selected.up) print(gal.sig.test[gal.selected.up,]) ## Select down-regulated genes gal.selected.down <- gal.selected & gal.sig.test$z < 0 sum(gal.selected.down) print(gal.sig.test[gal.selected.down,])
The KEGG database provides a very convenient tool for mapping a collection of genes against predefined pathway maps. We will now map the significant genes against KEGG pathway maps, in order to understand which metabolic pathways are up- and down-regulated by galactose, respectively.