The Snf2 dataset

The RNA-Seq dataset we will use in this practical has been produced by Gierliński et al ([@pmid26206307, @pmid27022035]). The dataset is composed of 48 WT yeast samples vs 48 Snf2 knock-out mutant cell line. The prepared RNA-Seq libraries (unstranded) were pooled and sequenced on seven lanes of a single flow-cell on an Illumina HiSeq 2000 resulting in a total of 1 billion 50-bp single-end reads across the 96 samples. RNA-Seq reads have been cleaned, mapped and counted to generated a count data matrix containing 7126 rows/genes. The primary objective of this study was to check whether the observed gene read counts distribution where consistent with theorical models (e.g. negative binomial). More information can be obtained in the original paper (pdf)

Loading the dataset

R enables to download data directly from the Web. The expression matrix and phenotypic information will be loaded into R using the read.table function. Both table will be converted into a data.frame object when loaded into R. The ‘count.table’ object will contains counts for each gene (row) and each sample (column).

# Download data files from the Web site (only if not done yet)
url.counts <- "http://jvanheld.github.io/stats_avec_RStudio_EBA/practicals/yeast_2x48_replicates/data/"

## Local paths: create a directory to store the results
dir.snf2 <- ("~/ASG/practicals/rnaseq_snf2_Schurch_2015")
dir.counts <- file.path(dir.snf2, "data")
file.counts <- file.path(dir.counts, "counts.txt")
file.expDesign <- file.path(dir.counts, "expDesign.txt")

## Create a directory to download the dataset if it does not exist
dir.create(dir.counts, showWarnings = FALSE, recursive = TRUE)

## Download the data files if required
if (!file.exists(file.counts)) {
  message("Downloading count table from ", url.counts)
  download.file(url=file.path(url.counts, "counts.txt"), destfile = file.counts)
}
if (!file.exists(file.expDesign)) {
  message("Downloading design table from ", url.counts)
  download.file(url=file.path(url.counts, "expDesign.txt"), destfile = file.expDesign)
}

# Load the count table
count.table <- read.table(file=file.counts, sep="\t", header=TRUE, row.names=1)
# View(count.table)

Phenotypic data

The dataset contains RNA-Seq count data for a wild type strain (WT) and a Snf2 mutant, with 48 biological replicates for each genotype.

All phenotypic informations are enclosed in a dedicated file. Note that the produced data.frame encodes the ‘strains’ columns as a factor¹.

# Load experimental design file
expDesign <- read.table(file=file.expDesign, sep="\t", header=TRUE)
#View(expDesign)

# Check the first and last line of the phenotypic data
head(expDesign)

  label strain
1   WT1     WT
2   WT2     WT
3   WT3     WT
4   WT4     WT
5   WT5     WT
6   WT6     WT

tail(expDesign)

   label strain
91 Snf43    Snf
92 Snf44    Snf
93 Snf45    Snf
94 Snf46    Snf
95 Snf47    Snf
96 Snf48    Snf

## Count the number of sample in each class
table(expDesign$strain)


Snf  WT 
 48  48

## Define a strain-specific color for each sample,
## and add it as a supplementary column to the phenotypic data
col.strain <- c("WT"="green","Snf"="orange") # Choose one color per strain
expDesign$color <- col.strain[as.vector(expDesign$strain)]

Draw a barplot showing the number of reads in each sample. Use either the colSums() or the apply() function (run help(colSums() if you don’t know this function).
What can you say from this diagram?

View solution| Hide solution

Descriptive statistics

Basic statistics

Before going further in the analysis, we will compute some descriptive statistics on the dataset. At this stage we only compute statistics per sample, since statistics per gene are meaningful only after library-wise normalization of the counts.

## Dimensions
nb.samples <- ncol(count.table)
print(nb.samples)

[1] 96

nb_genes <- nrow(count.table)
print(nb_genes)

[1] 7126

dim(count.table)

[1] 7126   96

## Min, Max, median (...). 
## Here on the first 4 samples
head(summary(count.table[,1:4]))

      WT1              WT2              WT3              WT4        
 Min.   :     0   Min.   :     0   Min.   :     0   Min.   :     0  
 1st Qu.:    49   1st Qu.:    88   1st Qu.:    74   1st Qu.:   104  
 Median :   224   Median :   371   Median :   297   Median :   430  
 Mean   :   837   Mean   :  1107   Mean   :   903   Mean   :  1464  
 3rd Qu.:   561   3rd Qu.:   853   3rd Qu.:   670   3rd Qu.:  1019  
 Max.   :188825   Max.   :196804   Max.   :172119   Max.   :328674

## A magic trick to convert column-wise summaries into a data.frame.
## The do.call() function produces a data frame with one col per sample, 
## we transpose it to obtain one row per sample and one column per statistics.
stats.per.sample <- data.frame(t(do.call(cbind, lapply(count.table, summary))))
head(stats.per.sample)

    Min. X1st.Qu. Median Mean X3rd.Qu.   Max.
WT1    0     49.0    224  837      561 189000
WT2    0     88.0    371 1110      853 197000
WT3    0     74.2    297  903      670 172000
WT4    0    104.0    430 1460     1020 329000
WT5    0     84.0    353 1120      819 225000
WT6    0    153.0    667 2050     1600 357000

## We can now add some columns to the stats per sample
stats.per.sample$libsum <- apply(count.table, 2, sum) ## libsum
# Add some percentiles
stats.per.sample$perc05 <- apply(count.table, 2, quantile, 0.05)
stats.per.sample$perc10 <- apply(count.table, 2, quantile, 0.10)
stats.per.sample$perc90 <- apply(count.table, 2, quantile, 0.90)
stats.per.sample$perc95 <- apply(count.table, 2, quantile, 0.95)
stats.per.sample$zeros <- apply(count.table==0, 2, sum)
stats.per.sample$percent.zeros <- 100*stats.per.sample$zeros/nrow(count.table)

# View(stats.per.sample)
kable(stats.per.sample[sample(1:ncol(count.table), size = 10),],
      caption = "**Table: statistics per sample. ** We only display a random selection of 10 samples. ")

**Table: statistics per sample.** We only display a random selection of 10 samples.
	X1st.Qu.	Median	Mean	X3rd.Qu.	Max.	libsum	perc10	perc90	perc95	zeros	percent.zeros
Snf21	112.0	463	1340	1060	202000	9531827	2	2408	4274	505	7.09
Snf26	124.0	506	1380	1140	210000	9812933	3	2576	4334	467	6.55
WT29	83.0	353	1140	816	242000	8158307	2	1918	3487	514	7.21
Snf16	120.0	493	1270	1080	182000	9073818	2	2368	4008	503	7.06
Snf7	161.0	668	1750	1520	243000	12479715	4	3320	5671	449	6.30
WT26	86.2	375	1300	914	291000	9231334	2	2260	4147	537	7.54
Snf31	74.0	335	1070	788	199000	7605035	1	1892	3361	569	7.99
Snf42	75.0	325	968	752	164000	6894818	1	1771	3084	587	8.24
Snf18	107.0	439	1170	985	160000	8358221	2	2204	3772	512	7.18
Snf9	115.0	495	1410	1110	211000	10025640	2	2530	4496	481	6.75

Distributions

Histograms of counts per gene

The summary only displays a few milestone values (mean, median, quartiles). In order to get a better intuition of the data, we can draw an histogram of all values.

par(mfrow=c(3,1))

hist(as.matrix(count.table), col="blue", border="white", breaks=100)

hist(as.matrix(count.table), col="blue", border="white",
     breaks=20000, xlim=c(0,2000), main="Counts per gene",
     xlab="Counts (truncated axis)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

epsilon <- 1 # pseudo-count to avoid problems with log(0)
hist(as.matrix(log2(count.table + epsilon)), breaks=100, col="blue", border="white",
     main="Log2-transformed counts per gene", xlab="log2(counts+1)", ylab="Number of genes", 
     las=1, cex.axis=0.7)

Histogram of counts per genes. Top: raw counts. the scale is determined by the gene with the highest count, which is apparently an outlier. Middle: raw counts, with X axis truncated to 2000 in order to display a representative range despite outliers. Bottom: log2-transformed counts (bottom) per gene, with a pseudocount of 1 to avoid minus infinitevalues resulting from zero counts.

par(mfrow=c(1,1))

Interpretation

The top histogram is not very informative so far, apparently due to the presence of a few very high count values, that impose a very large scale on the $X$ axis.
The middle histogram shows the representative range. Note the height of the first bin, which includes the zero counts.
The logarithmic transformation (bottom histogram) improves the readability. Note that we added a pseudo-count of 1 to avoid problems with the log transformation of zero counts (which gives $-\infty$).

Boxplots of gene count distributions per sample

To get better insights into the distribution per sample, boxplots offer a good perspective.

## Boxplots
boxplot(log2(count.table + epsilon), col=expDesign$color, pch=".", 
        horizontal=TRUE, cex.axis=0.5,
        las=1, ylab="Samples", xlab="log2(Counts +1)")

Box plots of non-normalized log2(counts) per sample.

Density plots

Another way to get an intuition of the value distributions is to use the plotDensity() function, which draws one density curve for each sample.

## Density
## We will require one function from the affy package
if(!require("affy")){
  source("http://bioconductor.org/biocLite.R")
  biocLite("affy")  
}
library(affy)
plotDensity(log2(count.table + epsilon), lty=1, col=expDesign$color, lwd=2)
grid()
legend("topright", legend=names(col.strain), col=col.strain, lwd=2)

Densities of log2(counts). Each curve corresponds to one sample.

Beware: the R function plotDensity() does not display the actual distribution of your values, but a polynomial fit. The representation thus generally looks smoother than the actual data. It is important to realize that, in some particular cases, the fit can lead to extrapolated values which can be misleading.

Scatter plots

nb.pairs <- 6


## Define a function to draw a scatter plot for a pair of variables (samples) with density colors
plotFun <- function(x,y){ 
  dns <- densCols(x,y); 
  points(x,y, col=dns, pch=".", panel.first=grid());  
#  abline(a=0, b=1, col="brown")
  }

## Plot the scatter plot for a few pairs of variables selected at random
set.seed(123) # forces the random number generator to produce fixed results. Should generally not be used, except for the sake of demonstration with a particular selection. 
pairs(log2(count.table[,sample(ncol(count.table), nb.pairs)] + epsilon), 
      panel=plotFun, lower.panel = NULL)

Scatter plot of log2-counts for a random selection of samples.

Let’s have a look at the scatter plots using the pairs() function. We will only represent 6 randomly selected samples.

The command pairs() draws a scatter plot for each pair of columns of the input dataset. The plot shows a fairly good reproducibility between samples of the same type (WT or KO, respectively): all points are aligned along te diagonal, with a relatively wider dispersion at the bottom, corresponding to small number fluctuations.

In contrast, on all the plots comparing a WT and a KO sample, we can see some points (genes) discarding from the diagonal.

Eliminating undetected genes

All genes from genome the S. cerevisiae where quantified. However only a fraction of them where expressed and some of them where to weakly expressed to be detected in any of the sample. As a result the count table may contain rows with only zero values (null counts).

What is the percentage of gene having null counts per sample. Draw a barplot.
Some genes were not detected in any of the sample. Count their number, and delete them from the count.table data.frame.

View solution| Hide solution

Solution

prop.null <- apply(count.table, 2, function(x) 100*mean(x==0))
print(head(prop.null))

 WT1  WT2  WT3  WT4  WT5  WT6 
9.22 7.27 7.37 7.07 7.37 6.05

barplot(prop.null, main="Percentage of null counts per sample", 
        horiz=TRUE, cex.names=0.5, las=1, 
        col=expDesign$color, ylab='Samples', xlab='% of null counts')

Percentage of null counts per sample.

## Some genes were not detected at all in these samples. We will discard them.
count.table <- count.table[rowSums(count.table) > 0,]

Selecting random samples

One of the questions that will drive the analysis will be to define the impact of the number of biological samples on the results.

The original study contained 48 replicates per genotype, what happens if we select a smaller number?

Each attendee of this course select a given number (e.g. 3, 4, 5, 10, 15, 20, 35, 40, 45…) and adapt the code below run the analysis with that number of replicates per genotype. We will at the end then compare the results (number of genes, significance, …).

nb.replicates <- 10 ## Each attendee chooses a number (3,4,5,10,15 or 20)

samples.WT <- sample(1:48, size=nb.replicates, replace=FALSE)

## Random sampling of the Snf2 replicates (columns 49 to 96)
samples.Snf2 <- sample(49:96, size=nb.replicates, replace=FALSE)

selected.samples <- c(samples.WT, samples.Snf2)

# Don't forget to update colors
col.pheno.selected <- expDesign$color[selected.samples]

Differential analysis with DESeq2

In this section we will search for genes whose expression is affected by the genetic invalidation. You will first need to install the DESeq2 bioconductor library then load it.

## Install the library if needed then load it
if(!require("DESeq2")){
  install.packages("lazyeval")
  install.packages("ggplot2")
  
  source("http://bioconductor.org/biocLite.R")
  biocLite("DESeq2")
}
library("DESeq2")

Creating a DESeqDataSet dataset

We will then create a DESeqDataSet using the DESeqDataSetFromMatrix() function. Get some help about the DESeqDataSet and have a look at some important accessor methods: counts, conditions, estimateSizeFactors, sizeFactors, estimateDispersions and nbinomTest.

## Use the DESeqDataSetFromMatrix to create a DESeqDataSet object
dds0 <- DESeqDataSetFromMatrix(countData = count.table[,selected.samples ], colData = expDesign[selected.samples,], design = ~ strain)
print(dds0)

class: DESeqDataSet 
dim: 6887 20 
metadata(1): version
assays(1): counts
rownames(6887): 15s_rrna 21s_rrna ... ty(gua)m2 ty(gua)o
rowData names(0):
colnames(20): WT26 WT42 ... Snf42 Snf28
colData names(3): label strain color

## What kind of object is it ?
is(dds0)

[1] "DESeqDataSet"               "RangedSummarizedExperiment"
[3] "SummarizedExperiment"       "Vector"                    
[5] "Annotated"

isS4(dds0)

[1] TRUE

## What does it contain ?
# The list of slot names
slotNames(dds0)

[1] "design"             "dispersionFunction" "rowRanges"         
[4] "colData"            "assays"             "NAMES"             
[7] "elementMetadata"    "metadata"

## Get some help about the "CountDataSet" class.
## NOT RUN
#?"DESeqDataSet-class"

Normalization

The normalization procedure (RLE) is implemented through the estimateSizeFactors function.

How is the scaling factor computed ?

Given a matrix with $p$ columns (samples) and $n$ rows (genes) this function estimates the size factors as follows: Each column element is divided by the geometric means of the rows. For each sample, the median (or, if requested, another location estimator) of these ratios (skipping the genes with a geometric mean of zero) is used as the size factor for this column.

The scaling factor for sample $j$ is thus obtained as:

\[sf_{j} = median(\frac{K_{g,j}}{(\prod_{j=1}^p K_{g,j})^{1/p}}) \]

### Let's implement such a function
### cds is a countDataset
estimSf <- function (cds){
    # Get the count matrix
    cts <- counts(cds)
    
    # Compute the geometric mean
    geomMean <- function(x) prod(x)^(1/length(x))

    # Compute the geometric mean over the line
    gm.mean  <-  apply(cts, 1, geomMean)
    
    # Zero values are set to NA (avoid subsequentcdsdivision by 0)
    gm.mean[gm.mean == 0] <- NA
    
    # Divide each line by its corresponding geometric mean
    # sweep(x, MARGIN, STATS, FUN = "-", check.margin = TRUE, ...)
    # MARGIN: 1 or 2 (line or columns)
    # STATS: a vector of length nrow(x) or ncol(x), depending on MARGIN
    # FUN: the function to be applied
    cts <- sweep(cts, 1, gm.mean, FUN="/")
    
    # Compute the median over the columns
    med <- apply(cts, 2, median, na.rm=TRUE)
    
    # Return the scaling factor
    return(med)
}

Now, check that the results obtained with our function are the same as those produced by DESeq. The method associated with normalization for the “CountDataSet” class is estimateSizeFactors().

## Normalizing using the method for an object of class"CountDataSet" 
dds.norm <-  estimateSizeFactors(dds0)
sizeFactors(dds.norm)

 WT26  WT42  WT48  WT21  WT43  WT20  WT29  WT24   WT5  WT36 Snf12  Snf2 
1.049 0.864 0.919 1.135 0.797 0.680 0.944 0.982 0.953 0.841 1.047 1.170 
Snf16 Snf43 Snf40 Snf30 Snf27 Snf41 Snf42 Snf28 
1.330 1.316 1.158 1.012 0.888 1.318 0.914 1.043

## Now get the scaling factor with our homemade function.cds.norm
head(estimSf(dds0))

 WT26  WT42  WT48  WT21  WT43  WT20 
1.049 0.864 0.919 1.135 0.797 0.680

all(round(estimSf(dds0),6) == round(sizeFactors(dds.norm), 6))

[1] TRUE

## Checking the normalization
par(mfrow=c(2,2),cex.lab=0.7)
boxplot(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(counts)", horizontal=TRUE, main="Raw counts")
boxplot(log2(counts(dds.norm, normalized=TRUE)+epsilon),  col=col.pheno.selected, cex.axis=0.7, 
        las=1, xlab="log2(normalized counts)", horizontal=TRUE, main="Normalized counts") 
plotDensity(log2(counts(dds.norm)+epsilon),  col=col.pheno.selected, 
            xlab="log2(counts)", cex.lab=0.7, panel.first=grid()) 
plotDensity(log2(counts(dds.norm, normalized=TRUE)+epsilon), col=col.pheno.selected, 
            xlab="log2(normalized counts)", cex.lab=0.7, panel.first=grid())

Impact of the count normalization.

Count variance is related to mean

As you can see from the following plot the relationship between variance and mean is not strictly linear. This can be shown by the poor fit that is obtained using a linear regression.

## Computing mean and variance
norm.counts <- counts(dds.norm, normalized=TRUE)
mean.counts <- rowMeans(norm.counts)
variance.counts <- apply(norm.counts, 1, var)

## sum(mean.counts==0) # Number of completely undetected genes

norm.counts.stats <- data.frame(
  min=apply(norm.counts, 2, min),
  mean=apply(norm.counts, 2, mean),
  median=apply(norm.counts, 2, median),
  max=apply(norm.counts, 2, max),
  zeros=apply(norm.counts==0, 2, sum),
  percent.zeros=100*apply(norm.counts==0, 2, sum)/nrow(norm.counts),
  perc05=apply(norm.counts, 2, quantile, 0.05),
  perc10=apply(norm.counts, 2, quantile, 0.10),
  perc90=apply(norm.counts, 2, quantile, 0.90),
  perc95=apply(norm.counts, 2, quantile, 0.95)
)

kable(norm.counts.stats)

	mean	median	max	zeros	percent.zeros	perc05	perc10	perc90	perc95
WT26	1278	380	277712	298	4.33	0.953	4.77	2228	4115
WT42	1230	383	241876	324	4.71	1.157	5.79	2183	3864
WT48	1263	385	265673	323	4.69	1.088	5.44	2219	4030
WT21	907	395	112577	294	4.27	0.881	5.29	1987	3213
WT43	1207	385	240482	312	4.53	1.254	6.27	2138	3935
WT20	1343	378	306825	386	5.61	0.000	4.41	2301	4294
WT29	1255	394	256593	275	3.99	1.060	5.30	2077	3854
WT24	1177	396	200791	274	3.98	1.019	5.09	2022	3724
WT5	1221	395	236672	286	4.15	1.050	6.30	2050	3699
WT36	1355	377	294218	286	4.15	1.189	4.75	1995	3999
Snf12	1027	379	145840	253	3.67	0.955	6.69	1843	3264
Snf2	1079	380	170947	313	4.54	0.855	5.13	1958	3382
Snf16	990	390	136603	264	3.83	0.752	6.01	1816	3107
Snf43	1056	378	166746	243	3.53	0.760	5.32	1899	3298
Snf40	1037	381	156081	267	3.88	0.864	5.18	1911	3258
Snf30	1037	384	153527	293	4.25	0.989	5.93	1887	3257
Snf27	1086	380	160618	333	4.83	1.126	5.63	1971	3505
Snf41	1097	382	181837	267	3.88	0.759	5.31	1967	3369
Snf42	1096	378	179120	348	5.05	0.000	5.47	2000	3494
Snf28	1070	379	172815	291	4.22	0.959	5.75	1914	3358

## Mean and variance relationship
mean.var.col <- densCols(x=log2(mean.counts), y=log2(variance.counts))
plot(x=log2(mean.counts), y=log2(variance.counts), pch=16, cex=0.5, 
     col=mean.var.col, main="Mean-variance relationship",
     xlab="Mean log2(normalized counts) per gene",
     ylab="Variance of log2(normalized counts)",
     panel.first = grid())
abline(a=0, b=1, col="brown")

Figure: variance/mean plot. The brown line highlights $x=y$, which corresponds to the expected relationship between mean and variance for a Poisson distribution.

Modeling read counts

Let us imagine that we would produce a lot of RNA-Seq experiments from the same samples (technical replicates). For each gene $g$ the measured read counts would be expected to vary rather slighlty around the expected mean and would be probably well modeled using a Poisson distribution. However, when working with biological replicates more variations are intrinsically expected. Indeed, the measured expression values for each genes are expected to fluctuate more importantly, due to the combination of biological and technical factors: inter-individual variations in gene regulation, sample purity, cell-synchronization issues or reponses to environment (e.g. heat-shock).

The Poisson distribution has only one parameter indicating its expected mean : $\lambda$. The variance of the distribution equals its mean $\lambda$. Thus in most cases, the Poisson distribution is not expected to fit very well with the count distribution in biological replicates, since we expect some over-dispersion (greater variability) due to biological noise.

As a consequence, when working with RNA-Seq data, many of the current approaches for differential expression call rely on an alternative distribution: the negative binomial (note that this holds true also for other -Seq approaches, e.g. ChIP-Seq with replicates).

What is the negative binomial ?

The negative binomial distribution is a discrete distribution that give us the probability of observing $x$ failures before a target number of succes $n$ is obtained. As we will see later the negative binomial can also be used to model over-dispersed data (in this case this overdispersion is relative to the poisson model).

The probability of $x$ failures before $n$ success

First, given a Bernouilli trial with a probability $p$ of success, the negative binomial distribution describes the probability of observing $x$ failures before a target number of successes $n$ is reached. In this case the parameters of the distribution will thus be $p$, $n$ (in dnbinom() function of R, $n$ and $p$ are denoted by arguments size and prob respectively).

\[P_{NegBin}(x; n, p) = \binom{x+n-1}{x}\cdot p^n \cdot (1-p)^x = C^{x}_{x+n-1}\cdot p^n \cdot (1-p)^x \]

In this formula, $p^n$ denotes the probability to observe $n$ successes, $(1-p)^x$ the probability of $x$ failures, and the binomial coefficient $C^{x}_{x+n-1}$ indicates the number of possible ways to dispose $x$ failures among the $x+n-1$ trials that precede the last one (the problem statement imposes for the last trial to be a success).

The negative binomial distribution has expected value $n\frac{q}{p}$ and variance $n\frac{q}{p^2}$. Some examples of using this distribution in R are provided below.

Particular case: when $n=1$ the negative binomial corresponds to the the geometric distribution, which models the probability distribution to observe the first success after $x$ failures: $P_{NegBin}(x; 1, p) = P_{geom}(x; p) = p \cdot (1-p)^x$.

par(mfrow=c(1,1))

## Some intuition about the negative binomiale parametrized using n and p.
## The simple case, one success (see geometric distribution)
# Let's have a look at the density
p <- 1/6 # the probability of success
n <- 1   # target for number of successful trials

# The density function
plot(0:10, dnbinom(0:10, n, p), type="h", col="blue", lwd=4)

Negative binomial distribution.

# the probability of zero failure before one success.
# i.e the probability of success 
dnbinom(0, n , p)

[1] 0.167

## i.e the probability of at most 5 failure before one success. 
sum(dnbinom(0:5, n , p)) # == pnbinom(5, 1, p)

[1] 0.665

## The probability of at most 10 failures before one sucess 
sum(dnbinom(0:10, n , p)) # == pnbinom(10, 1, p)

[1] 0.865

## The probability to have more than 10 failures before one sucess
1-sum(dnbinom(0:10, n , p)) # == 1 - pnbinom(10, 1, p)

[1] 0.135

## With two successes
## The probability of x failure before two success (e.g. two six)
n <- 2
plot(0:30, dnbinom(0:30, n, p), type="h", col="blue", lwd=2,
     main="Negative binomial density",
     ylab="P(x; n,p)",
     xlab=paste("x = number of failures before", n, "successes"))

# Expected value
q <- 1-p
(ev <- n*q/p)

[1] 10

abline(v=ev, col="darkgreen", lwd=2)

# Variance 
(v <- n*q/p^2)

[1] 60

arrows(x0=ev-sqrt(v), y0 = 0.04, x1=ev+sqrt(v), y1=0.04, col="brown",lwd=2, code=3, , length=0.2, angle=20)

Negative binomial distribution.

Using mean and dispersion

The second way of parametrizing the distribution is using the mean value $m$ and the dispersion parameter $r$ (in dnbinom() function of R, $m$ and $r$ are denoted by arguments mu and size respectively). The variance of the distribution can then be computed as $m + m^2/r$.

Note that $m$ can be deduced from $n$ and $p$.

n <- 10
p <- 1/6
q <- 1-p
mu <- n*q/p

all(dnbinom(0:100, mu=mu, size=n) == dnbinom(0:100, size=n, prob=p))

[1] TRUE

Modelling read counts through a negative binomial

To perform diffential expression call DESeq will assume that, for each gene, the read counts are generated by a negative binomial distribution. One problem here will be to estimate, for each gene, the two parameters of the negative binomial distribution: mean and dispersion.

The mean will be estimated from the observed normalized counts in both conditions.
The first step will be to compute a gene-wise dispersion. When the number of available samples is insufficient to obtain a reliable estimator of the variance for each gene, DESeq will apply a shrinkage strategy, which assumes that counts produced by genes with similar expression level (counts) have similar variance (note that this is a strong assumption). DESeq will regress the gene-wise dispersion onto the means of the normalized counts to obtain an estimate of the dispersion that will be subsequently used to build the binomial model for each gene.

## Performing estimation of dispersion parameter
dds.disp <- estimateDispersions(dds.norm)

## A diagnostic plot which
## shows the mean of normalized counts (x axis)
## and dispersion estimate for each genes
plotDispEsts(dds.disp)

Performing differential expression call

Now that a negative binomial model has been fitted for each gene, the nbinomWaldTest can be used to test for differential expression. The output is a data.frame which contains nominal p-values, as well as FDR values (correction for multiple tests computed with the Benjamini–Hochberg procedure).

alpha <- 0.0001
wald.test <- nbinomWaldTest(dds.disp)
res.DESeq2 <- results(wald.test, alpha=alpha, pAdjustMethod="BH")

## What is the object returned by nbinomTest()
class(res.DESeq2)

[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"

is(res.DESeq2) # a data.frame

[1] "DESeqResults"    "DataFrame"       "DataTable"       "SimpleList"     
[5] "DataTableORNULL" "List"            "Vector"          "Annotated"

slotNames(res.DESeq2)

[1] "rownames"        "nrows"           "listData"        "elementType"    
[5] "elementMetadata" "metadata"

head(res.DESeq2)

log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
          baseMean log2FoldChange     lfcSE      stat    pvalue      padj
         <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
15s_rrna     17.79         0.3235    0.4260     0.759   0.44768    0.5795
21s_rrna    121.72         0.2596    0.4108     0.632   0.52733    0.6551
hra1          2.31         0.7676    0.3931     1.953   0.05086    0.1025
icr1        131.94        -0.2572    0.0962    -2.673   0.00751    0.0199
lsr1        215.12         0.3479    0.3217     1.081   0.27957    0.4062
nme1         25.27        -0.0496    0.2507    -0.198   0.84324    0.8998

## The column names of the data.frame
## Note the column padj 
## contains FDR values (computed Benjamini–Hochberg procedure)
colnames(res.DESeq2)

[1] "baseMean"       "log2FoldChange" "lfcSE"          "stat"          
[5] "pvalue"         "padj"

## Order the table by decreasing p-valuer
res.DESeq2 <- res.DESeq2[order(res.DESeq2$padj),]
head(res.DESeq2)

log2 fold change (MAP): strain WT vs Snf 
Wald test p-value: strain WT vs Snf 
DataFrame with 6 rows and 6 columns
         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
yar071w      1459           4.04    0.1049      38.5  0.00e+00  0.00e+00
ygr234w      2690           4.09    0.0977      41.8  0.00e+00  0.00e+00
yil121w       813           2.75    0.0551      49.9  0.00e+00  0.00e+00
yml123c      4276           4.45    0.1148      38.8  0.00e+00  0.00e+00
yhr215w      1158           4.16    0.1120      37.1 1.68e-301 2.31e-298
yor290c       765           6.95    0.1910      36.4 1.93e-289 2.21e-286

## Draw an histogram of the p-values
hist(res.DESeq2$padj, breaks=20, col="grey", main="DESeq2 p-value distribution", xlab="DESeq2 P-value", ylab="Number of genes")

Histogram of the p-values reported by DESeq2.

Volcano plot

alpha <- 0.01 # Threshold on the adjusted p-value
cols <- densCols(res.DESeq2$log2FoldChange, -log10(res.DESeq2$pvalue))
plot(res.DESeq2$log2FoldChange, -log10(res.DESeq2$padj), col=cols, panel.first=grid(),
     main="Volcano plot", xlab="Effect size: log2(fold-change)", ylab="-log10(adjusted p-value)",
     pch=20, cex=0.6)
abline(v=0)
abline(v=c(-1,1), col="brown")
abline(h=-log10(alpha), col="brown")

gn.selected <- abs(res.DESeq2$log2FoldChange) > 2 & res.DESeq2$padj < alpha 
text(res.DESeq2$log2FoldChange[gn.selected],
     -log10(res.DESeq2$padj)[gn.selected],
     lab=rownames(res.DESeq2)[gn.selected ], cex=0.4)

$Volcano plot of DESeq2 results. Abcsissa: log2(fold-change). Ordinate: significance ($-log_{10}(P-value)$).$

Volcano plot of DESeq2 results. Abcsissa: log2(fold-change). Ordinate: significance ($-log_{10}(P-value)$).

Check the expression levels of the most differentially expressed gene

It may be important to check the validity of our analysis by simply assessing the expression level of the most highly differential gene.

gn.most.sign <- rownames(res.DESeq2)[1]
gn.most.diff.val <- counts(dds.norm, normalized=T)[gn.most.sign,]
barplot(gn.most.diff.val, col=col.pheno.selected, main=gn.most.sign, las=2, cex.names=0.5)

Barplot of the counts per sample fr a selected gene.

Looking at the results with a MA plot

One popular diagram in dna chip analysis is the M versus A plot (MA plot) between two conditions $a$ and $b$. In this representation :

M (Minus) is the log ratio of counts calculated for any gene. \[M_g = log2(\bar{x}_{g,a}) - log2(\bar{x}_{g,b})\]
A (add) is the average log counts which corresponds to an estimate of the gene expression level. \[A_g = \frac{1}{2}(log2(\bar{x}_g,a) + log2(\bar{x}_g,b))\]

## Draw a MA plot.
## Genes with adjusted p-values below 1% are shown
plotMA(res.DESeq2, colNonSig = "blue")
abline(h=c(-1:1), col="red")

MA plot. The abcsissa indicates the mean of normalized counts; the ordinate the log2(fold-change).

Hierarchical clustering

To ensure that the selected genes distinguish well between “treated”" and “untreated” condition we will perform a hierachical clustering using the heatmap.2() function from the gplots library.

## We select gene names based on FDR (1%)
gene.kept <- rownames(res.DESeq2)[res.DESeq2$padj <= alpha & !is.na(res.DESeq2$padj)]

## We retrieve the normalized counts for gene of interest
count.table.kept <- log2(count.table + epsilon)[gene.kept, ]
dim(count.table.kept)

[1] 2319   96

## Install the gplots library if needed then load it
if(!require("gplots")){
  install.packages("gplots")
}
library("gplots")

## Perform the hierarchical clustering with
## A distance based on Pearson-correlation coefficient
## and average linkage clustering as agglomeration criteria
heatmap.2(as.matrix(count.table.kept), 
          scale="row", 
          hclust=function(x) hclust(x,method="average"), 
          distfun=function(x) as.dist((1-cor(t(x)))/2), 
          trace="none", 
          density="none", 
          labRow="",
          cexCol=0.7)

Heatmap of the gebes deckared significant with DESeq2. Rows correspond to genes, columns to samples.

Functional enrichment

We will now perform functional enrichment using the list of induced genes. This step will be performed using the gProfileR R library.

library(gProfileR)

res.DESeq2.df <- na.omit(data.frame(res.DESeq2))
induced.sign <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange >= 2 &  res.DESeq2.df$padj < alpha]
# head(induced.sign)
# names(term.induced)

term.induced <- gprofiler(query=induced.sign, organism="scerevisiae")
term.induced <- term.induced[order(term.induced$p.value),]
# term.induced$p.value
kable(term.induced[1:10,c("term.name",
                      "term.size",
                      "query.size",
                      "overlap.size",
                      "recall",
                      "precision",
                      "p.value", 
                      "intersection")], 
      format.args=c(engeneer=TRUE, digits=3), caption="**Table: functional analysis wit gProfileR. ** ")

**Table: functional analysis wit gProfileR.**
	term.name	term.size	query.size	overlap.size	recall	precision	intersection
41	RNA-DNA hybrid ribonuclease activity	48	74	26	0.351	0.542	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
7	DNA integration	50	74	26	0.351	0.520	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
28	RNA-directed DNA polymerase activity	52	74	26	0.351	0.500	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
44	aspartic-type peptidase activity	54	74	25	0.338	0.463	YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
46	aspartic-type endopeptidase activity	54	74	25	0.338	0.463	YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
40	endoribonuclease activity, producing 5’-phosphomonoesters	63	74	26	0.351	0.413	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
27	DNA-directed DNA polymerase activity	64	74	26	0.351	0.406	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
26	DNA polymerase activity	68	74	26	0.351	0.382	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
37	endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters	71	74	26	0.351	0.366	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
39	endoribonuclease activity	77	74	26	0.351	0.338	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B

And now using the list of repressed genes.

res.DESeq2.df <- na.omit(data.frame(res.DESeq2))
repressed.sign <- rownames(res.DESeq2.df)[res.DESeq2.df$log2FoldChange <= -2 &  res.DESeq2.df$padj < alpha]
head(repressed.sign)

[1] "yer081w" "ypl025c" "ymr122c" "yor348c" "ygr087c" "yml122c"

term.repressed <- gprofiler(query=repressed.sign, organism="scerevisiae")
term.repressed <- term.repressed[order(term.repressed$p.value),]
kable(head(term.induced[,c("p.value", "term.name","intersection")], 10))

	term.name	intersection
41	RNA-DNA hybrid ribonuclease activity	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
7	DNA integration	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
28	RNA-directed DNA polymerase activity	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
44	aspartic-type peptidase activity	YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
46	aspartic-type endopeptidase activity	YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
40	endoribonuclease activity, producing 5’-phosphomonoesters	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
27	DNA-directed DNA polymerase activity	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
26	DNA polymerase activity	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
37	endonuclease activity, active with either ribo- or deoxyribonucleic acids and producing 5’-phosphomonoesters	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B
39	endoribonuclease activity	YAR009C,YBL005W-B,YDR098C-B,YDR261C-D,YDR316W-B,YDR365W-B,YER138C,YER160C,YGR027W-B,YGR038C-B,YHR214C-B,YJR027W,YJR029W,YLR157C-B,YLR227W-B,YML039W,YML045W,YMR045C,YMR050C,YNL284C-B,YOL103W-B,YOR142W-B,YPL257W-B,YPR137C-B,YPR158C-D,YPR158W-B

Assess the effect of sample number on differential expression call

Using a loop, randomly select 10 times 2,5,10,15..45 samples from WT and Snf2 KO. Perform differential expression calls and draw a diagram showing the number of differential expressed genes.

## Create a directory to store the results that will be obtained below
dir.results <- file.path(dir.snf2, "results")
dir.create(dir.results, showWarnings = FALSE, recursive = TRUE)

## Export the table with statistics per sample.
write.table(stats.per.sample, file=file.path(dir.results, "stats_per_sample.tsv"),
            quote=FALSE, sep="\t", col.names =NA, row.names = TRUE)

# Export the DESeq2 result table
DESeq2.table <- file.path(dir.results, "yeast_Snf2_vs_WT_DESeq2_diff.tsv")
write.table(res.DESeq2, file=DESeq2.table, col.names = NA, row.names = TRUE, sep="\t", quote = FALSE)

A factor is a vector with levels (categories), which permits an efficient storage and indexing, but can in some cases lead to misleading effects. To circumvent this, we will sometimes have to convert the factor to a vector, with the R command as.vector().↩

RNA-Seq - differential expression using DESeq2

D. Puthier (adapted From Hugo Varet, Julie Auberta and J. van Helden)

First version: 2016-12-10; Last update: 2017-11-22

The Snf2 dataset

Loading the dataset

Phenotypic data

Solution

Descriptive statistics

Basic statistics

Distributions

Histograms of counts per gene

Interpretation

Boxplots of gene count distributions per sample

Density plots

Scatter plots

Eliminating undetected genes

Solution

Selecting random samples

Differential analysis with DESeq2

Creating a DESeqDataSet dataset

Normalization

How is the scaling factor computed ?

Modeling read counts

What is the negative binomial ?

The probability of \(x\) failures before \(n\) success

Using mean and dispersion

Modelling read counts through a negative binomial

Performing differential expression call

Volcano plot

Check the expression levels of the most differentially expressed gene

Looking at the results with a MA plot

Hierarchical clustering

Functional enrichment

Assess the effect of sample number on differential expression call