Statistics for Bioinformatics
Practicals - Student Conformity Test

Introduction

Theory

This tutorial is an application of concepts seen in the following chapters of the course:

Prerequisite

  • This tutorial assumes you already executed the script config.R as described in the configuration page.

    Tutorial

    Question

    Microarrays were used to measure the level of expression of all the yeast genes in two different culture media: (1) minimal medium (measured in the green channel of the microarray); (2) minimal medium + methionine (measured in the red channel of the microarray). Three repetitions of the experiment were performed, and the log-ratios log10(Red/Green) were calculated for each microarray. For a given gene, we obtain the following values of log-ratio: 2.0, 3.1, 0.3.

    1. Is this gene significantly activated by methionine ?
    2. How many false positives would we expect with this level of significance, if the test was applied on 6200 genes ?
    ## Descriptive statistics
    
    ## The sample is stored in a vector called x
    x <- c(2.0, 3.1, 0.3) 
    print(x) ## Check the sample
    
    ## Calculate sample size
    n <- length(x)
    print(n) ## Check sample size
    
    ## Calculate the sample mean
    sample.mean <- mean(x)
    
    ## Calcualte the standard deviation of the sample
    ## This way to calculate is inefficient, it is just shown for didactic purpose
    sample.var <- mean((x - sample.mean)^2)
    sample.sd <- sqrt(sample.var)
    print(sample.sd)
    
    ## Estimate the standard deviation of the population
    ## This can be done by applying the correction on the sample standard deviation
    print(sample.sd*sqrt(n/(n-1)))
    
    ## Faster way: use the R function sd(), which automatically performs
    ## the n/(n-1) correction
    sd.est <- sd(x) 
    print(sd.est)
    
    ## Calculate standard error
    print(standard.error <- sd.est/sqrt(n))
    
    ## Calculate the observed Student statistics t.obs
    ref.mean <- 0
    t.obs <- (sample.mean - ref.mean )/standard.error
    print(t.obs)
    
    ## Draw the histogram of the Student theoretical functions, and compare them to the normal distribution
    y <- seq(from=-5,to=5,by=0.1)
    
    ## Draw the normal distribution
    plot(y, dnorm(y), tpye="l", col="darkblue", type="l",panel.first=grid(col="black"))
    
    ## help(dt)
    i <- 0
    for (d in c(1,2,3,4,5,10,100,1000)) {
         i <- i+1
         lines(y, dt(y,df=i),type="l",col=i)
    }
    
    ## Calculate the P-value of t.obs
    P.value <- pt(t.obs,df=n-1,lower.tail=F)
    print(P.value)
    
    ## E-value
    G <- 6200
    E.value <- P.value*G
    print(E.value)
    
    ## T.test in R
    t.test(x,alternative="greater")