Denis Puthier and Jacques van Helden
This tutorial is just a brief tour of the language capabilities and is intented to give some clues to begin with the R programming language. For a more detailled overview see R for beginners (E. Paradis)
R is an object-oriented programming language. You can easily create basic objects of class vector, matrix, data.frame, list, factor,...
Below, we create a vector x that contains one value. You can see the content of x by simply calling it.
Alternatively, you can assign a value to x by using the "=" operator. However "<-" is most generally prefered.
In R, anything on a line after a hash mark (#) is a comment and is ignored by the interpreter.
Instructions can be separated by semi-colons (;) or new-line.
x<-12; y<-13 x; y
Once values are assigned to an object, R will store this object into the memory. Previously created objects can be listed using the ls function.
Object can be deleted using the rm (remove) function.
rm(x) rm(y) ls()
In the above section we have created vectors containing numeric data. We have also used functions (ls and rm). We can use numerous functions to perform specific tasks. When calling a function, we will use this generic syntax:
-NameOfTheFunction(arg1=a, arg2=b, ...)
To access the documentation of a given function, use the help function (or the question mark). The documentation gives you an overview of the function:
For instance to get information about the substr function (used to extract part of a character string) use one of the following instructions:
When calling a function, the name of the arguments can be omitted if they are placed as expected. For instance if one wants to extract character 2 to 4 in the string "microarray":
If the arguments are not in the expected order their names are mandatory (note that, for convenience, they can be abbreviated but the abbreviation used should be unambiguous):
substr(st=2,st=4,x="microarray") #ambiguous. R throw an error message.
The function c is used to combine values into a vector. A vector can contain several values of the same mode. Most frequently, the mode will be one of: "numeric", "character" or "logical".
mic<-c("Agilent","Affy") #a character vector mic class(mic) # or is(mic) num<-c(1,2,3) # a numeric vector num class(num) bool<-c(T,F,T) # a logical vector class(bool)
The rep function repeats a value as many times as requested.
The seq (sequence) function is used to generate a regular sequences of numerics
rep(3,5) seq(0,10,by=2) seq(0,10,length.out=7)
the rnorm (random normal)function is used to generate normally distributed values with mean equal to 'mean' (default 0) and standard deviation equal to 'sd' (default 1).
additional distributions are available, for instance, runif (random uniform), rpois (random poisson)
set.seed(1) x<-round(rnorm(10),2) x x x[1:3] x[c(2,6)] which(x > 0) # returns the positions containing positive values x[which(x > 0)] # returns the requested positive values(using a vector of integers) x> 0 # returns TRUE/FALSE for each position. x[x > 0] # same results as x[which(x0)] nm<-paste("pos",1:10,sep="_") nm names(x)<-nm x x["pos_10"] # indexing with the names of the elements
Simply use the <- operators. Note that in R, missing values are defined as NA (Not Attributed).
x[1:2]<-c(10,11) x x[4:6]<-NA x is.na(x) # returns TRUE if the position is NA x<-na.omit(x) # To delete NA values (or x[!is.na(x)]) x
R is intented to handle large data sets and to retrieve information using a concise syntax. Thanks to the internal feature of R, called vectorization, numerous operation can be written without a loop:
x<-0:10 y<-20:30 x+y x^2
This object looks like a vector. It is used to store categorical variables. A vector can be converted to a factor using the as.factor function. The levels function can be used to extract the names of the categories and to rename them.
x<-rep(c("good","bad"),5) x x<-as.factor(x) x # note that levels are displayed now levels(x) levels(x)<-0:1 x table(x)
Matrix objects are intended to store 2-dimensional datasets. Each value will be of the same mode. As with vectors, one can use names, numeric vectors or a logical vector for indexing this object. One can index rows or columns or both.
x<-matrix(1:10,ncol=2) colnames(x)<-c("ctrl","trmt") row.names(x)<-paste("gene",1:5,sep="_") x x[,1] # first column x[1,] # first row x[1,2] # row 1 and column 2 x[c(T,F,T,T,T),]
Note that the syntax below that use a logical matrix is also frequently used to extract or replace part of a matrix.
x > 2 & x < 8 x[x > 2 & x < 8]<-NA
This object is very similar to the matrix except that each column can contain a given mode (a column with characters, a column with logicals, a column with numerics,...).
Columns from a data.frame can also be extracted using the $ operator
x <- as.data.frame(x) x x$ctrl
Object of class list can store any type of object. They should be indexed with the "[[" or $ operators.
l1<-list(A=x,B=rnorm(10)) l1 l1[] l1[] l1$A
They are used to loop through row and columns of a matrix (or dataframe) or through elements of a list.
x<-matrix(rnorm(20),ncol=4) apply(x,MARGIN=1,min) # extract min value for each row (MARGIN=1) apply(x,MARGIN=2,min) # extract min value for each column (MARGIN=2)
The lapply is used for list (or data.frame).
his function tipically takes a vector and a factor as arguments. Let say we have value (x) )related to three caterogies ("good", "bad", "medium"). We can compute different statistics related to the category:
cat<-rep(c("good","bad","medium"),5) cat<-as.factor(cat) x<-rnorm(length(cat)) x[cat=="good"]<-x[cat=="good"]+2 x[cat=="medium"]<-x[cat=="medium"]+1 boxplot(x~cat) tapply(x,cat,sd) tapply(x,cat,mean) tapply(x,cat,length)
R offers a large variety of high-level graphics functions (plot, boxplot, barplot, hist, pairs, image, ...). The generated graphics can be modified using low-level functions (points, text, line, abline, rect, legend, ...).
path path<-system.file("swirldata",package="marray") getwd() # the current working directory setwd(path) # set working directory to "path" getwd() # The working directory has changed dir() # list files and directories in the current working directory #file.show("swirl.1.spot") # this file contains a Header d<-read.table("swirl.1.spot",header=T,sep="\t",row.names=1) is(d) colnames(d) G<-d$Gmedian R<-d$Rmedian plot(R,G,pch=16,cex=0.5,col="red") R<-log2(R) G<-log2(G) M<-R-G A<-R+G plot(A,M,pch=16,cex=0.5) low<-lowess(M~A) lines(low,col="blue",lwd=2)#lwd:linewidth abline(h=0,col="red")#h:horizontal abline(h=-1,col="green") abline(h=1,col="green") # We will only add gene names (here a numeric) for a subset of strongly induced/repressed genes subset<-abs(M) > 1 points(A[subset],M[subset],col="red") gn<-1:nrow(d) text(A[subset],M[subset],lab=gn[subset],cex=0.4,pos=2)