Introduction
General Overview
R (http://cran.at.rproject.org)
is a comprehensive statistical environment and programming language for
professional data analysis and graphical display. The associated Bioconductor
project provides many additional R packages for statistical data
analysis in different life science areas, such as tools for microarray, next generation
sequence and genome analysis. The R software is free and runs on all common operating systems.
Scope of this Manual
This R tutorial provides a condensed introduction into the usage of
the R environment and its utilities for general data analysis and
clustering. It also introduces a subset of packages from the
Bioconductor project. The included packages are a 'personal selection'
of the author of this manual that does not reflect the full utility
specturm of the R/Bioconductor projects. Many packages were chosen,
because the author uses them often for his own teaching and research. To
obtain a broad overview of available R packages, it is strongly
recommended to consult the official Bioconductor and R
project sites. Due to the rapid development of most packages, it is
also important to be aware that this manual will often not be fully
uptodate. Because of this and many other reasons, it is absolutely
critical to use the original documentation of each package (PDF manual
or vignette) as primary source of documentation. Users are welcome to
send suggestions for improving this manual directly to its author.
Format of this Manual
A not always very easy to read, but practical copy & paste
format has been chosen throughout this manual. In this format all
commands are represented in code boxes, where the comments are given in blue color. To save space, often several commands are
concatenated on one line and separated with a semicolon ';'. All comments/explanations start with the standard comment sign '#' to prevent them
from being interpreted by R as commands. This way several commands can
be pasted with their comment text into the R console to demo the
different functions and analysis steps. Commands starting with a '$'
sign need to be executed from a Unix or Linux shell. Windows users can
simply ignore them. Commands highlighted in red color are considered essential knowledge. They are important for someone interested in a quick start with R and
Bioconductor. Where relevant the output generated by is given in green color. The remaining commands in black color provide often more
indepth information about a certain topic.
Installation of the R Software and R Packages
Basic R Usage
$ R # Starts the R console under Unix/Linux. The R GUI versions under
Windows and Mac OS X can be started by doubleclicking their icons.
object
< function(arguments) # This general R command syntax uses the
assignment operator '<' (or '=') to assign data generated by a
command to its right to object on its left.
object =
function(arguments) # A more recently introduced assignment operator is
'='. Both of them work the same way and in both directions. For
consistency reasons one should use only one of them.
assign("x", function(arguments)) # Has the same effect, but uses the assignment function instead of the assignment operator.
source("my_script")
# Command to execute an R script, here 'my_script'. For example,
generate a text file 'my_script' with the command 'print(1:100)', then
execute it with the source function.
x < edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
x < edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
x < scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'.
q() # Quits R console.
R Startup Behavior
The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)
Finding Help
Various online manuals are available on the R project site. Very useful manuals for beginners are:
Documentation within R can be found with the following commands:
?function # or 'help(function)' opens documentation on a function
apropos(function) # Finds all functions containing a given term.
example(heatmap) # executes examples for function 'heatmap'
help.search("topic") # searches help system for documentation
RSiteSearch('regression', restrict='functions', matchesPerPage=100)
# Searches for key words or phrases in the Rhelp mailing list
archives, help pages, vignettes or task views, using the search engine
# at http://search.rproject.org and view them in a web browser.
help.start()
# Starts local HTML interface. The link 'Packages' provides a list of
all installed packages. After initiating 'start.help()' in a session the
'?function' commands will open as HTML pages.
sessionInfo() # Prints
version information about R and all loaded packages. The generated
output should be provided when sending questions or bug reports to the R
and BioC mailing lists.
$ R h # or 'R help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'

Basics on Functions and Packages
R contains most arithmetic functions like mean, median, sum, prod,
sqrt, length, log, etc. An extensive list of R functions can be found on
the function and variable index page.
Many R functions and datasets are stored in separate packages, which
are only available after loading them into an R session. Information
about installing new packages can be found in the administrative section of this manual.
Loading of libraries/packages
library() # Shows all libraries available on a system.
library(my_package) # Loads a particular package.
library(help=mypackage) # Lists all functions/objects of a library.
search() # Lists which packages are currently loaded.
Information and management of objects
ls() or objects() # Lists R objects created during session, they are
stored in file '.RData' when exiting R and the workspace is saved.
rm(my_object1, my_object2, ...) # Removes objects.
rm(list = ls()) # Removes all objects without warning!
str(object) # Displays object types and structure of an R object.
ls.str(pattern="") # Lists object type info on all objects in a session.
print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
lsf.str(pattern="") # Lists object type info on all functions in a session.
class(object) # Prints the object type.
mode(object) # Prints the storage mode of an object.
summary(object) # Generic summary info for all kinds of objects.
attributes(object) # Returns an object's attribute list.
gc()
# Causes the garbage collection to take place. This is sometimes useful
to clean up memory allocations after deleting large objects.
length(object) # Provides length of object.
.Last.value # Prints the value of the last evaluated expression.
 Reading and changing directories
dir() # Reads content of current working directory.
getwd() # Returns current working directory.
setwd("/home/user") # Changes current working directory to the specified directory.
System commands under Linux
R IN/OUTPUT & BATCH Mode
One can redirect R input and output with ' ', ' >' and ' <' from the Shell command line. More details on this topic can be found here.
$ R slave < my_infile > my_outfile # The argument 'slave'
makes R run as 'quietly' as possible. This option is intended to support
programs which use R to compute results for them.
# For example,
if my_infile contains 'x < c(1:100); x;' the result for this R
expression will be written to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile]
# Syntax for running R programs in BATCH mode from the commandline.
The output file lists the commands from the script file and their
outputs. If no outfile is specified, the name used
# is that of
'infile' and '.Rout' is appended to outfile. To stop all the usual R
command line information from being written to the outfile, add this as
first line to my_script.R file:
# 'options(echo=FALSE)'. If the
command is run like this 'R CMD BATCH nosave my_script.R', then
nothing will be saved in the .Rdata file which can get often very large.
More on this
# can be found on the help pages: '$ R CMD BATCH help' or '> ?BATCH'.
$ echo 'sqrt(3)'  R slave # calculates sqrt of 3 in R and prints it to STDOUT.
Executing Shell & Perl commands from R with system() function.
Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')
system("ls al") # Prints content of current working directory.
system("perl
ne 'print if (/my_pattern1/ ? ($c=1) : ($c > 0)); print if
(/my_pattern2/ ? ($d = 1) : ($d > 0));' my_infile.txt >
my_outfile.txt")
# Runs Perl oneliner that extracts two patterns from external text file and writes result into new file.
Reading and Writing Data from/to Files
Import from files
read.delim("clipboard", header=T) # Command to copy&paste tables from
Excel or other programs into R. If the 'header' argument is set to
FALSE, then the first line of the data set will not be used as column
titles.
read.delim(pipe("pbpaste")) # Command to copy & paste on Mac OS X systems.
file.show("my_file") # Prints content of file to screen, allows scrolling.
scan("my_file") # Reads vector/array into vector from file or keyboard.
my_frame < read.table(file="my_table") # Reads in table and assigns it to data frame.
my_frame < read.table(file="my_table", header=TRUE, sep="\t")
# Same as above, but with info on column headers and field separators.
If you want to import the data in character mode, then include this
argument: colClasses = "character".
my_frame <
read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements2009729.txt",
na.strings = "", fill=TRUE, header=T, sep="\t")
# The function
read.delim() is often more flexible for importing tables with empty
fields and long character strings (e.g. gene descriptions).
cat(month.name,
file="zzz.txt", sep="\n"); x < readLines("zzz.txt"); x <
x[c(grep("^J", as.character(x), perl = TRUE))];
t(as.data.frame(strsplit(x,"u")))
# A somewhat more advanced
example for retrieving specific lines from an external file with a
regular expression. In this example an external file is created with the
'cat'
# function, all lines of this file are imported into a
vector with 'readLines', the specific elements (lines) are then retieved
with the 'grep' function, and the resulting lines are
# split into subfields with 'strsplit'.
Export to files
write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F)
# Command to copy&paste from R into Excel or other programs. It
writes the data of an R data frame object into the clipbroard from where
it can be pasted into other applications.
zz < pipe('pbcopy',
'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) #
Command to copy & paste from R into Excel or other programs on Mac OS
X systems.
write.table(my_frame, file="my_file", sep="\t", col.names
= NA)
# Writes data frame to a tabdelimited text file. The argument
'col.names = NA' makes sure that the titles align with columns when
row/index names are exported (default).
save(x, file="my_file.txt");
load(file="file.txt") # Commands to save R object to an external file
and to read it in again from this file.
files < list.files(pattern=".txtquot;);
for(i in files) { x < read.table(i, header=TRUE, row.names=1,
comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x);
write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) }
# Batch import and export of many files. First, the *.txt file names in
the current directory are assigned to a list ($ sign is used to anchor
string '*.txt' to end of names).
# Second, the files are imported
onebyone using a for loop where the original names are assigned to the
generated data frames with the 'assign' function. Read ?read.table to
# understand arguments 'row.names=1' and 'comment.char = "A"'. Third,
the data frames are exported using their names for file naming and
appending '*.out'.
HTML(my_frame, file = "my_table.html")
#
Writes data frame to HTML table. Subsequent exports to the same file
will arrange several tables in one HTML document. In order to access
this function one needs to load the library 'R2HTML' first.
# This library is usually not installed by default.
write(x, file="my_file") # Writes matrix data to a file.
sink("My_R_Output") # Redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # Restores normal R output behavior.
Interfacing with Google Docs
R Objects
Data Types
x < c(1, 2, 3); x; is.numeric(x);
as.character(x) # Creates a numeric vector, checks for the data type and
converts it into a character vector.
 Character data: "a", "b" , "c"
x < c("1", "2", "3"); x; is.character(x); as.numeric(x)
# Creates a character vector, checks for the data type and converts it
into a numeric vector.
 Logical data: TRUE, FALSE, TRUE
1:10 < 5 # Returns TRUE where x is < 5.
Object Types in R
 vectors: ordered collection of numeric, character, complex and logical values.
 factors: special type vectors with grouping information of its components
 data frames: two dimensional structures with different data types
 matrices: two dimensional structures with data of same type
 arrays: multidimensional arrays of vectors
 lists: general form of vectors with different types of elements
 functions: piece of code
Naming Rules
 Object, row and column names should not start with a number.
 Avoid spaces in object, row and column names.
 Avoid special characters like '#'.
General Subsetting Rules
 Subsetting syntax:
my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.
There are three possibilities to subset data objects:
(1) Subsetting by positive or negative index/position numbers
my_object < 1:26; names(my_object) < LETTERS # Creates a vector sample with named elements.
my_object[1:4] # Returns the elements 14.
my_object[c(1:4)] # Excludes elements 14.
(2) Subsetting by same length logical vectors
my_logical < my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
(3) Subsetting by field names
my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M
Calling a single column or list component by its name with the '$' sign.
iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.
Assigning values to object components
zzz < iris[,1:3]; zzz < 0 # Reassignment syntax to create/replace an entire object.
zzz < iris[,1:3]; zzz[,] < 0 # Populates all fields in an object with zeros.
zzz < iris[,1:3]; zzz[zzz < 4] < 0 # Populates only specified fields with zeros
Basic Operators and Calculations
 Comparison operators
 equal: ==
 not equal: !=
 greater/less than: > <
 greater/less than or equal: >= <=
Example:
Logical operators
x < 1:10; y < 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
x == y  x != y # Returns TRUE where at least one comparison is TRUE.
!x > y # The '!' sign returns the negation (opposite) of a logical vector.
 Calculations
 Four basic arithmetic functions: addition, subtraction, multiplication and division
1 + 1; 1  1; 1 * 1; 1 / 1 # Returns results of basic arithmetic calculations.
x < 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for
the vector x its sum, mean, standard deviation and square root.
# A list
of the basic R functions can be found on the function and variable index
page.
x < 1:10; y < 1:10; x + y # Calculates the sum for each element in the vectors x and y.
apply(iris[,1:3], 1, mean) # Calculates the mean values for the
columns 13 in the sample data frame 'iris'. With the argument setting
'1',
# rowwise iterations are performed and with '2' columnwise
iterations.
tapply(iris[,4], iris$Species, mean) # Calculates the
mean values for the 4th column based on the grouping information in the
'Species' column in the 'iris' data frame.
sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'
Regular expressions
R's regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with: ?regexp.
?regexp # Opens general help page for regular expression support in R.
month.name[grep("A", month.name)] # The grep function can be used for finding patterns in strings, here letter 'A' in vector 'month.name'.
gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE)
# Example for using regular expressions to substitute a pattern by another one using a back reference. Remember: single escapes '\' need to be double escaped '\\' in R.
Vectors
General information
Vectors are ordered collection of 'atomic' (same data type)
components or modes of the following four types: numeric, character,
complex and logical. Missing values are indicated by 'NA'. R inserts
them automatically in blank fields.
x < c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using the c() function.
z < scan(file="my_file") # Reads data linewise from file and assigns them to vector 'z'.
vector[rows] # Syntax to access vector sections.
z < 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
d < as.integer(x); d # Transforms numeric data into integers.
x < 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
x < as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)]
# Generates random integers from 1 to 4. The sort() function sorts the
items by size. The rev() function reverses the order. The order()
function returns the corresponding
# indices for a sorted object.
The order() function is usually the one that needs to be used for
sorting complex objects, such as data frames or lists.
x < rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
sample(1:10,
5, replace=TRUE) # Returns a set of randomly selected elements from a
vector (here 5 numbers from 1 to 10) using either with or without
replacement.
Sequences
R has several facilities to create sequences of numbers:
1:30 # Generates a sequence of integers from 1 to 30.
letters;
LETTERS; month.name; month.abb # Generates lower case letters, capital
letters, month names and abbreviated month names, respectively.
2*1:10 # Creates a sequence of even numbers.
seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.
paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x
< paste(rep("A", times=12), 1:12, sep=""); y < paste(rep("B",
times=12), 1:12, sep=""); append(x,y) # Possibility to build plate
location vector in R (better example under 'arrays').
x < 1:100; x[2:23] # Values in square brackets select vector range.
x < 1:100; x[(2:23)] # Prints everything except the values in square brackets.
x[5] < 99 # Replaces value at position 5 with '99'.
x < 1:10; y < c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c")
# Returns index numbers where "c" occurs in the 'letters' vector. For
retrieving indices of several strings provided by query vector, use the
following 'match' function.
match(c("c","g"), letters)
# Returns
index numbers for "c" and "g" in the 'letters' vector. If the query
vector (here 'c("c","g")') contains entries that are duplicated in the
target vector, then this syntax
# returns only the first
occurence(s) for each duplicate. To retrieve the indices for all
duplicates, use the following '%in%' function.
x < rep(1:10, 2); y < c(2,4,6); x %in% y
# The function '%in%' returns a logical vector. This syntax allows the
subsetting of vectors and data frames with a query vector ('y')
containing entries that are duplicated in the target
# vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.
Table of Contents
Finding Identical and NonIdentical Entries between Vectors
intersect(month.name[1:4], month.name[3:7]) # Returns identical entries of two vectors.
month.name[month.name %in% month.name[3:7]]
# Returns the same result as in the previous step. The vector comparison with %in% returns first a logical vector of identical items that is
# then used to subset the first vector.
setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4])
# Returns the unique entries occuring only in the first vector. Note: if the argument names are not used, as in the second example, then the order of
# the arguments is important.
union(month.name[1:4], month.name[3:7]) # Joins two vectors without duplicating identical entries.
x < c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Returns duplicated entries
Factors
Factors are vector objects that contain grouping (classification) information of its components.
animalf < factor(c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector.
animalf # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
table(animalf) # Creates frequency table for levels.
 Function 'tapply' applies calculation on all members (replicates) of a level.
weight < c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean
< tapply(weight, animalf, mean) # Applies function (length, mean,
median, sum, sterr, etc) to all level values; 'length' provides the
number of entries (replicates) in each level.
 Function 'cut' divides a numeric vector into size intervals.
y < 1:200; interval < cut(y, right=F, breaks=c(1, 2, 6, 11, 21,
51, 101, length(y)+1), labels=c("1","25","610", "1120", "2150",
"51100", ">=101")); table(interval)
# Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval,
ylim=c(0,110), xlab="Intervals", ylab="Count", col="green");
text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2),
y=as.vector(table(interval))+2)
# Plots the size interval counts as bar diagram.
Table of Contents
Matrices and Arrays
 Matrices are two dimensional data objects consisting of rows and
columns. Arrays are similar, but they can have one, two or more
dimensions. In contrast to data frames (see below), one can store only a
single data type in the same object (e.g. numeric or character).
x < matrix(1:30, 3, 10, byrow = T) # Lays out vector
(1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the
matrix is filled by row or columns.
dim(x) < c(3,5,2) # Transforms above matrix into multidimensional array.
x < array(1:25, dim=c(5,5)) # Creates 5 by 5 array ('x') and fills it with values 125.
y < c(x) # Writes array into vector.
x[c(1:5),3] # Writes values from 3rd column into vector structure.
mean(x[c(1:5),3]) # Calculates mean of 3rd column.
 Subsetting matrices and arrays
array[rows, columns] # Syntax to access columns and rows of matrices and twodimensional arrays.
as.matrix(iris)
# Many functions in R require matrices as input. If something doesn't
work then try to convert the object into a matrix with the as.matrix()
function.
i < array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 15, 51.
x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] < 0 # Replaces those elements by zeros.
array1
< array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data
from 'my_array_file' and writes it into 4 by 3 array.
 Subsetting arrays with more than two dimensions
array[rows, columns, dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x < array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all subarrays.
 Merging arrays: example for building location tables for microtiter plates
Z < array(1:12, dim=c(12,8)); X < array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y < paste(t(X), Z, sep=""); Y # Creates array with 12 AxHx columns.
M < array(Y, c(96,1)) # Writes 12 AxHx columns into one.
 Script for mapping 24/48/96 to 384 well plates
 Script for mapping 384 to 1536 well plates
 Appending arrays and matrices
cbind(matrix1, matrix2) # Appends columns of matrices with same number of rows.
rbind(matrix1, matrix2) # Appends rows of matrices with same number of columns.
c(array1, array2) # Clears dimension attributes from arrays and concatenates them in vector format. The function 'as.vector(my_array1)' achieves similar result.
 Calculations between arrays
Z < array(1:12, dim=c(12,8)); X < array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimensions.
calarray < Z/X # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.
 Information about arrays
dim(X); nrow(X); ncol(X) # Returns number of rows and columns of array 'X'.

Data Frames
Data frames are two dimensional data objects that are composed of
rows and columns. They are very similar to matrices. The main difference
is that data frames can store different data types, whereas matrices
allow only one data type (e.g. numeric or character).
 Constructing data frames
my_frame < data.frame(y1=rnorm(12), y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)) # Creates data frame with vectors 112 and 121.
rownames(my_frame) < month.name[1:12] # Assigns row (index) names. These names need to be unique.
names(my_frame) < c("y4", "y3", "y2", "y1") # Assigns new column titles.
names(my_frame)[c(1,2)] < c("y3", "y4") # Changes titles of specific columns.
my_frame < data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,1] # Different possibilities to remove column(s) from a data frame.
Accessing and slicing data frame sections
my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
dim(my_frame) # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
row.names(my_frame) # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE), ] # Sorts the rows of a data frame by the specified columns, here 'y2'; for increasing order use 'decreasing=FALSE'.
my_frame[order(my_frame[,4], my_frame[,3]),] # Subsequent subsorts can be performed by specifying additional columns. By adding a "" sign one can reverse the sort order.
my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
v <my_frame[,4]; v[3] # Notation for returning the value of an individual cell. In this example the corresponding column is first assigned to a vector and then the desired field
# is accessed by its index number.
my_frame[1:5,] # Notation to view only the first five rows of all columns.
my_frame[, 1:2] # Notation to view all rows of the first two columns.
my_frame[, c(1,3)] # Notation to view all rows of the specified columns.
my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 12.
my_frame["August",] # Notation to retrieve row values by their index name (here "August").
x < data.frame(row.names=LETTERS[1:10], letter=letters[1:10], Month=month.name[1:10]); x; match(c("c","g"), x[,1])
# Returns matching index numbers of data frame or vector using 'match' function.
# This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.
x[x[, 2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,])) # Returns one row entry of a data frame as vector.
as.data.frame(my_list) # Returns a list object as data frame if the list components can be converted into a two dimensional object.
Table of Contents
 Calculations on data frames
summary(my_frame) # Prints a handy summary for a data frame.
mean(my_frame) # Calculates the mean for all columns.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3]))
# The apply function performs rowwise or columnwise calculations on data frames or matrices (here mean and ratios). The results are returned as vectors.
# In this example, they are appended to the original data frame with the data.frame function. The argument '1' in the apply function specifies rowwise calculations.
# If '2' is selected, then the calculations are performed columnwise.
aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean)
# The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a correlation matrix based on allagainstall rows and allagainstall columns.
x < matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t", 1:4, sep=""))); corV < cor(x["August",], t(x), method="pearson"); y < cbind(x, correl=corV[1,]); y[order(y[,5]), ]
# Commands to perform a correlation search and ranking across all rows (matrix object required here). First, an example matrix 'x' is created. Second, the
# correlation values for the "August" row against all other rows are calculated. Finally, the resulting vector with the correlation values is merged with the
# original matrix 'x' and sorted by decreasing correlation values.
merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE)
# Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on row names (indices),
# refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
my_frame1 < data.frame(title1=month.name[1:8], title2=1:8); my_frame2 < data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE)
# Example for merging two data frames by common field.
 Fast computations on large data frames and matrices
myDF < as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
myCol < c(1,1,1,2,2,2,3,3,4,4); myDFmean < t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,1])
colnames(myDFmean) < tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,]
# The aggregate function can be used to perform calculations (here mean) across rows for any combination of column selections (here myCol) after transposing the
# data frame. However, this will be very slow for data frames with millions of rows.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) < sapply(myList, paste, collapse="_"); myDFmean < sapply(myList, function(x) mean(as.data.frame(t(myDF[,x])))); myDFmean[1:4,]
# Faster alternative for performing the aggregate computation of the previous step. However, this step will be still very slow for very large
# data sets, due to the sapply loop over the row elements.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
myDFmean < sapply(myList, function(x) rowSums(myDF[,x])/length(x)); colnames(myDFmean) < sapply(myList, paste, collapse="_")
myDFmean[1:4,] # By using the rowSums or rowMeans functions one can perform the above aggregate computations 1001000 times faster by avoiding a loop over the rows altogether.
myDFsd < sqrt((rowSums((myDFrowMeans(myDF))^2)) / (length(myDF)1)); myDFsd[1:4]
# Similarly, one can compute the standard deviation for large data frames by avoiding loops.
# This approach is about 100 times faster than the loopbased alternatives: sd(t(myDF)) or apply(myDF, 1, sd).
Table of Contents
 Conditional selections
my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,]
# Prints all rows of data frame where values of col1 > col2. Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and),  (or) and ! (not).
x < 0.5:10; x[x<1.0] < 1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2  x$no1 > x$no2),]
# Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (14) using a for loop (see below).
z < data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y < apply(z, 1, function(x) sum(x == "m") > 2); z[y,]
# Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
z < data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c < data.frame(z, count=apply(z[,1:3], 1, FUN < function(x) sum(x >= 5))); c
# Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame. By default rows with "NA" values will
# be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.g.: x[is.na(x)] < 1.
x < data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index < x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,]
# Example of how one can count occurances of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting to a data frame of
# logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: 'cbind(x, count=apply(x=="P",1,sum))'.
Lists
Lists are ordered collections of objects that can be of different
modes (e.g. numeric vector, array, etc.). A name can be assigned to each
list component.
my_list < list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Returns component 2 of list; The [[..]] operator returns the object type stored in this component, while '[..]' returns a single component list.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife < 1:12 # Replaces the content of the existing component with new content.
my_list$wife < NULL # Deletes list component 'wife'.
my_list < c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list < c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list < c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
my_frame < data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)); my_list < apply(my_frame, 1, list); my_list < lapply(my_list, unlist); my_list
# Stores the rows of a data frame as separate vectors in a list container.
mylist < list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]]))
# Syntax to access list component names with the looping function 'lapply'. In this example the list component names are prepended to the corresponding vectors.
Table of Contents
Missing Values
 Missing values are represented in R data objects by the missing
value place holder 'NA'. The default behavior for many R functions on
data objects with missing values is 'na.fail' which returns the value
'NA'. Several 'na.action' options are available to change this behavior.
x < 1:10; x < x[1:12]; z < data.frame(x,y=12:1) # Creates sample data.frame with 'NA' values is.na(x) # Finds out which elements of x contain "NA" fields. na.omit(z) # Removes rows with 'NA' fields in numeric data objects. x < letters[1:10]; print(x); x < x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'. apply(z, 1, mean, na.rm=T) #
Uses available values to perform calculation while ignoring the 'NA'
values; the default 'na.rm=F' returns 'NA' for rows with missing values. z[is.na(z)] < 0
# Replaces 'NA' values with zeros. To replace the special <NA>
signs in character columns of data frames, convert the data frame with
as.matrix() to a matrix, # perform the substitution on this level and then convert it back to a data frame with as.data.frame().
Some Great R Functions
 This sections contains a small collection of extremely useful R functions.
 The unique() function makes vector entries unique:
unique(iris$Sepal.Length); length(unique(iris$Sepal.Length))
# The first step returns the unique entries for the Sepal.Length column in the iris data set and the second step counts the number of its unique entries.
 The table() function counts the occurrence of entries in a vector. It is the most basic "clustering function":
my_counts < table(iris$Sepal.Length, exclude=NULL)[iris$Sepal.Length]; cbind(iris, CLSZ=my_counts)[1:4,] # Counts identical entries in the Sepal.Length column of the iris data set and aligns the counts with the original data frame. Note: to count NAs, set the 'exclude' argument to NULL.
myvec < c("a", "a", "b", "c", NA, NA); table(factor(myvec, levels=c(unique(myvec), "z"), exclude=NULL)) # Additional count levels can be specified by turning the test vector into a factor and specifying them with the 'levels' argument.
 The combn() function creates all combinations of elements:
labels < paste("Sample", 1:5, sep=""); combn(labels, m=2, FUN=paste, collapse="")
# Example to create all possible pairwise combinations of sample labels. The number of samples to consider in the comparisons can be controlled with the 'm' argument.
allcomb < lapply(seq(along=labels), function(x) combn(labels, m=x, simplify=FALSE, FUN=paste, collapse="")); unlist(allcomb) # Creates all possible combinations of sample labels.
 The aggregate() function computes any type of summary statistics of data subsets that are grouped together:
aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=T) # Computes the mean (or any other stats) for the data groups defined by the 'Species' column in the iris data set.
t(aggregate(t(iris[,1:4]), by=list(c(1,1,2,2)), FUN=mean, na.rm=T)[,1])
# The function can also perform computations for column aggregates (here: columns 12 and 34) after transposing the data frame. A much faster alternative is given in the data frame section.
 The apply() function simplifies the coding of iterative tasks (loops). In the following example, apply() is first used to compute a rowwise calculation on a matrix. In a second example apply() is called by the custom function colAg()
to perform similar calculations, but allowing the user to select any
combination of column aggregates with an easy to handle grouping vector.
myMA < matrix(rnorm(100000), 10000, 10, dimnames=list(1:10000, paste("C", 1:10, sep=""))) # Creates a sample data set.
apply(myMA, 1, mean)[1:4] # Computes the mean for each row in myMA.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R") # Imports the colAg() function.
colAg(myMA=myMA, group=c(1,1,1,2,2,2,3,3,4,4), myfct=mean)[1:4,]
# Applies a computation given under the myfct argument (here: mean) for the column aggregates specified under the group argument
# (here: c(1,1,1,2,2,2,3,3,4,4)). The columns in the resulting object are named after the chosen aggregates. Note: the function can
# only perform those calculations that can be applied to sets of two or more values, such as mean, sum, sd, min and max. Much faster
# to compute, but less flexible, alternatives are given in the
data frame section .
 The %in% function returns the intersect between two vectors. In a subsetting context with '[ ]', it can be used to intersect matrices, data frames and lists:
my_frame < data.frame(Month=month.name, N=1:12); my_query < c("May", "August") # Generates a sample data frame and a sample query vector.
my_frame[my_frame$Month %in% my_query, ] # Returns the rows in my_frame where the "Months" column and my_query contain identical entries.
 The merge() function joins data frames based on a common key column:
frame1 < iris[sample(1:length(iris[,1]), 30), ]
my_result < merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE); dim(my_result)
# Merges two data frames (tables) by common field entries, here row names (by.x=0). To obtain only the common rows,
# change 'all = TRUE' to 'all = FALSE'. To merge on specific columns, refer to them by their position numbers or their column names (e.g. "Species").
Table of Contents
Introduction to R Graphics
R provides comprehensive graphics utilities for visualizing and
exploring scientific data. In addition, several powerful graphics
environments extend these utilities. These include the grid, lattice and ggplot2
packages. The grid package is part of R's base distribution. It
provides the lowlevel infrastructure for many graphics packages,
including lattice and ggplot2. Extensive information on graphics
utilities in R can be found on the Graphics Task Page, the R Graph Gallery and the R Graphical Manual. Another useful reference for graphics procedures is Paul Murrell's book R Graphics. Interactive graphics in R can be generated with rggobi (GGobi) and iplots.
Base Graphics
 The following list provides an overview of some very useful plotting
functions in R's base graphics. To get familiar with their usage, it is
recommended to carefully read their help documentation with ?myfct as well as the help on the functions plot and par.

 plot: generic xy plotting
 barplot: bar plots
 boxplot: boxandwhisker plot
 hist: histograms
 pie: pie charts
 dotchart: cleveland dot plots
 image, heatmap, contour, persp: functions to generate imagelike plots
 qqnorm, qqline, qqplot: distribution comparison plots
 pairs, coplot: display of multivariant data
The lattice package developed by Deepayan Sarkar implements in R the
Trellis graphics system from SPlus. The environment greatly simplifies
many complicated highlevel plotting tasks, such as automatically
arranging complex graphical features in one or several plots. The syntax
of the package is similar to R's base graphics; however, highlevel
lattice
functions return an object
of class "trellis", that can be either plotted directly or stored in an
object. The command library(help=lattice) will open a list of all functions available in the lattice package, while ?myfct and example(myfct) can be used to access and/or demo their documentation. Important functions for accessing and changing global parameters are: ?lattice.options and ?trellis.device.
ggplot2 is another more recently developed graphics system for R, based on the grammar of graphics
theory. The environment streamlines many graphics routines for the user
to generate with minimum effort complex multilayered plots. Its
syntax is centered around the main ggplot function, while the convenience function qplot provides many shortcuts. The ggplot function accepts two arguments: the data set to be plotted and the corresponding aesthetic mappings provided by the aes function. Additional plotting parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with ' +'
as separator. For instance, the following command will generate a
scatter plot for the first two columns of the iris data frame: ggplot(iris, aes(iris[,1], iris[,2])) + geom_point(). A list of the available geom_* functions can be found here. The settings of the plotting theme can be accessed with the command theme_get(). Their settings can be changed with the opts()function.
To benefit from the many convenience features built into ggplot2, the
expected input data class is usually a data frame where all labels for
the plot are provided by the column titles and/or grouping factors in
additional column(s).
The following graphics sections
demonstrate how to generate different types of plots first with R's base
graphics device and then with the lattice and ggplot2 packages.

Scatter Plots
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]) # Plots two vectors (columns) in form of a scatter plot against each other.
plot(y[,1], y[,2], type="n", main="Plot of Labels"); text(y[,1], y[,2], rownames(y)) # Prints instead of symbols the row names.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y)) # Plots both, symbols plus their labels.
grid(5, 5, lwd = 2) # Adds a grid to existing plot.
op < par(mar=c(8,8,8,8), bg="lightblue")
plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", ylab="y label", main="My Main", sub="My Sub")
par(op)
# The previous commands demonstrates the usage of the most important plotting parameters. The 'mar' argument specifies the margin sizes around the plotting area
# in this order: c(bottom, left, top, right). The color of the plotted symbols can be controled with the 'col' argument. The plotting symbols can be selected
# with the 'pch' argument, while their size is controlled by the 'lwd' argument. A selection palette for 'pch' plotting symbols can be opened with the command
# 'example(points)'. As alternative, one can plot any character string by passing it on to 'pch', e.g.: pch=".". The font sizes of the different text components
# can be changed with the 'cex.*' arguments. Please consult the '?par' documentation for more details on fine tuning R's plotting behavior.
plot(y[,1], y[,2]); myline < lm(y[,2]~y[,1], data=y[,1:2]); abline(myline, lwd=2) # Adds a regression line to the plot.
summary(myline) # Prints summary about regression line.
plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on log scale.
plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot.
plot(y) # Produces all possible scatter plots for allagainstall columns in a matrix or a data frame. The column headers of the matrix or data frame are used as axis titles.
pairs(y) # Alternative way to produce all possible scatter plots for allagainstall columns in a matrix or a data frame.
library(scatterplot3d); scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y.
library(geneplotter); smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
Scatter Plot Generated with Base Graphics
library(lattice)
xyplot(1:10 ~ 1:10) # Simple scatter plot.
xyplot(1:10 ~ 1:10  rep(LETTERS[1:5], each=2), as.table=TRUE) # Plots subcomponents specified by grouping vector after '' in separate panels. The argument as.table controls the order of the panels.
myplot < xyplot(Petal.Width ~ Sepal.Width  Species , data = iris); print(myplot) # Assigns plotting function to an object and executes it.
xyplot(Petal.Width ~ Sepal.Width  Species , data = iris, layout = c(3, 1, 1)) # Changes layout of individual plots.
## Change plotting parameters
show.settings() # Shows global plotting parameters in a set of sample plots.
default < trellis.par.get(); mytheme < default; names(mytheme) # Stores the global plotting parameters in list mytheme and prints its component titles.
mytheme["background"][[1]][[2]] < "grey" # Sets background to grey
mytheme["strip.background"][[1]][[2]] < "transparent" # Sets background of title bars to transparent.
trellis.par.set(mytheme) # Sets global parameters to 'mytheme'.
show.settings() # Shows custom settings.
xyplot(1:10 ~ 1:10  rep(LETTERS[1:5], each=2), as.table=TRUE, layout=c(1,5,1), col=c("red", "blue"))
trellis.par.set(default)
Scatter Plot Generated with lattice
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() # Plots two vectors (columns) in form of a scatter plot against each other.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) # Plots larger dots and colors them with default color scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) + ylim(2,4) + xlim(4,8) + scale_color_manual(values=rainbow(10)) # Colors dots with custom scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_text() + opts(title = "Plot of Labels") # Print instead of symbols a set of custom labels and adds a title to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_point() + geom_text(hjust=0.5, vjust=0.5) # Prints both symbols and labels.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + opts(panel.background=theme_rect(fill = "white", colour = "black")) # Changes background color to white.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + stat_smooth(method="lm", se=FALSE) # Adds a regression line to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + coord_trans(x = "log2", y = "log2") # Plots axes in log scale.
Scatter Plot Generated with ggplot2

Line Plots
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], type="l", lwd=2, col="blue") # Plots a line graph instead.
split.screen(c(1,1))
plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1)
for(i in 2:length(y[1,])) screen(1, new=FALSE); plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="", bty="n")
close.screen(all=TRUE)
# Plots a line graph for all columns in data frame 'y'. The 'split.screen()' function is used in this example in a for loop to overlay several line
# graphs in the same plot. More details on this topic are provided in the 'Arranging Plots' section.
# A very nice line plot function for time series data is available in the Mfuzz library.
Line Plot Generated with Base Graphics
xyplot(Sepal.Length ~ Sepal.Width  Species, data=iris, type="a", layout=c(1,3,1))
parallel(~iris[1:4]  Species, iris) # Plots data for each species in iris data set in separate line plot.
parallel(~iris[1:4]  Species, iris, horizontal.axis = FALSE, layout = c(1, 3, 1)) # Changes layout of plot.

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) # Plots lines for three samples in the 'Species' column in a single plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) + facet_wrap(~Species, ncol=1) # Plots three line plots, one for each sample in Species column.
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
barplot(as.matrix(y[1:4,]), ylim=c(0,max(y[1:4,])+0.1), beside=T)
# Generates a bar diagram for the first four rows in y. The barplot() function expects as input format a matrix, and it uses by default the column titles for labeling the bars.
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1)+sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.02)
# Adds corresponding values on top of each bar.
ysub < as.matrix(y[1:4,]); myN < length(ysub[,1])
mycol1 < gray(1:(myN+1)/(myN+1))[(myN+1)]
mycol2 < sample(colors(),myN); barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot", sub="data: ysub")
legend("topright", legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Generates a bar plot with a legend for the first four rows of the input matrix 'y'. To plot the diagram in black and white, set the 'col'
# argument to 'col=col1". The argument 'ncol' controls the number of columns that are used for printing the legend.
par(mar=c(10.1, 4.1, 4.1, 2.1)); par(xpd=TRUE);
barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot"); legend(x=4.5, y=0.3, legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Same as above, but places the legend below the bar plot. The arguments 'x' and 'y' control the placement of the legend and the 'mar' argument
# specifies the margin sizes around the plotting area in this order: c(bottom, left, top, right).
bar < barplot(x < abs(rnorm(10,2,1)), names.arg = letters[1:10], col="red", ylim=c(0,5))
stdev < x/5; arrows(bar, x, bar, x + stdev, length=0.15, angle = 90)
arrows(bar, x, bar, x + (stdev), length=0.15, angle = 90)
# Creates bar plot with standard deviation bars. The example in the 'barplot' documentation provides a very useful outline of this function.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R") # Imports a function that plots a loan amortization table as bar plot.
Bar Plot with Error Bars Generated with Base Graphics

y < matrix(sample(1:10, 40, replace=TRUE), ncol=4, dimnames=list(letters[1:10], LETTERS[1:4])) # Creates a sample data set.
barchart(y, auto.key=list(adj = 1), freq=T, xlab="Counts", horizontal=TRUE, stack=FALSE, groups=TRUE) # Plots all data in a single bar plot. The TRUE/FALSE arguments control the layout of the plot.
barchart(y, col="grey", layout = c(2, 2, 1), xlab="Counts", as.table=TRUE, horizontal=TRUE, stack=FALSE, groups=FALSE)
# Plots the different data components in separate bar plots. The TRUE/FALSE and layout arguments allow to rearrange the bar plots in different ways.
## (A) Sample Set: the following transforms the iris data set into a ggplot2friendly format
iris_mean < aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean) # Calculates the mean values for the aggregates given by the Species column in the iris data set.
iris_sd < aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd) # Calculates the standard deviations for the aggregates given by the Species column in the iris data set.
convertDF < function(df=df, mycolnames=c("Species", "Values", "Samples")) { myfactor < rep(colnames(df)[1], each=length(df[,1])); mydata < as.vector(as.matrix(df[,1])); df < data.frame(df[,1], mydata, myfactor); colnames(df) < mycolnames; return(df) } # Defines function to convert data frames into ggplot2friendly format.
df_mean < convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) # Converts iris_mean.
df_sd < convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) # Converts iris_sd.
limits < aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2]  df_sd[,2]) # Define standard deviation limits.
## (B) Bar plots of data stored in df_mean
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") # Plots bar sets defined by 'Species' column next to each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + coord_flip() + opts(axis.text.y=theme_text(angle=0, hjust=1)) # Plots bars and labels sideways.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="stack") # Plots same data set as stacked bars.
ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species)) + facet_wrap(~Species, ncol=1) # Plots data sets below each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") # Generates the same plot as before, but with error bars.
# (C) Customizing colors
library(RColorBrewer); display.brewer.all() # Select a color scheme and pass it on to 'scale_*' arguments.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_brewer(pal="Greys") + scale_color_brewer(pal = "Greys") # Generates the same plot as before, but with grey color scheme.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_manual(values=c("red", "green3", "blue")) + scale_color_manual(values=c("red", "green3", "blue")) # Uses custom colors passed on as vectors.

Pie Charts
y < table(rep(c("cat", "mouse", "dog", "bird", "fly"), c(1,3,3,4,2))) # Creates a sample data set.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), main="Pie Chart", clockwise=T) # Plots a simple pie chart.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), labels=NA, main="Pie Chart", clockwise=T)
legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=rainbow(length(y), start=0.1, end=0.8), ncol=1)
# Same pie chart as above, but with legend.

df < data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), value=c(1,3,3,4,2)) # Creates a sample data set.
ggplot(df, aes(x = "", y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots pie chart.
ggplot(df, aes(x = variable, y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots wind rose pie chart.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
image(y) # The image function plots matrices as heatmaps.
## Sample data
library(lattice); library(gplots)
y < lapply(1:4, function(x) matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))))
## Plot single heatmap:
levelplot(y[[1]])
## Arrange several heatmaps in one plot
x1 < levelplot(y[[1]], col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="colorpanel")
x2 < levelplot(y[[2]], col.regions=heat.colors(75), main="heat.colors")
x3 < levelplot(y[[3]], col.regions=rainbow(75), main="rainbow")
x4 < levelplot(y[[4]], col.regions=redgreen(75), main="redgreen")
print(x1, split=c(1,1,2,2))
print(x2, split=c(2,1,2,2), newpage=FALSE)
print(x3, split=c(1,2,2,2), newpage=FALSE)
print(x4, split=c(2,2,2,2), newpage=FALSE)

Venn Diagrams
Computation of Venn intersects for 220 samples and plotting of 25 way Venn diagrams.
 Computation of Venn intersects
The following imports several functions from the overLapper.R script for computing Venn intersects and plotting Venn diagrams (old version: vennDia.R).
These functions are relatively generic and scalable by supporting the
computation of Venn intersects of 220 or more samples. The upper limit
around 20 samples is unavoidable because the complexity of Venn
intersects increases exponentially with the sample number n according to this relationship: (2^n)  1.
A useful feature of the actual plotting step is the possiblity to
combine the counts from several Venn comparisons with the same number of
test sets in a single Venn diagram. The overall workflow of the method is to first compute for a list of samples sets their Venn intersects using the overLapper
function, which organizes the result sets in a list object.
Subsequently, the Venn counts are computed and plotted as bar or Venn
diagrams. The current implementation of the plotting function, vennPlot, supports Venn diagrams for 25 sample sets. To analyze larger numbers of sample sets, the Intersect Plot
methods often provide reasonable alternatives. These methods are much
more scalable than Venn diagrams, but lack their restrictive intersect
logic. Additional Venn diagram resources are provided by limma, gplots, vennerable, eVenn, VennDiagram, shapes, C Seidel (online) and Venny (online).
################# ## Sample data ## #################
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports required functions.
setlist < list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18), F=sample(letters, 22, replace=T))
# To work with the overLapper function, the sample sets (here six) need to be stored in a list object where the different compontents are named by unique identifiers,
# here 'A to F'. These names are used as sample labels in all subsequent data sets and plots.
sets < read.delim("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sets.txt")
setlistImp < lapply(colnames(sets), function(x) as.character(sets[sets[,x]!="", x]))
names(setlistImp) < colnames(sets)
# Example how a list of test sets can be imported from an external table file stored in tab delimited format. Such a file can be easily created from a
# spreadsheet program, such as Excel. As a reminder, copy & paste from external programs into R is also possible (see read.delim function).
OLlist < overLapper(setlist=setlist, sep="", type="vennsets"); OLlist; names(OLlist)
# With the setting type="vennsets", the overLapper function computes all Venn Intersects for the six test samples in setlist and stores the results in
# the Venn_List component of the returned OLlist object. By default, duplicates are removed from the test sets. The setting keepdups=TRUE will retain
# duplicates by appending a counter to each entry. When assigning the value "intersects" to the type argument then the function will compute Regular
# Intersects instead of Venn Intersects. The Regular Intersect approach (not compatible with Venn diagrams!) is described in the next section. Both
# analyses return a presentabsent matrix in the Intersect_Matrix component of OLlist. Each overlap set in the Venn_List data set is labeled according to
# the sample names provided in setlist. For instance, the composite name 'ABC' indicates that the entries are restricted to A, B and C. The seperator used
# for naming the intersect samples can be specified under the sep argument. By adding the argument cleanup=TRUE, one can minimize formatting issues in the
# sample sets. This setting will convert all characters in the sample sets to upper case and remove leading/trailing spaces.
#############################
## Bar plot of Venn counts ## #############################
olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Venn Bar Plot") # Generates a bar plot for the Venn counts of the six test sample sets. In contrast to Venn diagrams, bar plots scale to larger numbers of sample sets. The # layout of the plot can be adjusted by changing the default values of the argument: margins=c(4,10,3,1). The minimum number of counts to consider in the plot # can be set with the mincount argument (default is 0). The bars themselves are colored by complexity levels using the default setting: mycol=OLlist$Complexity_Levels.
######################### ## 2way Venn diagrams ## #########################
setlist2 < setlist[1:2]; OLlist2 < overLapper(setlist=setlist2, sep="_", type="vennsets") OLlist2$Venn_List; counts < sapply(OLlist2$Venn_List, length); vennPlot(counts=counts) # Plots a nonproportional 2way Venn diagram. The main graphics features of the vennPlot function can be controlled by the following arguments (here with 2way defaults): # mymain="Venn Diagram": main title; mysub="default": subtitle; ccol=c("black","black","red"): color of counts; lcol=c("red","green"): label color; lines=c("red","green"): # line color; mylwd=3: line width; ccex=1.0: font size of counts; lcex=1.0: font size of labels. Note: the vector lengths provided for the arguments ccol, lcol and lines # should match the number of their corresponding features in the plot, e.g. 3 ccol values for a 2way Venn diagram and 7 for a 3way Venn diagram. The argument setlabels # allows to provide a vector of custom sample labels. However, assigning the proper names in the original test set list is much more effective for tracking purposes.
######################### ## 3way Venn diagrams ## #########################
setlist3 < setlist[1:3]; OLlist3 < overLapper(setlist=setlist3, sep="_", type="vennsets") counts < list(sapply(OLlist3$Venn_List, length), sapply(OLlist3$Venn_List, length)) vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, 0.2)) # Plots a nonproportional 3way Venn diagram. The results from several Venn comparisons can be combined in a single Venn diagram by assigning to the count argument a list # with several count vectors. The positonal offset of the count sets in the plot can be controlled with the yoffset argument. The argument setting colmode=2 allows to assign # different colors to each count set. For instance, with colmode=2 one can assign to ccol a color vector or a list, such as ccol=c("blue", "red") or ccol=list(1:8, 8:1).
######################### ## 4way Venn diagrams ## #########################
setlist4 < setlist[1:4] OLlist4 < overLapper(setlist=setlist4, sep="_", type="vennsets") counts < list(sapply(OLlist4$Venn_List, length), sapply(OLlist4$Venn_List, length)) vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, 0.2)) # Plots a nonproportional 4way Venn diagram. The setting type="circle" returns an incomplete 4way Venn diagram as circles. This representation misses two overlap sectors, # but is sometimes easier to navigate than the default ellipse version.
######################### ## 5way Venn diagrams ## #########################
setlist5 < setlist[1:5]; OLlist5 < overLapper(setlist=setlist5, sep="_", type="vennsets") counts < sapply(OLlist5$Venn_List, length) vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a nonproportional 5way Venn diagram.
################################ ## Export and other utilities ## ################################
OLexport < as.matrix(unlist(sapply(OLlist5[[4]], paste, collapse=" "))) write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
OLexport < data.frame(Venn_Comp=rep(names(OLlist5[[4]]), sapply(OLlist5[[4]], length)), IDs=unlist(OLlist5[[4]])) write.table(OLexport, file="test.xls", row.names=F, quote=F, sep="\t") # Same as above, but exports to an alternative tabular format.
tapply(counts, OLlist5[[3]], function(x) rev(sort(x))) # Sorts the overlap results within each complexity level by their size. This allows to identify the sample set combinations with the largest intersect # within each complexity level.
sapply(names(setlist), function(x) table(setlist[[x]])[table(setlist[[x]])!=1]) # Command to identify and count duplicated objects in the original sample set object 'setlist'. In the given example, only set 'F' contains duplications. # Their frequency is provided in the result.
vennPlot(counts, mymain="", mysub="", ccol="white", lcol="white") # Returns an empty Venn diagram without counts or labels.
################################################################################ ## Typical analysis routine for sets of differentially expressed genes (DEGs) ## ################################################################################
ratio < matrix(sample(seq(5, 5, by=0.1), 100, replace=T), 100, 4, dimnames=list(paste("g", 1:100, sep=""), paste("DEG", 1:4, sep="")), byrow=T) # Creates a sample matrix of gene expression log2 ratios. This could be any data type!
setlistup < sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]>=1,])) setlistdown < sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]<=1,])) # Identifies all genes with at least a two fold up or down regulation and stores the corresponding gene identifiers in setlistup and setlistdown, respectively.
OLlistup < overLapper(setlist=setlistup, sep="_", type="vennsets") OLlistdown < overLapper(setlist=setlistdown, sep="_", type="vennsets") counts < list(sapply(OLlistup$Venn_List, length), sapply(OLlistdown$Venn_List, length)) vennPlot(counts=counts, ccol=c("red", "blue"), colmode=2, mysub="Top: DEG UP; Bottom: DEG Down", yoffset=c(0.3, 0.2)) # Performs Venn analysis for the four sets stored in setlistup and setlistdown. The argument setting colmode=2 allows to assign different colors to each count set. # For instance, with colmode=2 one can assign to ccol a color vector or a list, such as ccol=c("blue", "red") or ccol=list(1:8, 8:1).
Sample Venn Diagrams (here 4 and 5way)
Intersect PlotsCollection of methods for analyzing and visualizing intersect relationships among large numbers of sample sets.
############################################################################################# ## All Possible Intersects: for comprehensive overlap analyses of 220 or more sample sets ## ############################################################################################# source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # To compute all possible intersects among more than two samples at the same time  such as common in AB, ABC, ABCD and so on  the overLapper function can # be used. Note: processing more than 20 sets with this function can take a long time, because the complexity of all possible intersects increases with the # number of sample sets n according to this relationship: (2^n)  1 (includes self comparisons). This means there are 63 possible intersects for n=6, 1,023 # for n=10, 32,767 for n=15 and 33,554,431 for n=25. With the current implementation, the computation time is about 0.1 second for n=10 and 2.5 seconds for n=15. setlist < lapply(1:6, function(x) sample(letters, 20)); names(setlist) < LETTERS[seq(along=setlist)] # To work with the overLapper function, the sample sets need to be stored in a list object where the different compontents are named by unique identifiers: A, B, C, ..., Z. OLlist < overLapper(setlist=setlist, complexity=1:length(setlist), sep="", type="intersects"); OLlist; names(OLlist) # Computes all possible intersects for the samples stored in 'setlist'. The results are returned as a list where each overlap component is labeled by the # corresponding sample names. For instance, the name 'ABC' indicates that the entries are common in samples A, B and C. The seperator used for naming the # intersect samples can be specified under the 'sep' argument. The complexity level range to consider can be controlled with the 'complexity' argument, which # can have values from 2 to the total number of samples. OLlist[[2]]; OLlist[[3]] # Returns the corresponding intersect matrix and complexity levels. OLexport < as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" "))) write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file. counts < sapply(OLlist[[4]], length) # Counts the number of elements in each intersect component. tapply(counts, OLlist[[3]], function(x) rev(sort(x))) # Sorts the overlap results within each complexity level by their size. This allows to identify the sample set combinations with the largest intersect within # each complexity level. x11(height=12, width=8); olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Intersect Bar Plot") # Plots the counts as bar diagram. More details on this function are provided in the Venn diagram section.
##################################################################################################### ## All Pairwise Intersects: to identify and cluster overlaps among two to thousands of sample sets ## ##################################################################################################### setlist < lapply(11:30, function(x) sample(letters, x, replace=TRUE)); names(setlist) < paste("S", seq(along=setlist), sep="") # To work with the following function, the sample sets (here 20) need to be stored in a list object (here 'setlist') where each component is named by a # unique identifier (here S1S20). setlist < sapply(setlist, unique) olMA < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(setlist[[x]] %in% setlist[[y]]))); olMA # Creates an intersect matrix for all pairwise sample comparisons stored in 'setlist'. This approach scales well up to several thousands of sample sets. library("gplots"); heatmap.2(olMA, trace="none", Colv="none", Rowv="none", dendrogram="none", col=colorpanel(40, "darkred", "orange", "yellow")) # Plots intersect matrix as heatmap. heatmap.2(olMA, trace="none", col=colorpanel(40, "darkred", "orange", "yellow")) # Hierarchical clustering of the rows and columns of the intersect matrix 'olMA'. The result is plotted as heatmap with two identical dendrograms representing # the outcome of the hierarchical clustering. The latter is internally performed by calls of heatmap.2() to the functions dist() and hclust() using their # default settings: euclidean distances and complete linkage. Note: the distance matrix used for clustering in this examples is based on the rowtorow (columntocolumn) # similarities in the olMA object. The next example shows how one can use the olMA object directly as distance matrix for clustering after transforming the # intersect counts into similarity measures, such as Jaccard or Rand indices. diffMA1 < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[x]] %in% setlist[[y]]))) diffMA2 < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[y]] %in% setlist[[x]]))) jaccardMA < olMA/(diffMA1+diffMA2+olMA) hr < hclust(as.dist(1jaccardMA), method = "complete", members=NULL) heatmap.2(jaccardMA, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hr), trace="none", col=colorpanel(40, "darkred", "orange", "yellow")) # Uses the olMA for clustering after transfroming its intersect counts into Jaccard (Tanimoto) similarity indices. This transformation can give reasonable # results for sample sets with large size differences. Many alternative similarity measures for set comparisons could be considered here. setlist[["S1"]][setlist[["S1"]] %in% setlist[["S2"]]] # Example for returning the intersect entries for one pairwise comparison, here S1 and S2. name1 < matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=T) name2 < matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=F) mynames < paste(name1, name2, sep="_"); mynames < lapply(strsplit(mynames, "_"), sort) mynames < sapply(mynames, paste, collapse="_"); olV < as.vector(olMA) names(olV) < mynames; olV < olV[!duplicated(names(olV))]; sort(olV) # Converts the intersect matrix 'olMA' into a named vector sorted by intersect size. This allows to identify the combinations of pairwise sample comparisons # that have the largest or smallest intersects.
###################################################################################################### ## Presenceabsence matrices: to indentify memberships of items across large numbers of sample sets ## ###################################################################################################### setlist < lapply(11:30, function(x) sample(letters, x, replace=TRUE)); names(setlist) < paste("S", seq(along=setlist), sep="") # Generates a sample data set. source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports the required overLapper() function. paMA < overLapper(setlist=setlist, type="intersects", complexity=2)[[2]] paMA[names(rev(sort(rowSums(paMA)))),] # Creates a presenceabsence matrix of all items across their sample sets. In this matrix the presence information is # indicated by ones and its rows are sorted by the presence frequencies. Alternative versions of the presentabsent matrix can be returned by setting the 'type' argument to 1, 2 or 3. library("gplots"); heatmap.2(paMA, trace="none", Colv="none", Rowv="none", dendrogram="none", col=c("white", "gray")) # Plots the presentabsent matrix as heatmap. rowSums(paMA) # Returns the presence frequencies of the items in the sample sets. sapply(rownames(paMA), function(x) colnames(paMA)[paMA[x, ]==1]) # Returns for each item the names of sample sets where it is present. The opposite result for absent calls can be returned by changing '==1' to '==0'.
Intersectbased Heatmap with Clustering Using Jaccard Similarity Measure
Histograms
x < rnorm(100); hist(x, freq=FALSE); curve(dnorm(x), add=TRUE) # Plots a random distribution as histogram and overlays curve ('freq=F': ensures density histogram; 'add=T': enables 'overplotting'). plot(x<1:50, dbinom(x,size=50,prob=.33), type="h") # Creates pin diagram for binomial distribution with n=50 and p=30.
Table of ContentsBasic Histogram Generated with Base Graphics  ggplot2 (see this page for more details)
ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(fill = ..count..), binwidth=0.2) # Plots histogram for second column in 'iris' data set. ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(y = ..density.., fill = ..count..), binwidth=0.2) + geom_density() # Density representation.

Density Plotsplot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red") # Plots a random distribution in form of a density plot. split.screen(c(1,1)) plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red") screen(1, new=FALSE) plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n") close.screen(all = TRUE) # Several density plots can be overlayed with the 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') and # overlayed on the same graphics device using 'screen(1,new=FALSE))'.


Box Plotsy < as.data.frame(matrix(runif(300), ncol=10, dimnames=list(1:30, LETTERS[1:10]))) # Generates a sample data set. boxplot(y, col=1:length(y)) # Creates a box plot. A box plot (also known as a boxandwhisker diagram) is a graphical representation of a fivenumber summary, which consists of the smallest # observation, lower quartile, median, upper quartile and largest observation.

ROC Curves

Feature Maps for DNA/Protein Sequences## The featureMap.R script plots simple feature maps of biological sequences based on provided mapping coordinates.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")
## Alternative approach for plotting simple chromosome maps
plot(x < rnorm(40,2e+07,sd=1e+07), y < rep(1,times=40), type="h", col="blue", xaxt="n", yaxt="n", bty="n") abline(h=0.78, col="green", lwd=12) lines(a < rnorm(5,2e+07,sd=1e+07), b < rep(1,times=5), type="h", col="red", lwd=2)


Miscellaeneous Plotting Utilitiesplot(1:10); lines(1:10, lty=1) # Superimposes lines onto existing plots. The usage of plotted values will connect the data points. Different line types can be specified with # argument "lty = 2". More on this can be found in the documentation for 'par'. plot(x < 1:10, y < 1:10); abline(1,1, col="green"); abline(1,1, col="red"); abline(v=5, col="blue"); abline(h=5, col="brown") # Function 'abline' adds lines in different colors to xyplot.

Arranging Plotsx11(height=6, width=12, pointsize=12) # Defines the size of the plotting device. par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } # With the argument 'par(mfrow=c(nrow,ncol))' one can define how several plots are arranged next to each other on the same plotting device. nf < layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE), c(3,7), c(5,5), respect=TRUE) layout.show(nf) for(i in 1:3) { barplot(1:10) } # The 'layout()' function allows to devide the plotting device into variable numbers of rows and columns with the columnwidths and the rowheights specified in the respective arguments. split.screen(c(1,1)) plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red") screen(1, new=FALSE) plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n") close.screen(all = TRUE) # Possibility to overlay independent plots with 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') # and overlayed on the same graphics device using 'screen(1,new=FALSE))'. Split.screen is also very useful for generating multiple plots next to each # other on a single device by setting the layout setting to several screens, e.g.: split.screen(c(2,2))
 Example of Arranging Plots Using Base Graphics

Customize XY Axesop < par(mar=c(6,6,6,8)); barplot(1:10, names.arg=letters[1:10], cex.names=1.5, col=3, yaxt = "n") # Creates bar plot without yaxis and wider margins around plot. axis(2, las = 1, at=seq(0,10,0.5), lwd=2, cex.axis=1.5); axis(3, las = 1, lwd=2, cex.axis=1.5); axis(4, las = 1, at=seq(0,10,2), labels=month.name[1:6], lwd=2, cex.axis=1.5); par(op) # Adds three custom axes. curve(sin(x), pi, pi, axes=F, ylab="", xlab=""); axis(1, pos=0); axis(2, pos=0) # Plots a function in Cartesianstyle coordinate system.
Examples of Customizing XY Axes
Color Selection Utilities y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set. cat("palette():\n"); palette(); palette(rainbow(20, start=0.1, end=0.2)) cat("palette(rainbow(20, start=0.1, end=0.2)):\n"); palette(); palette("default") # Prints and modifies the color palette which is used when the argument 'col=' has a numeric index. The last step sets the palette back to its default setting. par(mfrow=c(1,2)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8) palette(rainbow(8, start=0.1, end=0.8)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8) palette("default") # Example to call the default color palette with a numeric vector in the 'col=' argument'. In the second plot a modified palette is called the same way. example(rainbow) # The function rainbow() allows to select various predefined color schemes. barplot(as.matrix(y[1:10,]), beside=T, col=rainbow(10, start=0.1, end=0.2)) # Example to select 10 colors with the rainbow() function. The start and end values need to be between 0 and 1. The wider their distance the more diverse are the resulting colors. barplot(as.matrix(y[1:10,]), beside=T, col=gray(c(1:10)/10)) # The gray() function allows to select any type of gray shades by providing values from 0 to 1. colors()[1:10]; col2rgb(colors()[1:10]) # The colors() function allows to select colors by their name. The col2rgb() can translates them into the RGB color code. library(RColorBrewer); example(brewer.pal) # Shows how to select color schemes with the RColorBrewer library.
Adding Texty < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set. plot(y[,1], y[,2]); text(0.2,0.2,"My Text") # Adds text at defined coordinates of plot. plot(y[,1], y[,2]); mtext("My Text", 4) # Adds text to one of the four margin areas (side: 14) of the plot. plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y)) # Plots a scatter plot with the corresponding labels (here row names) next to the data points.

Interactive Functionslocator() # Allows user to click on existing graph with left mouse button. System returns the corresponding xycoordinates after clicking on right mouse button. text(locator(1), "My_Outlier", adj=0) # Adds text "My_Outlier" to graph at position where left mouse button is clicked.
Saving Graphics to Filesjpeg("test.jpeg"); plot(1:10, 1:10); dev.off() # After the 'jpeg("test.jpeg")' command all graphs are redirected to the file "test.jpeg" in JPEG format. To export images with the highest quality, # the default setting "quality = 75" needs to be changed to 100%. The actual image data are not written to the file until the 'dev.off()' command is executed! pdf("test.pdf"); plot(1:10, 1:10); dev.off() # Same as above, but for pdf format. The pdf and svg formats provide often the best image quality, since they scale to any size without pixelation. png("test.png"); plot(1:10, 1:10); dev.off() # Same as above, but for png format. postscript("test.ps"); plot(1:10, 1:10); dev.off() # Same as above, but for PostScript format. library("RSvgDevice"); devSVG("test.svg"); plot(1:10, 1:10); dev.off() # Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.
Importing Graphics into Rlibrary(pixmap); x < read.pnm(file="my_image.pnm"); plot(x); print(x) # Images need to be in pnm format. This format can be obtained in ImageMagick like this: 'convert my_image.jpg my_image.pnm'.
Graphics Exercises

Writing Your Own Functions
 The R language allows users to create their own function objects. The basic syntax for defining new functions is:
name < function(arg_1, arg_2, ...) expression # Basic syntax for functions.
 A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual.

R Web Applications
 Various webbased R services and R web development kits are available:

 R Web Services
 R Cloud Computing
 R Web Development Kits
R Basics Exercises
 Slide Show (R Code)
 The following exercises introduce a variety of useful data analysis utilities available in R.
 Import from spreadsheet programs (e.g. Excel)
 Download the following molecular weight and subcelluar targeting
tables from the TAIR site, import the files into Excel and save them as
tab delimited text files.
 Import the tables into R
my_mw < read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t") # Imports molecular weight table. my_target < read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t") # Imports subcelluar targeting table.
 Online import
 Check tables and rename gene ID columns
my_mw[1:4,]; my_target[1:4,]; colnames(my_target)[1] < "ID"; colnames(my_mw)[1] < "ID" # Prints out the first four rows of the two data frames and renames their gene ID columns.
 Merge the two tables based on common ID field
my_mw_target < merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)
 Shorten one table before the merge and then remove the nonmatching rows (NAs) in the merged file
my_mw_target2 < merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove nonmatching rows, use the argument setting 'all=F'. na.omit(my_mw_target2) # Removes rows containing "NAs" (nonmatching rows).
Problem 1: How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods  the two step version with na.omit and your method  return identical results.
Problem 2: Replace all NAs in the data frame my_mw_target2 with zeros.
 Apply several combinatorial queries (filters) on the data set. For example:
my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C",] # Retrieves all records with MW greater than 100000 and targeting to chloroplasts. dim(my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C", ]) # Determines the number of matches for the above query.
Problem 3: How many protein entries in the my_mw_target data frame have a MW of greater then 4000 and less then 5000. Subset the data frame accordingly and sort it by MW to check that your result is correct.
 Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions
my_mw_target3 < data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) my_mw_target3 < data.frame(loci=gsub("(^.*)\\.\\d{1,}", "\\1", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) # Same as above but with backreference.
Problem 4: Retrieve those rows in my_mw_target3 where the second column contains the following identifiers: c("AT5G52930.1", "AT4G18950.1", "AT1G15385.1", "AT4G36500.1", "AT1G67530.1"). Use the '%in%' function for this query. As an alternative approach, assign the second column to the row index of the data frame and then perform the same query again using the row index. Explain the difference of the two methods.
 Calculate the number of gene models per loci with the table() function
my_mw_target4 < cbind(my_mw_target3, Freq=table(my_mw_target3[,1])[my_mw_target3[,1]]) # The table() function counts the number of duplicates for each loci and cbind() appends the information to the data frame. my_mw_target4[order(my_mw_target4$Freq),][1:30,] # Sorts by count information. length(unique(my_mw_target4[,1])) # Removes duplicates and prints the number of unique loci. sum(!duplicated(my_mw_target4[,1])) # Gives the same result as previous command; remember: the function 'duplicated' returns a logical vector. length(my_mw_target4[,2])length(unique(my_mw_target4[,1])) # Calculates the number of duplicated entries. sum(duplicated(my_mw_target4[,1])) # Generates the same result as the previous command.
 Perform a variety calculations across columns
data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3]/my_mw_target4[,4]))[1:40,] # Calculates the average AA weight per protein. data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean))[1:40,] # Calculates the mean across several fields of each row. data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean), stdev=apply(my_mw_target4[,6:9], 1, sd, na.rm=T))[1:40,] # Calculates the mean and standard deviation across several fields of each row. mean(my_mw_target4[,c(1,2,5)], na.rm=T) # Calculates the mean for all numeric columns. data.frame(my_mw_target4[1:40,], ttest=apply(my_mw_target4[1:40,6:9], 1, function(x) {t.test(x[1:2], x[3:4])$p.value} )) # Calculates a twosample ttest for two slices in each row of the data frame and records the corresponding pvalues. To calculate a paired ttest, # add the argument 'paired=T'.
 Export data frame to Excel
write.table(my_mw_target4, file="my_file.xls", quote=F, sep="\t", col.names = NA) # Writes the data frame 'my_mw_target4' into a tabdelimited text file that can be opened with Excel.
 Plotting samples
boxplot(my_mw_target4[,3], my_mw_target4[,4], names=c("MW","CC"), col=c("blue","red"), main="Box Plot") # Generates box plots for MW and AA counts. plot(density(my_mw_target4[,3]), col="blue", main="Density Plot") # Plots density distribution plots for MW. barplot(my_mw_target4[1:5,3], beside = F, col = c("red", "green", "blue", "magenta", "black"), names.arg=as.vector(my_mw_target4[1:5,1]), main="MW Bar Plot", ylim=c(0,230000)) # Bar plot of MW for first five loci. source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports the venn diagram function. setlist < list(A=sample(my_mw_target4[,1], 1000), B=sample(my_mw_target4[,1], 1100), C=sample(my_mw_target4[,1], 1200), D=sample(my_mw_target4[,1], 1300), E=sample(my_mw_target4[,1], 1400)) # Generates 5 sets of random IDs. OLlist < overLapper(setlist=setlist, sep="_", type="vennsets") counts < sapply(OLlist$Venn_List, length); vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a 5way venn diagram for the random ID sets. OLexport < as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" "))) write.table(OLexport, file="OLexport.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file. pdf("venn_plot.pdf"); vennPlot(counts=counts); dev.off() # Saves the plot image to a PDF file.
Problem 5: Generate two key lists each with 4 random sample sets. Compute their overlap counts and plot the results for both lists in one venn diagram. Problem 6: Write all commands from the previous exercises into an R script ( exerciseRbasics.R) and execute it with the source() function like this: source("exerciseRbasics.R"). This will execute all of the above commands and generates the corresponding output files in the current working directory.
HT Sequence Analysis with R and Bioconductor
 This section of the manual is available on the HTSeq site.
Programming in R
 This section of the manual is available on the Programming in R site.
Bioconductor
Introduction
BioConductor is an open source and open development software project for the analysis of genome data ( e.g.
sequence, microarray, annotation and many other data types). This
section of the manual provides a brief introduction into the usage and
utilities of a subset of packages from the BioConductor project. The
included packages are a 'personal selection' of the author of this
manual that does not reflect the full utility specturm of the
BioConductor project. The introduced packages were chosen, because the
author uses them often for his own teaching and research. To obtain a
broad overview of available BioConductor packages, it is strongly
recommended to consult its official project site.
Due to the rapid development of many packages, it is also important to
be aware that this manual will often not be fully uptodate. Because of
this and many other reasons, it is absolutley critical to use the
original documentation of each package (PDF manual or vignette) as
primary source of documentation. Users are welcome to send suggestions
for improving this manual directly to its author.

Finding Help
 The instructions for installing BioConductor packages are available in the administrative section of this manual. Documentation for BioConductor packages can be found in the vignette of each package. A listing of the available packages is available on the BiocViews page. Annotation libraries can be found on the meta data page. The BioC Mail GMANE Archive
can be searched find to answers to questions that have been discussed
on the mailing list. Another valuable information resource is the BioConductor Book. The basic R help functions provide additional information about packages and their functions:

library(affy) # Loads a particular package (here affy package).
library(help=affy) # Lists all functions/objects of a package (here affy package).
library() # Lists all libraries/packages that are available on a system.
openVignette() # Provides documentation on packages.
sessionInfo()
# Prints version information about R and all loaded packages. The
generated output should be provided when sending questions or bug
reports to the R and BioC mailing lists.
?my_topic # Opens documentation on a function.
help.search("topic") # Searches help system for documentation.
search() # Shows loaded packages and attached data frame in current search path.
system.file(package="affy") # Shows location of an installed package.
library("tools"); vigSrc <
list.files(pattern="Rnw$", system.file("doc",package="GOstats"),
full.names=TRUE); vigSrc; for (v in vigSrc) Stangle(v) # Extracts R code from a vignette and saves it to file(s) in your current working directory.
Affy Packages
 BioConductor provides extensive resources for the analysis of Affymetrix data. The most important ones are introduced here.
AffyThe Affy package provides basic methods for analyzing affymetrix
oligonucleotide arrays. The following steps outline the usage of the
Affy package and associated packages.
 Obtaining scaled expression values with 5 different methods (MAS5,
RMA, GCRMA, Plier & dChip). A comparison of three of these methods
including references is available in this slide show.
Move your CEL files into your working directory and make sure that
the corresponding CDF library for your chips is available on your
system.
library(affy) # Loads the affy package.
mydata < ReadAffy() # Reads all *.CEL (*.cel) files in your current working directory and stores them into the AffyBatch object 'mydata'.
mydata < ReadAffy(widget=TRUE) # Opens file browser to select specific CEL files.
eset < rma(mydata)
# Creates normalized and background corrected expression values using
the RMA method. The generated data are stored as ExpressionSet class in
the 'eset' object. For large data sets use the more memory efficient justRMA() function.
eset < mas5(mydata) # Uses expresso (MAS 5.0 method) module instead of RMA, which is much slower than RMA!
eset_gcrma < gcrma(mydata) # Use this command to aquire gcrma data. The 'library(gcrma)' needs to be loaded first.
eset_plier < justPlier(mydata) # Use this command to aquire plier data. The 'library(plier)' needs to be loaded first.
eset_PMA < mas5calls(mydata)
# Generates MAS 5.0 P/M/A calls. The command creates ExpressionSet with
P/M/A calls in the 'exprs' slot and the wilcoxon pvalues in the
'se.exprs' slot. To access them see below.
eset < expresso(mydata,
normalize.method="invariantset", bg.correct=FALSE,
pmcorrect.method="pmonly", summary.method="liwong") # Generates expression calls similar to dChip (MBEI) method from Li and Wong.
library(affycoretools); affystart(plot=T, express="mas5")
# Handy function to normalize all cel files in current working
directory, perform qc plots and export normalized data to file. Works
for mas5, rma and gcrma.
 Exporting data from 'ExpressionSet' objects
write.exprs(eset, file="mydata.txt") # Writes expression values to text file in working directory.
x <
data.frame(exprs(eset), exprs(eset_PMA), assayDataElement(eset_PMA,
"se.exprs")); x < x[,sort(names(x))]; write.table(x,
file="mydata_PMA.xls", quote=F, col.names = NA, sep="\t") #
Writes expression values, PMA values and wilcoxon pvalues to text file
in working directory. Remember: the old command for accessing the
wilcoxon pvalues in BioC versions <2.0 was 'se.exprs(eset_PMA))'.
 Obtaining single probelevel data from 'affybatch' objects (see also documentation '?ReadAffy')
mypm < pm(mydata) # retrieves PM intensity values for single probes
mymm < mm(mydata) # retrieves MM intensity values for single probes
myaffyids < probeNames(mydata) # retrieves Affy IDs
result < data.frame(myaffyids, mypm, mymm) # combines the above information in data frame
 Working with 'ExpressionSet' objects (see Biobase Manual)
eset; pData(eset) # Provides summary information of ExpressionSet object 'eset' and lists the analyzed file names.
exprs(eset)[1:2,1:4]; exprs(eset)[c("244901_at","244902_at"),1:4]
# Retrieves specific rows and fields of ExpressionSet object. To learn
more about this format class, consult ExpressionSet manual with command
'?ExpressionSet'.
test < as.data.frame(exprs(eset)); eset2 <new("ExpressionSet", exprs = as.matrix(test), annotation="ath1121501"); eset2
# Example for creating an ExpressionSet object from a data frame. To
create the object from an external file, use the read.delim() function
first and then convert it accordingly.
data.frame(eset) # Prints content of 'eset' as data frame to STDOUT.
exprs(eset_PMA)[1:2,1:2]; assayDataElement(eset_PMA, "se.exprs")[1:2,1:2]
# Prints from ExpressionSet of 'mas5calls(mydata)' the PMA values from
its 'exprs' slot and the pvalues from its 'se.exprs' slot.
 Retrieving annotation data for Affy IDs (see Annotation Package Manual)
library(ath1121501.db) # Opens library with annotation data.
library(help=ath1121501.db) # Shows availability and syntax for annotation data.
ath1121501() # Provides a summary about the available annotation data sets of an anntation library.
library(ath1121501cdf); ls(ath1121501cdf) # Retrieves all Affy IDs for a chip in vector format.
x < c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at") # Generates sample data set of Affy ID numbers.
contents(ath1121501ACCNUM)[1:40] # Retrieves locus ID numbers for Affy IDs.
myAnnot <
data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=",
"), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "),
DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", "));
myAnnot[x, ] # Organizes an entire set of annotation features
in a data frame, here: ACCNUM, SYMBOL and GENENAME. The annotations
from probe sets with several mappings are collapsed into single strings.
mget(x, ath1121501GO, ifnotfound=NA) # Retrieves GO information for Affy IDs.
 Accessing probe sequence data
library(ath1121501probe) # Opens library with probe sequence data.
print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
library(Biostrings); pm < DNAStringSet(ath1121501probe$sequence); names(pm) < ath1121501probe$Probe.Set.Name # Stores probe sequences as DNAStringSet object. See HTSeq manual for more details.
Visualization and quality controls
 Additional information on this topic can be found in the affyPLM QC Manual, the arrayQualityMetrics package and on the WEHI Affy Lab site.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
deg < AffyRNAdeg(mydata); summaryAffyRNAdeg(deg); plotAffyRNAdeg(deg)
# Performs RNA degradation analysis. It averages on each chip the
probes relative to the 5'/3' position on the target genes. A summary
list and a plot are returned.
image(mydata[ ,1]) # Reconstructs image with log intensities of first chip.
hist(mydata[ ,1:2]) # Plots histogram of PM intensities for 1st and 2nd array.
hist(log2(pm(mydata[,1])), breaks=100, col="blue") # Plos bar histogram of the PM ('pm') or MM ('mm') log intensities of 1st array.
boxplot(mydata,col="red") # Generates a box plot of unnormalized log intensity values.
boxplot(data.frame(exprs(eset)),col="blue", main="Normalized Data") # Generates a box plot of normalized log intensity values.
mva.pairs(pm(mydata)[,c(1,4)])
# Creates MAplot for unnormalized data. A MAplot is a plot of
logintensity ratios (Mvalues) versus logintensity averages (Avalues)
between selected chips (here '[1,4]').
mva.pairs(exprs(eset)[,c(1,4)]) # Creates MAplot for normalized data.
Simpleaffy
 The simpleaffy package automates many repetitive highlevel analysis steps.
 Creating expression values from CEL files
Move CEL files into working directory.
Create whitespace delimited table file "covdesc.txt": first column no title then CEL file names; second column title "treatment", then treatment names.
library(simpleaffy) # Loads "affy" and "simpleaffy" packages.
raw.data < read.affy("covdesc.txt") # Reads data in working directory that are specified in "covdesc.txt" file; for help on this function type "?read.affy".
x.rma < call.exprs(raw.data, "rma")
# Creates expression values using RMA method. The function 'call.exprs'
returns always log2 expression values; for help on this function type
"?call.exprs". One can also use here the "gcrma" method after loading it
with the command 'library(gcrma)'.
x.mas < call.exprs(raw.data, "mas5") # Computes expression values using MAS 5.0 method.
write.exprs(x.rma, file="mydata.txt") # Writes expression values to file "mydata.txt".
 Quality control steps
The "qc" function provides several qc stats for chips. To use it, one follows the following steps:
x < read.affy("covdesc.txt") # Reads data in working directory.
x.mas5 < call.exprs(x,"mas5") # Calculates expression values with MAS 5.0 method which is required for the next step!
qc < qc(x,x.mas5) #
Calls "qc" function which generates object containing scale factors,
GAPDHActin_3'/5' ratios, target intensity, % present, average, min,
max, mean background int and more; for more details type "?qc".
x.mas5@description@preprocessing # Creates QC output like this.
getGapdh3(cleancdfname(cdfName(x))) # To find out which GAPDH/Actin probe sets are used on a given chip; for others use getGapdhM, getGapdh5, getBioB, getBioC...
plot(qc) # Creates summary plot of QC data. See description on page 5 of vignette "simpleaffy".
 Filtering by expression values
get.array.subset(x.rma, "treatment",c("CMP1","Ctrl"))
# When R loaded *.CEL files it also reads experiment layout from
covdesc.txt file (Ctrl and CMP1). The "get.array.subset" function makes
it easy to select subsets of arrays.
results < pairwise.comparison(x.rma, "treatment", c("1", "2"), raw.data)
# computes mean intensities of replicates from two treatments (means),
calculates log2fold changes between means (fc), performs ttest (tt)
and writes out PMA calls (calls). Type "?pairwise.comparison" for more
help on this function.
write.table(data.frame(means(results), fc(results), tt(results), calls(results)), file="my_comp.txt", sep="\t") # Command to export all 'pairwise.comparison' data into one table.
sort(abs(fc(results)),decreasing=TRUE)[1:100] # Prints out 100 highest foldchanges (fc), for means of intensities (means), for ttest (tt), for PMA (calls).
significant <
pairwise.filter(results, min.exp=log2(10), min.exp.no=2, fc=log2(8),
min.present.no=4, tt= 0.001, present.by.group=FALSE) # Function 'pairwise.filter' takes the output from 'pairwise.comparison' and filters for significant changes. Filter Arguments:
 min.exp: minimum expression cut off
 min.exp.no: occurence of 'min.exp' in at least this number of chips
 min.present.no: present calls on at least this number of chips
 present.by.group: If true, then present count is restricted to replicate groups and not all chips of an experiment!
 fc: A gene must show a log2 fold change greater than this to be
called significant. A 2.5fold change can be specified like this:
'fc=log2(2.5)'
 tt: A gene must be changing with a pscore less than this to be called significant
Type "?pairwise.filter" for more info on the different arguments of this function.
write.table(data.frame(means(significant),
fc(significant), tt(significant), calls(significant)),
file="my_comp.txt", col.names = NA, sep="\t") # Exports all 'pairwise.filter' data into one table.
 Plotting results
plot(significant, type="scatter") # Plots significant changes from above as scatter plot. Alternative plotting types: type="ma" or type="volcano".
plot(results,significant, type="scatter")
# Plots means of the two replicate groups as scatter plot and
highlights all genes that meet criteria of 'significant' filter.
Meaning of colors: red  all present, orange  all present in one group
or the other, yellow  all that remain.
plot(results,type="scatter") # Plots means of the two replicate groups as scatter plot.
png(file="figure1.png"); plot(significant,type="scatter"); dev.off() # Writes scatter plot to image file.
 Exercise simpleaffy (Command Summary)
Analysis of Differentially Expressed Genes
 Various packages are available for analyzing preprocessed data from dualcolor and Affymetrix arrays.
Limma
 Limma is a software package for the analysis of gene expression
microarray data, especially the use of linear models for analysing
designed experiments and the assessment of differential expression. The
package includes preprocessing capabilities for twocolor spotted
arrays. The differential expression methods apply to all array platforms
and treat Affymetrix, single channel and two channel experiments in a
unified way. The methods are described in Smyth 2004 and in the limma manual. An illustrated introduction for the GUI packages can be found at WEHI.
 Tcl/Tk Requirements
 Basic usage
library(limma) # Loads commandline limma package.
limmaUsersGuide() # Opens pdf manual for limma.
The cDNA and affy sample data sets of the manual can be downloaded from the limmaGUI and affylmGUI pages.
library(affylmGUI); affylmGUI() # Opens affylmGUI for affy data. For a quick start, follow the instructions for the Estrogen data set.
library(limmaGUI); limmaGUI() # Opens limmaGUI for cDNA array data. For a quick start, follow the instructions for the Swirl Zebrafish data set.
 Data objects in limma
There are four main data objects created and used by limma:
 RGList: for cDNA data created by function 'read.maimages()'
 MAList: for cDNA data created by functions MA.RG() or 'normalizeWithinArrays()'
 MArrayLM: created by function 'lmFit()'
 TestResults: created by function 'decideTests()'
More details on this can be found in paragraph 13 of the limma PDF manual ('limmaUsersGuide()').
Limma: Dual Color Arrays Reading cDNA array data
To make the following commands work, save and extract the SWIRL
cDNA microarray sample data into your R working directory. For a quick
demonstration of the analysis of this data set, one can copy&paste
or source the following commandline summary into the R terminal: my_swirl_commands.txt.
Requirements
 Have all intensity data files in one directory. Supported formats are: ArrayVision, ImaGene, GenePix, QuantArray, SMD (QuantArray) or SPOT. If an intensity data file
format is not supported then one can specify the corresponding column
names during the data import into R (see below). Type '?read.maimages'
for more information about supported formats.
 Targets file: Format of targets file. This file defines which RNA sample was hybridized to each channel of each array.
 Only for SPOT: separate gene list file
 Optional: spot type file for identifying special probes such as controls.
Reading intensity data
targets < readTargets("Targets.txt") # Reads targets information from file 'Targets.txt' and assigns it to targets frame.
RG < read.maimages(targets$FileName, source="spot", sep="\t", path="./")
# Writes intensity data into 'RGList' list object. The argument
'targets$FileName' specifies the intensity files; 'source="spot"'
specifies the image analysis program (e.g. spot); 'path="./"' provides
the path to the directory where the intensity files are located and
'sep="\t"' specifies the field separator.
RG <
read.maimages(targets$FileName, annotation="My_spot_labels",
columns=list(Rf="Rmean",Gf="Gmean",Rb="morphR",Gb="morphG")) # If an intensity data file
format is not supported then one can specify the corresponding column
names as shown here. The provided example is equivalent to the previous
command with the exception of the additional argument: annotation="My_spot_labels".
This argument allows the import of a spot ID or annotation column. This
can be useful for analyzing arrays with unavailable printing layout or
incomplete intensity data files. In those cases, all printtip
specific functions of the following steps will not work (e.g. use
'loess' normalizaten instead of 'printtiploess').
RG_combined < cbind(RG1, RG2)
# One can combine different 'RGList' objects (e.g. RG1, RG2) into one
large object. The command 'RG[,1]' shows only the first array, while
'RG[1:100,]' gives the data of the first 100 genes.
Spot quality weights
RG < read.maimages(targets$FileName, source="spot", wt.fun=wtarea(100))
# Assigns spot quality weights from 0 to 1 to each spot and writes
information as 'weight' component into 'RGList' object. Provided example
with 'wt.fun=wtarea(100)' gives full weight of 1 to spots with 100
pixels. Spots with zero pixels or twice the ideal size are given weight
of 0. All spots with weight of 0 are considered as flagged and limma
will ignore them in the subsequent analysis. The appropriate way of
computing quality weights depends on the image analysis software. The
command '?QualityWeight' provides more information on available weight
functions.
Gene lists and print layout
RG$genes < readGAL()
# The output of some image analysis programs contains intensity, gene
name and print layout information all in one file. In the cases of the
image analysis programs SPOT, GenePix, ArrayVision and others, this
information is located in a separate file from where it needs to be
written into the 'RGList' using the 'readGal' and 'getLayout' functions.
The function 'readGal' reads this information by default from a *.gal
file in the working directory. This file contains the columns: 'Block',
'Column', 'Row', 'ID' and 'Name' (optional). If such a file is not
provided then one can create it in R using the printing layout
information of a given array. Example of an array with 4 by 8 subgrids
(pins) each with 20 by 21 spots: x < 1:32; c1 < c(); for(i in
x) c1 < c(c1, rep(i, times=420)); x < 1:21; c2 < c(); for(i
in x) c2 < c(c2, rep(i, times=20)); c3 < rep(1:21, times=20);
test.gal < data.frame(Block=c1, Row=rep(c2, times=32),
Column=rep(c3, times=32), ID=rep("my_ID", times=13440),
Name=rep("my_Description", times=13440)); test.gal[1:20,];
RG$printer < getLayout(RG$genes)
# Extracts print layout (number of pins or subgrids) after the gene
list is available and adds this information as $printer component to
RGList.
Spot type file
Control spot information can be added to the RGList object with
the spot type file (SpotTypes.txt). Regular expressions can be used here
to associate this information with the RG$genes component of the
RGList. The incorporation of the spot type information has the advantage
that controls can be highlighted in plots or separated in certain
analysis steps. The format of this file is specified in the limma pdf
manual.
spottypes < readSpotTypes("SpotTypes.txt") # Reads in the file SpotTypes.txt.
RG$genes$Status < controlStatus(spottypes, RG) # Adds a 'Status' column to the RG$genes component in RGList.
 Quality plots
Recommended quality exploration steps are the imageplot() function
of the raw logratios and an MAplot (plotMA) of the raw data for each
array.
spottypes < readSpotTypes("SpotTypes.txt") # Same as above.
RG$genes$Status < controlStatus(spottypes, RG) # Same as above.
plotMA(RG, array=1) #
Creates MAplot which is a plot of the logintensity ratios (Mvalues)
versus the logintensity averages of both channels (Avalues). To
display plots for all arrays next to each other, one can use the layout
function 'par(mfrow=c(2,2))' and plot the different arrays in the same
plotting device by changing the array numbers in the argument 'array=1'.
imageplot(log2(RG$R[,1]), RG$printer, low="white", high="red")
# Creates an image of color shades that represent the values of a
statistic in the microarray printing format. This function can be used
to explore spatial effects across the microarray.
plotPrintTipLoess(RG) # Coplots unnormalized RG data in form of MAplots with the loess curves for each printtip.
 Normalization
Limma integrates several within and betweenarray normalization
methods. The commands '?normalizeWithinArrays' and
'?normalizeBetweenArrays' provide information about the different
methods. The method 'printtiploess' is the default. Use 'loess' instead
if your arrays have no printtip groups (e.g. Agilent) or have less than
150 spots per printtip.
MA < normalizeWithinArrays(RG, method="printtiploess")
# Background corrects and normalizes the expression logratios of the
RGList and assigns the results to a MAList object. The default
background correction method is bc.method="subtract". Use
bc.method="none" to ignore the background in the analysis. If the
quality weights should be ignored in this step, add to the command the
'weights=NULL' argument. For analyzing arrays with unavailable printing
layout or incomplete intensity data files, one should use here a
printtip unspecific normalization method such as 'loess'.
plotPrintTipLoess(MA) #
Coplots normalized MA data in form of MAplots with loess curves for
each printtip. Compare results with unnormalized RG data from above.
boxplot(as.data.frame(MA$M)) # Creates boxandwhisker plot
of MA values. This is a graphical summary of the MA distribution
between the arrays. The two hinges in the middle represent the first and
third quartile, the lines (whiskers) represent the largest and smallest
observation in a distance of 1.5 times the box height and everything
beyond that is printed as extreme values in form of dots. Inconsistent
spreads of hinges and whiskers between arrays can indicate normalization
issues. Betweenarray normalization may improve this situaltion (see
limma MAnual).
MA < normalizeBetweenArrays(MA)
# If boxandwhisker plot indicates different spreads of Mvalues, as
in the above example, then one can use this secondary betweenarray
normalization step.
 Background correction
The above 'normalizeWithinArrays' function performs by default a
background correction by subtracting the background values from the
expression intensities.
RGb < backgroundCorrect(RG, method="subtract"); MA < normalizeWithinArrays(RGb) # These two commands perform the same operation as the above 'normalizeWithinArrays' command.
RG < backgroundCorrect(RG, method="normexp", offset=50)
# This more advanced background correction method 'normexp' adjusts the
expression values adaptively for background values and results in
strictly positive expression values. This method is particularly
effective for obtaining more robust ratios for low expressed genes.
 Linear models and differential expression of single and dual color data
Limma provides advanced statistical methods for linear modelling of
microarray data and for identifying differentially expressed genes. The
approach fits a linear model to the data and uses an empirical Bayes
method for assessing differential expression. It is described in Smyth 2004. One or two experiment definition matrices need to be specified during the analysis: a design matrix defining the RNA samples and a contrast matrix
(optional for simple experiments) defining the comparisons to be
performed. How to set up these matrices is described in the limma manual
and on this page from Natalie Thorne. Another useful manual is Advanced Linear Modeling and TimeSeries Analysis.
 cDNA common reference design (one sample test)
The swirl experiment is a classical one sample test where a mutant
is compared to a WT sample with four replicates that include 2 dye
swaps. The following commands will work with the above objects obtained
under steps '4.14.4'.
design < c(1,1,1,1) # Creates appropriate design matrix.
fit < lmFit(MA, design)
# Fits a linear model for each gene based on the given series of
arrays. The returned list object 'fit' contains the average Mvalue
($coefficients) for each gene and their standard deviations ($sigma).
The ordinary tstatistics can be computed with this command: ordinary.t < fit$coef / fit$stdev.unscaled / fit$sigma. However, the empirical Bayes moderated tstatistics should be the preferred method for limma users (see below).
plotMA(fit); abline(0,0,col="blue") # Plots MAplot with baseline.
fit < eBayes(fit) #
Computes empirical Bayes statistics for differential expression. This
moderated tstatistics uses standard deviations that have been shrunk
towards a pooled standard deviation value.
qqt(fit$t,df=fit$df.prior+fit$df.residual,pch=16,cex=0.2); abline(0,1) # Plots the quantiles of a data sample against the theoretical quantiles of a Student's t distribution.
options(digits=3) # Sets the number of digits to print in numeric output to 3 digits.
topTable(fit, adjust="fdr", sort.by="B", number=10)
# Generates list of top 10 ('number=10') differentially expressed genes
sorted by Bvalues ('sort.by=B'). The summary table contains the
following information: logFC is the log2fold change, AveExpr is the
average expression value accross all arrays and channels, the moderated
tstatistic (t) is the logFC to its standard error, the P.Value is the
associated pvalue, the adj.P.Value is the pvalue adjusted for multiple
testing and the Bvalue (B) is the empirical Bayes logodds of
differential expression (thehigherthebetter). Usually one wants to
base gene selection on the adj.P.Value rather than the t or Bvalues.
More details on this can be found in the limma PDF manual (type
'limmaUsersGuide()') or on this FAQ page.
If the swirl data set was normalized with the 'printtiploess' method
and the betweenarray method, then the top downregulated gene is BMP2
which is the mutated gene of this sample.
x < topTable(fit, adjust="fdr", sort.by="P", number=100); x[x$adj.P.Val < 0.01,] # Filters out candidates that have Pvalues < 0.01.
plotMA(fit); ord <
order(fit$lods, decreasing=TRUE); top30 < ord[1:30];
text(fit$Amean[top30], fit$coef[top30], labels=fit$genes[top30, "Name"],
cex=0.8, col="blue") # Plots MAplot and highlights 30 top changes.
Limma: Affymetrix Arrays
To enable the following analysis steps, users need to save and extract this archive of 6 cel files into their R working directory. These sample files are from the Cold Stress Time Course of the AtGenExpress site (ftp batch download). This affy_targets.txt
file is required to select specific cel files during the analysis. For a
quick demonstration of the analysis of this data set, one can
copy&paste the following commands into the R terminal.
library(limma) # Loads limma library.
targets < readTargets("affy_targets.txt") # Reads targets information from file 'affy_targets.txt' and assigns it to targets frame.
library(affy); data < ReadAffy(filenames=targets$FileName) # Reads CEL files (specified in 'targets') into AffyBatch object.
eset < rma(data) # Normalizes data with 'rma' function and assigns them to ExpressionSet object (see above).
# exprs(eset) < log2(exprs(eset))
# If eset contains absolute intensity values like MAS5 results, then
they should be transformed to log2 (or loge) values for limma. RMA/GCRMA
generate log2 values and MAS5 produces absolute values.
pData(eset) # Lists the analyzed file names.
write.exprs(eset, file="affy_all.txt") # Exports all affy expression values to tab delimited text file. The MAS 5.0 P/M/A calls can be retrieved with the simpleaffy package or with the affy package like this: 'eset < mas5calls(data); write.exprs(eset, file="my_PMA.txt")'.
design < model.matrix(~ 1+factor(c(1,1,2,2,3,3)))
# Creates appropriate design matrix. Alternatively, such a design
matrix can be created in any spreadsheet program and then imported into
R.
colnames(design) < c("group1", "group2", "group3") # Assigns column names.
fit < lmFit(eset, design) # Fits a linear model for each gene based on the given series of arrays.
contrast.matrix < makeContrasts(group2group1, group3group2, group3group1, levels=design)
# Creates appropriate contrast matrix to perform all pairwise
comparisons. Alternatively, such a contrast matrix can be created in any
spreadsheet program and then imported into R. For complex experiments
one can also use this function to compute a contrast matrix with all possible pairwise comparisons.
fit2 < contrasts.fit(fit, contrast.matrix) # Computes estimated coefficients and standard errors for a given set of contrasts.
fit2 < eBayes(fit2) #
Computes moderated tstatistics and logodds of differential expression
by empirical Bayes shrinkage of the standard errors towards a common
value.
topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=10)
# Generates list of top 10 ('number=10') differentially expressed genes
sorted by Bvalues ('sort.by=B') for each of the three comparison
groups ('coef=1') in this sample set. The summary table contains the
following information: logFC is the log2fold change, the AveExpr is the
average expression value accross all arrays and channels, the moderated
tstatistic (t) is the logFC to its standard error, the P.Value is the
associated pvalue, the adj.P.Value is the pvalue adjusted for multiple
testing and the Bvalue (B) is the logodds that a gene is
differentially expressed (thehigherthebetter). Usually one wants to
base gene selection on the adjusted Pvalue rather than the t or
Bvalues. More details on this can be found in the limma PDF manual
(type 'limmaUsersGuide()') or on this FAQ page.
write.table(topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=50000), file="limma_complete.xls", row.names=F, sep="\t") # Exports complete limma statistics table for first comparison group ('coef=1') to tab delimited text file.
results < decideTests(fit2, p.value=0.05); vennDiagram(results) # Creates venn diagram of all changed genes with pvalue equal or less than 0.05.
x < topTable(fit2, coef=1,
adjust="fdr", sort.by="P", number=50000); y < x[x$adj.P.Val <
0.05,]; y; print("Number of genes in this list:"); length(y$ID)
# Filters out candidates that have Pvalues < 0.05 in each group
('coef=1') and provides the number of candidates for each list. These
numbers should be identical with the sum of the values in each circle of
the above venn diagram.
x < topTable(fit2, coef=1,
adjust="fdr", sort.by="P", number=50000); y < x[x$adj.P.Val <
0.01 & (x$logFC > 1  x$logFC < 1) & x$AveExpr > 10,];
y; print("Number of genes in this list:"); length(y$ID) # Same as above but with complex filter: Pvalue < 0.01 AND at least 2fold change AND expression value A > 10.
results < decideTests(fit2, p.value=0.000005); heatDiagram(results, fit2$coef, primary=1)
# This function plots heat diagram gene expression profiles for genes
which are significantly differentially expressed in the primary
condition (this is not a cluster analysis heat map). Genes are sorted by
differential expression under the primary condition. The argument
'primary=1' selects the first contrast column in the 'results' matrix as
primary condition. The plotted genes can be extracted like this
'results[results[,1]==1,]'. More information on this function can be
found in the limma manual.
For visualizing the analyzed affy data, use the same visualization and quality control steps as described in the affy package section.
RankProd
 The RankProd
package contains functions for the identification of differentially
expressed genes using the rank product nonparametric method from Breitling et al., 2004.
It generates a list of up or downregulated genes based on the
estimated percentage of false positive predictions (pfp), which is also
known as false discovery rate (FDR). The attraction of this method is
its ability to analyze data sets from different origins (e.g.
laboratories) or variable environments.
 Required data objects
 'data' object of type matrix or data frame containing expression values in log2 scale.
 'cl' vector of length ncol(data) with class lables of samples/treatments.
 'origin' vector of length ncol(data) with origin labels of samples.
The origin vector is only required for analyzing data from multiple
origins!
 Differential expression analysis with Affymetrix data from single origin
library(RankProd) # Loads the required library.
data(arab) # Loads sample data matrix or data frame 'arab' that contains log2 intensities of 500 genes (rows) for 10 samples (columns).
my_expr_matrix < exprs(my_eset)
# When the analysis is started from Affy Cel files, one can create the
required expression matrix or data frame for input into RankProd from an
ExpressionSet object with the 'exprs' function.
arab.sub < arab[, which(arab.origin == 1)] # Generates subdata set for single origin analysis.
cl < arab.cl[which(arab.origin == 1)] # Generates 'cl' vector with classes (contrast groups) labeled with '0' and '1'.
RP.out < RP(arab.sub, cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = FALSE, gene.names = arab.gnames, rand = 123) # Performs rank product analysis for singleorigin data. If the input data are not log transformed use 'logged=FALSE'.
topGene(RP.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames)
# The function 'topGene' outputs identified genes with userspecified
filters: 'cutoff': pfp threshold; 'num.gene': number of top genes
identified; 'gene.names': vector with Affy IDs in order of input data
set. For log2 scaled data use the arguments 'logged=TRUE' and
'logbase=2'. The generated 'Table 1' lists the upregulated genes and
the 'Table 2' the downregulated genes. The four columns in each table
contain the following information: (1) 'gene index' = position in
original data set, (2) 'RP/Rsum' = rank product, 'FC' = fold change of
averaged expression levels, 'pfp' = estimated pfp or FDR.
plotRP(RP.out, cutoff = 0.05)
# Plots pfp (FDR) versus number of identified up and downregulated
genes. Genes within the specified cutoff range are plotted in red.
 Differential expression analysis with Affymetrix data from multiple origins
cl < arab.cl #
Assigns 'cl' vector with classes (contrast groups) labeled with '0' and
'1'. Object 'arab.cl' is part of sample data set 'arab'.
origin < arab.origin # Assigns 'origin' vector with origin labels of samples. Object 'arab.origin' is part of sample data set 'arab'.
RP.adv.out < RPadvance(arab, cl, origin, num.perm = 100, logged = TRUE, gene.names = arab.gnames, rand = 123) # The function 'RPadvance' performs the rank product analysis for multipleorigin data that requires the above 'origin' vector.
topGene(RP.adv.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames) # Generates the 'topGene' list (see above).
 Differential expression analysis with cDNA arrays
 SAM
 SAM is part of the 'siggenes' package.
library(siggenes) # Loads siggenes package.
?sam # Opens help page for SAM.
sam(data,cl....) # Basic syntax for running SAM.
 R/maaova
 Digital Gene Expression (DGE)
Additional Dual Color Array Packages
 BioConductor provides various packages for the analysis of dualcolor cDNA arrays.
Marray
 Exploratory analysis for twocolor spotted microarray data.
 The following gives a very brief summary on how to use this package:

library(marray) # Loads marray library.
... # Continue...
Chromosome maps
 Several BioConductor packages are available for displaying genomic
information graphically. The following commands illustrate some of the
chromosome plotting utilities from the geneplotter package.

library(annotate);
library(geneplotter); library(hgu95av2); newChrom <
buildChromLocation("hgu95av2"); newChrom; cPlot(newChrom) #
This displays all genes on the chromosomes of an organisms. Genes
encoded by the antisense strands are represented by lines below the
chromosomes.
data(sample.ExpressionSet); myeset < sample.ExpressionSet; cColor(featureNames(sample.ExpressionSet), "red", newChrom) # This highlights in the above plot a set of genes of interest in red color (e.g. expressed genes of an experiment).
cPlot(newChrom,c("1","2"), fg="yellow", scale="relative"); cColor(featureNames(myeset), "red", newChrom) # Plots just a specific set of chromosomes. 
Gene Ontologies
General Several packages are available for Gene Ontology (GO) analysis. The
following examples of the different packages use the GO annotations from
Arabidopsis.

 Basic GO usage with GOstats
library(GOstats); library(GO.db); library(ath1121501.db); library(annotate) # Loads the required libraries.
goann < as.list(GOTERM) # Retrieves full set of GO annotations.
zz < eapply(GOTERM, function(x) x@Ontology); table(unlist(zz)) # Calculates the number of annotations for each ontology category.
?GOTERM # To find out, how to access the different GO components.
GOTERM$"GO:0003700"; GOMFPARENTS$"GO:0003700"; GOMFCHILDREN$"GO:0003700" # Shows how to print out the GO annotations for one entry and how to retrieve its direct parents and children.
GOMFANCESTOR$"GO:0003700"; GOMFOFFSPRING$"GO:0003700" # Prints out complete lineages of parents and children for a GO ID.
goterms < unlist(eapply(GOTERM, function(x) x@Term)); goterms[grep("molecular_function", goterms)]
# Retrieves all GO terms and prints out only those matching a search
string given in the grep function. The same can be done for the
definition field with 'x@Definition'. A set of GO IDs can be provided as
well: goterms[GOMFANCESTOR$"GO:0005507"]
go_df <
data.frame(GOID=unlist(eapply(GOTERM, function(x) x@GOID)),
Term=unlist(eapply(GOTERM, function(x) x@Term)),
Ont=unlist(eapply(GOTERM, function(x) x@Ontology))) # Generates data frame of the commonly used GO components: GOID, GO Term and Ontology Type.
affyGO < eapply(ath1121501GO, getOntology, "MF"); table(sapply(affyGO, length))
# Retrieves MF GO terms for all probe IDs of a chosen Affy chip and
calculates how many probes have multiple GO terms associated. Use "BP"
and "CC" arguments to retrieve BP/CC GO terms.
affyGOdf <
data.frame(unlist(affyGO)); affyGOdf <
data.frame(AffyID=row.names(affyGOdf), GOID=affyGOdf[,1]); affyGOdf
< merge(affyGOdf, go_df, by.x="GOID", by.y="GOID", all.x=T)
# Converts above MF list object into a data frame. The AffyID occurence
counts are appended to AffyIDs. The last step merges the two data
frames: 'affyGOdf' and 'go_df'.
unique(lookUp("GO:0004713", "ath1121501", "GO2ALLPROBES")) # Retrieves all Affy IDs that are associated with a GO node.
z < affyGO[c("254759_at", "260744_at")]; as.list(GOTERM)[z[[1]]] # Retrieves GO IDs for set of Affy IDs and then the corresponding GO term for first Affy ID.
a < data.frame(unlist(z)); a
< data.frame(ID=row.names(a), a); b <
data.frame(goterms[as.vector(unlist(z))]); b <
data.frame(ID=row.names(b), b); merge(b,a, by.x = "ID",
by.y="unlist.z.") # Merges Affy ID, GO ID and GO annotation information.
affyEv < eapply(ath1121501GO, getEvidence); table(unlist(affyEv, use.names = FALSE)) # Provides evidence code information for each gene and summarizes the result.
test1 < eapply(ath1121501GO, dropECode, c("IEA", "NR")); table(unlist(sapply(test1, getEvidence), use.names = FALSE)) # This example shows how one can remove certain evidence codes (e.g. IEA, IEP) from the analysis.
library(ath1121501.db);
affySample < c("266592_at", "266703_at", "266199_at", "246949_at",
"267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at",
"262632_at"); geneSample < as.vector(unlist(mget(affySample,
ath1121501ACCNUM, ifnotfound=NA))); library(ath1121501cdf); affyUniverse
< ls(ath1121501cdf); geneUniverse <
as.vector(unlist(mget(affyUniverse, ath1121501ACCNUM, ifnotfound=NA)));
params < new("GOHyperGParams", geneIds = geneSample, universeGeneIds
= geneUniverse, annotation="ath1121501", ontology = "MF", pvalueCutoff =
0.5, conditional = FALSE, testDirection = "over"); hgOver <
hyperGTest(params); summary(hgOver); htmlReport(hgOver, file =
"MyhyperGresult.html") # Example of how to test a sample set
of probe set keys for overrepresentation of GO terms using a
hypergeometric distribution test with the function hyperGTest(). For
more information, read the GOstatsHyperG manual.
GOHyperGAll
 GOHyperGAll function:
To test a sample population of genes for overrepresentation of GO
terms, the function 'GOHyperGAll' computes for all GO nodes a
hypergeometric distribution test and returns the corresponding raw and
Bonferroni corrected pvalues (notes about implementation).
A subsequent filter function performs a GO Slim analysis using default
or custom GO Slim categories. The method has been published in Plant Physiol (2008) 147, 4157).
GOHyperGAll provides similar utilities as the hyperGTest function in
the GOstats package from BioConductor. The main difference is that
GOHyperGAll simplifies the usage of custom chiptogene and genetoGO
mappings.
To demo the utility of the GOHyperGAll function, run it as shown below.
The initial sample data generation step takes some time (~10 min),
since it needs to generate the required data objects for all three
ontologies. This needs to be done only once for every custom genetoGO
annotation.
(1.1) Import all required functions with the following source() command
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")
(2.1) Using genetoGO mappings from geneontology.org
readGOorg(myfile = "gene_association.tair", colno = c(5,11,9), org = "Arabidopsis"); gene2GOlist(rootUK=T) # Download the required annotation table from geneontology.org
and unzip it. Then point the 'readGOorg()' function to this file name.
The two functions generate 4 data frames with the assigned genetoGO
mappings and 3 lists containing the genetoGOOFFSPRING associations.
When the processes are completed, 6 files will be saved in your working
directory! They can be reloaded in future R sessions with the 'load'
command below. If the argument 'rootUK' is set to TRUE, then the root
nodes are treated as terminal nodes to account for the new assignment of
unknown genes to the root nodes.
(2.2) Using genetoGO mappings from BioC  Note: users should execute either step (2.1) or (2.2), but not both!
sampleDFgene2GO(lib="ath1121501.db"); gene2GOlist(rootUK=T)
# Similar as above, but the genetoGO mappings are obtained from BioC.
The generated 4 sample data frame and 3 list objects can be reloaded in
future R sessions with the 'load' command below.
(2.3) Obtain AffyIDtoGeneID mappings when working with AffyIDs
AffyID2GeneID(map = "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements20101220.txt")
# When working with AffyIDs, this function creates a AffyIDtoGeneID
mapping data frame using by default the TAIR mappings for the
Arabidopsis ATH1 chip. To use the function for the mappings of other
chips, one needs to create the corresponding decoding data frame
'affy2locusDF'.
(3.1) Reloading required data objects from local files
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list")
# This step makes future sessions much faster, since it allows to skip
the previous data generation steps (2.12.3). A sample data set is
available here: ArabSampleGOHyperGAll.zip (Jan 2010).
(3.2) Obtain a sample set of GeneIDs
test_sample < unique(as.vector(GO_MF_DF[1:40,2])) # When working with GeneIDs.
test_sample < AffyID2GeneID(affyIDs=affy_sample, probe2gene=1)
# When working with AffyIDs, one can use the function 'AffyID2GeneID'
to obtain for a set of AffyIDs their corresponding GeneIDs from the data
frame 'affy2locusDF' (see above). For probe sets that match several
loci, only the first locus ID will be used if the argument 'probe2gene'
is set to 1. To demo the function, one can use the following sample
AffyIDs: affy_sample < c("266592_at", "266703_at", "266199_at",
"246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at",
"266295_at", "262632_at").
(4.1) Perform phyper test, goSlim subsetting and plotting of results
GOHyperGAll_result < GOHyperGAll(gocat="MF", sample=test_sample, Nannot=2); GOHyperGAll_result[1:10,8]
# The function 'GOHyperGAll()' performs the phyper test against all
nodes in the GO network. It returns raw and adjusted pvalues. The
Bonferroni correction is used as pvalues adjustment method according to
Boyle et al, 2004 (online).
The argument 'Nannot' defines the minimum number of direct annotations
per GO node from the sample set to determine the number of tested
hypotheses for the pvalue adjustment. The argument 'gocat' can be
assigned the values "MF", "BP" and "CC". Omitting the '8' delimiter
will provide the sample keys matching at every GO node.
subset < GOHyperGAll_Subset(GOHyperGAll_result, sample=test_sample, type="goSlim"); subset[,8]
# The function 'GOHyperGAll_Subset()' subsets the GOHyperGAll results
by assigned GO nodes or custom goSlim categories. The argument 'type'
can be assigned the values "goSlim" or "assigned". The optional argument
'myslimv' can be used to provide a custom goSlim vector. Omitting the
'8' delimiter will show the sample keys matching at every GO node.
pie(subset[subset$SampleMatch>0
,3], labels=as.vector(subset[subset$SampleMatch>0 ,1]),
main=unique(as.vector(subset[subset$SampleMatch>0, 7]))) # Plots pie chart of subsetted results.
(4.2) Reduce GO Term Redundancy in 'GOHyperGAll_results'
simplifyDF < GOHyperGAll_Simplify(GOHyperGAll_result, gocat="MF", cutoff=0.001, correct=T)
# The result data frame 'GOHyperGAll_result' often contains several
connected GO terms with significant scores which can complicate the
interpretation of large sample sets. To reduce the redundancy, the
function 'GOHyperGAll_Simplify' subsets the data frame
'GOHyperGAll_result' by a user specified pvalue cutoff and removes from
it all GO nodes with overlapping children sets (OFFSPRING), while the
best scoring nodes remain in the data frame.
data.frame(GOHyperGAll_result[GOHyperGAll_result[,1] %in% simplifyDF[,1], 8], GO_OL_Match=simplifyDF[,2])
# This command returns the redundancy reduced data set. The column
'GO_OL_Match' provides the number of accessions that match the connected
nodes.
(4.3) Batch Analysis of Many Gene Clusters
BatchResult <
GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10,
cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674",
"GO:0008150", "GO:0005575")) # The function
'GOCluster_Report' performs the three GO analyses in batch mode:
'GOHyperGAll', 'GOHyperGAll_Subset' or 'GOHyperGAll_Simplify'. It
processes many groups of genes (e.g. gene expression clusters) and
returns the results conveniently organized in a single data frame. The
gene sets need to be provided in a data frame of the format specified at
the end of the GOHyperGAll
script. CLSZ: minimum cluster size to consider; method: "all", "slim"
or "simplify"; gocat: "MF", "BP" or "CC"; cutoff: adjusted pvalue
cutoff; recordSpecGO: argument to include one specific GOID in each of
the 3 ontologies, e.g: recordSpecGO=c("GO:0003674", "GO:0008150",
"GO:0005575").
GSEA
 Gene Set Enrichment Analysis (GSEA)
The following function converts the objects, generated by the above
GOHyperGAll script, into the gene set format (*.gmt) for the GSEA tool
from the Broad Institute. This way the custom GO files of the
GOHyperGAll method can be used as gene sets for the Java or R releases
of the GSEA program.
(A) Import the required file format conversion function with the following source() command
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOhyper2GSEA() function.
GOhyper2GSEA(myfile=c("MF_node_affy_list", "BP_node_affy_list", "CC_node_affy_list"), type="all")
# Converts XX_node_affy_list or GO_XX_DF files into GO_XX_ALL.gmt and
GO_XX_TERM.gmt files, respecitively. Use the argument "all" for
XX_node_affy_list files and "terminal" for GO_XX_DF files. The former
contain all GO assignments, whereas the latter contain only the direct
GO assignments.
(B) Import the generated gene set files (*.gmt) along with the gene
expression intensity data (*.txt) or preranked gene lists (*.rnk) into
the Java or R GSEA program, and follow the instructions on the GSEA User Guide. A detailed description of the data formats is provided on the Data Formats page.
GOTools and goCluster  Getting started with goTools
my_list <
list(L1=c("245265_at", "260744_at", "259561_at", "254759_at",
"267181_at"), L2=c("256497_at", "250942_at", "265611_at", "247488_at",
"255148_at")) # Creates a working list with two components each consisting of Arabidopsis Affy IDs.
library("goTools", verbose = FALSE) # Loads the required library.
ontoCompare(my_list, probeType = "ath1121501", goType="MF") # Provides compositional differences of the molecular function GOs between the Affy ID vectors contained in my_list.
ontoCompare(my_list, probeType = "ath1121501", goType="MF", plot=TRUE) # Plots the above MF GO compositions in bar diagram.
par(mar = c(5, 8, 5, 8)); res2
< ontoCompare(my_list["L1"], probeType = "ath1121501",
method="TIDS", goType="MF", plot=FALSE); ontoPlot(res2, cex = 0.7) # Plots pie chart for "L1" gene list component in my_list.
 goCluster demo
library("goCluster", verbose = FALSE) # Loads the goCluster package.
testdata < eset[c(1:400),c(1:4)] # Selects 400 genes from previously generated ExpressionSet object (e.g. simpleaffy) with four chips.
test < new("goCluster") # Creates goCluster object.
testConfigured < config(test)
# The configmethod starts the configuration process and stores the
results in object 'testConfigured'. The required configuration
information needs to be provided in the interactive mode (use
'ath1121501').
setup(testConfigured)[["algo"]]; print(testConfigured) # Retrieves the generated configuration information.
testExecuted < execute(testConfigured) # Analyzes the configured object.
print(testExecuted) # Prints the above analysis object to give a result summary.
display(testExecuted, selection = 3) # Displays result in form of heat map.
testExecuted2 < testExecuted;
testExecuted2@visu < new("clusterVisualHeatmap"); testExecuted2@visu
< execute(testExecuted2@visu, testExecuted2);
display(testExecuted2, selection = "GO:0009535") # Displays result for chosen GO identifier.
KEGG Pathway Analysis
 KEGG Pathway Script: Script to map Arabidopsis locus IDs to KEGG pathways. It can be easily adjusted to any organism.
Motif Identification in Promoter Regions
 COSMO from Oliver Bembom et al.: allows to discover motifs in unaligned DNA sequences
 rGADEM: de novo motif discovery
Phylogenetic Analysis
Cheminformatics in R
 ChemmineR is an R package for mining druglike compound and screening data sets.
Protein Structure Analysis
 Bio3D in R: utilities for the analysis of protein structure and sequence data.
MS Data Analysis
GenomeWide Association Studies (GWAS)
 SimHap: A comprehensive modelling framework and a multipleimputation approach to haplotypic analysis of populationbased data.
 GenABEL: R library for Genomewide association analysis.
 crlmm: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays (Carvalho et al., 2009).
 GGtools: Software and data for genetical genomics.
 GeneticsBase: Classes and functions for handling genetic data.
 PLINK:
Free, opensource whole genome association analysis toolset, designed
to perform a range of basic, largescale analyses in a computationally
efficient manner.
BioConductor Exercises
The following exercises demonstrate several very useful BioConductor tools for singlecolor and dualcolor array analysis.
 Slide Show: Technology Overview (R Code)
 Install R, BioC and required libraries
 Download a sample set of Affymetrix cel files
 Rightclick this link and save its content to your computer: Workshop.zip. These sample CEL files are from the GEO data set: GSE5621
 After unpacking this file archive one should see six *.cel files.
 Generate RMA expression data, MAS5 P/M/A calls and export results to Excel
library(affy); mydata < ReadAffy() # Reads cel files in current working directory into affybatch object 'mydata'.
library(ath1121501probe); print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
eset_rma < rma(mydata) # Generates RMA expression values and stores them as ExpressionSet.
exprs(eset_rma)[1:4,] # Prints first 4 rows in data frame structure.
# exprs(eset_rma) < 2^(exprs(eset_rma)) # If needed unlog RMA expression values.
mydf < 2^exprs(eset_rma);
myList < tapply(colnames(mydf), c(1,1,2,2,3,3), list); names(myList)
< sapply(myList, paste, collapse="_"); mymean < sapply(myList,
function(x) rowMeans(mydf[,x])) # Generic approach for calculating mean values for any sample combination.
eset_pma < mas5calls(mydata) # Generates MAS 5.0 P/M/A calls.
my_frame < data.frame(exprs(eset_rma), exprs(eset_pma), assayDataElement(eset_pma, "se.exprs")) # Combine RMA intensities, P/M/A calls plus their wilcoxon pvalues in one data frame.
my_frame < my_frame[, sort(names(my_frame))] # Sorts columns by cel file name.
write.table(my_frame, file="my_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.
 Add annotation information
library("ath1121501.db") # Loads required annotation package.
Annot <
data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=",
"), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "),
DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", ")) # Constructs a data frame containing the gene IDs, gene symbols and descriptions for all probe sets on the chip.
all < merge(Annot, my_frame, by.x=0, by.y=0, all=T) # Merges everything with above expression data.
write.table(all, file="my_annot_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.
 Visualization and quality control
d < cor(2^exprs(eset_rma), method="pearson"); plot(hclust(dist(1d))) # Generates a correlation matrix for allagainstall chip comparisons.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
eset < eset_rma # To work with following commands.
 Continue with commands in section "Visualization and quality controls".
 Simple Affy (command summary)
 Save this covdesc.txt to your R working directory.
 Generate expression data with RMA and MAS5.
 Filter each of the three data sets with the following parameters:
2fold changes, present in all 4 chips and pscore less than 0.001.
 Write the results into separate files.
 Create scatter plots for the filtered data sets and save them to external image files.
 Identify the overlap of the significant changes between the RMA and MAS5 data.
 Perform simpleaffy QC checks: scaling factor, percent present calls, etc.
 Identify differtially expressed genes (DEGs) with the limma package
 Follow the instructions in the Limma Section for Affy Data.
 cDNA microarray users can save and extract the SWIRL
cDNA microarray sample data. For a quick demonstration of the analysis
of this data set, one can copy&paste or source the following
commandline summary into the R terminal: my_swirl_commands.txt.
 Test gene sets for overrepresented GO terms using the GOHyperGAll script
 Download and unzip the annotation objects for Arabidopsis (March 2010) into your working directory.
 Then continue with the following commands:
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOHyperGAll functions.
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list") # Loads the downloaded annotation objects for Arabidopsis.
GOHyperGAll_result <
GOHyperGAll(gocat="MF", sample=unique(as.vector(GO_MF_DF[1:40,2])),
Nannot=2); GOHyperGAll_result[1:10,8] # Performs the enrichment test for provided set of gene identifiers.
CL_DF < data.frame(geneID=GO_MF_DF[1:400,2], ClusterID=sort(rep(1:4,100)), ClusterSize=rep(100,400)) # Create sample data set for batch processing.
BatchResult <
GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10,
cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674",
"GO:0008150", "GO:0005575")); BatchResult[1:4,10] # Performs
all three GO analyses for many sample set at once. When the method
argument is set to "slim" then the goSlim method is used.
write.table(BatchResult, file="GO_Batch.xls", sep="\t", col.names = NA) # Exports batch result to Excel file.
 Clustering of differentially expressed genes
my_fct < function(x) hclust(x, method="complete") # Creates function to perform hierarchical clustering (complete linkage).
heatmap(as.matrix(2^exprs(eset)[1:40,]), col = cm.colors(256), hclustfun=my_fct) # Plots heatmap with dendrograms.
 Replace in last step 'exprs(eset)[1:40,]' by matrix of differentially expressed genes from limma analysis.
 Convert the above commands into an R script and execute it with the source() function
source("myscript.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.
Clustering and Data Mining in R
Introduction [ Slide Show ]
Data Preprocessing
Generate a sample data set
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data matrix.
 Data centering and scaling
scale(t(y)); yscaled < t(scale(t(y))); apply(yscaled, 1, sd)
# The function scale() centers and/or scales numeric matrices
columnwise. When used with its default settings, the function returns
columns that have a mean close to zero and a standard deviation of one.
For rowwise scaling the data set needs to be transposed first using the
t() function. Another transposing step is necessary to return the data
set in its original orientation (same row/column assignment). Centering
and scaling are common data transformation steps for many clustering
techniques.
 Obtain a distance matrix
dist(y, method = "euclidean")
# Computes and returns a distance matrix for the rows in the data
matrix 'y' using the specified distance measure, here 'euclidean'.
c < cor(t(y), method="spearman"); d < as.dist(1c); d
# To obtain correlationbased distances, one has to compute first a
correlation matrix with the cor() function. Since this step iterates
across matrix columns, a transpose 't()' step is required to obtain row
distances. In a second step the distance matrix object of class "dist"
can be obtained with the as.dist() function.
Hierarchical Clustering (HC)
The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50100 faster) version of hclust. The pvclust
package can be used for assessing the uncertainty in hierarchical
cluster analyses. It provides approximately unbiased pvalues as well as
bootstrap pvalues. As an introduction into R's standard hierarchical
clustering utilities one should read the help pages on the following
functions: hclust, dendrogram, as.dendrogram, cutree and heatmap.
An example for subclustering (subsetting) heatmaps based on selected
tree nodes is given in the last part of this section (see zoom into heatmaps).

 Clustering with hclust
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
c < cor(t(y), method="spearman"); d < as.dist(1c); hr < hclust(d, method = "complete", members=NULL)
# This is a short example of clustering the rows of the generated
sample matrix 'y' with hclust. Seven different clustering methods can be
selected with the 'method' argument: ward, single, complete, average,
mcquitty, median and centroid. The object returned by hclust is a list
of class hclust which describes the tree generated by the clustering
process with the following components: merge, height, order, labels,
method, call and dist.method.
par(mfrow = c(2, 2)); plot(hr, hang = 0.1); plot(hr, hang = 1)
# The generated tree can be plotted with the plot() function. When the
hang argument is set to '1' then all leaves end on one line and their
labels hang down from 0. More details on the plotting behavior is
provided in the hclust help document (?hclust).
plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T) # To plot trees horizontally, the hclust tree has to be transformed into a dendrogram object.
unclass(hr) # Prints the full content of the hclust object.
str(as.dendrogram(hr)) # Prints dendrogram structure as text.
hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
par(mfrow=c(2,1)); hrd1 <
as.dendrogram(hr); plot(hrd1); hrd2 < reorder(hrd1, sample(1:10));
plot(hrd2); labels(hrd1); labels(hrd2) # Example to reorder a dendrogram and print out its labels.
library(ctc); hc2Newick(hr) # The 'hc2Newick' function of the BioC ctc library can convert a hclust object into the Newick tree file format for export into external programs.
 Tree subsetting (see also Dynamic Tree Cut package)
mycl < cutree(hr, h=max(hr$height)/2); mycl[hr$labels[hr$order]]
# Cuts tree into discrete cluster groups (bins) at specified height in
tree that is specified by 'h' argument. Alternatively, one can cut based
on a desired number of clusters that can be specified with the 'k'
argument.
rect.hclust(hr, k=5, border="red") # Cuts dendrogram at specified level and draws rectangles around the resulting clusters.
subcl < names(mycl[mycl==3]);
subd < as.dist(as.matrix(d)[subcl,subcl]); subhr < hclust(subd,
method = "complete"); par(mfrow = c(1, 2)); plot(hr); plot(subhr)
# This example shows how one can subset a tree by cutting it at a given
branch point provided by the cutree() function. The identified row IDs
are then used to subset the distancs matrix and recluster it with
hclust.
lapply(as.dendrogram(hr), unlist); lapply(as.dendrogram(hr)[[1]], unlist)
# Returns the indices of the cluster members of the two main branches.
The second step returns the members of the next two descendent clusters
of the first main branch.
example(hclust) # The
last step in this example shows how the 'members' argument can be used
to reconstruct a hierarchical tree above the cut points using the
cluster centers. More details on this can be found in the hclust help
document.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/selectNodes.R");
Uniclusters < nodes2Cluster(myhclust=hr, rootdist=2);
clusterQCselect < clusterQC(clusterlist=Uniclusters,
simMatrix=as.matrix(c), method="min", cutoff=0.7, output="df");
clusterQCselect # The function 'nodes2Cluster' cuts the tree
at all height levels as defined in 'hr$height' and the second function
'clusterQC' selects clusters based on a given minimum cutoff in an
allagainstall comparison within each cluster. The 'method' argument
allows to calculate the cutoff value with one of the following
functions: "min", "qu1", "median", "mean", "qu3" and "max". The end
result can be returned as list (output="list") or as data frame
(output="df"). The argument 'rootdist' defines the starting distance
from the root of the tree to avoid testing of very large clusters.
 Tree coloring and zooming into branches
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function..
dend_colored < dendrapply(as.dendrogram(hr), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20)
# In this example the dendrogram for the above 'hr' object is colored
with the imported 'dendroCol()' function based on the identifiers
provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the
labels are colored instead of the leaves.
par(mfrow = c(1, 3));
plot(dend_colored, horiz=T); plot(dend_colored, horiz=T, type="tr");
plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0),
ylim=c(1,3)) # Plots the colored tree in different formats.
The last command shows how one can zoom into the tree with the 'xlim and
ylim' arguments, which is possible since R 2.8.
z < as.dendrogram(hr); attr(z[[2]][[2]],"edgePar") < list(col="blue", lwd=4, pch=NA); plot(z, horiz=T) # This example shows how one can manually color tree elements.
 Plot heatmaps with the heatmap() function
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R")
# Imports color function to obtain heat maps in red and green. Use this
color function in heatmap color argument like this:
'col=my.colorFct()'.
hr < hclust(as.dist(1cor(t(y), method="pearson")), method="complete") # Clusters rows by Pearson correlation as distance method.
hc < hclust(as.dist(1cor(y, method="spearman")), method="complete") # Clusters columns by Spearman correlation as distance method.
heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row")
# The previous two hclust objects are used to obtain a heatmap where
the rows are clusterd by Pearson and the columns by Spearman correlation
distances. The 'scale' argument allows to scale the input matrix by
rows or columns. The default is by rows and "none" turns the scaling
off.
mycl < cutree(hr,
h=max(hr$height)/1.5); mycol < sample(rainbow(256)); mycol <
mycol[as.vector(mycl)]; heatmap(y, Rowv=as.dendrogram(hr),
Colv=as.dendrogram(hc), col=my.colorFct(), scale="row",
ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
# The arguments 'ColSideColors' and 'RowSideColors' allow to annotate a
heatmap with a color bar. In this example the row colors correspond to
cluster bins that were obtained by the cutree() function.
dend_colored <
dendrapply(as.dendrogram(hr), dendroCol, keys=c("g1", "g2"),
xPar="edgePar", bgr="black", fgr="red", pch=20); heatmap(y,
Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(),
scale="row", ColSideColors=heat.colors(length(hc$labels)),
RowSideColors=mycol) # In addition, one can color the tree
leaves or labels by providing a colored dendrogram object, here
'dend_colored' from above.
heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", labRow = "", labCol="") # The arguments 'labRow' and 'labCol' allow to provide custom labels or to omit the printing of the labels as shown here.
heatmap(y, Rowv=as.dendrogram(hr), Colv=NA, col=my.colorFct()) # If 'Rowv' or 'Cowv' is set to 'NA' then the row and/or the column clustering is turned off.
ysort < y[hr$labels[hr$order], hc$labels[hc$order]]; heatmap(ysort, Rowv=NA, Colv=NA, col=my.colorFct()) # This example shows how the rows and columns of a heatmap can be sorted by the hclust trees without printing them.
x11(height=4, width=4);
heatmap(matrix(rep(seq(min(as.vector(y)), max(as.vector(y)),
length.out=10),2), 2, byrow=T, dimnames=list(1:2,
round(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),1))),
col=my.colorFct(), Colv=NA, Rowv=NA, labRow="", main="Color Legend") # Prints color legend in a separate window.
 Plot heatmaps with the image() or heatmap.2() functions
image(scale(t(ysort)), col=my.colorFct())
# A very similar plot as before can be obtained by the image() function
which plots the values of a given matrix as a grid of colored
rectangles.
x11(height=12); par(mfrow=c(1,2));
plot(as.dendrogram(hr), horiz=T, yaxt="n", edgePar=list(col="grey",
lwd=4)); image(scale(t(ysort)), col=my.colorFct(), xaxt="n", yaxt="n")
# To create very large heatmaps that are still readable, one can plot
the dendrogram and the heatmap of the image() function next to it.
library("gplots") # Loads the gplots library that contains the heatmap.2() function.
heatmap.2(y,
Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=redgreen(75),
scale="row", ColSideColors=heat.colors(length(hc$labels)),
RowSideColors=mycol, trace="none", key=T,
cellnote=round(t(scale(t(y))),1)) # The heatmap.2() function
contains many additional plotting features that are not available in the
basic heatmap() function. For instance, the 'key' argument allows to
add a color key within the same plot. Numeric values can be added to
each cell with the 'cellnote' argument. In addition, heatmap.2() scales
to very large formats for viewing complex heatmaps.
library(RColorBrewer); mypalette < rev(brewer.pal(9,"Blues")[1]) # Select an alternative color scheme that is more appropriate for representing continous values (e.g. chip intensities).
heatmap.2(y,
Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=mypalette,
scale="row", trace="none", key=T, cellnote=round(t(scale(t(y))),1))
# Example to illustrate how to turn off the row or column reordering in
heatmap.2. The settings "dendrogram="row", Colv=F" turn off the column
reordering and no column tree is shown. To turn off the reordering of
both dimensions, use the following settings: "dendrogram="none", Rowv=F,
Colv=F".
 Zooming into heatmaps by subclustering selected tree nodes
y < matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
hr <
hclust(as.dist(1cor(t(y), method="pearson")), method="complete"); hc
< hclust(as.dist(1cor(y, method="spearman")), method="complete") # Generates row and column dendrograms.
mycl < cutree(hr,
h=max(hr$height)/1.5); mycolhc < rainbow(length(unique(mycl)),
start=0.1, end=0.9); mycolhc < mycolhc[as.vector(mycl)] # Cuts the tree and creates color vector for clusters.
library(gplots); myheatcol < redgreen(75) # Assign your favorite heatmap color scheme. Some useful examples: colorpanel(40,
"darkblue", "yellow", "white"); heat.colors(75); cm.colors(75);
rainbow(75); redgreen(75); library(RColorBrewer);
rev(brewer.pal(9,"Blues")[1]). Type demo.col(20) to see more color schemes.
heatmap.2(y,
Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=myheatcol,
scale="row", density.info="none", trace="none", RowSideColors=mycolhc) # Creates heatmap for entire data set where the obtained clusters are indicated in the color bar.
x11(height=6,
width=2); names(mycolhc) < names(mycl); barplot(rep(10, max(mycl)),
col=unique(mycolhc[hr$labels[hr$order]]), horiz=T,
names=unique(mycl[hr$order])) # Prints color key for cluster assignments. The numbers next to the color boxes correspond to the cluster numbers in 'mycl'.
clid < c(1,2);
ysub < y[names(mycl[mycl%in%clid]),]; hrsub <
hclust(as.dist(1cor(t(ysub), method="pearson")), method="complete") # Select subcluster number (here: clid=c(1,2)) and generate corresponding dendrogram.
x11();
heatmap.2(ysub, Rowv=as.dendrogram(hrsub), Colv=as.dendrogram(hc),
col=myheatcol, scale="row", density.info="none", trace="none",
RowSideColors=mycolhc[mycl%in%clid]) # Create heatmap for chosen subcluster.
data.frame(Labels=rev(hrsub$labels[hrsub$order])) # Print out row labels in same order as shown in the heatmap.
Bootstrap Analysis in Hierarchical Clustering
The pvclust
package allows to assess the uncertainty in hierarchical cluster
analysis by calculating for each cluster pvalues via multiscale
bootstrap resampling. The method provides two types of pvalues. The
approximately unbiased pvalue (AU) is computed by multiscale bootstrap
resampling. It is a less biased pvalue than than the second one,
bootstrap probability (BP), which is computed by normal bootstrap
resampling.
library(pvclust) # Loads the required pvclust package.
y < matrix(rnorm(500), 50, 10, dimnames=list(paste("g", 1:50, sep=""), paste("t", 1:10, sep=""))) # Creates a sample data set.
pv < pvclust(scale(t(y)), method.dist="correlation", method.hclust="complete", nboot=10)
# This step performs the hierarchical cluster analysis with multiscale
bootstrap with 10 repetitions, complete linkage for cluster joining and a
correlationbased dissimilarity matrix. Usually, one should set 'nboot'
to at least 1000 bootstrap repetitions.
plot(pv, hang=1) # Plots a dendrogram where the red numbers represent the AU pvalues and the green ones the BP values.
pvrect(pv, alpha=0.95) # This command highlights with red rectangles all clusters in the dendrogram that have an AU value above 95%.
clsig < unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # The function 'pvpick' allows to pick significant clusters.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function. The details on this function are provided in the hclust section of this manual.
dend_colored < dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20) # Creates dendrogram object where the significant clusters are labeled in red.
plot(dend_colored, horiz=T) # Plots a dendrogram where the significant clusters are highlighted in red.
QT Clustering
QT (quality threshold) clustering is a partitioning method that
forms clusters based on a maximum cluster diameter. It iteratively
identifies the largest cluster below the threshold and removes its items
from the data set until all items are assigned. The method was
developed by Heyer et al. (1999) for the clustering of gene expression data.
library(flexclust) # Loads flexclust library.
x < matrix(10*runif(1000), ncol=2) # Creates a sample data set.
cl1 < qtclust(x, radius=3) # Uses 3 as the maximum distance of the points to the cluster centers.
cl2 < qtclust(x, radius=1) # Uses 1 as the maximum distance of the points to the cluster centers.
par(mfrow=c(2,1)); plot(x, col=predict(cl1), xlab="", ylab=""); plot(x, col=predict(cl2), xlab="", ylab="") # Plots the two cluster results.
data.frame(name_index=1:nrow(x), cl=attributes(cl1)$cluster) # Provides the rowtocluster assignment information.
KMeans & PAM
Kmeans, PAM (partitioning around medoids) and clara are related
partition clustering algorithms that cluster data points into a
predefined number of K clusters. They do this by associating each data
point to its nearest centroids and then recomputing the cluster
centroids. In the next step the data points are associated with the
nearest adjusted centroid. This procedure continues until the cluster
assignments are stable. Kmeans uses the average of all the points in a
cluster for centering, while PAM uses the most centrally located point.
Commonly used R functions for Kmeans clustering are: kmeans() of the
stats package, kcca() of the flexclust package and trimkmeans() of the trimcluster package. PAM clustering is available in the pam() function from the cluster package. The clara() function of the same package is a PAM wrapper for clustering very large data sets.
 Kmeans
km < kmeans(t(scale(t(y))), 3); km$cluster
# Kmeans clustering is a partitioning method where the number of
clusters is predefined. The basic R implementation requires as input
the data matrix and uses Euclidean distance. In contrast to pam(), the
implementation does not allow to provide a distance matrix. In the given
example the rowwise centered data matrix is provided.
mycol < sample(rainbow(256));
mycol < mycol[as.vector(km$cluster)]; heatmap(y,
Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(),
scale="row", ColSideColors=heat.colors(length(hc$labels)),
RowSideColors=mycol) # This example shows how the obtained
Kmeans clusters can be compared with the hierarchical clustering
results by highlighting them in the heatmap color bar.
 PAM (partitioning around medoids)
library(cluster) # Loads the required cluster library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
pamy < pam(y, 4); pamy$clustering # Clusters data into 4 clusters using default Euclidean as distance method.
mydist < as.dist(1cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
pamy < pam(mydist, 4); pamy$clustering # Same a above, but uses provided distance matrix.
pam(mydist, 4, cluster.only=TRUE) # Same as before, but with faster computation.
plot(pamy) # Generates cluster plot.
 Clara (clustering large applications: PAM method for large data sets)
clarax < clara(t(scale(t(y))), 4); clarax$clustering
# Clusters above data into 4 clusters using default Euclidean as
distance method. If the data set consists of gene expression values,
then rowwise scaling of y is recommend in combination with Euclidean
distances.
Fuzzy Clustering
In contrast to strict/hard clustering approaches, fuzzy clustering
allows multiple cluster memberships of the clustered items. This is
commonly achieved by partitioning the membership assignments among
clusters by positive weights that sum up for each item to one. Several R
libraries contain implementations of fuzzy clustering algorithms. The
library e1071 contains the cmeans (fuzzy Cmeans) and cshell (fuzzy Cshell) clustering functions. And the cluster library provides the fanny function, which is a fuzzy implementation of the above described kmedoids method.
 Fuzzy clustering with the cluster library
library(cluster) # Loads the cluster library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
fannyy < fanny(y, k=4, metric = "euclidean", memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
# Computes membership coefficients (weights) for 4 clusters and stores
them in the 'membership' component of the resulting fanny.object. The
corresponding hard clustering result is provided in the 'clustering'
slot, which is formed by assigning each point to the cluster with the
highest coefficient. If the distance method is set to
metric="SqEuclidean", then the function performs fuzzy Cmeans
clustering. The membership exponent can be controled with the argument
'memb.exp' (default is 2). Values close to 1 give 'crispier' clustering,
whereas larger values increase the fuzzyness of the results. If a
clustering results in complete fuzzyness, then the functions returns a
warining.
mydist < as.dist(1cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
fannyy < fanny(mydist, k=4, memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering # Example for a using a correlationbased distance matrix.
fannyyMA < round(fannyy$membership, 2) > 0.10; apply(fannyyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
 Fuzzy clustering with the e1071 library
library(e1071) # Loads the e1071 library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
cmeansy < cmeans(y, 4, method="cmeans", m=1.5); round(cmeansy$membership, 2); cmeansy$cluster
# Uses the cmeans clustering method to compute membership coefficients
(weights) for 4 clusters and stores them in the 'membership' component
of the resulting object. The result of this example should be identical
to the above clustering with the fanny function when its metric argument
is set to "SqEuclidean". The cmeans function does not accept a distance
matrix as input, but the fanny function does.
cmeansyMA < round(cmeansy$membership, 2) > 0.10; apply(cmeansyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
cshelly < cshell(y, 4, method="cshell", m=1.5); round(cshelly$membership, 2); cshelly$cluster
# Uses the cshell clustering method to compute membership coefficients
(weights) for 4 clusters and stores them in the 'membership' component
of the resulting object.
SelfOrganizing Map (SOM)
Selforganizing map (SOM), also known as Kohonen network, is a
popular artificial neural network algorithm in the unsupervised learning
area. The approach iteratively assigns all items in a data matrix to a
specified number of representatives and then updates each representative
by the mean of its assigned data points. Widely used R packages for SOM
clustering and visualization are: class (part of R), SOM and kohonen. The SOM package, that is introduced here, provides similar utilities as the GeneCluster software from the Broad Institute.
library(som) # Loads the required SOM library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
y < t(scale(t(y))) # Normalizes the data so that each row has a mean close to zero and a standard deviation of one.
y.som < som(y, xdim = 5, ydim = 6, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
plot(y.som) # Plots SOM clustering result.
y.som$visual # Provides the assignment of rows items to the SOM clusters. Remember that the coordinate counting starts here at zero!
Principal Component Analysis (PCA)
Principal components analysis (PCA) is a data reduction technique
that allows to simplify multidimensional data sets to 2 or 3 dimensions
for plotting purposes and visual variance analysis. The following
commands introduce the basic usage of the prcomp() function. A very
related function is princomp(). The BioConductor library pcaMethods provides many additional PCA functions. For viewing PCA plots in 3D, one can use the scatterplot3d library or the made4 library.
z1 < rnorm(10000, mean=1,
sd=1); z2 < rnorm(10000, mean=3, sd=3); z3 < rnorm(10000,
mean=5, sd=5); z4 < rnorm(10000, mean=7, sd=7); z5 <
rnorm(10000, mean=9, sd=9); mydata < matrix(c(z1, z2, z3, z4, z5),
2500, 20, byrow=T, dimnames=list(paste("R", 1:2500, sep=""), paste("C",
1:20, sep=""))) # Generates sample matrix of five discrete clusters that have very different mean and standard deviation values.
pca < prcomp(mydata, scale=T)
# Performs principal component analysis after scaling the data. It
returns a list with class "prcomp" that contains five components: (1)
the standard deviations (sdev) of the principal components, (2) the
matrix of eigenvectors (rotation), (3) the principal component data (x),
(4) the centering (center) and (5) scaling (scale) used.
summary(pca) # Prints variance summary for all principal components.
x11(height=6, width=12, pointsize=12); par(mfrow=c(1,2)) # Set plotting parameters.
mycolors < c("red", "green", "blue", "magenta", "black") # Define plotting colors.
plot(pca$x, pch=20, col=mycolors[sort(rep(1:5, 500))]) # Plots scatter plot for the first two principal components that are stored in pca$x[,1:2].
plot(pca$x, type="n"); text(pca$x, rownames(pca$x), cex=0.8, col=mycolors[sort(rep(1:5, 500))]) # Same as above, but prints labels.
library(geneplotter); smoothScatter(pca$x) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))]) # Plots scatter plots for all combinations between the first four principal components.
biplot(pca) # Plots a
scatter plot for the first two principal components plus the
corresponding eigen vectors that are stored in pca$rotation.
library(scatterplot3d) # Loads library scatterplot3d.
scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))]) # Same as above, but plots the first three principal components in 3D scatter plot.
library(rgl); rgl.open(); offset
< 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset));
rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60,
zoom=1); spheres3d(pca$x[,1], pca$x[,2], pca$x[,3], radius=0.3,
color=mycolors, alpha=1, shininess=20); aspect3d(1, 1, 1);
axes3d(col='black'); title3d("", "", "PC1", "PC2", "PC3", col='black');
bg3d("white") # Creates an interactive 3D scatter plot with
Open GL. The rgl library needs to be installed for this. To save a
snapshot of the graph, one can use the command rgl.snapshot("test.png").
Multidimensional Scaling (MDS)
Multidimensional scaling (MDS) algorithms start with a matrix of
itemitem distances and then assign coordinates for each item in a
lowdimensional space to represent the distances graphically. Cmdscale()
is the base function for MDS in R. Additional MDS functions are
sammon() and isoMDS() of the MASS library.
loc < cmdscale(eurodist) # Performs MDS analysis on the geographic distances between European cities.
plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)"); text(loc[,1], loc[,2], rownames(loc), cex=0.8) # Plots the MDS results in 2D plot. The minus is required in this example to flip the plotting orientation.
Bicluster Analysis
Biclustering (also coclustering or twomode clustering) is an
unsupervised clustering technique which allows simultaneous clustering
of the rows and columns of a matrix. The goal of biclustering is to find
subgroups of rows and columns which are as similar as possible to each
other and as different as possible to the remaining data points. The biclust package, that is introduced here, contains a collection of bicluster algorithms, data preprocessing and visualization methods (Detailed User Manual). An algorithm that allows the integration of different data types is implemented in the cMonkey R program (BMC Bioinformatics 2006, 7, 280). A comparison of several bicluster algorithms for clustering gene expression data has been published by Prelic et al (2006).
Since most biclustering algorithms expect the input data matrix to be
properly preprocessed, it is especially important to carefully read the
manual pages for the different functions.
 Plaid model biclustering
library(biclust) # Loads the biclust library.
data(BicatYeast) # Loads sample data from Barkow et al. (Bioinformatics, 22, 12821283).
res1 < biclust(BicatYeast, method=BCPlaid())
# Performs plaid model biclustering as described in Turner et al, 2003.
This algorithm models data matrices to a sum of layers, the model is
fitted to data through minimization of error.
summary(res1); show(res1)
# The summary() functions returns the sizes of all clusters found and
the show() function provides an overview of the method applied.
names(attributes(res1)) #
Converts res1 object of class Biclust into a list and returns the names
of its components (slots). The results for the row clustering are
stored in the "RowxNumber" slot and the results for the column
clustering are stored in the "NumberxCol" slot.
BicatYeast[res1@RowxNumber[,2], res1@NumberxCol[2,]] # Extracts values for bicluster number 2 from original data set.
myrows <
attributes(res1)$RowxNumber; mycols < attributes(res1)$NumberxCol;
myclusters < lapply(1:length(myrows[1,]), function(x)
list(rownames(BicatYeast[myrows[,x], ]), colnames(BicatYeast[,
mycols[x,]]))); names(myclusters) < paste("CL",
1:length(myclusters), sep="_"); myclusters[2] # Organizes all identified row and column clusters in a list object, here 'myclusters'.
writeBiclusterResults("results.txt", res1, "My Result", dimnames(BicatYeast)[1][[1]], dimnames(BicatYeast)[2][[1]]) # Writes bicluster results to a file.
bubbleplot(BicatYeast, res1, showLabels=T)
# Draws a bubble plot where each bicluster is represented as a circle
(bubble). Up to three biclustering result sets can be compared in one
plot (here one). The sets are then labled with different colors. The
color brightness representes the bicluster homogeneity (darker, less
homogeneous). The bubble sizes represent the size of the biclusters, as
(number of rows)x(number of columns). The bubble location is a
2Dprojection of row and column profiles.
parallelCoordinates(x=BicatYeast, bicResult=res1, number=4) # Plots the profiles for the 4th cluster. The cluster number is specified under the number argument.
drawHeatmap(BicatYeast, res1, number=4)
# Draws a heatmap for the BicatYeast data matrix with the rows and
colums of the selected 4th bicluster at the topleft of plot.
 XMotifs biclustering
y < discretize(BicatYeast, nof=10) # Converts input matrix with continuous data into matrix with discrete values of 10 levels (nof).
res2 < biclust(y, method=BCXmotifs(), alpha=0.05, number=50)
# Performs XMotifs biclustering based on the framework by Murali and
Kasif (2003). It searches for rows with constant values over a set of
columns which results in a submatrix where all rows in a descretized
matrix have the same value profile. Usually the algorihm needs a
discrete matrix to perform well.
bubbleplot(BicatYeast, res1, res2, showLabels=F) # Compares the two bicluster results in form of a bubble plot.
jaccardind(res1, res2) #
Calculates the similarity for two clustering results using an
adaptation of the Jaccard index (values from 01). An overview of
related methods is available on this cluster validity algorithm page.
 Bimax biclustering
y < BicatYeast; y[y >= 1  y <= 1] < 1; y[y != 1] < 0
# Transforms input data matrix into binary values. Here the log2 ratios
with at least a twofold change are set to 1 and the rest is set to 0.
res3 < biclust(y, method=BCBimax(), minr=15, minc=10, number=100) # Performs Bimax biclustering based on an algorithm described by Prelic et al (2006). This method searches in a binary matrix for submatrices containing only ones.
 Spectral biclustering
res4 < biclust(BicatYeast, method=BCSpectral(), numberOfEigenvalues=2)
# Performs spectral biclustering as described in Kluger et al, 2003.
Spectral biclustering supposes that normalized microarray data matrices
have a checkerboard structure that can be discovered by the use of svd
decomposition in eigenvectors, applied to genes.
 CC biclustering
res5 < biclust(y, method=BCCC(), delta=1.5, alpha=1, number=50)
# Performs CC biclustering based on the framework by Cheng and Church
(2000). Searches for submatrices with a score lower than a specific
treshold in a standardized data matrix.
Network Analysis
 WGCNA: an R package for weighted correlation network analysis
 BioNet: routines for the functional analysis of biological networks
Support Vector Machines (SVM)
Support vector machines (SVMs) are supervised machine learning
classification methods. SVM implementations are available in the R
packages kernlab and e1071.
The e1071 package contains an interface to the C++ libsvm
implementation from ChihChung Chang and ChihJen Lin. In addition to
SVMs, the e1071 package includes a comprehensive set of machine learning
utilities, such as functions for latent class analysis, bootstrapping,
short time Fourier transform, fuzzy clustering, shortest path
computation, bagged clustering, naive Bayes classifier, etc. The kernlab
package contains additional functions for spectral clustering, kernel
PCA, etc.
Similarity Measures for Clustering Results
To measure the similarities among clustering results, one can
compare the numbers of identical and unique item pairs appearing in
their clusters. This can be achieved by counting the number of item
pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c)
set. With this information it is possible to calculate a similarity
coefficient, such as the Jaccard Index. The latter is defined as the
size of the intersect divided by the size of the union of two sample
sets: a/(a+b+c). In case of
partitioning results, the Jaccard Index measures how frequently pairs of
items are joined together in two clustering data sets and how often
pairs are observed only in one set. Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d)
that are not joined together in any of the clusters in both sets. A
variety of alternative similarity coefficients can be considered for
comparing clustering results. An overview of available methods is given
on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila (J Mult Anal 98, 873895).
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/clusterIndex.R") # Imports the cindex() function for computing the Jaccard Index or the Rand Index.
library(cluster); y <
matrix(rnorm(5000), 1000, 5, dimnames=list(paste("g", 1:1000, sep=""),
paste("t", 1:5, sep=""))); clarax < clara(y, 49); clV1 <
clarax$clustering; clarax < clara(y, 50); clV2 <
clarax$clustering # Creates two sample cluster data sets stored in the named vectors clV1 and clV2.
cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard") # Computes the Jaccard Index for clV1 and clV2,
where values close to 0 indicate low similarities and values close to 1
high similarities among the evaluated cluster sets. The Rand and
Adjusted Rand Indices can be returned by assigning to the methods argument the values rand or arand, respectively. By setting the argument self=TRUE,
one can allow self comparisons, where pairs with different and
identical labels are counted. This way clusters with single items will
be considered as well. If it is set to FALSE only pairs with different labels are considered. The implemented cindex()
function is relatively speed and memory efficient. Two clustering
results with 1000 items each partitioned into 50 clusters will be
processed in about 0.05 seconds. To focus the analysis on clusters above
a minimum size limit, one can use the minSZ argument. For instance, the setting minSZ=3 will only consider clusters with 3 or more items.
clVlist < lapply(3:12,
function(x) clara(y[1:30, ], k=x)$clustering); names(clVlist) <
paste("k", "=", 3:12); d < sapply(names(clVlist), function(x)
sapply(names(clVlist), function(y) cindex(clV1=clVlist[[y]],
clV2=clVlist[[x]], method="jaccard")[[3]])); hv <
hclust(as.dist(1d)); plot(as.dendrogram(hv), edgePar=list(col=3,
lwd=4), horiz=T, main="Similarities of 10 Clara Clustering Results for
k: 312") # Example for clustering entire cluster result
sets. First, 10 sample cluster results are created with Clara using
kvalues from 3 to 12. The results are stored as named clustering
vectors in a list object. Then a nested sapply loop is used to generate a
similarity matrix of Jaccard Indices for the clustering results. After
converting the result into a distance matrix, hierarchical clustering is
performed with hclust.
Clustering Exercises
 Slide Show (R Code)
 Install required libraries
 The following exercises demonstrate several useful clustering and data mining utilities in R.
 Import a sample data set
 Download from GEO the Arabidopsis IAA treatment series "GSE1110" in TXT format. The direct link to the download is:
 Import the data set into R
temp <
readLines("GSE1110_series_matrix.txt"); cat(temp[grep("^!^\"$",
temp)], file="GSE1110clean.txt", sep="\n"); mydata <
read.delim("GSE1110clean.txt", header=T, sep="\t") # These import commands include a cleanup step to get rid of annotation lines and corrupted return signs.
rownames(mydata) < mydata[,1]; mydata < as.matrix(mydata[,1]) # Assigns row indices and converts the data into a matrix object.
 Filtering
mydata < mydata[apply(mydata>100, 1, sum)/length(mydata[1,])>0.5 & apply(log2(mydata), 1, IQR)>1.5,] # Retrieves all rows with high intensities (50% > 100) and high variability (IQR>1.5).
 Hierarchical clustering
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R") # Import an alternative color scheme for the heatmap function.
mydatascale < t(scale(t(mydata))) # Centers and scales data.
hr < hclust(as.dist(1cor(t(mydatascale), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
hc < hclust(as.dist(1cor(mydatascale, method="spearman")), method="complete") # Clusters columns by Spearman correlation.
heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row") # Plot the data table as heatmap and the cluster results as dendrograms.
mycl < cutree(hr,
h=max(hr$height)/1.5); mycolhc < sample(rainbow(256)); mycolhc <
mycolhc[as.vector(mycl)]; heatmap(mydata, Rowv=as.dendrogram(hr),
Colv=as.dendrogram(hc), col=my.colorFct(), scale="row",
RowSideColors=mycolhc) # Cut the tree at specific height and color the corresponding clusters in the heatmap color bar.
 Obtain significant clusters by pvclust bootstrap analysis
library(pvclust) # Loads the required pvclust package.
pv < pvclust(scale(t(mydata)), method.dist="correlation", method.hclust="complete", nboot=10)
# Perform the hierarchical cluster analysis. Due to time resrictions,
we are using here only 10 bootstrap repetitions. Usually, one should use
at least 1000 repetitions.
plot(pv, hang=1); pvrect(pv, alpha=0.95) # Plots result as a dendrogram where the significant clusters are highlighted with red rectangles.
clsig < unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # Retrieve members of significant clusters.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
dend_colored < dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20) # Create dendrogram object where the significant clusters are labeled in red.
heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc)
# Plot the heatmap from above, but with the significant clusters in red
and the cluster bins from the tree cutting step in the color bar.
x11(height=12); heatmap.2(mydata,
Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(),
scale="row", trace="none", RowSideColors=mycolhc) # Plot heatmap with heatmap.2() function which scales better for many entries.
mydatasort < mydata[pv$hclust$labels[pv$hclust$order], hc$labels[hc$order]] # Sort rows in data table by 'dend_colored' and its colums by 'hc'.
x11(height=16, width=12);
par(mfrow=c(1,2)); plot(dend_colored, horiz=T, yaxt="n");
image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n") # Plot heatmap with bootstrap tree in larger format using instead of heatmap the image function.
pdf("pvclust.pdf", height=21,
width=6); plot(dend_colored, horiz=T, yaxt="n"); dev.off();
pdf("heatmap.pdf", height=20, width=6); image(scale(t(mydatasort)),
col=my.colorFct(), xaxt="n",yaxt="n"); dev.off() # Save graphical results to two PDF files: 'pvclust.pdf' and'heatmap.pdf'.
 Compare PAM (Kmeans) with hierarchical clustering
library(cluster) # Loads required library.
mydist < t(scale(t(mydata))) # Center and scale data.
mydist < as.dist(1cor(t(mydist), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
pamy < pam(mydist, max(mycl)) # Clusters distance matrix into as many clusters as obtained by tree cutting step (6).
mycolkm < sample(rainbow(256));
mycolkm < mycolkm[as.vector(pamy$clustering)]; heatmap(mydata,
Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(),
scale="row", RowSideColors=mycolkm) # Compare PAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
pdf("pam.pdf", height=20,
width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc),
col=my.colorFct(), scale="row", RowSideColors=mycolkm); dev.off() # Save graphical results to PDF file: 'pvclust.pdf'.
 Compare SOM with hierarchical clustering
library(som) # Loads required library.
y < t(scale(t(mydata))) # Center and scale data.
y.som < som(y, xdim = 2, ydim = 3, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
plot(y.som) # Plots results.
pdf("som.pdf"); plot(y.som); dev.off() # Save plot to PDF: 'som.pdf'.
somclid < as.numeric(paste(y.som$visual[,1], y.som$visual[,2], sep=""))+1 # Returns SOM cluster assignment in order of input data.
mycolsom <
sample(rainbow(256)); mycolsom < mycolsom[somclid]; heatmap(mydata,
Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(),
scale="row", RowSideColors=mycolsom) # Compare SAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
pdf("somhc.pdf", height=20,
width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc),
col=my.colorFct(), scale="row", RowSideColors=mycolsom); dev.off() # Save graphical results to PDF file: 'somhc.pdf'.
 Compare PCA with SOM
pca < prcomp(mydata, scale=T) # Performs principal component analysis after scaling the data.
summary(pca) # Prints variance summary for all principal components.
library(scatterplot3d) # Loads 3D library.
scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom) # Plots PCA result in 3D. The SOM clusters are highlighted in their color.
pdf("pcasom.pdf"); scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom); dev.off() # Saves PCA plot in PDF format: 'pcasom.pdf'.
 Compare MDS with HC, SOM and Kmeans
loc < cmdscale(mydist, k = 3) # Performs MDS analysis and returns results for three dimensions.
x11(height=8, width=8, pointsize=12); par(mfrow=c(2,2)) # Sets plotting parameters.
plot(loc[,1:2], pch=20, col=mycolsom, main="MDS vs SOM 2D") # Plots MDSSOM comparison in 2D. The SOM clusters are highlighted in their color.
scatterplot3d(loc, pch=20, color=mycolsom, main="MDS vs SOM 3D") # Plots MDSSOM comparison in 3D.
scatterplot3d(loc, pch=20, color=mycolhc, main="MDS vs HC 3D") # Plots MDSHC comparison.
scatterplot3d(loc, pch=20, color=mycolkm, main="MDS vs KM 3D") # Plots MDSKM comparison.
 Fuzzy clustering
library(cluster) # Loads cluster library.
fannyy < fanny(mydist, k= max(mycl), memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
# Performs fuzzy clustering with as many coefficients as clusters were
obtained by tree cutting step in HC. The hard clustering results are
provided in the 'clustering' slot.
fannyyMA < round(fannyy$membership, 2) > 0.3; apply(fannyyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.3).
 Biclustering
Administration
 Installation of R
Download and install R for your operating system from CRAN.
R interfaces: RStudio (additional options)
 Installation of BioConductor Packages
The latest instructions for installing BioConductor packages are available on the BioC Installation
page. Only the essential steps are given here. To install BioConductor
packages, execute from the R console the following commands:
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R installation script.
biocLite() # Installs the biocLite.R default set of BioConductor packages.
biocLite(c("pkg1", "pkg2")) # Command to install additional packages from BioC.
update.packages(repos=biocinstallRepos(), ask=FALSE)
# Bioconductor packages, especially those in the development branch,
are updated fairly regularly. To update all of your installed packages,
run this command after sourcing "biocLite.R".
 Installation of CRAN Packages (Task View)
To install CRAN packages, execute from the R console the following command:
install.packages(c("pkg1", "pkg2")) # The selected package name(s) need to be provided within the quotation marks.
Alternatively, a package can be downloaded and then intalled in compressed form like this:
install.packages("pkg.zip", repos=NULL)
List all packages installed on a system:
row.names(installed.packages())
On UNIXbased systems one can also execute the following shell command (not from R):
$ R CMD INSTALL pkg_0.1.tar.gz
# This installs packages by default into a systemwide directory. To
install packages into a user account directory (not systemwide), one
can specify a custom directory with the argument 'l my_lib_dir'. A
locally installed library can be called in R like this: 'library(my_lib,
lib.loc= "/home/user/my_lib_dir/")'. To manage libraries by default in a
custom directory, one can generate a '.Reviron' file containing the
line: R_LIBS=/your_home_dir/Rlib
