R & Bioconductor Manual
R Basics
Introduction
General Overview
R (http://cran.at.rproject.org)
is a comprehensive statistical environment and programming language for
professional data analysis and graphical display. The associated Bioconductor
project provides many additional R packages for statistical data
analysis in different life science areas, such as tools for microarray, next generation
sequence and genome analysis. The R software is free and runs on all common operating systems.
Scope of this Manual
This R tutorial provides a condensed introduction into the usage of
the R environment and its utilities for general data analysis and
clustering. It also introduces a subset of packages from the
Bioconductor project. The included packages are a 'personal selection'
of the author of this manual that does not reflect the full utility
specturm of the R/Bioconductor projects. Many packages were chosen,
because the author uses them often for his own teaching and research. To
obtain a broad overview of available R packages, it is strongly
recommended to consult the official Bioconductor and R
project sites. Due to the rapid development of most packages, it is
also important to be aware that this manual will often not be fully
uptodate. Because of this and many other reasons, it is absolutely
critical to use the original documentation of each package (PDF manual
or vignette) as primary source of documentation. Users are welcome to
send suggestions for improving this manual directly to its author.
Format of this Manual
A not always very easy to read, but practical copy & paste
format has been chosen throughout this manual. In this format all
commands are represented in code boxes, where the comments are given in blue color. To save space, often several commands are
concatenated on one line and separated with a semicolon ';'. All comments/explanations start with the standard comment sign '#' to prevent them
from being interpreted by R as commands. This way several commands can
be pasted with their comment text into the R console to demo the
different functions and analysis steps. Commands starting with a '$'
sign need to be executed from a Unix or Linux shell. Windows users can
simply ignore them. Commands highlighted in red color are considered essential knowledge. They are important for someone interested in a quick start with R and
Bioconductor. Where relevant, the output generated by R is given in green color.
Installation of the R Software and R Packages  The installation instructions are provided in the Administrative Section of this manual.
 R working environments with syntax highlighting support and utilities to send code to the R console:
R Projects and Interfaces
Basic R Usage
$ R # Starts the R console under Unix/Linux. The R GUI versions under
Windows and Mac OS X can be started by doubleclicking their icons.
object
< function(arguments) # This general R command syntax uses the
assignment operator '<' (or '=') to assign data generated by a
command to its right to object on its left.
object =
function(arguments) # A more recently introduced assignment operator is
'='. Both of them work the same way and in both directions. For
consistency reasons one should use only one of them.
assign("x", function(arguments)) # Has the same effect, but uses the assignment function instead of the assignment operator.
source("my_script")
# Command to execute an R script, here 'my_script'. For example,
generate a text file 'my_script' with the command 'print(1:100)', then
execute it with the source function.
x < edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
x < edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
x < scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'.
q() # Quits R console.
R Startup Behavior
The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)
Finding Help
Various online manuals are available on the R project site. Very useful manuals for beginners are:
Documentation within R can be found with the following commands
?function # or 'help(function)' opens documentation on a function
apropos(function) # Finds all functions containing a given term.
example(heatmap) # executes examples for function 'heatmap'
help.search("topic") # searches help system for documentation
RSiteSearch('regression', restrict='functions', matchesPerPage=100)
# Searches for key words or phrases in the Rhelp mailing list
archives, help pages, vignettes or task views, using the search engine
# at http://search.rproject.org and view them in a web browser.
help.start()
# Starts local HTML interface. The link 'Packages' provides a list of
all installed packages. After initiating 'start.help()' in a session the
'?function' commands will open as HTML pages.
sessionInfo() # Prints
version information about R and all loaded packages. The generated
output should be provided when sending questions or bug reports to the R
and BioC mailing lists.
$ R h # or 'R help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'

Basics on Functions and Packages
R contains most arithmetic functions like mean, median, sum, prod,
sqrt, length, log, etc. An extensive list of R functions can be found on
the function and variable index page.
Many R functions and datasets are stored in separate packages, which
are only available after loading them into an R session. Information
about installing new packages can be found in the administrative section of this manual.
Loading of libraries/packages
library() # Shows all libraries available on a system.
library(my_package) # Loads a particular package.
library(help=mypackage) # Lists all functions/objects of a package. (v < vignette("mypackage")) # Opens vignette of a package. Stangle(v$file) # Writes code chunks of vignette to a file named mypackage.R for loading into R IDE (e.g. RStudio).
search() # Lists which packages are currently loaded.
Information and management of objects
ls() or objects() # Lists R objects created during session, they are
stored in file '.RData' when exiting R and the workspace is saved.
rm(my_object1, my_object2, ...) # Removes objects.
rm(list = ls()) # Removes all objects without warning!
str(object) # Displays object types and structure of an R object.
ls.str(pattern="") # Lists object type info on all objects in a session.
print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
lsf.str(pattern="") # Lists object type info on all functions in a session.
class(object) # Prints the object type.
mode(object) # Prints the storage mode of an object.
summary(object) # Generic summary info for all kinds of objects.
attributes(object) # Returns an object's attribute list.
gc()
# Causes the garbage collection to take place. This is sometimes useful
to clean up memory allocations after deleting large objects.
length(object) # Provides length of object.
.Last.value # Prints the value of the last evaluated expression.
 Reading and changing directories
dir() # Reads content of current working directory.
getwd() # Returns current working directory.
setwd("/home/user") # Changes current working directory to the specified directory.
System commands under Linux
R IN/OUTPUT & BATCH Mode
One can redirect R input and output with ' ', ' >' and ' <' from the Shell command line. More details on this topic can be found here.
$ R slave < my_infile > my_outfile # The argument 'slave'
makes R run as 'quietly' as possible. This option is intended to support
programs which use R to compute results for them.
# For example,
if my_infile contains 'x < c(1:100); x;' the result for this R
expression will be written to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile]
# Syntax for running R programs in BATCH mode from the commandline.
The output file lists the commands from the script file and their
outputs. If no outfile is specified, the name used
# is that of
'infile' and '.Rout' is appended to outfile. To stop all the usual R
command line information from being written to the outfile, add this as
first line to my_script.R file:
# 'options(echo=FALSE)'. If the
command is run like this 'R CMD BATCH nosave my_script.R', then
nothing will be saved in the .Rdata file which can get often very large.
More on this
# can be found on the help pages: '$ R CMD BATCH help' or '> ?BATCH'.
$ echo 'sqrt(3)'  R slave # calculates sqrt of 3 in R and prints it to STDOUT.
Executing Shell & Perl commands from R with system() function.
Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')
system("ls al") # Prints content of current working directory.
system("perl
ne 'print if (/my_pattern1/ ? ($c=1) : ($c > 0)); print if
(/my_pattern2/ ? ($d = 1) : ($d > 0));' my_infile.txt >
my_outfile.txt")
# Runs Perl oneliner that extracts two patterns from external text file and writes result into new file.
Reading and Writing Data from/to Files
Import from files
read.delim("clipboard", header=T) # Command to copy&paste tables from
Excel or other programs into R. If the 'header' argument is set to
FALSE, then the first line of the data set will not be used as column
titles.
read.delim(pipe("pbpaste")) # Command to copy & paste on Mac OS X systems.
file.show("my_file") # Prints content of file to screen, allows scrolling.
scan("my_file") # Reads vector/array into vector from file or keyboard.
my_frame < read.table(file="my_table") # Reads in table and assigns it to data frame.
my_frame < read.table(file="my_table", header=TRUE, sep="\t")
# Same as above, but with info on column headers and field separators.
If you want to import the data in character mode, then include this
argument: colClasses = "character".
my_frame <
read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements20101220.txt",
na.strings = "", fill=TRUE, header=T, check.names=F, sep="\t")
# The function
read.delim() is often more flexible for importing tables with empty
fields and long character strings (e.g. gene descriptions).
cat(month.name,
file="zzz.txt", sep="\n"); x < readLines("zzz.txt"); x <
x[c(grep("^J", as.character(x), perl = TRUE))];
t(as.data.frame(strsplit(x,"u")))
# A somewhat more advanced
example for retrieving specific lines from an external file with a
regular expression. In this example an external file is created with the
'cat'
# function, all lines of this file are imported into a
vector with 'readLines', the specific elements (lines) are then retieved
with the 'grep' function, and the resulting lines are
# split into subfields with 'strsplit'.
Export to files
write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F)
# Command to copy&paste from R into Excel or other programs. It
writes the data of an R data frame object into the clipbroard from where
it can be pasted into other applications.
zz < pipe('pbcopy',
'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) #
Command to copy & paste from R into Excel or other programs on Mac OS
X systems.
write.table(my_frame, file="my_file", sep="\t", col.names
= NA)
# Writes data frame to a tabdelimited text file. The argument
'col.names = NA' makes sure that the titles align with columns when
row/index names are exported (default).
save(x, file="my_file.txt");
load(file="file.txt") # Commands to save R object to an external file
and to read it in again from this file.
files < list.files(pattern=".txtquot;);
for(i in files) { x < read.table(i, header=TRUE, row.names=1,
comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x);
write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) }
# Batch import and export of many files. First, the *.txt file names in
the current directory are assigned to a list ($ sign is used to anchor
string '*.txt' to end of names).
# Second, the files are imported
onebyone using a for loop where the original names are assigned to the
generated data frames with the 'assign' function. Read ?read.table to
# understand arguments 'row.names=1' and 'comment.char = "A"'. Third,
the data frames are exported using their names for file naming and
appending '*.out'.
HTML(my_frame, file = "my_table.html")
#
Writes data frame to HTML table. Subsequent exports to the same file
will arrange several tables in one HTML document. In order to access
this function one needs to load the library 'R2HTML' first.
# This library is usually not installed by default.
write(x, file="my_file") # Writes matrix data to a file.
sink("My_R_Output") # Redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # Restores normal R output behavior.
dput(myfct, "my_file") # Writes function definition to a file.
Interfacing with Google Docs
R Objects
Data Types
x < c(1, 2, 3); x; is.numeric(x);
as.character(x) # Creates a numeric vector, checks for the data type and
# converts it into a character vector.
 Character data: "a", "b" , "c"
x < c("1", "2", "3"); x; is.character(x); as.numeric(x)
# Creates a character vector, checks for the data type and
# converts it
into a numeric vector.
 Logical data: TRUE, FALSE, TRUE
1:10 < 5 # Returns TRUE where x is < 5.
Object Types in R
 vectors: ordered collection of numeric, character, complex and logical values.
 factors: special type vectors with grouping information of its components
 data frames: two dimensional structures with different data types
 matrices: two dimensional structures with data of same type
 arrays: multidimensional arrays of vectors
 lists: general form of vectors with different types of elements
 functions: piece of code
Naming Rules
 Object, row and column names should not start with a number.
 Avoid spaces in object, row and column names.
 Avoid special characters like '#'.
General Subsetting Rules
 Subsetting syntax:
my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.
There are three possibilities to subset data objects:
(1) Subsetting by positive or negative index/position numbers
my_object < 1:26; names(my_object) < LETTERS # Creates a vector sample with named elements.
my_object[1:4] # Returns the elements 14.
my_object[c(1:4)] # Excludes elements 14.
(2) Subsetting by same length logical vectors
my_logical < my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.
(3) Subsetting by field names
my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M
Calling a single column or list component by its name with the '$' sign.
iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.
Assigning values to object components
zzz < iris[,1:3]; zzz < 0 # Reassignment syntax to create/replace an entire object.
zzz < iris[,1:3]; zzz[,] < 0 # Populates all fields in an object with zeros.
zzz < iris[,1:3]; zzz[zzz < 4] < 0 # Populates only specified fields with zeros
Basic Operators and Calculations
 Comparison operators
 equal: ==
 not equal: !=
 greater/less than: > <
 greater/less than or equal: >= <=
Example:
Logical operators
x < 1:10; y < 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.
x == y  x != y # Returns TRUE where at least one comparison is TRUE.
!x > y # The '!' sign returns the negation (opposite) of a logical vector.
 Calculations [ Function Index ]
 Four basic arithmetic functions: addition, subtraction, multiplication and division
1 + 1; 1  1; 1 * 1; 1 / 1 # Returns results of basic arithmetic calculations.
x < 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for
the vector x its sum, mean, standard deviation
# and square root. A list
of the basic R functions can be found on the function and variable index
page.
x < 1:10; y < 1:10; x + y # Calculates the sum for each element in the vectors x and y.
apply(iris[,1:3], 1, mean) # Calculates the mean values for the
columns 13 in the sample data frame 'iris'.
# With the argument setting
'1', rowwise iterations are performed and with '2' columnwise
iterations.
tapply(iris[,4], iris$Species, mean) # Calculates the
mean values for the 4th column based on the grouping
# information in the
'Species' column in the 'iris' data frame.
sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'
Regular expressions
R's regular expression utilities work
similar as in other languages. To learn how to use them in R, one can
consult the main help page on this topic with: ?regexp.
?regexp # Opens general help page for regular expression support in R.
month.name[grep("A", month.name)] # The grep function can be used for finding patterns in strings, here letter
# 'A' in vector 'month.name'.
gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE)
# Example for using regular
expressions to substitute a pattern by another one using a back
reference.
# Remember: single escapes '\' need to be double escaped '\\'
in R.
Vectors
General information
Vectors are ordered collection of 'atomic' (same data type)
components or modes of the following four types: numeric, character,
complex and logical. Missing values are indicated by 'NA'. R inserts
them automatically in blank fields.
x < c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using the c() function.
z < scan(file="my_file") # Reads data linewise from file and assigns them to vector 'z'.
vector[rows] # Syntax to access vector sections.
z < 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
d < as.integer(x); d # Transforms numeric data into integers.
x < 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
x < as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)]
# Generates random integers from 1 to 4. The sort() function sorts the
items by size. The rev() function reverses the order. The order()
function returns the corresponding
# indices for a sorted object.
The order() function is usually the one that needs to be used for
sorting complex objects, such as data frames or lists.
x < rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
sample(1:10,
5, replace=TRUE) # Returns a set of randomly selected elements from a
vector (here 5 numbers from 1 to 10) using either with or without
replacement.
Sequences
R has several facilities to create sequences of numbers:
1:30 # Generates a sequence of integers from 1 to 30.
letters;
LETTERS; month.name; month.abb # Generates lower case letters, capital
letters, month names and abbreviated month names, respectively.
2*1:10 # Creates a sequence of even numbers.
seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.
Character Vectors
paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x
< paste(rep("A", times=12), 1:12, sep=""); y < paste(rep("B",
times=12), 1:12, sep=""); append(x,y) # Possibility to build plate
location vector in R (better example under 'arrays').
Subsetting Vectors
x < 1:100; x[2:23] # Values in square brackets select vector range.
x < 1:100; x[(2:23)] # Prints everything except the values in square brackets.
x[5] < 99 # Replaces value at position 5 with '99'.
x < 1:10; y < c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c")
# Returns index numbers where "c" occurs in the 'letters' vector. For
retrieving indices
# of several strings provided by query vector, use the
following 'match' function.
match(c("c","g"), letters)
# Returns
index numbers for "c" and "g" in the 'letters' vector. If the query
vector (here 'c("c","g")') contains entries
# that are duplicated in the
target vector, then this syntax returns only the first
occurence(s) for each duplicate.
# To retrieve the indices for all
duplicates, use the following '%in%' function.
x < rep(1:10, 2); y < c(2,4,6); x %in% y
# The function '%in%' returns a logical vector. This syntax allows the
subsetting of vectors and data frames with a
# query vector ('y')
containing entries that are duplicated in the target
# vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.
Finding Identical and NonIdentical Entries between Vectors
intersect(month.name[1:4], month.name[3:7]) # Returns identical entries of two vectors.
month.name[month.name %in% month.name[3:7]]
# Returns the same result as in the
previous step. The vector comparison with %in% returns first a logical
vector of
# identical items that is then used to subset the first vector.
setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4])
# Returns the unique entries
occuring only in the first vector. Note: if the argument names are not
used, as in the
# second example, then the order of the arguments is important.
union(month.name[1:4], month.name[3:7]) # Joins two vectors without duplicating identical entries.
x < c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Returns duplicated entries
Factors
Factors are vector objects that contain grouping (classification) information of its components
animalf < factor(c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector.
animalf # Prints out factor 'animalf',
this lists first all components and then the different levels (unique
entries); alternatively one can print only levels with
'levels(animalf)'.
table(animalf) # Creates frequency table for levels.
Function tapply applies calculation on all members of a level
weight < c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean
< tapply(weight, animalf, mean) # Applies function (length, mean,
median, sum, sterr, etc) to all level values; 'length'
# provides the
number of entries (replicates) in each level.
Function cut divides a numeric vector into size intervals
y < 1:200; interval < cut(y, right=F, breaks=c(1, 2, 6, 11, 21,
51, 101, length(y)+1), labels=c("1","25","610", "1120", "2150",
"51100", ">=101")); table(interval)
# Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval,
ylim=c(0,110), xlab="Intervals", ylab="Count", col="green");
text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2),
y=as.vector(table(interval))+2)
# Plots the size interval counts as bar diagram.
Matrices and Arrays
Matrices are two dimensional data objects consisting of rows and
columns. Arrays are similar, but they can have one, two or more
dimensions. In contrast to data frames (see below), one can store only a
single data type in the same object ( e.g. numeric or character).
x < matrix(1:30, 3, 10, byrow = T) # Lays out vector
(1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the
matrix
# is filled by row or columns.
dim(x) < c(3,5,2) # Transforms above matrix into multidimensional array.
x < array(1:25, dim=c(5,5)) # Creates 5 by 5 array ('x') and fills it with values 125.
y < c(x) # Writes array into vector.
x[c(1:5),3] # Writes values from 3rd column into vector structure.
mean(x[c(1:5),3]) # Calculates mean of 3rd column.
Subsetting matrices and arrays
array[rows, columns] # Syntax to access columns and rows of matrices and twodimensional arrays.
as.matrix(iris)
# Many functions in R require matrices as input. If something doesn't
work then try to convert the object into a
# matrix with the as.matrix()
function.
i < array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 15, 51.
x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] < 0 # Replaces those elements by zeros.
array1
< array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data
from 'my_array_file' and writes it into 4 by 3 array.
Subsetting arrays with more than two dimensions
array[rows, columns, dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x < array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all subarrays.
Merging arrays: example for building location tables for microtiter plates
Z
< array(1:12, dim=c(12,8)); X <
array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y <
paste(t(X), Z, sep=""); Y
# Creates array with 12 AxHx columns.
M < array(Y, c(96,1)) # Writes 12 AxHx columns into one.
Script for mapping 24/48/96 to 384 well plates
Script for mapping 384 to 1536 well plates
Appending arrays and matrices
cbind(matrix1, matrix2) # Appends columns of matrices with same number of rows.
rbind(matrix1, matrix2) # Appends rows of matrices with same number of columns.
c(array1, array2) # Clears dimension
attributes from arrays and concatenates them in vector format. The
function
#
'as.vector(my_array1)' achieves similar result.
Calculations between arrays
Z < array(1:12, dim=c(12,8)); X < array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimensions.
calarray < Z/X # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.
Information about arrays
dim(X); nrow(X); ncol(X) # Returns number of rows and columns of array 'X'.

Data Frames
Data frames are two dimensional data objects that are composed of
rows and columns. They are very similar to matrices. The main difference
is that data frames can store different data types, whereas matrices
allow only one data type (e.g. numeric or character).
Constructing data frames
my_frame < data.frame(y1=rnorm(12), y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)) # Creates data frame with vectors 112 and 121.
rownames(my_frame) < month.name[1:12] # Assigns row (index) names. These names need to be unique.
names(my_frame) < c("y4", "y3", "y2", "y1") # Assigns new column titles.
names(my_frame)[c(1,2)] < c("y3", "y4") # Changes titles of specific columns.
my_frame < data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,1] # Different possibilities to remove column(s) from a data frame.
Accessing and slicing data frame sections
my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
dim(my_frame) # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
row.names(my_frame) # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE), ] # Sorts the rows of a data frame by the specified columns, here 'y2';
# for increasing order use 'decreasing=FALSE'.
my_frame[order(my_frame[,4], my_frame[,3]),] # Subsequent subsorts can be performed by specifying additional columns. By adding a "" sign one can reverse the sort order.
my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
v <my_frame[,4]; v[3] # Notation for returning the value of an
individual cell. In this example the corresponding
# column is first
assigned to a vector and then the desired field is accessed by its index number.
my_frame[1:5,] # Notation to view only the first five rows of all columns.
my_frame[, 1:2] # Notation to view all rows of the first two columns.
my_frame[, c(1,3)] # Notation to view all rows of the specified columns.
my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 12.
my_frame["August",] # Notation to retrieve row values by their index name (here "August").
x < data.frame(row.names=LETTERS[1:10], letter=letters[1:10], Month=month.name[1:10]); x; match(c("c","g"), x[,1])
# Returns matching index numbers of data frame or vector using 'match' function.
# This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.
x[x[, 2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,])) # Returns one row entry of a data frame as vector.
as.data.frame(my_list) # Returns a list object as data frame if the list components can be converted into a two dimensional object.
Calculations on data frames
summary(my_frame) # Prints a handy summary for a data frame.
mean(my_frame) # Calculates the mean for all columns.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3]))
# The apply function performs
rowwise or columnwise calculations on data frames or matrices (here
mean and ratios). The results are returned as vectors.
# In this example, they are appended to the original data frame with
the data.frame function. The argument '1' in the apply function
specifies rowwise calculations.
# If '2' is selected, then the calculations are performed columnwise.
aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean)
# The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a
correlation matrix based on allagainstall rows and allagainstall
columns.
x < matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t",
1:4, sep=""))); corV < cor(x["August",], t(x), method="pearson"); y
< cbind(x, correl=corV[1,]); y[order(y[,5]), ]
# Commands to perform a correlation
search and ranking across all rows (matrix object required here). First,
an example matrix 'x' is created. Second, the
# correlation values for the "August" row against all other rows are
calculated. Finally, the resulting vector with the correlation values is
merged with the
# original matrix 'x' and sorted by decreasing correlation values.
merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE)
# Merges two data frames (tables) by
common field entries. To obtain only the common rows, change 'all =
TRUE' to 'all = FALSE'. To merge on row names (indices),
# refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.
my_frame1 < data.frame(title1=month.name[1:8], title2=1:8);
my_frame2 < data.frame(title1=month.name[4:12], title2=4:12);
merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all =
TRUE)
# Example for merging two data frames by common field.
Fast computations on large data frames and matrices
myDF < as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
myCol < c(1,1,1,2,2,2,3,3,4,4); myDFmean < t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,1])
colnames(myDFmean) < tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,]
# The aggregate function can be used
to perform calculations (here mean) across rows for any combination of
column selections (here myCol) after transposing the
# data frame. However, this will be very slow for data frames with millions of rows.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) < sapply(myList, paste, collapse="_"); myDFmean <
sapply(myList, function(x) mean(as.data.frame(t(myDF[,x]))));
myDFmean[1:4,]
# Faster alternative for performing
the aggregate computation of the previous step. However, this step will
be still very slow for very large
# data sets, due to the sapply loop over the row elements.
myList < tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
myDFmean < sapply(myList, function(x) rowSums(myDF[,x])/length(x));
colnames(myDFmean) < sapply(myList, paste, collapse="_")
myDFmean[1:4,] # By using the rowSums
or rowMeans functions one can perform the above aggregate computations
1001000 times faster by avoiding a loop over the rows altogether.
myDFsd < sqrt((rowSums((myDFrowMeans(myDF))^2)) / (length(myDF)1)); myDFsd[1:4]
# Similarly, one can compute the standard deviation for large data frames by avoiding loops.
# This approach is about 100 times faster than the loopbased alternatives: sd(t(myDF)) or apply(myDF, 1, sd).
Table of Contents
Conditional selections
my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,]
# Prints all rows of data frame
where values of col1 > col2. Comparison operators are: == (equal), !=
(not equal), >= (greater than or equal), etc. Logical operators are
& (and),  (or) and ! (not).
x < 0.5:10; x[x<1.0] < 1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48,
no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2  x$no1 >
x$no2),]
# Prints all records of frame 'x'
that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR
have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] #
Possibility to print out all rows of a data frame where a regular
expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (14) using a for loop (see below).
z < data.frame(chip1=letters[1:25], chip2=letters[25:1],
chip3=letters[1:25]); z; y < apply(z, 1, function(x) sum(x == "m")
> 2); z[y,]
# Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
z < data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c <
data.frame(z, count=apply(z[,1:3], 1, FUN < function(x) sum(x >=
5))); c
# Counts in each row of a data frame
the number of fields that are above or below a specified value and
appends this information to the data frame. By default rows with "NA"
values will
# be ignored. To work around this limitation, one can replace the NA
fields with a value that doesn't affect the result, e.g.: x[is.na(x)]
< 1.
x < data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index < x
== "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,]
# Example of how one can count
occurances of strings across rows. In this example the occurances of "P"
in a data frame of "PMA" values are counted by converting to a data
frame of
# logical values and then counting the 'TRUE' occurences with the
'rowSums' function, which does in this example the same as: 'cbind(x,
count=apply(x=="P",1,sum))'.
Reformatting data frames with reshape.2 and splitting/apply routines with plyr
############## ## reshape2 ## ############## ## Some of these operations are important for plotting routines with the ggplot2 library. ## melt: rbinds many columns
library(reshape2) (iris_mean < aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean)) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026
(df_mean < melt(iris_mean, id.vars=c("Species"), variable.name = "Samples")) # See ?melt.data.frame for details. Species Samples value 1 setosa Sepal.Length 5.006 2 versicolor Sepal.Length 5.936 3 virginica Sepal.Length 6.588 4 setosa Sepal.Width 3.428 5 versicolor Sepal.Width 2.770 6 virginica Sepal.Width 2.974 ...
## dcast: cbinds row aggregates (reverses melt result)
dcast(df_mean, formula = Species ~ ...) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026
## colsplit: splits strings into columns
x < c("a_1_4", "a_2_3", "b_2_5", "c_3_9") colsplit(x, "_", c("trt", "time1", "time2")) trt time1 time2 1 a 1 4 2 a 2 3 3 b 2 5 4 c 3 9
########## ## plyr ## ##########
library(plyr) ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize) Species mean 1 setosa 5.006 2 versicolor 5.936 3 virginica 6.588
ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), transform) Sepal.Length Sepal.Width Petal.Length Petal.Width Species mean 1 5.1 3.5 1.4 0.2 setosa 5.006 2 4.9 3.0 1.4 0.2 setosa 5.006 3 4.7 3.2 1.3 0.2 setosa 5.006 4 4.6 3.1 1.5 0.2 setosa 5.006 ...
## Usage with parallel package library(parallel); library(doMC); registerDoMC(2) # 2 cores test < ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize, parallel=TRUE)
Lists
Lists are ordered collections of objects that can be of different
modes (e.g. numeric vector, array, etc.). A name can be assigned to each
list component.
my_list < list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Returns
component 2 of list; The [[..]] operator returns the object type stored
in this component, while '[..]' returns a single component list.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife < 1:12 # Replaces the content of the existing component with new content.
my_list$wife < NULL # Deletes list component 'wife'.
my_list < c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list < c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list < c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
my_frame < data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12),
y4=rnorm(12)); my_list < apply(my_frame, 1, list); my_list <
lapply(my_list, unlist); my_list
# Stores the rows of a data frame as separate vectors in a list container.
mylist < list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]]))
# Syntax to access list component
names with the looping function 'lapply'. In this example the list
component names are prepended to the corresponding vectors.
Table of Contents
Missing Values
Missing values are represented in R data objects by the missing
value place holder ' NA'. The default behavior for many R functions on
data objects with missing values is ' na.fail' which returns the value
' NA'. Several 'na.action' options are available to change this behavior.
x < 1:10; x < x[1:12]; z < data.frame(x,y=12:1) # Creates sample data.frame with 'NA' values
is.na(x) # Finds out which elements of x contain "NA" fields.
na.omit(z) # Removes rows with 'NA' fields in numeric data objects.
x < letters[1:10]; print(x); x < x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'.
apply(z, 1, mean, na.rm=T) #
Uses available values to perform calculation while ignoring the 'NA'
values; the default 'na.rm=F' returns 'NA' for rows with missing values.
z[is.na(z)] < 0
# Replaces 'NA' values with zeros. To replace the special <NA>
signs in character columns of data frames, convert the data frame with
as.matrix() to a matrix,
# perform the substitution on this level and then convert it back to a data frame with as.data.frame().
Some Great R Functions
This sections contains a small collection of extremely useful R functions.
The unique() function makes vector entries unique:
unique(iris$Sepal.Length); length(unique(iris$Sepal.Length))
# The first step returns the unique
entries for the Sepal.Length column in the iris data set and the second
step
# counts the number of its unique entries.
The table() function counts the occurrence of entries in a vector. It is the most basic "clustering function":
my_counts < table(iris$Sepal.Length, exclude=NULL)[iris$Sepal.Length]; cbind(iris, CLSZ=my_counts)[1:4,]
#
Counts identical entries in the Sepal.Length column of the iris data
set and aligns the counts with the original
# data frame. Note: to count
NAs, set the 'exclude' argument to NULL.
myvec < c("a", "a", "b", "c", NA, NA); table(factor(myvec, levels=c(unique(myvec), "z"), exclude=NULL))
#
Additional count levels can be specified by turning the test vector
into a factor and specifying them with the 'levels' argument.
The combn() function creates all combinations of elements:
labels < paste("Sample", 1:5, sep=""); combn(labels, m=2, FUN=paste, collapse="")
# Example to create all possible pairwise combinations of sample
labels. The number of samples to consider in the
# comparisons can be
controlled with the 'm' argument.
allcomb < lapply(seq(along=labels), function(x) combn(labels, m=x,
simplify=FALSE, FUN=paste, collapse="")); unlist(allcomb)
# Creates all possible combinations of sample labels.
The aggregate() function computes any type of summary statistics of data subsets that are grouped together:
aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=T) # Computes the mean (or any other stats) for the data groups
# defined by the 'Species' column in the iris data set.
t(aggregate(t(iris[,1:4]), by=list(c(1,1,2,2)), FUN=mean, na.rm=T)[,1])
# The function can also perform
computations for column aggregates (here: columns 12 and 34) after
transposing the data
# frame. A much faster alternative is given in the data frame section.
The apply() function simplifies the coding of iterative tasks (loops). In the following example, apply() is first used to compute a rowwise calculation on a matrix. In a second example apply() is called by the custom function colAg()
to perform similar calculations, but allowing the user to select any
combination of column aggregates with an easy to handle grouping vector.
myMA < matrix(rnorm(100000), 10000, 10, dimnames=list(1:10000, paste("C", 1:10, sep=""))) # Creates a sample data set.
apply(myMA, 1, mean)[1:4] # Computes the mean for each row in myMA.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R") # Imports the colAg() function.
colAg(myMA=myMA, group=c(1,1,1,2,2,2,3,3,4,4), myfct=mean)[1:4,]
# Applies a computation given under the myfct argument (here: mean)
for the column aggregates specified under the group argument
# (here: c(1,1,1,2,2,2,3,3,4,4)). The columns in the resulting object
are named after the chosen aggregates. Note: the function can
# only perform those calculations that can be applied to sets of two
or more values, such as mean, sum, sd, min and max. Much faster
# to compute, but less flexible, alternatives are given in the
data frame section .
The %in% function returns the intersect between two vectors. In a subsetting context with ' [ ]', it can be used to intersect matrices, data frames and lists:
my_frame < data.frame(Month=month.name, N=1:12); my_query < c("May", "August") # Generates a sample data frame and a sample query vector.
my_frame[my_frame$Month %in% my_query, ] # Returns the rows in my_frame where the "Months" column and my_query contain identical entries.
The merge() function joins data frames based on a common key column:
frame1 < iris[sample(1:length(iris[,1]), 30), ]
my_result < merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE); dim(my_result)
# Merges two data frames (tables) by common field entries, here row names (by.x=0). To obtain only the common rows,
# change 'all = TRUE' to 'all = FALSE'. To merge on specific columns,
refer to them by their position numbers or their column
# names (e.g.
"Species").
Table of Contents
Introduction to R Graphics
R provides comprehensive graphics utilities for visualizing and
exploring scientific data. In addition, several powerful graphics
environments extend these utilities. These include the grid, lattice and ggplot2
packages. The grid package is part of R's base distribution. It
provides the lowlevel infrastructure for many graphics packages,
including lattice and ggplot2. Extensive information on graphics
utilities in R can be found on the Graphics Task Page, the R Graph Gallery and the R Graphical Manual. Another useful reference for graphics procedures is Paul Murrell's book R Graphics. Interactive graphics in R can be generated with rggobi (GGobi) and iplots.
Base Graphics
The following list provides an overview of some very useful plotting
functions in R's base graphics. To get familiar with their usage, it is
recommended to carefully read their help documentation with ?myfct as well as the help on the functions plot and par.

 plot: generic xy plotting
 barplot: bar plots
 boxplot: boxandwhisker plot
 hist: histograms
 pie: pie charts
 dotchart: cleveland dot plots
 image, heatmap, contour, persp: functions to generate imagelike plots
 qqnorm, qqline, qqplot: distribution comparison plots
 pairs, coplot: display of multivariant data
Lattice [ Manuals: lattice, Intro, book ]
The lattice package developed by Deepayan Sarkar implements in R the
Trellis graphics system from SPlus. The environment greatly simplifies
many complicated highlevel plotting tasks, such as automatically
arranging complex graphical features in one or several plots. The syntax
of the package is similar to R's base graphics; however, highlevel
lattice
functions return an object
of class "trellis", that can be either plotted directly or stored in an
object. The command library(help=lattice) will open a list of all functions available in the lattice package, while ?myfct and example(myfct) can be used to access and/or demo their documentation. Important functions for accessing and changing global parameters are: ?lattice.options and ?trellis.device.
ggplot2 [ Manuals: ggplot2, Docs, Intro and book ]
ggplot2 is another more recently developed graphics system for R, based on the grammar of graphics
theory. The environment streamlines many graphics routines for the user
to generate with minimum effort complex multilayered plots. Its
syntax is centered around the main ggplot function, while the convenience function qplot provides many shortcuts. The ggplot function accepts two arguments: the data set to be plotted and the corresponding aesthetic mappings provided by the aes function. Additional plotting parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with ' +'
as separator. For instance, the following command will generate a
scatter plot for the first two columns of the iris data frame: ggplot(iris, aes(iris[,1], iris[,2])) + geom_point(). A list of the available geom_* functions can be found here. The settings of the plotting theme can be accessed with the command theme_get(). Their settings can be changed with the opts()function.
To benefit from the many convenience features built into ggplot2, the
expected input data class is usually a data frame where all labels for
the plot are provided by the column titles and/or grouping factors in
additional column(s).
The following graphics sections
demonstrate how to generate different types of plots first with R's base
graphics device and then with the lattice and ggplot2 packages.

Scatter Plots
Base Graphics
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]) # Plots two vectors (columns) in form of a scatter plot against each other.
plot(y[,1], y[,2], type="n", main="Plot of Labels"); text(y[,1], y[,2], rownames(y)) # Prints instead of symbols the row names.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y))
# Plots both, symbols plus their labels.
grid(5, 5, lwd = 2) # Adds a grid to existing plot.
op < par(mar=c(8,8,8,8), bg="lightblue")
plot(y[,1],
y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, cex.main=1.2,
cex.sub=1, lwd=4, pch=20, xlab="x label", ylab="y label", main="My
Main", sub="My Sub")
par(op)
#
The previous commands demonstrates the usage of the most important
plotting parameters. The 'mar' argument specifies the
# margin sizes
around the plotting area
in this order: c(bottom, left, top, right). The color of the plotted
symbols can be
# controlled with the 'col' argument. The plotting symbols
can be selected
with the 'pch' argument, while their size is
# controlled by the 'lwd'
argument. A selection palette for 'pch' plotting symbols can be opened
with the command
'example(points)'.
# As alternative, one can plot any character string
by passing it on to 'pch', e.g.: pch=".". The font sizes of the
different text
# components
can be changed with the 'cex.*' arguments. Please consult the '?par'
documentation for more details on fine tuning R's
# plotting behavior.
plot(y[,1], y[,2]); myline < lm(y[,2]~y[,1], data=y[,1:2]); abline(myline, lwd=2) # Adds a regression line to the plot.
summary(myline) # Prints summary about regression line.
plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on log scale.
plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot.
plot(y) #
Produces all possible scatter plots for allagainstall columns in a
matrix or a data frame. The column headers of
# the matrix or data frame
are used as axis titles.
pairs(y) # Alternative way to produce all possible scatter plots for allagainstall columns in a matrix or a data frame.
library(scatterplot3d); scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y.
library(geneplotter); smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density
# of the data points.
Scatter Plot Generated with Base Graphics
lattice
library(lattice)
xyplot(1:10 ~ 1:10) # Simple scatter plot.
xyplot(1:10 ~ 1:10  rep(LETTERS[1:5], each=2), as.table=TRUE) #
Plots subcomponents specified by grouping vector after '' in separate
panels. The argument as.table controls the order of the panels.
myplot < xyplot(Petal.Width ~ Sepal.Width  Species , data = iris); print(myplot) # Assigns plotting function to an object and executes it.
xyplot(Petal.Width ~ Sepal.Width  Species , data = iris, layout = c(3, 1, 1)) # Changes layout of individual plots.
## Change plotting parameters
show.settings() # Shows global plotting parameters in a set of sample plots.
default < trellis.par.get(); mytheme < default; names(mytheme) # Stores the global plotting parameters in list mytheme and prints its component titles.
mytheme["background"][[1]][[2]] < "grey" # Sets background to grey
mytheme["strip.background"][[1]][[2]] < "transparent" # Sets background of title bars to transparent.
trellis.par.set(mytheme) # Sets global parameters to 'mytheme'.
show.settings() # Shows custom settings.
xyplot(1:10 ~ 1:10  rep(LETTERS[1:5], each=2), as.table=TRUE, layout=c(1,5,1), col=c("red", "blue"))
trellis.par.set(default)
Scatter Plot Generated with lattice
ggplot2
library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() # Plots two vectors (columns) in form of a scatter plot against each other.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) # Plots larger dots and colors them with default color scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color =
Species), size=4) + ylim(2,4) + xlim(4,8) +
scale_color_manual(values=rainbow(10)) # Colors dots with custom scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_text() + opts(title = "Plot of Labels") # Print instead of symbols a set of custom labels and adds a title to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_point() + geom_text(hjust=0.5, vjust=0.5) # Prints both symbols and labels.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() +
opts(panel.background=theme_rect(fill = "white", colour = "black")) # Changes background color to white.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + stat_smooth(method="lm", se=FALSE) # Adds a regression line to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + coord_trans(x = "log2", y = "log2") # Plots axes in log scale.
Scatter Plot Generated with ggplot2

Line Plots
Base Graphics
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], type="l", lwd=2, col="blue") # Plots a line graph instead.
split.screen(c(1,1))
plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1)
for(i in 2:length(y[1,])) screen(1, new=FALSE); plot(y[,i], ylim=c(0,1),
type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="",
bty="n")
close.screen(all=TRUE)
# Plots a line graph for all columns
in data frame 'y'. The 'split.screen()' function is used in this
example in a for loop
# to overlay several line
graphs in the same plot. More details on this topic are provided in the 'Arranging Plots' section.
# A very nice line plot function for time series data is available in the Mfuzz library.
Line Plot Generated with Base Graphics
lattice
xyplot(Sepal.Length ~ Sepal.Width  Species, data=iris, type="a", layout=c(1,3,1))
parallel(~iris[1:4]  Species, iris) # Plots data for each species in iris data set in separate line plot.
parallel(~iris[1:4]  Species, iris, horizontal.axis = FALSE, layout = c(1, 3, 1)) # Changes layout of plot.
Scatter Plot Generated with lattice
ggplot2
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) # Plots lines for three samples in the 'Species' column in a single plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) + facet_wrap(~Species, ncol=1) # Plots three line plots, one for each sample in Species column.
Scatter Plot Generated with ggplot2

Bar Plots
Base Graphics
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
barplot(as.matrix(y[1:4,]), ylim=c(0,max(y[1:4,])+0.1), beside=T)
# Generates a bar diagram for the
first four rows in y. The barplot() function expects as input format a
matrix, and
# it uses by default the column titles for labeling the bars.
text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13,
by=1)+sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.02)
# Adds corresponding values on top of each bar.
ysub < as.matrix(y[1:4,]); myN < length(ysub[,1])
mycol1 < gray(1:(myN+1)/(myN+1))[(myN+1)]
mycol2 < sample(colors(),myN); barplot(ysub, beside=T,
ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot", sub="data: ysub")
legend("topright", legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Generates a bar plot with a legend
for the first four rows of the input matrix 'y'. To plot the diagram in
black and
# white, set the 'col' argument to 'col=col1". The argument 'ncol' controls the number of columns that are used for printing the legend.
par(mar=c(10.1, 4.1, 4.1, 2.1)); par(xpd=TRUE);
barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar
Plot"); legend(x=4.5, y=0.3, legend=row.names(ysub), cex=1.3, bty="n",
pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
# Same as above, but places the
legend below the bar plot. The arguments 'x' and 'y' control the
placement of the legend
# and the 'mar' argument specifies the margin sizes around the plotting area in this order: c(bottom, left, top, right).
bar < barplot(x < abs(rnorm(10,2,1)), names.arg = letters[1:10], col="red", ylim=c(0,5))
stdev < x/5; arrows(bar, x, bar, x + stdev, length=0.15, angle = 90)
arrows(bar, x, bar, x + (stdev), length=0.15, angle = 90)
# Creates bar plot with standard
deviation bars. The example in the 'barplot' documentation provides a
very useful outline
# of this function.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")
# Imports a function that plots a loan amortization table as bar plot.
Bar Plot with Error Bars Generated with Base Graphics

lattice
y < matrix(sample(1:10, 40, replace=TRUE), ncol=4, dimnames=list(letters[1:10], LETTERS[1:4])) # Creates a sample data set.
barchart(y, auto.key=list(adj = 1), freq=T, xlab="Counts", horizontal=TRUE, stack=FALSE, groups=TRUE) # Plots all data in a single bar plot. The TRUE/FALSE arguments control the layout of the plot.
barchart(y, col="grey", layout = c(2, 2, 1), xlab="Counts", as.table=TRUE, horizontal=TRUE, stack=FALSE, groups=FALSE)
# Plots the different data components in separate bar plots. The
TRUE/FALSE and layout arguments allow to rearrange the bar plots in
different ways.
Bar Plot Generated with lattice
ggplot2
## (A) Sample Set: the following transforms the iris data set into a ggplot2friendly format
iris_mean < aggregate(iris[,1:4], by=list(Species=iris$Species),
FUN=mean) # Calculates the mean values for the aggregates given by the
Species column in the iris data set.
iris_sd < aggregate(iris[,1:4], by=list(Species=iris$Species),
FUN=sd) # Calculates the standard deviations for the aggregates given by
the Species column in the iris data set.
convertDF < function(df=df, mycolnames=c("Species", "Values",
"Samples")) { myfactor < rep(colnames(df)[1], each=length(df[,1]));
mydata < as.vector(as.matrix(df[,1])); df < data.frame(df[,1],
mydata, myfactor); colnames(df) < mycolnames; return(df) } #
Defines function to convert data frames into ggplot2friendly format.
df_mean < convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) # Converts iris_mean.
df_sd < convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) # Converts iris_sd.
limits < aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2]  df_sd[,2]) # Define standard deviation limits.
## (B) Bar plots of data stored in df_mean
ggplot(df_mean, aes(Samples, Values, fill = Species)) +
geom_bar(position="dodge") # Plots bar sets defined by 'Species' column
next to each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) +
geom_bar(position="dodge") + coord_flip() +
opts(axis.text.y=theme_text(angle=0, hjust=1)) # Plots bars and labels
sideways.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="stack") # Plots same data set as stacked bars.
ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species)) +
facet_wrap(~Species, ncol=1) # Plots data sets below each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) +
geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") #
Generates the same plot as before, but with error bars.
# (C) Customizing colors
library(RColorBrewer); display.brewer.all() # Select a color scheme and pass it on to 'scale_*' arguments.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) +
geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") +
scale_fill_brewer(pal="Greys") + scale_color_brewer(pal = "Greys") #
Generates the same plot as before, but with grey color scheme.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) +
geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") +
scale_fill_manual(values=c("red", "green3", "blue")) +
scale_color_manual(values=c("red", "green3", "blue")) # Uses custom
colors passed on as vectors.
Bar Plot Generated with ggplot2

Pie Charts
Base Graphics
y < table(rep(c("cat", "mouse", "dog", "bird", "fly"), c(1,3,3,4,2))) # Creates a sample data set.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), main="Pie Chart", clockwise=T) # Plots a simple pie chart.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), labels=NA, main="Pie Chart", clockwise=T)
legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15,
pt.cex=1.8, col=rainbow(length(y), start=0.1, end=0.8), ncol=1)
# Same pie chart as above, but with legend.
Pie Chart Generated with Base Graphics

ggplot2 (see this page for more details)
df < data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), value=c(1,3,3,4,2)) # Creates a sample data set.
ggplot(df, aes(x = "", y = value, fill = variable)) + geom_bar(width =
1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots pie chart.
ggplot(df, aes(x = variable, y = value, fill = variable)) +
geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie
Chart") # Plots wind rose pie chart.
Wind Rose Pie Chart Generated with ggplot2

Heatmaps
Base Graphics
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
image(y) # The image function plots matrices as heatmaps.
Heatmap Generated with Base Graphics
lattice
## Sample data
library(lattice); library(gplots)
y < lapply(1:4, function(x) matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))))
## Plot single heatmap:
levelplot(y[[1]])
## Arrange several heatmaps in one plot
x1 < levelplot(y[[1]], col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="colorpanel")
x2 < levelplot(y[[2]], col.regions=heat.colors(75), main="heat.colors")
x3 < levelplot(y[[3]], col.regions=rainbow(75), main="rainbow")
x4 < levelplot(y[[4]], col.regions=redgreen(75), main="redgreen")
print(x1, split=c(1,1,2,2))
print(x2, split=c(2,1,2,2), newpage=FALSE)
print(x3, split=c(1,2,2,2), newpage=FALSE)
print(x4, split=c(2,2,2,2), newpage=FALSE)
Several Heatmaps in One Plot Generated with lattice

Venn Diagrams
Computation of Venn intersects for 220 samples and plotting of 25 way Venn diagrams.
Computation of Venn intersects
The following imports several functions from the overLapper.R script for computing Venn intersects and plotting Venn diagrams (old version: vennDia.R).
These functions are relatively generic and scalable by supporting the
computation of Venn intersects of 220 or more samples. The upper limit
around 20 samples is unavoidable because the complexity of Venn
intersects increases exponentially with the sample number n according to this relationship: (2^n)  1.
A useful feature of the actual plotting step is the possiblity to
combine the counts from several Venn comparisons with the same number of
test sets in a single Venn diagram. The overall workflow of the method
is to first compute for a list of samples sets their Venn intersects
using the overLapper
function, which organizes the result sets in a list object.
Subsequently, the Venn counts are computed and plotted as bar or Venn
diagrams. The current implementation of the plotting function, vennPlot, supports Venn diagrams for 25 sample sets. To analyze larger numbers of sample sets, the Intersect Plot
methods often provide reasonable alternatives. These methods are much
more scalable than Venn diagrams, but lack their restrictive intersect
logic. Additional Venn diagram resources are provided by limma, gplots, vennerable, eVenn, VennDiagram, shapes, C Seidel (online) and Venny (online).
#################
## Sample data ##
#################
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports required functions.
setlist < list(A=sample(letters, 18), B=sample(letters, 16),
C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18),
F=sample(letters, 22, replace=T))
# To work with the overLapper
function, the sample sets (here six) need to be stored in a list object
where the different
# compontents are named by unique identifiers, here 'A to F'. These names are used as sample labels in all subsequent data
# sets and plots.
sets < read.delim("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sets.txt")
setlistImp < lapply(colnames(sets), function(x) as.character(sets[sets[,x]!="", x]))
names(setlistImp) < colnames(sets)
# Example how a list of test sets
can be imported from an external table file stored in tab delimited
format. Such
# a file can be easily created from a spreadsheet program, such as Excel. As a reminder, copy & paste
from external
# programs into R is also possible (see read.delim
function).
OLlist < overLapper(setlist=setlist, sep="", type="vennsets"); OLlist; names(OLlist)
# With the setting type="vennsets",
the overLapper function computes all Venn Intersects for the six test
samples in
# setlist and stores the results in the Venn_List component of the returned OLlist object. By default,
duplicates are
# removed from the test sets. The setting keepdups=TRUE
will retain duplicates by appending a counter to each entry. When
# assigning the
value "intersects" to the type argument then the function will compute
Regular
# Intersects instead of Venn Intersects. The Regular Intersect
approach (not compatible with Venn diagrams!) is described
# in the next
section. Both analyses return a presentabsent matrix in the Intersect_Matrix
component of OLlist. Each overlap
# set in the Venn_List data set is
labeled according to the sample names provided in setlist. For instance, the composite
# name 'ABC' indicates that the entries are restricted to A, B and C. The
seperator used for naming the intersect samples
# can be specified under the sep
argument. By adding the argument cleanup=TRUE, one can minimize
formatting issues in the
# sample sets. This setting will convert all characters in the sample sets to upper case and remove leading/trailing spaces.
#############################
## Bar plot of Venn counts ##
#############################
olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Venn Bar Plot")
# Generates a bar plot for the Venn counts of the six test sample sets.
In contrast to Venn diagrams, bar plots scale
# to larger numbers of
sample sets. The layout of the plot can be adjusted by changing
the default values of the argument:
# margins=c(4,10,3,1). The minimum
number of counts to consider in the plot can be set with the
mincount argument
# (default is 0). The bars themselves are colored by
complexity levels using the default setting:
mycol=OLlist$Complexity_Levels.
#########################
## 2way Venn diagrams ##
#########################
setlist2 < setlist[1:2]; OLlist2 < overLapper(setlist=setlist2, sep="_", type="vennsets")
OLlist2$Venn_List; counts < sapply(OLlist2$Venn_List, length); vennPlot(counts=counts)
#
Plots a nonproportional 2way Venn diagram. The main graphics features
of the vennPlot function can be controlled by
# the following arguments
(here with 2way defaults): mymain="Venn Diagram": main title;
mysub="default": subtitle;
# ccol=c("black","black","red"): color of
counts; lcol=c("red","green"): label color; lines=c("red","green"):
# line color; mylwd=3: line width; ccex=1.0: font size of counts;
lcex=1.0: font size of labels. Note: the vector
# lengths provided for the
arguments ccol, lcol and lines should match the number of
their corresponding features
# in the plot, e.g. 3 ccol values for a 2way
Venn diagram and 7 for a 3way Venn diagram. The argument setlabels
# allows to provide a vector of custom sample labels. However,
assigning the proper names in the original test set list
# is much more
effective for tracking purposes.
#########################
## 3way Venn diagrams ##
#########################
setlist3 < setlist[1:3]; OLlist3 < overLapper(setlist=setlist3, sep="_", type="vennsets")
counts < list(sapply(OLlist3$Venn_List, length), sapply(OLlist3$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, 0.2))
#
Plots a nonproportional 3way Venn diagram. The results from several
Venn comparisons can be combined in a
# single Venn diagram by assigning
to the count argument a list with several count vectors. The
positonal offset
# of the count sets in the plot can be controlled with
the yoffset argument. The argument setting colmode=2 allows
# to assign different colors to each count set. For instance, with colmode=2 one
can assign to ccol a color vector
# or a list, such as ccol=c("blue",
"red") or ccol=list(1:8, 8:1).
#########################
## 4way Venn diagrams ##
#########################
setlist4 < setlist[1:4]
OLlist4 < overLapper(setlist=setlist4, sep="_", type="vennsets")
counts < list(sapply(OLlist4$Venn_List, length), sapply(OLlist4$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, 0.2))
#
Plots a nonproportional 4way Venn diagram. The setting type="circle"
returns an incomplete 4way Venn diagram as
# circles. This representation
misses two overlap sectors, but is sometimes easier to navigate than the default
# ellipse version.
#########################
## 5way Venn diagrams ##
#########################
setlist5 < setlist[1:5]; OLlist5 < overLapper(setlist=setlist5, sep="_", type="vennsets")
counts < sapply(OLlist5$Venn_List, length)
vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a nonproportional 5way Venn diagram.
################################
## Export and other utilities ##
################################
OLexport < as.matrix(unlist(sapply(OLlist5[[4]], paste, collapse=" ")))
write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
OLexport < data.frame(Venn_Comp=rep(names(OLlist5[[4]]), sapply(OLlist5[[4]], length)), IDs=unlist(OLlist5[[4]]))
write.table(OLexport, file="test.xls", row.names=F, quote=F, sep="\t") # Same as above, but exports to an alternative tabular format.
tapply(counts, OLlist5[[3]], function(x) rev(sort(x)))
# Sorts the overlap results within each complexity level by their size.
This allows to identify the sample set
# combinations with the largest
intersect within each complexity level.
sapply(names(setlist), function(x) table(setlist[[x]])[table(setlist[[x]])!=1])
#
Command to identify and count duplicated objects in the original sample
set object 'setlist'. In the given example,
# only set 'F' contains
duplications. Their frequency is provided in the result.
vennPlot(counts, mymain="", mysub="", ccol="white", lcol="white") # Returns an empty Venn diagram without counts or labels.
################################################################################
## Typical analysis routine for sets of differentially expressed genes (DEGs) ##
################################################################################
ratio < matrix(sample(seq(5, 5, by=0.1), 100, replace=T), 100, 4,
dimnames=list(paste("g", 1:100, sep=""), paste("DEG", 1:4, sep="")),
byrow=T)
# Creates a sample matrix of gene expression log2 ratios. This could be any data type!
setlistup < sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]>=1,]))
setlistdown < sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]<=1,]))
#
Identifies all genes with at least a two fold up or down regulation and
stores the corresponding gene identifiers
# in setlistup and setlistdown,
respectively.
OLlistup < overLapper(setlist=setlistup, sep="_", type="vennsets")
OLlistdown < overLapper(setlist=setlistdown, sep="_", type="vennsets")
counts < list(sapply(OLlistup$Venn_List, length), sapply(OLlistdown$Venn_List, length))
vennPlot(counts=counts, ccol=c("red", "blue"), colmode=2, mysub="Top: DEG UP; Bottom: DEG Down", yoffset=c(0.3, 0.2))
#
Performs Venn analysis for the four sets stored in setlistup and
setlistdown. The argument setting colmode=2 allows
# to assign different
colors to each count set.
# For instance, with colmode=2 one can
assign to ccol a color vector or a list, such as ccol=c("blue", "red")
or ccol=list(1:8, 8:1).

Sample Venn Diagrams (here 4 and 5way)

Intersect Plots
Collection of methods for analyzing and visualizing intersect relationships among large numbers of sample sets.
#############################################################################################
## All Possible Intersects: for comprehensive overlap analyses of 220 or more sample sets ##
#############################################################################################
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
#
To compute all possible intersects among more than two samples at the
same time  such as common in AB, ABC,
# ABCD and so on  the overLapper
function can be used. Note: processing more than 20 sets with
this function can take
# a long time, because the complexity of all
possible intersects increases with the number of sample sets n
according
# to this relationship: (2^n)  1 (includes self comparisons).
This means there are 63 possible intersects for n=6, 1,023
# for
n=10, 32,767 for n=15 and 33,554,431 for n=25. With the current
implementation, the computation time is about 0.1
# second for n=10 and
2.5 seconds for n=15.
setlist < lapply(1:6, function(x) sample(letters, 20)); names(setlist) < LETTERS[seq(along=setlist)]
#
To work with the overLapper function, the sample sets need to be stored
in a list object where the different
# components are named by unique
identifiers: A, B, C, ..., Z.
OLlist < overLapper(setlist=setlist, complexity=1:length(setlist), sep="", type="intersects"); OLlist; names(OLlist)
#
Computes all possible intersects for the samples stored in 'setlist'.
The results are returned as a list where each
# overlap component is
labeled by the corresponding sample names. For instance, the
name 'ABC' indicates that the
# entries are common in samples A, B and
C. The seperator used for naming the intersect samples can be
specified under
# the 'sep' argument. The complexity level range to
consider can be controlled with the 'complexity' argument, which
# can have values from 2 to the total number of samples.
OLlist[[2]]; OLlist[[3]] # Returns the corresponding intersect matrix and complexity levels.
OLexport < as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" ")))
write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
counts < sapply(OLlist[[4]], length) # Counts the number of elements in each intersect component.
tapply(counts, OLlist[[3]], function(x) rev(sort(x)))
# Sorts the overlap results within each complexity level by their size.
This allows to identify the sample set
# combinations with the largest
intersect within each complexity level.
x11(height=12, width=8); olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Intersect Bar Plot")
# Plots the counts as bar diagram. More details on this function are provided in the Venn diagram section.
#####################################################################################################
## All Pairwise Intersects: to identify and cluster overlaps among two to thousands of sample sets ##
#####################################################################################################
setlist
< lapply(11:30, function(x) sample(letters, x, replace=TRUE));
names(setlist) < paste("S", seq(along=setlist), sep="")
# To work with the following function, the sample sets (here 20) need
to be stored in a list object (here 'setlist')
# where each component is
named by a unique identifier (here S1S20).
setlist < sapply(setlist, unique)
olMA < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(setlist[[x]] %in% setlist[[y]]))); olMA
#
Creates an intersect matrix for all pairwise sample comparisons stored
in 'setlist'. This approach scales well up to
# 3several thousands of
sample sets.
library("gplots"); heatmap.2(olMA, trace="none",
Colv="none", Rowv="none", dendrogram="none", col=colorpanel(40,
"darkred", "orange", "yellow")) # Plots intersect matrix as heatmap.
heatmap.2(olMA, trace="none", col=colorpanel(40, "darkred", "orange", "yellow"))
#
Hierarchical clustering of the rows and columns of the intersect matrix
'olMA'. The result is plotted as heatmap
# with two identical dendrograms
representing the outcome of the hierarchical clustering. The
latter is internally
# performed by calls of heatmap.2() to the functions
dist() and hclust() using their default settings: euclidean
# distances and complete linkage. Note: the distance matrix used for
clustering in this examples is based on the
# rowtorow
(columntocolumn) similarities in the olMA object. The next
example shows how one can use the olMA
# object directly as distance
matrix for clustering after transforming the intersect counts into similarity measures,
# such as Jaccard or Rand indices.
diffMA1 < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[x]] %in% setlist[[y]])))
diffMA2 < sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[y]] %in% setlist[[x]])))
jaccardMA < olMA/(diffMA1+diffMA2+olMA)
hr < hclust(as.dist(1jaccardMA), method = "complete", members=NULL)
heatmap.2(jaccardMA,
Rowv=as.dendrogram(hr), Colv=as.dendrogram(hr), trace="none",
col=colorpanel(40, "darkred", "orange", "yellow"))
#
Uses the olMA for clustering after transfroming its intersect counts
into Jaccard (Tanimoto) similarity indices.
# This transformation can give
reasonable results for sample sets with large size
differences. Many alternative
# similarity measures for set comparisons
could be considered here.
setlist[["S1"]][setlist[["S1"]] %in% setlist[["S2"]]]
# Example for returning the intersect entries for one pairwise comparison, here S1 and S2.
name1 < matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=T)
name2 < matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=F)
mynames < paste(name1, name2, sep="_"); mynames < lapply(strsplit(mynames, "_"), sort)
mynames < sapply(mynames, paste, collapse="_"); olV < as.vector(olMA)
names(olV) < mynames; olV < olV[!duplicated(names(olV))]; sort(olV)
#
Converts the intersect matrix 'olMA' into a named vector sorted by
intersect size. This allows to identify the
# combinations of pairwise
sample comparisons that have the largest or smallest intersects.
######################################################################################################
## Presenceabsence matrices: to indentify memberships of items across large numbers of sample sets ##
######################################################################################################
setlist
< lapply(11:30, function(x) sample(letters, x, replace=TRUE));
names(setlist) < paste("S", seq(along=setlist), sep="")
# Generates a sample data set.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
# Imports the required overLapper() function.
paMA < overLapper(setlist=setlist, type="intersects", complexity=2)[[2]]
paMA[names(rev(sort(rowSums(paMA)))),]
# Creates a presenceabsence matrix of all items across their sample sets. In this matrix the presence information is
# indicated by ones and its rows are sorted by the presence
frequencies. Alternative versions of the presentabsent
@ matrix can be
returned by setting the 'type' argument to 1, 2 or 3.
library("gplots"); heatmap.2(paMA, trace="none", Colv="none", Rowv="none", dendrogram="none", col=c("white", "gray"))
# Plots the presentabsent matrix as heatmap.
rowSums(paMA) # Returns the presence frequencies of the items in the sample sets.
sapply(rownames(paMA), function(x) colnames(paMA)[paMA[x, ]==1])
#
Returns for each item the names of sample sets where it is present. The
opposite result for absent calls can be
# returned by changing '==1' to
'==0'.
Intersectbased Heatmap with Clustering Using Jaccard Similarity Measure
Histograms
Base Graphics
x < rnorm(100); hist(x, freq=FALSE); curve(dnorm(x), add=TRUE) #
Plots a random distribution as histogram and overlays curve ('freq=F':
ensures density histogram; 'add=T': enables 'overplotting').
plot(x<1:50, dbinom(x,size=50,prob=.33), type="h") # Creates pin diagram for binomial distribution with n=50 and p=30.
Basic Histogram Generated with Base Graphics
ggplot2 (see this page for more details)
ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(fill = ..count..), binwidth=0.2) # Plots histogram for second column in 'iris' data set.
ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(y = ..density.., fill = ..count..), binwidth=0.2) + geom_density() # Density representation.
Histogram Generated with ggplot2

Density Plots
plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red") # Plots a random distribution in form of a density plot.
split.screen(c(1,1))
plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red")
screen(1, new=FALSE)
plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n")
close.screen(all = TRUE)
#
Several density plots can be overlayed with the 'split.screen'
function. In ths example two density plots are plotted at the same scale
('xlim/ylim') and
# overlayed on the same graphics device using 'screen(1,new=FALSE))'.


Box Plots
y < as.data.frame(matrix(runif(300), ncol=10, dimnames=list(1:30, LETTERS[1:10]))) # Generates a sample data set.
boxplot(y, col=1:length(y))
#
Creates a box plot. A box plot (also known as a boxandwhisker
diagram) is a graphical representation of a fivenumber summary, which
consists of the smallest
# observation, lower quartile, median, upper quartile and largest observation.
Basic Box Plot Generated with Base Graphics

ROC Plots
A variety of libraries are available for these tasks: ROCR, ROC, nonbinROC, pROC, ggplot2

Feature Maps for DNA/Protein Sequences
## The featureMap.R script plots simple feature maps of biological sequences based on provided mapping coordinates.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")
## Alternative approach for plotting simple chromosome maps
plot(x < rnorm(40,2e+07,sd=1e+07), y < rep(1,times=40), type="h", col="blue", xaxt="n", yaxt="n", bty="n")
abline(h=0.78, col="green", lwd=12)
lines(a < rnorm(5,2e+07,sd=1e+07), b < rep(1,times=5), type="h", col="red", lwd=2)


Miscellaeneous Plotting Utilities
plot(1:10); lines(1:10, lty=1)
# Superimposes lines onto existing plots. The usage of plotted values
will connect the data points. Different line types can be specified with
# argument "lty = 2". More on this can be found in the documentation for 'par'.
plot(x
< 1:10, y < 1:10); abline(1,1, col="green"); abline(1,1,
col="red"); abline(v=5, col="blue"); abline(h=5, col="brown") # Function 'abline' adds lines in different colors to xyplot.

Arranging PlotsBase Graphics
x11(height=6, width=12, pointsize=12) # Defines the size of the plotting device.
par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } #
With the argument 'par(mfrow=c(nrow,ncol))' one can define how several
plots are arranged next to each other on the same plotting device.
nf < layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE), c(3,7), c(5,5), respect=TRUE)
layout.show(nf)
for(i in 1:3) { barplot(1:10) }
#
The 'layout()' function allows to devide the plotting device into
variable numbers of rows and columns with the columnwidths and the
rowheights specified in the respective arguments.
split.screen(c(1,1))
plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="red")
screen(1, new=FALSE)
plot(density(rnorm(10)), xlim=c(2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n")
close.screen(all = TRUE)
# Possibility to overlay independent plots with 'split.screen'
function. In ths example two density plots are plotted at the same scale
('xlim/ylim')
# and overlayed on the same graphics device using
'screen(1,new=FALSE))'. Split.screen is also very useful for generating
multiple plots next to each
# other on a single device by setting the layout setting to several screens, e.g.: split.screen(c(2,2))

 Example of Arranging Plots Using Base Graphics
ggplot2
library(grid); library(ggplot2) # Requires grid library. dsmall < diamonds[sample(nrow(diamonds), 1000), ] # Creates a small sample data set. a < ggplot(dsmall, aes(color, price/carat)) + geom_jitter(size=4, alpha = I(1 / 1.5), aes(color=color)) b < ggplot(dsmall, aes(color, price/carat, color=color)) + geom_boxplot() c < ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() + opts(legend.position = "none") grid.newpage() # Open a new page on grid device pushViewport(viewport(layout = grid.layout(2, 2))) # Assign to device viewport with 2 by 2 grid layout print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2)) print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1)) print(c, vp = viewport(layout.pos.row = 2, layout.pos.col = 2, width=0.3, height=0.3, x=0.8, y=0.8))

Customize XY Axes
op < par(mar=c(6,6,6,8)); barplot(1:10, names.arg=letters[1:10], cex.names=1.5, col=3, yaxt = "n") # Creates bar plot without yaxis and wider margins around plot.
axis(2,
las = 1, at=seq(0,10,0.5), lwd=2, cex.axis=1.5); axis(3, las = 1,
lwd=2, cex.axis=1.5); axis(4, las = 1, at=seq(0,10,2),
labels=month.name[1:6], lwd=2, cex.axis=1.5); par(op) # Adds three custom axes.
curve(sin(x), pi, pi, axes=F, ylab="", xlab=""); axis(1, pos=0); axis(2, pos=0) # Plots a function in Cartesianstyle coordinate system.

Examples of Customizing XY Axes
Color Selection Utilities
For more details on this topic consult Earl Glynn's Color Chart
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
cat("palette():\n"); palette(); palette(rainbow(20, start=0.1, end=0.2))
cat("palette(rainbow(20, start=0.1, end=0.2)):\n"); palette(); palette("default")
#
Prints and modifies the color palette which is used when the argument
'col=' has a numeric index. The last step sets the palette back to its
default setting.
par(mfrow=c(1,2)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8)
palette(rainbow(8, start=0.1, end=0.8)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8)
palette("default")
#
Example to call the default color palette with a numeric vector in the
'col=' argument'. In the second plot a modified palette is called the
same way.
example(rainbow) # The function rainbow() allows to select various predefined color schemes.
barplot(as.matrix(y[1:10,]), beside=T, col=rainbow(10, start=0.1, end=0.2))
#
Example to select 10 colors with the rainbow() function. The start and
end values need to be between 0 and 1. The wider their distance the more
diverse are the resulting colors.
barplot(as.matrix(y[1:10,]), beside=T, col=gray(c(1:10)/10)) # The gray() function allows to select any type of gray shades by providing values from 0 to 1.
colors()[1:10]; col2rgb(colors()[1:10]) # The colors() function allows to select colors by their name. The col2rgb() can translates them into the RGB color code.
library(RColorBrewer); example(brewer.pal) # Shows how to select color schemes with the RColorBrewer library.
Adding Text
y < as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]); text(0.2,0.2,"My Text") # Adds text at defined coordinates of plot.
plot(y[,1], y[,2]); mtext("My Text", 4) # Adds text to one of the four margin areas (side: 14) of the plot.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y))
# Plots a scatter plot with the corresponding labels (here row names) next to the data points.

Interactive Functions
locator() #
Allows user to click on existing graph with left mouse button. System
returns the corresponding xycoordinates after clicking on right mouse
button.
text(locator(1), "My_Outlier", adj=0) # Adds text "My_Outlier" to graph at position where left mouse button is clicked.

Saving Graphics to Files
jpeg("test.jpeg"); plot(1:10, 1:10); dev.off()
#
After the 'jpeg("test.jpeg")' command all graphs are redirected to the
file "test.jpeg" in JPEG format. To export images with the highest
quality,
# the default setting "quality = 75" needs to be changed
to 100%. The actual image data are not written to the file until the
'dev.off()' command is executed!
pdf("test.pdf"); plot(1:10, 1:10); dev.off() #
Same as above, but for pdf format. The pdf and svg formats provide
often the best image quality, since they scale to any size without
pixelation.
png("test.png"); plot(1:10, 1:10); dev.off() # Same as above, but for png format.
postscript("test.ps"); plot(1:10, 1:10); dev.off() # Same as above, but for PostScript format. svg("test.svg"); plot(1:10, 1:10); dev.off() # Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.
Importing Graphics into R
library(pixmap); x < read.pnm(file="my_image.pnm"); plot(x); print(x) # Images need to be in pnm format. This format can be obtained in ImageMagick like this: 'convert my_image.jpg my_image.pnm'.

Graphics Exercises

Writing Your Own Functions
The R language allows users to create their own function objects. The basic syntax for defining new functions is:
name < function(arg_1, arg_2, ...) expression # Basic syntax for functions.
 A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual.

R Web Applications
 Various webbased R services and R web development kits are available:

 R Web Services
 R Cloud Computing
 R Web Development Kits
R Basics Exercises
[ Slide Show ] [ R Code ]
The following exercises introduce a variety of useful data analysis utilities available in R.
Import from spreadsheet programs (e.g. Excel)
 Download the following molecular weight and subcelluar targeting
tables from the TAIR site, import the files into Excel and save them as
tab delimited text files.
my_mw < read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t") # Imports molecular weight table.
my_target < read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t") # Imports subcelluar targeting table.
 Check tables and rename gene ID columns
my_mw[1:4,]; my_target[1:4,]; colnames(my_target)[1] < "ID"; colnames(my_mw)[1] < "ID" # Prints out the first four rows of the two data frames and renames their gene ID columns.
 Merge the two tables based on common ID field
my_mw_target < merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)

 Shorten one table before the merge and then remove the nonmatching rows (NAs) in the merged file
my_mw_target2 < merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove nonmatching rows, use the argument setting 'all=F'.
na.omit(my_mw_target2) # Removes rows containing "NAs" (nonmatching rows).
Problem 1:
How can the merge function in the previous step be executed so that
only the common rows among the two data frames are returned? Prove that
both methods  the two step version with na.omit and your method  return identical results.
Problem 2: Replace all NAs in the data frame my_mw_target2 with zeros.
 Apply several combinatorial queries (filters) on the data set. For example:
my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C",] # Retrieves all records with MW greater than 100000 and targeting to chloroplasts.
dim(my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C", ]) # Determines the number of matches for the above query.
Problem 3: How many protein entries in the my_mw_target
data frame have a MW of greater then 4000 and less then 5000. Subset
the data frame accordingly and sort it by MW to check that your result
is correct.
 Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions
my_mw_target3 < data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target)
my_mw_target3 < data.frame(loci=gsub("(^.*)\\.\\d{1,}", "\\1", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) # Same as above but with backreference.
Problem 4: Retrieve those rows in my_mw_target3 where the second column contains the following identifiers: c("AT5G52930.1", "AT4G18950.1", "AT1G15385.1", "AT4G36500.1", "AT1G67530.1").
Use the '%in%' function for this query. As an alternative approach,
assign the second column to the row index of the data frame and then
perform the same query again using the row index. Explain the difference
of the two methods.
 Calculate the number of gene models per loci with the table() function
my_mw_target4 < cbind(my_mw_target3, Freq=table(my_mw_target3[,1])[as.character(my_mw_target3[,1])])
# The table() function counts the number of duplicates for each loci and cbind() appends the information to the data frame.
my_mw_target4[order(my_mw_target4$Freq),][1:30,] # Sorts by count information.
length(unique(my_mw_target4[,1])) # Removes duplicates and prints the number of unique loci.
sum(!duplicated(my_mw_target4[,1])) # Gives the same result as previous command; remember: the function 'duplicated' returns a logical vector.
length(my_mw_target4[,2])length(unique(my_mw_target4[,1])) # Calculates the number of duplicated entries.
sum(duplicated(my_mw_target4[,1])) # Generates the same result as the previous command.

 Perform a variety calculations across columns
data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3]/my_mw_target4[,4]))[1:40,] # Calculates the average AA weight per protein.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean))[1:40,] # Calculates the mean across several fields of each row.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean), stdev=apply(my_mw_target4[,6:9], 1, sd, na.rm=T))[1:40,]
# Calculates the mean and standard deviation across several fields of each row.
mean(my_mw_target4[,c(1,2,5)], na.rm=T) # Calculates the mean for all numeric columns.
data.frame(my_mw_target4[1:40,], ttest=apply(my_mw_target4[1:40,6:9], 1, function(x) {t.test(x[1:2], x[3:4])$p.value} ))
# Calculates a twosample ttest for two slices in each row of the data
frame and records the corresponding pvalues. To calculate a paired
ttest,
# add the argument 'paired=T'.
 Export data frame to Excel
write.table(my_mw_target4, file="my_file.xls", quote=F, sep="\t", col.names = NA) # Writes the data frame 'my_mw_target4' into a tabdelimited text file that can be opened with Excel.
boxplot(my_mw_target4[,3], my_mw_target4[,4], names=c("MW","CC"), col=c("blue","red"), main="Box Plot") # Generates box plots for MW and AA counts.
plot(density(my_mw_target4[,3]), col="blue", main="Density Plot") # Plots density distribution plots for MW.
barplot(my_mw_target4[1:5,3],
beside = F, col = c("red", "green", "blue", "magenta", "black"),
names.arg=as.vector(my_mw_target4[1:5,1]), main="MW Bar Plot",
ylim=c(0,230000))
# Bar plot of MW for first five loci.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports the venn diagram function.
setlist
< list(A=sample(my_mw_target4[,1], 1000),
B=sample(my_mw_target4[,1], 1100), C=sample(my_mw_target4[,1], 1200),
D=sample(my_mw_target4[,1], 1300), E=sample(my_mw_target4[,1], 1400))
# Generates 5 sets of random IDs.
OLlist < overLapper(setlist=setlist, sep="_", type="vennsets")
counts
< sapply(OLlist$Venn_List, length); vennPlot(counts=counts,
ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a 5way venn diagram for the random ID sets.
OLexport < as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" ")))
write.table(OLexport, file="OLexport.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
pdf("venn_plot.pdf"); vennPlot(counts=counts); dev.off() # Saves the plot image to a PDF file.
Problem 5:
Generate two key lists each with 4 random sample sets. Compute their
overlap counts and plot the results for both lists in one venn diagram.
Problem 6: Write all commands from the previous exercises into an R script ( exerciseRbasics.R) and execute it with the source() function like this: source("exerciseRbasics.R"). This will execute all of the above commands and generates the corresponding output files in the current working directory.
HT Sequence Analysis with R and Bioconductor
 This section of the manual is available on the HTSeq site.
Programming in R
 This section of the manual is available on the Programming in R site.
Bioconductor
Introduction
Bioconductor is an open source and open development software project for the analysis of genome data ( e.g.
sequence, microarray, annotation and many other data types). This
section of the manual provides a brief introduction into the usage and
utilities of a subset of packages from the Bioconductor project. The
included packages are a 'personal selection' of the author of this
manual that does not reflect the full utility specturm of the
Bioconductor project. The introduced packages were chosen, because the
author uses them often for his own teaching and research. To obtain a
broad overview of available Bioconductor packages, it is strongly
recommended to consult its official project site.
Due to the rapid development of many packages, it is also important to
be aware that this manual will often not be fully uptodate. Because of
this and many other reasons, it is absolutley critical to use the
original documentation of each package (PDF manual or vignette) as
primary source of documentation. Users are welcome to send suggestions
for improving this manual directly to its author.

Finding Help
The instructions for installing BioConductor packages are available in the administrative section of this manual. Documentation for Bioconductor packages can be found in the vignette of each package. A listing of the available packages is available on the Bioc Package page. Annotation libraries can be found here. The Bioc Mail GMANE Archive
can be searched to find answers to questions that have been discussed
on the mailing list. Another valuable information resource is the Bioconductor Book. The basic R help functions provide additional information about packages and their functions:
library(affy) # Loads a particular package (here affy package).
library(help=affy) # Lists all functions/objects of a package (here affy package).
library() # Lists all libraries/packages that are available on a system.
openVignette() # Provides documentation on packages.
sessionInfo()
# Prints version information about R and all loaded packages. The generated output should be provided when sending
# questions or bug reports to the R and BioC mailing lists.
?function # Opens documentation on a function.
help.search("topic") # Searches help system for documentation.
search() # Shows loaded packages and attached data frame in current search path.
system.file(package="affy") # Shows location of an installed package.
library("tools"); vigSrc < list.files(pattern="Rnw$", system.file("doc",package="GOstats"), full.names=TRUE); vigSrc; for (v in vigSrc) Stangle(v) # Extracts R code from a vignette and saves it to file(s) in your current working directory.
Affy Packages
BioConductor provides extensive resources for the analysis of Affymetrix data. The most important ones are introduced here.
Affy
The Affy package provides basic methods for analyzing affymetrix
oligonucleotide arrays. The following steps outline the usage of the
Affy package and associated packages.
(1) Obtaining scaled expression values with 5 different methods (MAS5,
RMA, GCRMA, Plier & dChip).
A comparison of three of these methods
including references is available in this slide show. Move your CEL files into your working directory and make sure that
the corresponding CDF library for your chips is available on your
system.
library(affy) # Loads the affy package.
mydata < ReadAffy()
# Reads all *.CEL (*.cel) files in your current working directory and stores them into the AffyBatch object 'mydata'.
mydata < ReadAffy(widget=TRUE)
# Opens file browser to select specific CEL files.
eset < rma(mydata)
# Creates normalized and background corrected expression values using the RMA method. The generated data are stored as
# ExpressionSet class in the 'eset' object. For large data sets use the more memory efficient justRMA() function.
eset < mas5(mydata)
# Uses MAS 5.0 normalization method, which is much slower than RMA!
eset_gcrma < gcrma(mydata)
# Use this command to acquire gcrma data. The 'library(gcrma)' needs to be loaded first.
eset_plier < justPlier(mydata)
# Use this command to generate plier data. The 'library(plier)' needs to be loaded first.
eset_PMA < mas5calls(mydata)
# Generates MAS 5.0 P/M/A calls. The command creates ExpressionSet with P/M/A calls in the 'exprs' slot and the
# wilcoxon pvalues in the 'se.exprs' slot. To access them see below.
eset < expresso(mydata, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly", summary.method="liwong")
# Generates expression calls similar to dChip (MBEI) method from Li and Wong.
library(affycoretools); affystart(plot=T, express="mas5")
# Handy function to normalize all cel files in current working directory, perform qc plots and export normalized data to file.
# Works for mas5, rma and gcrma.
(2) Exporting data from ExpressionSet objects
write.exprs(eset, file="mydata.txt") # Writes expression values to text file in working directory.
x < data.frame(exprs(eset), exprs(eset_PMA), assayDataElement(eset_PMA, "se.exprs"))
x < x[,sort(names(x))]; write.table(x, file="mydata_PMA.xls", quote=F, col.names = NA, sep="\t")
# Writes expression values, PMA values and wilcoxon pvalues to text file in working directory. Remember: the old
# command for accessing the wilcoxon pvalues in BioC versions <2.0 was 'se.exprs(eset_PMA))'.
(3) Obtaining single probelevel data from 'affybatch' objects (see also documentation ' ?ReadAffy')
mypm < pm(mydata) # Retrieves PM intensity values for single probes
mymm < mm(mydata) # Retrieves MM intensity values for single probes
myaffyids < probeNames(mydata) # Retrieves Affy IDs
result < data.frame(myaffyids, mypm, mymm) # Combines the above information in data frame
(4) Working with 'ExpressionSet' objects (see Biobase Manual)
eset; pData(eset) # Provides summary information of ExpressionSet object 'eset' and lists the analyzed file names.
exprs(eset)[1:2,1:4]; exprs(eset)[c("244901_at","244902_at"),1:4]
# Retrieves specific rows and fields of ExpressionSet object. To learn more about this format class, consult
# ExpressionSet manual with command '?ExpressionSet'.
test < as.data.frame(exprs(eset)); eset2 <new("ExpressionSet", exprs = as.matrix(test), annotation="ath1121501"); eset2
# Example for creating an ExpressionSet object from a data frame. To create the object from an external file, use
# the read.delim() function first and then convert it accordingly.
data.frame(eset) # Prints content of 'eset' as data frame to STDOUT.
exprs(eset_PMA)[1:2,1:2]; assayDataElement(eset_PMA, "se.exprs")[1:2,1:2]
# Returns from ExpressionSet of 'mas5calls(mydata)' the PMA values from its 'exprs' slot and the pvalues from its
# 'se.exprs' slot.
(5) Retrieving annotation data for Affy IDs (see Annotation Package Manual)
library(ath1121501.db) # Opens library with annotation data.
library(help=ath1121501.db) # Shows availability and syntax for annotation data.
ath1121501() # Provides a summary about the available annotation data sets of an annotation library.
library(ath1121501cdf); ls(ath1121501cdf) # Retrieves all Affy IDs for a chip in vector format.
x < c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at") # Generates sample data set of Affy ID numbers.
contents(ath1121501ACCNUM)[1:40] # Retrieves locus ID numbers for Affy IDs.
myAnnot < data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "),
SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "),
DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", "))
myAnnot[x, ]
# Organizes full set of annotation features in a data frame, here: ACCNUM, SYMBOL and GENENAME. The annotations
# from probe sets with several mappings are collapsed into single strings.
mget(x, ath1121501GO, ifnotfound=NA) # Retrieves GO information for Affy IDs.
(6) Accessing probe sequence data
library(ath1121501probe) # Opens library with probe sequence data.
print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
library(Biostrings)
pm < DNAStringSet(ath1121501probe$sequence); names(pm) < ath1121501probe$Probe.Set.Name
# Stores probe sequences as DNAStringSet object. See HTSeq manual for more details.
Visualization and quality controls
Additional information on this topic can be found in the affyPLM QC Manual, the arrayQualityMetrics package and on the WEHI Affy Lab site.
library(affyQCReport)
QCReport(mydata, file="ExampleQC.pdf")
# Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
deg < AffyRNAdeg(mydata); summaryAffyRNAdeg(deg); plotAffyRNAdeg(deg)
# Performs RNA degradation analysis. It averages on each chip the probes relative to the 5'/3' position on the target
# genes. A summary list and a plot are returned.
image(mydata[ ,1]) # Reconstructs image with log intensities of first chip.
hist(mydata[ ,1:2]) # Plots histogram of PM intensities for 1st and 2nd array.
hist(log2(pm(mydata[,1])), breaks=100, col="blue") # Plos bar histogram of the PM ('pm') or MM ('mm') log intensities of 1st array.
boxplot(mydata,col="red") # Generates a box plot of unnormalized log intensity values.
boxplot(data.frame(exprs(eset)), col="blue", main="Normalized Data") # Generates a box plot of normalized log intensity values.
mva.pairs(pm(mydata)[,c(1,4)])
# Creates MAplot for unnormalized data. A MAplot is a plot of logintensity ratios (Mvalues) versus logintensity
# averages (Avalues) between selected chips (here '[1,4]').
mva.pairs(exprs(eset)[,c(1,4)]) # Creates MAplot for normalized data.
Simpleaffy
The simpleaffy package automates many repetitive highlevel analysis steps.
(1) Creating expression values from CEL files
Move CEL files into working directory and create whitespace delimited table file " covdesc.txt": first column no title then CEL file names; second column title "treatment", then treatment names.
library(simpleaffy) # Loads "affy" and "simpleaffy" packages.
raw.data < read.affy("covdesc.txt")
# Reads data in working directory that are specified in "covdesc.txt" file; for help on this function type "?read.affy".
x.rma < call.exprs(raw.data, "rma")
# Creates expression values using RMA method. The function 'call.exprs' returns always log2 expression values; for
# help on this function type "?call.exprs". One can also use here the "gcrma" method after loading it with the command 'library(gcrma)'.
x.mas < call.exprs(raw.data, "mas5") # Computes expression values using MAS 5.0 method.
write.exprs(x.rma, file="mydata.txt") # Writes expression values to file "mydata.txt".
(2) Quality control steps
The "qc" function provides several qc stats for chips. To use it, one follows the following steps.
x < read.affy("covdesc.txt") # Reads data in working directory.
x.mas5 < call.exprs(x,"mas5") # Calculates expression values with MAS 5.0 method which is required for the next step!
qc < qc(x,x.mas5)
# Calls "qc" function which generates object containing scale factors, GAPDHActin_3'/5' ratios, target intensity,
# percent present, average, min, max, mean background int and more; for more details type "?qc".
x.mas5@description@preprocessing # Creates QC output like this.
getGapdh3(cleancdfname(cdfName(x)))
# To find out which GAPDH/Actin probe sets are used on a given chip; for others use getGapdhM, getGapdh5, getBioB, getBioC...
plot(qc) # Creates summary plot of QC data. See description on page 5 of vignette "simpleaffy".
(3) Filtering by expression values
get.array.subset(x.rma, "treatment",c("CMP1","Ctrl"))
# When R loaded *.CEL files it also reads experiment layout from covdesc.txt file (Ctrl and CMP1).
# The "get.array.subset" function makes it easy to select subsets of arrays.
results < pairwise.comparison(x.rma, "treatment", c("1", "2"), raw.data)
# Computes mean intensities of replicates from two treatments (means), calculates log2fold changes between
# means (fc), performs ttest (tt) and writes out PMA calls (calls). Type "?pairwise.comparison" for more help on this function.
write.table(data.frame(means(results), fc(results), tt(results), calls(results)), file="my_comp.txt", sep="\t")
# Command to export all 'pairwise.comparison' data into one table.
sort(abs(fc(results)),decreasing=TRUE)[1:100]
# Prints out 100 highest foldchanges (fc), for means of intensities (means), for ttest (tt), for PMA (calls).
significant < pairwise.filter(results, min.exp=log2(10), min.exp.no=2, fc=log2(8), min.present.no=4, tt= 0.001, present.by.group=FALSE)
# Function 'pairwise.filter' takes the output from 'pairwise.comparison' and filters for significant changes. Filter Arguments:
# min.exp: minimum expression cut off
# min.exp.no: occurence of 'min.exp' in at least this number of chips
# min.present.no: present calls on at least this number of chips
# present.by.group: If true, then present count is restricted to replicate groups and not all chips of an experiment!
# fc: A gene must show a log2 fold change greater than this to be called significant. A 2.5fold change can be specified
# like this: 'fc=log2(2.5)'
# tt: A gene must be changing with a pscore less than this to be called significant
# Type "?pairwise.filter" for more info on the different arguments of this function.
write.table(data.frame(means(significant), fc(significant), tt(significant), calls(significant)), file="my_comp.txt", col.names = NA, sep="\t")
# Exports all 'pairwise.filter' data into one table.
(4) Plotting results
plot(significant, type="scatter")
# Plots significant changes from above as scatter plot. Alternative plotting types: type="ma" or type="volcano".
plot(results,significant, type="scatter")
# Plots means of the two replicate groups as scatter plot and highlights all genes that meet criteria of 'significant'
# filter. Meaning of colors: red  all present, orange  all present in one group or the other, yellow  all that remain.
plot(results,type="scatter") # Plots means of the two replicate groups as scatter plot.
png(file="figure1.png"); plot(significant,type="scatter"); dev.off() # Writes scatter plot to image file.
(5) Exercise simpleaffy ( Command Summary)
Analysis of Differentially Expressed Genes
Limma
Limma is a software package for the analysis of gene expression
microarray data, especially the use of linear models for analysing
designed experiments and the assessment of differential expression. The
package includes preprocessing capabilities for twocolor spotted
arrays. The differential expression methods apply to all array platforms
and treat Affymetrix, single channel and two channel experiments in a
unified way. The methods are described in Smyth 2004 and in the limma manual. An illustrated introduction for the GUI packages can be found at WEHI.
library(limma) # Loads commandline limma package.
limmaUsersGuide() # Opens pdf manual for limma.
## The cDNA and affy sample data sets of the manual can be downloaded from the limmaGUI and affylmGUI pages.
library(affylmGUI); affylmGUI()
# Opens affylmGUI for affy data. For a quick start, follow the instructions for the Estrogen data set.
library(limmaGUI); limmaGUI()
# Opens limmaGUI for cDNA array data. For a quick start, follow the instructions for the Swirl Zebrafish data set.
There are four main data objects created and used by limma:
 RGList: for cDNA data created by function read.maimages()
 MAList: for cDNA data created by functions MA.RG() or normalizeWithinArrays()
 MArrayLM: created by function lmFit()
 TestResults: created by function decideTests()
More details on this can be found in paragraph 13 of the limma PDF manual (limmaUsersGuide).
(1) Reading cDNA array data
To make the following commands work, save and extract the SWIRL cDNA microarray sample data into your R working directory. For a quick demonstration of the analysis of this data set, one can copy & paste or source the following commandline summary into the R terminal: my_swirl_commands.txt.
Requirements
 Have all intensity data files in one directory. Supported formats are: ArrayVision, ImaGene, GenePix, QuantArray, SMD (QuantArray) or SPOT. If an intensity data file
format is not supported then one can specify the corresponding column
names during the data import into R (see below). Type '?read.maimages'
for more information about supported formats.
 Targets file: Format of targets file. This file defines which RNA sample was hybridized to each channel of each array.
 Only for SPOT: separate gene list file
 Optional: spot type file for identifying special probes such as controls.
############################
## Reading intensity data ##
############################
targets < readTargets("Targets.txt") # Reads targets information from file 'Targets.txt' and assigns it to targets frame.
RG < read.maimages(targets$FileName, source="spot", sep="\t", path="./")
# Writes intensity data into 'RGList' list object. The argument 'targets$FileName' specifies the intensity files;
# 'source="spot"' specifies the image analysis program (e.g. spot); 'path="./"' provides the path to the directory where the intensity
# files are located and 'sep="\t"' specifies the field separator.
RG < read.maimages(targets$FileName, annotation="My_spot_labels", columns=list(Rf="Rmean", Gf="Gmean", Rb="morphR", Gb="morphG"))
# If an intensity data file format is not supported then one can specify the corresponding column names as shown here. The
# provided example is equivalent to the previous command with the exception of the additional argument: annotation="My_spot_labels".
# This argument allows the import of a spot ID or annotation column. This can be useful for analyzing arrays with unavailable
# printing layout or incomplete intensity data files. In those cases, all printtip specific functions of the following steps
# will not work (e.g. use 'loess' normalizaten instead of 'printtiploess').
RG_combined < cbind(RG1, RG2)
# One can combine different 'RGList' objects (e.g. RG1, RG2) into one large object. The command
# 'RG[,1]' shows only the first array, while 'RG[1:100,]' gives the data of the first 100 genes.
##########################
## Spot quality weights ##
##########################
RG < read.maimages(targets$FileName, source="spot", wt.fun=wtarea(100))
# Assigns spot quality weights from 0 to 1 to each spot and writes information as 'weight' component into 'RGList' object.
# Provided example with 'wt.fun=wtarea(100)' gives full weight of 1 to spots with 100 pixels. Spots with zero pixels or
# twice the ideal size are given weight of 0. All spots with weight of 0 are considered as flagged and limma will ignore
# them in the subsequent analysis. The appropriate way of computing quality weights depends on the image analysis software.
# The command '?QualityWeight' provides more information on available weight functions.
#################################
## Gene lists and print layout ##
#################################
RG$genes < readGAL()
# The output of some image analysis programs contains intensity, gene name and print layout information all in one file. In
# the cases of the image analysis programs SPOT, GenePix, ArrayVision and others, this information is located in a separate file
# from where it needs to be written into the 'RGList' using the 'readGal' and 'getLayout' functions. The function 'readGal'
# reads this information by default from a *.gal file in the working directory. This file contains the columns: 'Block', 'Column',
# 'Row', 'ID' and 'Name' (optional). If such a file is not provided then one can create it in R using the printing layout information
# of a given array. Example of an array with 4 by 8 subgrids (pins) each with 20 by 21 spots:
x < 1:32; c1 < c(); for(i in x) c1 < c(c1, rep(i, times=420)); x < 1:21; c2 < c(); for(i in x) c2 < c(c2, rep(i, times=20)); c3 < rep(1:21, times=20); test.gal < data.frame(Block=c1, Row=rep(c2, times=32), Column=rep(c3, times=32), ID=rep("my_ID", times=13440), Name=rep("my_Description", times=13440)); test.gal[1:20,]
RG$printer < getLayout(RG$genes)
# Extracts print layout (number of pins or subgrids) after the gene list is available and adds this information as
# $printer component to RGList.
####################
## Spot type file ##
####################
## Control spot information can be added to the RGList object with the spot type file (SpotTypes.txt). Regular expressions can
## be used here to associate this information with the RG$genes component of the RGList. The incorporation of the spot type
## information has the advantage that controls can be highlighted in plots or separated in certain analysis steps. The format
## of this file is specified in the limma pdf manual.
spottypes < readSpotTypes("SpotTypes.txt") # Reads in the file SpotTypes.txt.
RG$genes$Status < controlStatus(spottypes, RG) # Adds a 'Status' column to the RG$genes component in RGList.
(2) Quality plots
Recommended quality exploration steps are the imageplot() function of the raw logratios and an MAplot ( plotMA) of the raw data for each array.
spottypes < readSpotTypes("SpotTypes.txt") # Same as above.
RG$genes$Status < controlStatus(spottypes, RG) # Same as above.
plotMA(RG, array=1)
# Creates MAplot which is a plot of the logintensity ratios (Mvalues) versus the logintensity averages of both
# channels (Avalues). To display plots for all arrays next to each other, one can use the layout function 'par(mfrow=c(2,2))'
# and plot the different arrays in the same plotting device by changing the array numbers in the argument 'array=1'.
imageplot(log2(RG$R[,1]), RG$printer, low="white", high="red")
# Creates an image of color shades that represent the values of a statistic in the microarray printing format. This function
# can be used to explore spatial effects across the microarray.
plotPrintTipLoess(RG) # Coplots unnormalized RG data in form of MAplots with the loess curves for each printtip.
(3) Normalization
Limma integrates several within and betweenarray normalization methods. The commands ?normalizeWithinArrays and ?normalizeBetweenArrays provide information about the different methods. The method printtiploess is the default. Use 'loess' instead if your arrays have no printtip groups (e.g. Agilent) or have less than 150 spots per printtip.
MA < normalizeWithinArrays(RG, method="printtiploess")
# Background corrects and normalizes the expression logratios of the RGList and assigns the results to a MAList
# object. The default background correction method is bc.method="subtract". Use bc.method="none" to ignore the background
# in the analysis. If the quality weights should be ignored in this step, add to the command the 'weights=NULL' argument.
# For analyzing arrays with unavailable printing layout or incomplete intensity data files, one should use here a printtip
# unspecific normalization method such as 'loess'.
plotPrintTipLoess(MA)
# Coplots normalized MA data in form of MAplots with loess curves for each printtip. Compare results with unnormalized
# RG data from above.
boxplot(as.data.frame(MA$M))
# Creates boxandwhisker plot of MA values. This is a graphical summary of the MA distribution between the arrays. The two
# hinges in the middle represent the first and third quartile, the lines (whiskers) represent the largest and smallest
# observation in a distance of 1.5 times the box height and everything beyond that is printed as extreme values in form of dots.
# Inconsistent spreads of hinges and whiskers between arrays can indicate normalization issues. Betweenarray normalization may
# improve this situaltion (see limma MAnual).
MA < normalizeBetweenArrays(MA)
# If boxandwhisker plot indicates different spreads of Mvalues, as in the above example, then one can use this secondary
# betweenarray normalization step.
(4) Background correction
The above normalizeWithinArrays function performs by default a background correction by subtracting the background values from the expression intensities.
RGb < backgroundCorrect(RG, method="subtract")
MA < normalizeWithinArrays(RGb)
# These two commands perform the same operation as the above 'normalizeWithinArrays' command.
RG < backgroundCorrect(RG, method="normexp", offset=50)
# This more advanced background correction method 'normexp' adjusts the expression values adaptively for background
# values and results in strictly positive expression values. This method is particularly effective for obtaining more
# robust ratios for low expressed genes.
(5) Linear models and differential expression of dual color data
Limma provides advanced statistical methods for linear modelling of microarray data and for identifying differentially expressed genes. The approach fits a linear model to the data and uses an empirical Bayes method for assessing differential expression. It is described in Smyth 2004. One or two experiment definition matrices need to be specified during the analysis: a design matrix defining the RNA samples and a contrast matrix (optional for simple experiments) defining the comparisons to be performed. How to set up these matrices is described in the limma manual and on this page from Natalie Thorne. Another useful manual is Advanced Linear Modeling and TimeSeries Analysis.
cDNA data set with common reference design
The swirl experiment is a classical one sample test where a mutant is compared to a WT sample with four replicates that include 2 dye swaps. The following commands will work with the objects obtained in the above steps.
design < c(1,1,1,1) # Creates appropriate design matrix.
fit < lmFit(MA, design)
# Fits a linear model for each gene based on the given series of arrays. The returned list object 'fit' contains the average
# Mvalue ($coefficients) for each gene and their standard deviations ($sigma). The ordinary tstatistics can be computed
# with this command: ordinary.t < fit$coef / fit$stdev.unscaled / fit$sigma. However, the empirical Bayes moderated tstatistics
# should be the preferred method for limma users (see below).
plotMA(fit); abline(0,0,col="blue") # Plots MAplot with baseline.
fit < eBayes(fit)
# Computes empirical Bayes statistics for differential expression. This moderated tstatistics uses standard deviations
# that have been shrunk towards a pooled standard deviation value.
qqt(fit$t,df=fit$df.prior+fit$df.residual,pch=16,cex=0.2); abline(0,1)
# Plots the quantiles of a data sample against the theoretical quantiles of a Student's t distribution.
options(digits=3) # Sets the number of digits to print in numeric output to 3 digits.
topTable(fit, adjust="fdr", sort.by="B", number=10)
# Generates list of top 10 ('number=10') differentially expressed genes sorted by Bvalues ('sort.by=B'). The summary table
# contains the following information: logFC is the log2fold change, AveExpr is the average expression value accross all
# arrays and channels, the moderated tstatistic (t) is the logFC to its standard error, the P.Value is the associated pvalue,
# the adj.P.Value is the pvalue adjusted for multiple testing and the Bvalue (B) is the empirical Bayes logodds of differential
# expression (thehigherthebetter). Usually one wants to base gene selection on the adj.P.Value rather than the t or Bvalues.
# More details on this can be found in the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page. If the swirl data set was
# normalized with the 'printtiploess' method and the betweenarray method, then the top downregulated gene is BMP2 which is the
# mutated gene of this sample.
x < topTable(fit, adjust="fdr", sort.by="P", number=100); x[x$adj.P.Val < 0.01,]
# Filters out candidates that have Pvalues < 0.01.
plotMA(fit); ord < order(fit$lods, decreasing=TRUE); top30 < ord[1:30]; text(fit$Amean[top30], fit$coef[top30], labels=fit$genes[top30, "Name"], cex=0.8, col="blue") # Plots MAplot and highlights 30 top changes.
Limma: Affymetrix Arrays
To enable the following analysis steps, users need to save and extract this archive of 6 cel files into their R working directory. These sample files are from the Cold Stress Time Course of the AtGenExpress site ( ftp batch download). This affy_targets.txt file is required to select specific cel files during the analysis. For a quick demonstration of the analysis of this data set, one can copy & paste the following commands into the R terminal.
library(limma) # Loads limma library.
targets < readTargets("affy_targets.txt") # Reads targets information from file 'affy_targets.txt' and assigns it to targets frame.
library(affy); data < ReadAffy(filenames=targets$FileName) # Reads CEL files (specified in 'targets') into AffyBatch object.
eset < rma(data) # Normalizes data with 'rma' function and assigns them to ExpressionSet object (see above).
# exprs(eset) < log2(exprs(eset))
# If eset contains absolute intensity values like MAS5 results, then they should be transformed to log2 (or loge) values for
# limma. RMA/GCRMA generate log2 values and MAS5 produces absolute values.
pData(eset) # Lists the analyzed file names.
write.exprs(eset, file="affy_all.txt")
# Exports all affy expression values to tab delimited text file. The MAS 5.0 P/M/A calls can be retrieved with the simpleaffy
# package or with the affy package like this: 'eset < mas5calls(data); write.exprs(eset, file="my_PMA.txt")'.
design < model.matrix(~ 1+factor(c(1,1,2,2,3,3)))
# Creates appropriate design matrix. Alternatively, such a design matrix can be created in any spreadsheet program and then
# imported into R.
colnames(design) < c("group1", "group2", "group3") # Assigns column names.
fit < lmFit(eset, design) # Fits a linear model for each gene based on the given series of arrays.
contrast.matrix < makeContrasts(group2group1, group3group2, group3group1, levels=design)
# Creates appropriate contrast matrix to perform all pairwise comparisons. Alternatively, such a contrast matrix can
# be created in any spreadsheet program and then imported into R. For complex experiments one can also use this function
# to compute a contrast matrix with all possible pairwise comparisons.
fit2 < contrasts.fit(fit, contrast.matrix) # Computes estimated coefficients and standard errors for a given set of contrasts.
fit2 < eBayes(fit2)
# Computes moderated tstatistics and logodds of differential expression by empirical Bayes shrinkage of the standard
# errors towards a common value.
topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=10)
# Generates list of top 10 ('number=10') differentially expressed genes sorted by Bvalues ('sort.by=B') for each of the three
# comparison groups ('coef=1') in this sample set. The summary table contains the following information: logFC is the
# log2fold change, the AveExpr is the average expression value accross all arrays and channels, the moderated tstatistic (t)
# is the logFC to its standard error, the P.Value is the associated pvalue, the adj.P.Value is the pvalue adjusted for multiple
# testing and the Bvalue (B) is the logodds that a gene is differentially expressed (thehigherthebetter). Usually one
# wants to base gene selection on the adjusted Pvalue rather than the t or Bvalues. More details on this can be found in
# the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page.
write.table(topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=50000), file="limma_complete.xls", row.names=F, sep="\t")
# Exports complete limma statistics table for first comparison group ('coef=1') to tab delimited text file.
results < decideTests(fit2, p.value=0.05); vennDiagram(results)
# Creates venn diagram of all changed genes with pvalue equal or less than 0.05.
x < topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y < x[x$adj.P.Val < 0.05,]; y; print("Number of genes in this list:"); length(y$ID)
# Filters out candidates that have Pvalues < 0.05 in each group ('coef=1') and provides the number of candidates for each
# list. These numbers should be identical with the sum of the values in each circle of the above venn diagram.
x < topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y < x[x$adj.P.Val < 0.01 & (x$logFC > 1  x$logFC < 1) & x$AveExpr > 10,]; y; print("Number of genes in this list:"); length(y$ID)
# Same as above but with complex filter: Pvalue < 0.01 AND at least 2fold change AND expression value A > 10.
results < decideTests(fit2, p.value=0.000005); heatDiagram(results, fit2$coef, primary=1)
# This function plots heat diagram gene expression profiles for genes which are significantly differentially expressed in the
# primary condition (this is not a cluster analysis heat map). Genes are sorted by differential expression under the primary
# condition. The argument 'primary=1' selects the first contrast column in the 'results' matrix as primary condition. The plotted
# genes can be extracted like this 'results[results[,1]==1,]'. More information on this function can be found in the limma manual.
For visualizing the analyzed affy data, use the same visualization and quality control steps as described in the affy package section.
RankProd
The RankProd
package contains functions for the identification of differentially
expressed genes using the rank product nonparametric method from Breitling et al., 2004.
It generates a list of up or downregulated genes based on the
estimated percentage of false positive predictions (pfp), which is also
known as false discovery rate (FDR). The attraction of this method is
its ability to analyze data sets from different origins (e.g.
laboratories) or variable environments.
Required data objects:
 'data' object of type matrix or data frame containing expression values in log2 scale.
 'cl' vector of length ncol(data) with class lables of samples/treatments.
 'origin' vector of length ncol(data) with origin labels of samples.
The origin vector is only required for analyzing data from multiple
origins!
(1) Differential expression analysis with Affymetrix data from single origin
library(RankProd) # Loads the required library.
data(arab) # Loads sample data matrix or data frame 'arab' that contains log2 intensities of 500 genes (rows) for 10 samples (columns).
my_expr_matrix < exprs(my_eset)
# When the analysis is started from Affy Cel files, one can create the required expression matrix or data frame for input
# into RankProd from an ExpressionSet object with the 'exprs' function.
arab.sub < arab[, which(arab.origin == 1)] # Generates subdata set for single origin analysis.
cl < arab.cl[which(arab.origin == 1)] # Generates 'cl' vector with classes (contrast groups) labeled with '0' and '1'.
RP.out < RP(arab.sub, cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = FALSE, gene.names = arab.gnames, rand = 123)
# Performs rank product analysis for singleorigin data. If the input data are not log transformed use 'logged=FALSE'.
topGene(RP.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames)
# The function 'topGene' outputs identified genes with userspecified filters: 'cutoff': pfp threshold; 'num.gene': number
# of top genes identified; 'gene.names': vector with Affy IDs in order of input data set. For log2 scaled data use the
# arguments 'logged=TRUE' and 'logbase=2'. The generated 'Table 1' lists the upregulated genes and the 'Table 2' the
# downregulated genes. The four columns in each table contain the following information: (1) 'gene index' = position in
# original data set, (2) 'RP/Rsum' = rank product, 'FC' = fold change of averaged expression levels, 'pfp' = estimated pfp or FDR.
plotRP(RP.out, cutoff = 0.05)
# Plots pfp (FDR) versus number of identified up and downregulated genes. Genes within the specified cutoff range are plotted in red.
(2) Differential expression analysis with Affymetrix data from multiple origins
cl < arab.cl
# Assigns 'cl' vector with classes (contrast groups) labeled with '0' and '1'. Object 'arab.cl' is part of sample
# data set 'arab'.
origin < arab.origin # Assigns 'origin' vector with origin labels of samples. Object 'arab.origin' is part of sample data set 'arab'.
RP.adv.out < RPadvance(arab, cl, origin, num.perm = 100, logged = TRUE, gene.names = arab.gnames, rand = 123)
# The function 'RPadvance' performs the rank product analysis for multipleorigin data that requires the above 'origin' vector.
topGene(RP.adv.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames) # Generates the 'topGene' list (see above).
(3) Differential expression analysis with cDNA arrays: see vignette RankProd
SAM
SAM is part of the 'siggenes' package.
library(siggenes) # Loads siggenes package.
?sam # Opens help page for SAM.
sam(data,cl....) # Basic syntax for running SAM.
R/maaova
Digital Gene Expression (DGE)
Additional Dual Color Array Packages
Bioconductor provides various additional packages for the analysis of dualcolor cDNA arrays.
Marray
Exploratory analysis for twocolor spotted microarray data...

Chromosome maps
Several BioConductor packages are available for displaying genomic
information graphically. The following commands illustrate some of the
chromosome plotting utilities from the geneplotter package. Much more detailed information on this topic is available in the NG Sequence Manual.
library(annotate); library(geneplotter); library(hgu95av2); newChrom < buildChromLocation("hgu95av2"); newChrom; cPlot(newChrom) # This displays all genes on the chromosomes of an organisms. Genes encoded by the antisense strands are represented by lines below the chromosomes.
data(sample.ExpressionSet); myeset < sample.ExpressionSet; cColor(featureNames(sample.ExpressionSet), "red", newChrom) # This highlights in the above plot a set of genes of interest in red color (e.g. expressed genes of an experiment).
cPlot(newChrom,c("1","2"), fg="yellow", scale="relative"); cColor(featureNames(myeset), "red", newChrom) # Plots just a specific set of chromosomes.

Gene Ontologies
General
Several packages are available for Gene Ontology (GO) analysis. The
following examples of the different packages use the GO annotations from
Arabidopsis.
 Basic GO usage with GOstats
###################################
## Basic usage of GO information ##
###################################
library(GOstats); library(GO.db); library(ath1121501.db); library(annotate) # Loads the required libraries.
goann < as.list(GOTERM) # Retrieves full set of GO annotations.
zz < eapply(GOTERM, function(x) x@Ontology); table(unlist(zz)) # Calculates the number of annotations for each ontology category.
?GOTERM # To find out, how to access the different GO components.
GOTERM$"GO:0003700"; GOMFPARENTS$"GO:0003700"; GOMFCHILDREN$"GO:0003700"
# Shows how to print out the GO annotations for one entry and how to retrieve its direct parents and children.
GOMFANCESTOR$"GO:0003700"; GOMFOFFSPRING$"GO:0003700" # Prints out complete lineages of parents and children for a GO ID.
goterms < unlist(eapply(GOTERM, function(x) x@Term)); goterms[grep("molecular_function", goterms)]
# Retrieves all GO terms and prints out only those matching a search string given in the grep function. The same can
# be done for the definition field with 'x@Definition'. A set of GO IDs can be provided as well: goterms[GOMFANCESTOR$"GO:0005507"]
go_df < data.frame(GOID=unlist(eapply(GOTERM, function(x) x@GOID)), Term=unlist(eapply(GOTERM, function(x) x@Term)), Ont=unlist(eapply(GOTERM, function(x) x@Ontology)))
# Generates data frame of the commonly used GO components: GOID, GO Term and Ontology Type.
affyGO < eapply(ath1121501GO, getOntology, "MF"); table(sapply(affyGO, length))
# Retrieves MF GO terms for all probe IDs of a chosen Affy chip and calculates how many probes have multiple GO terms
# associated. Use "BP" and "CC" arguments to retrieve BP/CC GO terms.
affyGOdf < data.frame(unlist(affyGO)); affyGOdf < data.frame(AffyID=row.names(affyGOdf), GOID=affyGOdf[,1]); affyGOdf < merge(affyGOdf, go_df, by.x="GOID", by.y="GOID", all.x=T)
# Converts above MF list object into a data frame. The AffyID occurence counts are appended to AffyIDs. The last step
# merges the two data frames: 'affyGOdf' and 'go_df'.
unique(lookUp("GO:0004713", "ath1121501", "GO2ALLPROBES")) # Retrieves all Affy IDs that are associated with a GO node.
z < affyGO[c("254759_at", "260744_at")]; as.list(GOTERM)[z[[1]]]
# Retrieves GO IDs for set of Affy IDs and then the corresponding GO term for first Affy ID.
a < data.frame(unlist(z)); a < data.frame(ID=row.names(a), a); b < data.frame(goterms[as.vector(unlist(z))]); b < data.frame(ID=row.names(b), b); merge(b, a, by.x = "ID", by.y="unlist.z.")
# Merges Affy ID, GO ID and GO annotation information.
affyEv < eapply(ath1121501GO, getEvidence); table(unlist(affyEv, use.names = FALSE))
# Provides evidence code information for each gene and summarizes the result.
test1 < eapply(ath1121501GO, dropECode, c("IEA", "NR")); table(unlist(sapply(test1, getEvidence), use.names = FALSE))
# This example shows how one can remove certain evidence codes (e.g. IEA, IEP) from the analysis.
##############################################
## GO term enrichment analysis with GOstats ##
##############################################
## Example of how to test a sample set of probe set keys for overrepresentation of GO terms using a hypergeometric distribution
## test with the function hyperGTest(). For more information, read the GOstatsHyperG manual.
library(ath1121501.db);
library(ath1121501cdf)
affySample < c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at")
geneSample < as.vector(unlist(mget(affySample, ath1121501ACCNUM, ifnotfound=NA)))
affyUniverse < ls(ath1121501cdf)
geneUniverse < as.vector(unlist(mget(affyUniverse, ath1121501ACCNUM, ifnotfound=NA)))
params < new("GOHyperGParams", geneIds = geneSample, universeGeneIds = geneUniverse, annotation="ath1121501", ontology = "MF", pvalueCutoff = 0.5, conditional = FALSE, testDirection = "over")
hgOver < hyperGTest(params)
summary(hgOver)
htmlReport(hgOver, file = "MyhyperGresult.html")

GOHyperGAll
GOHyperGAll function: To test a sample population of genes for overrepresentation of GO terms, the function 'GOHyperGAll' computes for all GO nodes a hypergeometric distribution test and returns the corresponding raw and Bonferroni corrected pvalues (notes about implementation). A subsequent filter function performs a GO Slim analysis using default or custom GO Slim categories. The method has been published in Plant Physiol (2008) 147, 4157). GOHyperGAll provides similar utilities as the hyperGTest function in the GOstats package from BioConductor. The main difference is that GOHyperGAll simplifies the usage of custom chiptogene and genetoGO mappings.
To demo the utility of the GOHyperGAll function, run it as shown below. The initial sample data generation step takes some time (~10 min), since it needs to generate the required data objects for all three ontologies. This needs to be done only once for every custom genetoGO annotation.
#############################################################################
## (1.1) Import all required functions with the following source() command ##
#############################################################################
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")
###########################################################
## (2.1) Using genetoGO mappings from geneontology.org ##
###########################################################
readGOorg(myfile = "gene_association.tair", colno = c(5,11,9), org = "Arabidopsis"); gene2GOlist(rootUK=T)
# Download the required annotation table from geneontology.org and unzip it. Then point the 'readGOorg()' function to
# this file name. The two functions generate 4 data frames with the assigned genetoGO mappings and 3 lists containing
# the genetoGOOFFSPRING associations. When the processes are completed, 6 files will be saved in your working directory!
# They can be reloaded in future R sessions with the 'load' command below. If the argument 'rootUK' is set to TRUE, then
# the root nodes are treated as terminal nodes to account for the new assignment of unknown genes to the root nodes.
###############################################
## (2.2) Using genetoGO mappings from Bioc ##
###############################################
## Note: users should execute either step (2.1) or (2.2), but not both!
sampleDFgene2GO(lib="ath1121501.db"); gene2GOlist(rootUK=T)
# Similar as above, but the genetoGO mappings are obtained from BioC. The generated 4 sample data frame and 3
# list objects can be reloaded in future R sessions with the 'load' command below.
######################################################################
## (2.3) Obtain AffyIDtoGeneID mappings when working with AffyIDs ##
######################################################################
AffyID2GeneID(map = "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements20101220.txt")
# When working with AffyIDs, this function creates a AffyIDtoGeneID mapping data frame using by default the TAIR
# mappings for the Arabidopsis ATH1 chip. To use the function for the mappings of other chips, one needs to create the
# corresponding decoding data frame 'affy2locusDF'.
############################################################
## (3.1) Reloading required data objects from local files ##
############################################################
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list")
# This step makes future sessions much faster, since it allows to skip the previous data generation steps (2.12.3).
# A sample data set is available here: ArabSampleGOHyperGAll.zip (Jan 2010).
##########################################
## (3.2) Obtain a sample set of GeneIDs ##
##########################################
test_sample < unique(as.vector(GO_MF_DF[1:40,2])) # When working with GeneIDs.
test_sample < AffyID2GeneID(affyIDs=affy_sample, probe2gene=1)
# When working with AffyIDs, one can use the function 'AffyID2GeneID' to obtain for a set of AffyIDs their corresponding
# GeneIDs from the data frame 'affy2locusDF' (see above). For probe sets that match several loci, only the first locus
# ID will be used if the argument 'probe2gene' is set to 1. To demo the function, one can use the following sample
affy_sample < c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at") # AffyID sample
##########################################################################
## (4.1) Perform phyper test, goSlim subsetting and plotting of results ##
##########################################################################
GOHyperGAll_result < GOHyperGAll(gocat="MF", sample=test_sample, Nannot=2); GOHyperGAll_result[1:10,8]
# The function 'GOHyperGAll()' performs the phyper test against all nodes in the GO network. It returns raw and
# adjusted pvalues. The Bonferroni correction is used as pvalues adjustment method according to Boyle et al, 2004 (online).
# The argument 'Nannot' defines the minimum number of direct annotations per GO node from the sample set to determine
# the number of tested hypotheses for the pvalue adjustment. The argument 'gocat' can be assigned the values "MF", "BP"
# and "CC". Omitting the '8' delimiter will provide the sample keys matching at every GO node.
subset < GOHyperGAll_Subset(GOHyperGAll_result, sample=test_sample, type="goSlim"); subset[,8]
# The function 'GOHyperGAll_Subset()' subsets the GOHyperGAll results by assigned GO nodes or custom goSlim categories.
# The argument 'type' can be assigned the values "goSlim" or "assigned". The optional argument 'myslimv' can be used to
# provide a custom goSlim vector. Omitting the '8' delimiter will show the sample keys matching at every GO node.
pie(subset[subset$SampleMatch>0 ,3], labels=as.vector(subset[subset$SampleMatch>0 ,1]), main=unique(as.vector(subset[subset$SampleMatch>0, 7]))) # Plots pie chart of subsetted results.
##############################################################
## (4.2) Reduce GO Term Redundancy in 'GOHyperGAll_results' ##
##############################################################
simplifyDF < GOHyperGAll_Simplify(GOHyperGAll_result, gocat="MF", cutoff=0.001, correct=T)
# The result data frame 'GOHyperGAll_result' often contains several connected GO terms with significant scores which
# can complicate the interpretation of large sample sets. To reduce the redundancy, the function 'GOHyperGAll_Simplify'
# subsets the data frame 'GOHyperGAll_result' by a user specified pvalue cutoff and removes from it all GO nodes with
# overlapping children sets (OFFSPRING), while the best scoring nodes remain in the data frame.
data.frame(GOHyperGAll_result[GOHyperGAll_result[,1] %in% simplifyDF[,1], 8], GO_OL_Match=simplifyDF[,2])
# This command returns the redundancy reduced data set. The column 'GO_OL_Match' provides the number of accessions that
# match the connected nodes.
################################################
## (4.3) Batch Analysis of Many Gene Clusters ##
################################################
BatchResult < GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575"))
# The function 'GOCluster_Report' performs the three GO analyses in batch mode: 'GOHyperGAll', 'GOHyperGAll_Subset'
# or 'GOHyperGAll_Simplify'. It processes many groups of genes (e.g. gene expression clusters) and returns the results
# conveniently organized in a single data frame. The gene sets need to be provided in a data frame of the format
# specified at the end of the GOHyperGAll script. CLSZ: minimum cluster size to consider; method: "all", "slim" or
# "simplify"; gocat: "MF", "BP" or "CC"; cutoff: adjusted pvalue cutoff; recordSpecGO: argument to include one specific
# GOID in each of the 3 ontologies, e.g: recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575").
GSEA
Gene Set Enrichment Analysis (GSEA) The following function converts the objects, generated by the above GOHyperGAll script, into the gene set format (*.gmt) for the GSEA tool from the Broad Institute. This way the custom GO files of the GOHyperGAll method can be used as gene sets for the Java or R releases of the GSEA program.
(1) Import the required file format conversion function with the following source() command
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOhyper2GSEA() function.
GOhyper2GSEA(myfile=c("MF_node_affy_list", "BP_node_affy_list", "CC_node_affy_list"), type="all")
# Converts XX_node_affy_list or GO_XX_DF files into GO_XX_ALL.gmt and GO_XX_TERM.gmt files, respecitively. Use the argument
# "all" for XX_node_affy_list files and "terminal" for GO_XX_DF files. The former contain all GO assignments, whereas the
# latter contain only the direct GO assignments.
(2) Import the generated gene set files (*.gmt) along with the gene expression intensity data (*.txt) or preranked gene lists (*.rnk) into the Java or R GSEA program, and follow the instructions on the GSEA User Guide. A detailed description of the data formats is provided on the Data Formats page.

GOTools and goCluster
Getting started with goTools
my_list < list(L1=c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at"), L2=c("256497_at", "250942_at", "265611_at", "247488_at", "255148_at")) # Creates a working list with two components each consisting of Arabidopsis Affy IDs.
library("goTools", verbose = FALSE) # Loads the required library.
ontoCompare(my_list, probeType = "ath1121501", goType="MF")
# Provides compositional differences of the molecular function GOs between the Affy ID vectors contained in my_list.
ontoCompare(my_list, probeType = "ath1121501", goType="MF", plot=TRUE) # Plots the above MF GO compositions in bar diagram.
par(mar = c(5, 8, 5, 8)); res2 < ontoCompare(my_list["L1"], probeType = "ath1121501", method="TIDS", goType="MF", plot=FALSE); ontoPlot(res2, cex = 0.7) # Plots pie chart for "L1" gene list component in my_list.

goCluster demo
library("goCluster", verbose = FALSE) # Loads the goCluster package.
testdata < eset[c(1:400),c(1:4)] # Selects 400 genes from previously generated ExpressionSet object (e.g. simpleaffy) with four chips.
test < new("goCluster") # Creates goCluster object.
testConfigured < config(test)
# The configmethod starts the configuration process and stores the results in object 'testConfigured'. The required
# configuration information needs to be provided in the interactive mode (use 'ath1121501').
setup(testConfigured)[["algo"]]; print(testConfigured) # Retrieves the generated configuration information.
testExecuted < execute(testConfigured) # Analyzes the configured object.
print(testExecuted) # Prints the above analysis object to give a result summary.
display(testExecuted, selection = 3) # Displays result in form of heat map.
testExecuted2 < testExecuted; testExecuted2@visu < new("clusterVisualHeatmap"); testExecuted2@visu < execute(testExecuted2@visu, testExecuted2); display(testExecuted2, selection = "GO:0009535") # Displays result for chosen GO identifier.

KEGG Pathway Analysis
 KEGG Pathway Script: Script to map Arabidopsis locus IDs to KEGG pathways. It can be easily adjusted to any organism.
Motif Identification in Promoter Regions
 COSMO from Oliver Bembom et al.: allows to discover motifs in unaligned DNA sequences
 rGADEM: de novo motif discovery
Phylogenetic Analysis
Cheminformatics in R
 ChemmineR is an R package for mining druglike compound and screening data sets.
Protein Structure Analysis
 Bio3D in R: utilities for the analysis of protein structure and sequence data.
MS Data Analysis
GenomeWide Association Studies (GWAS)
 SimHap: A comprehensive modelling framework and a multipleimputation approach to haplotypic analysis of populationbased data.
 GenABEL: R library for Genomewide association analysis.
 crlmm: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays (Carvalho et al., 2009).
 GGtools: Software and data for genetical genomics.
 PLINK:
Free, opensource whole genome association analysis toolset, designed
to perform a range of basic, largescale analyses in a computationally
efficient manner.
BioConductor Exercises
Slide Show: Technology Overview ( R Code)
Install of R and required Bioconductor libraries
 Download a sample set of Affymetrix cel files
 Rightclick this link and save its content to your computer: Workshop.zip. These sample CEL files are from the GEO data set: GSE5621
 After unpacking this file archive one should see six *.cel files.
 Generate RMA expression data, MAS5 P/M/A calls and export results to Excel
library(affy); mydata < ReadAffy() # Reads cel files in current working directory into affybatch object 'mydata'.
library(ath1121501probe); print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
eset_rma < rma(mydata) # Generates RMA expression values and stores them as ExpressionSet.
exprs(eset_rma)[1:4,] # Prints first 4 rows in data frame structure.
# exprs(eset_rma) < 2^(exprs(eset_rma)) # If needed unlog RMA expression values.
mydf < 2^exprs(eset_rma); myList < tapply(colnames(mydf), c(1,1,2,2,3,3), list); names(myList) < sapply(myList, paste, collapse="_"); mymean < sapply(myList, function(x) rowMeans(mydf[,x])) # Generic approach for calculating mean values for any sample combination.
eset_pma < mas5calls(mydata) # Generates MAS 5.0 P/M/A calls.
my_frame < data.frame(exprs(eset_rma), exprs(eset_pma), assayDataElement(eset_pma, "se.exprs"))
# Combine RMA intensities, P/M/A calls plus their wilcoxon pvalues in one data frame.
my_frame < my_frame[, sort(names(my_frame))] # Sorts columns by cel file name.
write.table(my_frame, file="my_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.

 Add annotation information
library("ath1121501.db") # Loads required annotation package.
Annot < data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "), DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", ")) # Constructs a data frame containing the gene IDs, gene symbols and descriptions for all probe sets on the chip.
all < merge(Annot, my_frame, by.x=0, by.y=0, all=T) # Merges everything with above expression data.
write.table(all, file="my_annot_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.
 Visualization and quality control
d < cor(2^exprs(eset_rma), method="pearson"); plot(hclust(dist(1d))) # Generates a correlation matrix for allagainstall chip comparisons.
library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
eset < eset_rma # Reassignment to work with following commands.
 Save this covdesc.txt to your R working directory.
 Generate expression data with RMA and MAS5.
 Filter each of the three data sets with the following parameters:
2fold changes, present in all 4 chips and pscore less than 0.001.
 Write the results into separate files.
 Create scatter plots for the filtered data sets and save them to external image files.
 Identify the overlap of the significant changes between the RMA and MAS5 data.
 Perform simpleaffy QC checks: scaling factor, percent present calls, etc.
 Identify differtially expressed genes (DEGs) with the limma package
 Follow the instructions in the Limma Section for Affy Data.
 cDNA microarray users can save and extract the SWIRL
cDNA microarray sample data. For a quick demonstration of the analysis
of this data set, one can copy&paste or source the following
commandline summary into the R terminal: my_swirl_commands.txt.
 Test gene sets for overrepresented GO terms using the GOHyperGAll script
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOHyperGAll functions.
loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list")
# Loads the downloaded annotation objects for Arabidopsis.
GOHyperGAll_result < GOHyperGAll(gocat="MF", sample=unique(as.vector(GO_MF_DF[1:40,2])), Nannot=2); GOHyperGAll_result[1:10,8]
# Performs the enrichment test for provided set of gene identifiers.
CL_DF < data.frame(geneID=GO_MF_DF[1:400,2], ClusterID=sort(rep(1:4,100)), ClusterSize=rep(100,400))
# Create sample data set for batch processing.
BatchResult < GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575")); BatchResult[1:4,10]
# Performs all three GO analyses for many sample set at once. When the method argument is set to "slim" then the goSlim method is used.
write.table(BatchResult, file="GO_Batch.xls", sep="\t", col.names = NA)
# Exports batch result to Excel file.
 Clustering of differentially expressed genes
my_fct < function(x) hclust(x, method="complete") # Defines function to perform hierarchical clustering (complete linkage).
heatmap(as.matrix(2^exprs(eset)[1:40,]), col = cm.colors(256), hclustfun=my_fct) # Plots heatmap with dendrograms.
# Replace in last step 'exprs(eset)[1:40,]' by matrix of differentially expressed genes from limma analysis.
 Convert the above commands into an R script and execute it with the source() function
source("myscript.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.
Clustering and Data Mining in R
Introduction
[ Slide Show ]
R contains many functions and libraries for clustering of large data
sets. A very useful overview of clustering utilities in R is available
on the Cluster Task Page and for machine learning algorithms on the Machine Learning Task Page and the MLInterfaces package.
Data Preprocessing
Generate a sample data set
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data matrix.
Data centering and scaling
scale(t(y)); yscaled < t(scale(t(y))); apply(yscaled, 1, sd)
# The function scale() centers and/or scales numeric matrices columnwise. When used with its default settings, the
# function returns columns that have a mean close to zero and a standard deviation of one. For rowwise scaling the
# data set needs to be transposed first using the t() function. Another transposing step is necessary to return the
# data set in its original orientation (same row/column assignment). Centering and scaling are common data transformation
# steps for many clustering techniques.
Obtain a distance matrix
dist(y, method = "euclidean")
# Computes and returns a distance matrix for the rows in the data matrix 'y' using the specified distance measure, here 'euclidean'.
c < cor(t(y), method="spearman"); d < as.dist(1c); d
# To obtain correlationbased distances, one has to compute first a correlation matrix with the cor() function. Since this
# step iterates across matrix columns, a transpose 't()' step is required to obtain row distances. In a second step the distance
# matrix object of class "dist" can be obtained with the as.dist() function.
Hierarchical Clustering (HC)
The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50100 faster) version of hclust. The pvclust
package can be used for assessing the uncertainty in hierarchical
cluster analyses. It provides approximately unbiased pvalues as well as
bootstrap pvalues. As an introduction into R's standard hierarchical
clustering utilities one should read the help pages on the following
functions: hclust, dendrogram, as.dendrogram, cutree and heatmap.
An example for subclustering (subsetting) heatmaps based on selected
tree nodes is given in the last part of this section (see zoom into heatmaps).
Clustering with hclust
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
c < cor(t(y), method="spearman"); d < as.dist(1c); hr < hclust(d, method = "complete", members=NULL)
#
This is a short example of clustering the rows of the generated sample
matrix 'y' with hclust. Seven different cluster joining
# methods can be
selected with the 'method' argument: ward, single, complete, average,
mcquitty, median and centroid. The
# object returned by hclust is a list
of class hclust which describes the tree generated by the clustering
process with the
# following components: merge, height, order, labels,
method, call and dist.method.
par(mfrow = c(2, 2)); plot(hr,
hang = 0.1); plot(hr, hang = 1)
# The generated tree can be plotted
with the plot() function. When the hang argument is set to '1' then all
leaves end on
# one line and their labels hang down from 0. More details
on the plotting behavior is provided in the hclust help document
(?hclust).
plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4),
horiz=T)
# To plot trees horizontally, the hclust tree has to be
transformed into a dendrogram object.
unclass(hr) # Prints the full content of the hclust object.
str(as.dendrogram(hr)) # Prints dendrogram structure as text.
hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
par(mfrow=c(2,1));
hrd1 < as.dendrogram(hr); plot(hrd1); hrd2 < reorder(hrd1,
sample(1:10)); plot(hrd2); labels(hrd1); labels(hrd2)
# Example to
reorder a dendrogram and print out its labels.
library(ctc);
hc2Newick(hr)
# The 'hc2Newick' function of the Bioc 'ctc' library can
convert a hclust object into the Newick tree file format for export
# into
external programs.
Hierarchical Clustering Dendrogram Generated by hclust
Tree subsetting (see also Dynamic Tree Cut package)
mycl < cutree(hr, h=max(hr$height)/2); mycl[hr$labels[hr$order]]
# Cuts tree into discrete cluster groups (bins) at specified height in tree that is specified by 'h' argument.
# Alternatively, one can cut based on a desired number of clusters that can be specified with the 'k' argument.
plot(hr); rect.hclust(hr, k=5, border="red") # Cuts dendrogram at specified level and draws rectangles around the resulting clusters.
subcl < names(mycl[mycl==3]); subd < as.dist(as.matrix(d)[subcl,subcl]); subhr < hclust(subd, method = "complete"); par(mfrow = c(1, 2)); plot(hr); plot(subhr)
# This example shows how one can subset a tree by cutting it at a given branch point provided by the cutree() function.
# The identified row IDs are then used to subset the distance matrix and recluster it with hclust.
lapply(as.dendrogram(hr), unlist); lapply(as.dendrogram(hr)[[1]], unlist)
# Returns the indices of the cluster members of the two main branches. The second step returns the members of the next two
# descendent clusters of the first main branch.
example(hclust)
# The last step in this example shows how the 'members' argument can be used to reconstruct a hierarchical
# tree above the cut points using the cluster centers. More details on this can be found in the hclust help document.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/selectNodes.R"); Uniclusters < nodes2Cluster(myhclust=hr, rootdist=2); clusterQCselect < clusterQC(clusterlist=Uniclusters, simMatrix=as.matrix(c), method="min", cutoff=0.7, output="df"); clusterQCselect
# The function 'nodes2Cluster' cuts the tree at all height levels as defined in 'hr$height' and the second function 'clusterQC'
# selects clusters based on a given minimum cutoff in an allagainstall comparison within each cluster. The 'method' argument
# allows to calculate the cutoff value with one of the following functions: "min", "qu1", "median", "mean", "qu3" and "max". The
# end result can be returned as list (output="list") or as data frame (output="df"). The argument 'rootdist' defines the starting
# distance from the root of the tree to avoid testing of very large clusters.
Tree coloring and zooming into branches
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
dend_colored < dendrapply(as.dendrogram(hr), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20)
# In this example the dendrogram for the above 'hr' object is colored with the imported 'dendroCol()' function based on
# the identifiers provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the labels are colored instead of
# the leaves.
par(mfrow = c(1, 3)); plot(dend_colored, horiz=T); plot(dend_colored, horiz=T, type="tr")
plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0), ylim=c(1,3))
# Plots the colored tree in different formats. The last command shows how one can zoom into the tree with the 'xlim and ylim'
# arguments, which is possible since R 2.8.
z < as.dendrogram(hr); attr(z[[2]][[2]],"edgePar") < list(col="blue", lwd=4, pch=NA); plot(z, horiz=T)
# This example shows how one can manually color tree elements.
Dendrogram/Tree Coloring Examples with Zooming into Trees
Plot heatmaps with the heatmap() function
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R")
# Imports color function to obtain heat maps in red and green. Use this color function in heatmap color argument like
# this: 'col=my.colorFct()'.
hr < hclust(as.dist(1cor(t(y), method="pearson")), method="complete") # Clusters rows by Pearson correlation as distance method.
hc < hclust(as.dist(1cor(y, method="spearman")), method="complete") # Clusters columns by Spearman correlation as distance method.
heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row")
# The previous two hclust objects are used to obtain a heatmap where the rows are clusterd by Pearson and the columns by
# Spearman correlation distances. The 'scale' argument allows to scale the input matrix by rows or columns. The default is
# by rows and "none" turns the scaling off.
mycl < cutree(hr, h=max(hr$height)/1.5); mycol < sample(rainbow(256)); mycol < mycol[as.vector(mycl)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
# The arguments 'ColSideColors' and 'RowSideColors' allow to annotate a heatmap with a color bar. In this example the row
# colors correspond to cluster bins that were obtained by the cutree() function.
dend_colored < dendrapply(as.dendrogram(hr), dendroCol, keys=c("g1", "g2"), xPar="edgePar", bgr="black", fgr="red", pch=20); heatmap(y, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
# In addition, one can color the tree leaves or labels by providing a colored dendrogram object, here 'dend_colored' from above.
heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", labRow = "", labCol="")
# The arguments 'labRow' and 'labCol' allow to provide custom labels or to omit the printing of the labels as shown here.
heatmap(y, Rowv=as.dendrogram(hr), Colv=NA, col=my.colorFct())
# If 'Rowv' or 'Cowv' is set to 'NA' then the row and/or the column clustering is turned off.
ysort < y[hr$labels[hr$order], hc$labels[hc$order]]; heatmap(ysort, Rowv=NA, Colv=NA, col=my.colorFct())
# This example shows how the rows and columns of a heatmap can be sorted by the hclust trees without printing them.
x11(height=4, width=4); heatmap(matrix(rep(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),2), 2, byrow=T, dimnames=list(1:2, round(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),1))), col=my.colorFct(), Colv=NA, Rowv=NA, labRow="", main="Color Legend")
# Prints color legend in a separate window.
Plot heatmaps with the image() or heatmap.2() functions
image(scale(t(ysort)), col=my.colorFct())
# A very similar plot as before can be obtained by the image() function which plots the values of a given matrix as a grid
# of colored rectangles.
x11(height=12);
par(mfrow=c(1,2)); plot(as.dendrogram(hr), horiz=T, yaxt="n",
edgePar=list(col="grey", lwd=4)); image(scale(t(ysort)),
col=my.colorFct(), xaxt="n", yaxt="n")
#
To create very large heatmaps that are still readable, one can plot the
dendrogram and the heatmap of the image() function next to it.
library("gplots") # Loads the gplots library that contains the heatmap.2() function.
heatmap.2(y,
Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=redgreen(75),
scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol, trace="none", key=T, cellnote=round(t(scale(t(y))),1))
# The heatmap.2() function contains many additional plotting features
that are not available in the basic heatmap() function.
# For instance, the 'key' argument allows to add a color key within the same plot. Numeric values can be added to each cell
# with the 'cellnote' argument. In addition, heatmap.2() scales to very large formats for viewing complex heatmaps.
library(RColorBrewer); mypalette < rev(brewer.pal(9,"Blues")[1])
# Select an alternative color scheme that is more appropriate for representing continous values (e.g. chip intensities).
heatmap.2(y,
Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=mypalette,
scale="row", trace="none", key=T, cellnote=round(t(scale(t(y))),1)) #
Example to illustrate how to turn off the row or column reordering in
heatmap.2. The settings "dendrogram="row", Colv=F" turn off the column
reordering and no column tree is shown. To turn off the reordering of
both dimensions, use the following settings: "dendrogram="none", Rowv=F,
Colv=F".
Zooming into heatmaps by subclustering selected tree nodes
y < matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
hr < hclust(as.dist(1cor(t(y), method="pearson")), method="complete"); hc < hclust(as.dist(1cor(y, method="spearman")), method="complete")
# Generates row and column dendrograms.
mycl < cutree(hr, h=max(hr$height)/1.5); mycolhc < rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc < mycolhc[as.vector(mycl)]
# Cuts the tree and creates color vector for clusters.
library(gplots); myheatcol < redgreen(75)
# Assign your favorite heatmap color scheme. Some useful examples: colorpanel(40, "darkblue", "yellow", "white");
# heat.colors(75); cm.colors(75); rainbow(75); redgreen(75); library(RColorBrewer); rev(brewer.pal(9,"Blues")[1]).
# Type demo.col(20) to see more color schemes.
heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc)
# Creates heatmap for entire data set where the obtained clusters are indicated in the color bar.
x11(height=6, width=2); names(mycolhc) < names(mycl); barplot(rep(10, max(mycl)), col=unique(mycolhc[hr$labels[hr$order]]), horiz=T, names=unique(mycl[hr$order]))
# Prints color key for cluster assignments. The numbers next to the color boxes correspond to the cluster numbers in 'mycl'.
clid < c(1,2); ysub < y[names(mycl[mycl%in%clid]),]; hrsub < hclust(as.dist(1cor(t(ysub), method="pearson")), method="complete")
# Select subcluster number (here: clid=c(1,2)) and generate corresponding dendrogram.
x11(); heatmap.2(ysub, Rowv=as.dendrogram(hrsub), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc[mycl%in%clid])
# Create heatmap for chosen subcluster.
data.frame(Labels=rev(hrsub$labels[hrsub$order]))
# Print out row labels in same order as shown in the heatmap.
Row and Column Trees Plotted with Heatmap of Source Data
Bootstrap Analysis in Hierarchical Clustering
The pvclust package allows to assess the uncertainty in hierarchical cluster analysis by calculating for each cluster pvalues via multiscale bootstrap resampling. The method provides two types of pvalues. The approximately unbiased pvalue (AU) is computed by multiscale bootstrap resampling. It is a less biased pvalue than than the second one, bootstrap probability (BP), which is computed by normal bootstrap resampling.
library(pvclust) # Loads the required pvclust package.
y < matrix(rnorm(500), 50, 10, dimnames=list(paste("g", 1:50, sep=""), paste("t", 1:10, sep=""))) # Creates a sample data set.
pv < pvclust(scale(t(y)), method.dist="correlation", method.hclust="complete", nboot=10)
# This step performs the hierarchical cluster analysis with multiscale bootstrap with 10 repetitions, complete linkage for
# cluster joining and a correlationbased dissimilarity matrix. Usually, one should set 'nboot' to at least 1000 bootstrap
# repetitions.
plot(pv, hang=1) # Plots a dendrogram where the red numbers represent the AU pvalues and the green ones the BP values.
pvrect(pv, alpha=0.95) # This command highlights with red rectangles all clusters in the dendrogram that have an AU value above 95%.
clsig < unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters)
# The function 'pvpick' allows to pick significant clusters.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R")
# Import tree coloring function. The details on this function are provided in the hclust section of this manual.
dend_colored < dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20)
# Creates dendrogram object where the significant clusters are labeled in red.
plot(dend_colored, horiz=T) # Plots a dendrogram where the significant clusters are highlighted in red.
QT Clustering
QT (quality threshold) clustering is a partitioning method that forms clusters based on a maximum cluster diameter. It iteratively identifies the largest cluster below the threshold and removes its items from the data set until all items are assigned. The method was developed by Heyer et al. (1999) for the clustering of gene expression data.
library(flexclust) # Loads flexclust library.
x < matrix(10*runif(1000), ncol=2) # Creates a sample data set.
cl1 < qtclust(x, radius=3) # Uses 3 as the maximum distance of the points to the cluster centers.
cl2 < qtclust(x, radius=1) # Uses 1 as the maximum distance of the points to the cluster centers.
par(mfrow=c(2,1)); plot(x, col=predict(cl1), xlab="", ylab=""); plot(x, col=predict(cl2), xlab="", ylab="") # Plots the two cluster results.
data.frame(name_index=1:nrow(x), cl=attributes(cl1)$cluster) # Provides the rowtocluster assignment information.
qtclust Clustering Result
KMeans & PAM
Kmeans, PAM (partitioning around medoids) and clara are related
partition clustering algorithms that cluster data points into a
predefined number of K clusters. They do this by associating each data
point to its nearest centroids and then recomputing the cluster
centroids. In the next step the data points are associated with the
nearest adjusted centroid. This procedure continues until the cluster
assignments are stable. Kmeans uses the average of all the points in a
cluster for centering, while PAM uses the most centrally located point.
Commonly used R functions for Kmeans clustering are: kmeans() of the
stats package, kcca() of the flexclust package and trimkmeans() of the trimcluster package. PAM clustering is available in the pam() function from the cluster package. The clara() function of the same package is a PAM wrapper for clustering very large data sets.
Kmeans
km < kmeans(t(scale(t(y))), 3)
# Kmeans clustering is a partitioning method where the number of clusters is predefined. The basic R implementation
# requires as input the data matrix and uses Euclidean distance. In contrast to pam(), the implementation does not allow
# to provide a distance matrix. In the given example the rowwise centered data matrix is provided.
km$cluster
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21 g22 g23 g24 g25 g26 g27 g28 g29 g30 g31 g32
1 2 2 1 3 3 3 1 3 1 2 3 1 2 1 2 3 2 3 3 1 1 3 3 3 3 1 3 3 3 2 2
g33 g34 g35 g36 g37 g38 g39 g40 g41 g42 g43 g44 g45 g46 g47 g48 g49 g50
2 1 1 1 2 1 2 3 3 3 2 1 3 3 1 3 2 1
mycol < sample(rainbow(256)); mycol < mycol[as.vector(km$cluster)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
# This example shows how the obtained Kmeans clusters can be compared with the hierarchical clustering results by highlighting
# them in the heatmap color bar.
PAM (partitioning around medoids)
library(cluster) # Loads the required cluster library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
pamy < pam(y, 4); pamy$clustering # Clusters data into 4 clusters using default Euclidean as distance method.
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 2 3 1 4 2 1 1 2 3
mydist < as.dist(1cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
pamy < pam(mydist, 4); pamy$clustering # Same a above, but uses provided distance matrix.
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 2 3 1 3 2 4 1 4 3
pam(mydist, 4, cluster.only=TRUE) # Same as before, but with faster computation.
plot(pamy) # Generates cluster plot.
Clara (clustering large applications: PAM method for large data sets)
clarax < clara(t(scale(t(y))), 4)
# Clusters above data into 4 clusters using default Euclidean as distance method. If the data set consists of gene expression
# values, then rowwise scaling of y is recommend in combination with Euclidean distances.
clarax$clustering
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 2 3 1 3 2 4 1 2 3
Fuzzy Clustering
In contrast to strict/hard clustering approaches, fuzzy clustering allows multiple cluster memberships of the clustered items. This is commonly achieved by partitioning the membership assignments among clusters by positive weights that sum up for each item to one. Several R libraries contain implementations of fuzzy clustering algorithms. The library e1071 contains the cmeans (fuzzy Cmeans) and cshell (fuzzy Cshell) clustering functions. And the cluster library provides the fanny function, which is a fuzzy implementation of the above described kmedoids method.
Fuzzy clustering with the cluster library
l ibrary(cluster) # Loads the cluster library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
fannyy < fanny(y, k=4, metric = "euclidean", memb.exp = 1.5)
# Computes membership coefficients (weights) for 4 clusters and stores them in the 'membership' component of the resulting
# fanny object. The corresponding hard clustering result is provided in the 'clustering' slot, which is formed by assigning
# each point to the cluster with the highest coefficient. If the distance method is set to metric="SqEuclidean", then the
# function performs fuzzy Cmeans clustering. The membership exponent can be controlled with the argument 'memb.exp'
# (default is 2). Values close to 1 give 'crispier' clustering, whereas larger values increase the fuzzyness of the results.
# If a clustering results in complete fuzzyness, then the functions returns a warning.
round(fannyy$membership, 2)
[,1] [,2] [,3] [,4]
g1 0.80 0.07 0.08 0.04
g2 0.09 0.80 0.08 0.03
g3 0.92 0.02 0.04 0.01
g4 0.76 0.09 0.10 0.04
g5 0.10 0.04 0.77 0.09
g6 0.00 0.00 0.01 0.98
g7 0.12 0.06 0.65 0.17
g8 0.13 0.16 0.29 0.41
g9 0.07 0.03 0.85 0.05
g10 0.03 0.94 0.02 0.01
fannyy$clustering
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1 2 1 1 3 4 3 4 3 2
mydist < as.dist(1cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
fannyy < fanny(mydist, k=4, memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
# Example for a using a correlationbased distance matrix.
fannyyMA < round(fannyy$membership, 2) > 0.10; apply(fannyyMA, 1, which)
# Returns multiple cluster memberships for coefficient above a certain value (here >0.1)
$g1
[1] 1 4
$g2
[1] 2
$g3
[1] 1
$g4
[1] 3
$g5
[1] 4
$g6
[1] 1
$g7
[1] 4
$g8
[1] 2
$g9
[1] 3 4
$g10
[1] 3
Fuzzy clustering with the e1071 library
library(e1071) # Loads the e1071 library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
cmeansy < cmeans(y, 4, method="cmeans", m=1.5); round(cmeansy$membership, 2); cmeansy$cluster
# Uses the cmeans clustering method to compute membership coefficients (weights) for 4 clusters and stores them in
# the 'membership' component of the resulting object. The result of this example should be identical to the above
# clustering with the fanny function when its metric argument is set to "SqEuclidean". The cmeans function does not
# accept a distance matrix as input, but the fanny function does.
cmeansyMA < round(cmeansy$membership, 2) > 0.10; apply(cmeansyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
cshelly < cshell(y, 4, method="cshell", m=1.5); round(cshelly$membership, 2); cshelly$cluster
# Uses the cshell clustering method to compute membership coefficients (weights) for 4 clusters and stores them in
# the 'membership' component of the resulting object.
SelfOrganizing Map (SOM)
Selforganizing map (SOM), also known as Kohonen network, is a
popular artificial neural network algorithm in the unsupervised learning
area. The approach iteratively assigns all items in a data matrix to a
specified number of representatives and then updates each representative
by the mean of its assigned data points. Widely used R packages for SOM
clustering and visualization are: class (part of R), SOM and kohonen. The SOM package, that is introduced here, provides similar utilities as the GeneCluster software from the Broad Institute.
library(som) # Loads the required SOM library.
y < matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
y < t(scale(t(y))) # Normalizes the data so that each row has a mean close to zero and a standard deviation of one.
y.som < som(y, xdim = 5, ydim = 6, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
plot(y.som) # Plots SOM clustering result.
y.som$visual # Provides the assignment of rows items to the SOM clusters. Remember that the coordinate counting starts here at zero!
x y qerror
1 3 0 0.6037758
2 4 3 0.9750106
3 2 2 1.0911974
4 0 2 0.7000283
5 2 2 1.2006966
6 0 3 0.5653868
7 1 0 0.7291410
8 3 4 1.5029060
9 2 3 1.5038304
10 2 4 0.4960032
SOM Plot
Principal Component Analysis (PCA)
Principal components analysis (PCA) is a data reduction technique
that allows to simplify multidimensional data sets to 2 or 3 dimensions
for plotting purposes and visual variance analysis. The following
commands introduce the basic usage of the prcomp() function. A very
related function is princomp(). The BioConductor library pcaMethods provides many additional PCA functions. For viewing PCA plots in 3D, one can use the scatterplot3d library or the made4 library.
z1 < rnorm(10000, mean=1, sd=1); z2 < rnorm(10000, mean=3, sd=3); z3 < rnorm(10000, mean=5, sd=5); z4 < rnorm(10000, mean=7, sd=7); z5 < rnorm(10000, mean=9, sd=9); mydata < matrix(c(z1, z2, z3, z4, z5), 2500, 20, byrow=T, dimnames=list(paste("R", 1:2500, sep=""), paste("C", 1:20, sep="")))
# Generates sample matrix of five discrete clusters that have very different mean and standard deviation values.
pca < prcomp(mydata, scale=T)
# Performs principal component analysis after scaling the data. It returns a list with class "prcomp" that contains five
# components: (1) the standard deviations (sdev) of the principal components, (2) the matrix of eigenvectors (rotation),
# (3) the principal component data (x), (4) the centering (center) and (5) scaling (scale) used.
summary(pca) # Prints variance summary for all principal components.
summary(pca)$importance[, 1:6] # Accesses subset of components.
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 2.181627 0.9806665 0.9785759 0.9581832 0.9483988 0.9419336
Proportion of Variance 0.237970 0.0480900 0.0478800 0.0459100 0.0449700 0.0443600
Cumulative Proportion 0.237970 0.2860600 0.3339400 0.3798500 0.4248200 0.4691800
x11(height=6, width=12, pointsize=12); par(mfrow=c(1,2)) # Set plotting parameters.
mycolors < c("red", "green", "blue", "magenta", "black") # Define plotting colors.
plot(pca$x, pch=20, col=mycolors[sort(rep(1:5, 500))])
# Plots scatter plot for the first two principal components that are stored in pca$x[,1:2].
plot(pca$x, type="n"); text(pca$x, rownames(pca$x), cex=0.8, col=mycolors[sort(rep(1:5, 500))])
# Same as above, but prints labels.
library(geneplotter); smoothScatter(pca$x) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))])
# Plots scatter plots for all combinations between the first four principal components.
biplot(pca)
# Plots a scatter plot for the first two principal components plus the corresponding eigen vectors that are stored in pca$rotation.
library(scatterplot3d) # Loads library scatterplot3d.
scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))])
# Same as above, but plots the first three principal components in 3D scatter plot.
library(rgl); rgl.open(); offset < 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset)); rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1); spheres3d(pca$x[,1], pca$x[,2], pca$x[,3], radius=0.3, color=mycolors, alpha=1, shininess=20); aspect3d(1, 1, 1); axes3d(col='black'); title3d("", "", "PC1", "PC2", "PC3", col='black'); bg3d("white")
# Creates an interactive 3D scatter plot with Open GL. The rgl library needs to be installed for this. To save a snapshot
# of the graph, one can use the command rgl.snapshot("test.png").
Table of Contents
Scatter Plots for First Four Principal Components: pairs(pca$x[,1:4])
Multidimensional Scaling (MDS)
Multidimensional scaling (MDS) algorithms start with a matrix of
itemitem distances and then assign coordinates for each item in a
lowdimensional space to represent the distances graphically. Cmdscale()
is the base function for MDS in R. Additional MDS functions are
sammon() and isoMDS() of the MASS library.
loc < cmdscale(eurodist) # Performs MDS analysis on the geographic distances between European cities.
plot(loc[,1], loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)")
text(loc[,1], loc[,2], rownames(loc), cex=0.8)
# Plots the MDS results in 2D plot. The minus is required in this example to flip the plotting orientation.
Table of Contents
Map of Europe Generated by MDS Clustering of CitytoCity Distances
Bicluster Analysis
Biclustering (also coclustering or twomode clustering) is an
unsupervised clustering technique which allows simultaneous clustering
of the rows and columns of a matrix. The goal of biclustering is to find
subgroups of rows and columns which are as similar as possible to each
other and as different as possible to the remaining data points. The biclust package, that is introduced here, contains a collection of bicluster algorithms, data preprocessing and visualization methods (Detailed User Manual). An algorithm that allows the integration of different data types is implemented in the cMonkey R program (BMC Bioinformatics 2006, 7, 280). A comparison of several bicluster algorithms for clustering gene expression data has been published by Prelic et al (2006).
Since most biclustering algorithms expect the input data matrix to be
properly preprocessed, it is especially important to carefully read the
manual pages for the different functions.
Plaid model biclustering
library(biclust) # Loads the biclust library.
data(BicatYeast) # Loads sample data from Barkow et al. (Bioinformatics, 22, 12821283).
res1 < biclust(BicatYeast, method=BCPlaid())
# Performs plaid model biclustering as described in Turner et al, 2003. This algorithm models data matrices to a sum of
# layers, the model is fitted to data through minimization of error.
summary(res1); show(res1)
# The summary() functions returns the sizes of all clusters found and the show() function provides an overview of the
# method applied.
names(attributes(res1))
# Converts res1 object of class Biclust into a list and returns the names of its components (slots). The results for the
# row clustering are stored in the "RowxNumber" slot and the results for the column clustering are stored in the "NumberxCol" slot.
BicatYeast[res1@RowxNumber[,2], res1@NumberxCol[2,]] # Extracts values for bicluster number 2 from original data set.
myrows < attributes(res1)$RowxNumber; mycols < attributes(res1)$NumberxCol; myclusters < lapply(1:length(myrows[1,]), function(x) list(rownames(BicatYeast[myrows[,x], ]), colnames(BicatYeast[, mycols[x,]]))); names(myclusters) < paste("CL", 1:length(myclusters), sep="_"); myclusters[2]
# Organizes all identified row and column clusters in a list object, here 'myclusters'.
writeBiclusterResults("results.txt", res1, "My Result", dimnames(BicatYeast)[1][[1]], dimnames(BicatYeast)[2][[1]])
# Writes bicluster results to a file.
bubbleplot(BicatYeast, res1, showLabels=T)
# Draws a bubble plot where each bicluster is represented as a circle (bubble). Up to three biclustering result sets can be
# compared in one plot (here one). The sets are then labled with different colors. The color brightness representes the
# bicluster homogeneity (darker, less homogeneous). The bubble sizes represent the size of the biclusters, as
# (number of rows)x(number of columns). The bubble location is a 2Dprojection of row and column profiles.
parallelCoordinates(x=BicatYeast, bicResult=res1, number=4)
# Plots the profiles for the 4th cluster. The cluster number is specified under the number argument.
drawHeatmap(BicatYeast, res1, number=4)
# Draws a heatmap for the BicatYeast data matrix with the rows and columns of the selected 4th bicluster at the topleft of plot.
XMotifs biclustering
y < discretize(BicatYeast, nof=10) # Converts input matrix with continuous data into matrix with discrete values of 10 levels (nof).
res2 < biclust(y, method=BCXmotifs(), alpha=0.05, number=50)
# Performs XMotifs biclustering based on the framework by Murali and Kasif (2003). It searches for rows with constant values
# over a set of columns which results in a submatrix where all rows in a descretized matrix have the same value profile.
# Usually the algorihm needs a discrete matrix to perform well.
bubbleplot(BicatYeast, res1, res2, showLabels=F) # Compares the two bicluster results in form of a bubble plot.
jaccardind(res1, res2)
# Calculates the similarity for two clustering results using an adaptation of the Jaccard index (values from 01). An overview
# of related methods is available on this cluster validity algorithm page.
Bimax biclustering
y < BicatYeast; y[y >= 1  y <= 1] < 1; y[y != 1] < 0
# Transforms input data matrix into binary values. Here the log2 ratios with at least a twofold change are set to 1 and
# the rest is set to 0.
res3 < biclust(y, method=BCBimax(), minr=15, minc=10, number=100)
# Performs Bimax biclustering based on an algorithm described by Prelic et al (2006). This method searches in a binary
# matrix for submatrices containing only ones.
Spectral biclustering
res4 < biclust(BicatYeast, method=BCSpectral(), numberOfEigenvalues=2)
# Performs spectral biclustering as described in Kluger et al, 2003. Spectral biclustering supposes that normalized microarray
# data matrices have a checkerboard structure that can be discovered by the use of svd decomposition in eigenvectors, applied to genes.
CC biclustering
res5 < biclust(y, method=BCCC(), delta=1.5, alpha=1, number=50)
# Performs CC biclustering based on the framework by Cheng and Church (2000). Searches for submatrices with a score lower than
# a specific threshold in a standardized data matrix.
Network Analysis
 WGCNA: an R package for weighted correlation network analysis
 BioNet: routines for the functional analysis of biological networks
Support Vector Machines (SVM)
Support vector machines (SVMs) are supervised machine learning
classification methods. SVM implementations are available in the R
packages kernlab and e1071.
The e1071 package contains an interface to the C++ libsvm
implementation from ChihChung Chang and ChihJen Lin. In addition to
SVMs, the e1071 package includes a comprehensive set of machine learning
utilities, such as functions for latent class analysis, bootstrapping,
short time Fourier transform, fuzzy clustering, shortest path
computation, bagged clustering, naive Bayes classifier, etc. The kernlab
package contains additional functions for spectral clustering, kernel
PCA, etc.
An excellent introduction into the usage of SVMs in R is available in David Meyer's SVM article.
Similarity Measures for Clustering Results
To measure the similarities among clustering results, one can
compare the numbers of identical and unique item pairs appearing in
their clusters. This can be achieved by counting the number of item
pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c)
set. With this information it is possible to calculate a similarity
coefficient, such as the Jaccard Index. The latter is defined as the
size of the intersect divided by the size of the union of two sample
sets: a/(a+b+c). In case of
partitioning results, the Jaccard Index measures how frequently pairs of
items are joined together in two clustering data sets and how often
pairs are observed only in one set. Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d)
that are not joined together in any of the clusters in both sets. A
variety of alternative similarity coefficients can be considered for
comparing clustering results. An overview of available methods is given
on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila (J Mult Anal 98, 873895).
####################################
## Basic usage of cindex function ##
####################################
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/clusterIndex.R")
# Imports the cindex() function for computing the Jaccard Index or the Rand Index.
library(cluster); y < matrix(rnorm(5000), 1000, 5, dimnames=list(paste("g", 1:1000, sep=""), paste("t", 1:5, sep=""))); clarax < clara(y, 49); clV1 < clarax$clustering; clarax < clara(y, 50); clV2 < clarax$clustering
# Creates two sample cluster data sets stored in the named vectors clV1 and clV2.
ci < cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard")
# Computes the Jaccard Index for clV1 and clV2, where values close to 0 indicate low similarities and values close to 1
# high similarities among the evaluated cluster sets. The Rand and Adjusted Rand Indices can be returned by assigning to
# the methods argument the values rand or arand, respectively. By setting the argument self=TRUE, one can allow self
# comparisons, where pairs with different and identical labels are counted. This way clusters with single items will be
# considered as well. If it is set to FALSE only pairs with different labels are considered. The implemented cindex() function
# is relatively speed and memory efficient. Two clustering results with 1000 items each partitioned into 50 clusters will
# be processed in about 0.05 seconds. To focus the analysis on clusters above a minimum size limit, one can use the minSZ
# argument. For instance, the setting minSZ=3 will only consider clusters with 3 or more items.
ci[2:3] # Returns Jaccard index and variables used to compute it.
$variables
a b c
10519 4186 3813
$Jaccard_Index
[1] 0.5680419
ci[[1]][1:4,1:4] # Returns slice of intersect matrix.
1 2 3 4
1 20 0 0 0
2 0 49 0 0
3 0 0 3 0
4 0 0 0 15
################################
## Clustering cluster results ##
################################
clVlist < lapply(3:12, function(x) clara(y[1:30, ], k=x)$clustering); names(clVlist) < paste("k", "=", 3:12)
d < sapply(names(clVlist), function(x) sapply(names(clVlist), function(y) cindex(clV1=clVlist[[y]], clV2=clVlist[[x]], method="jaccard")[[3]]))
hv < hclust(as.dist(1d))
plot(as.dendrogram(hv), edgePar=list(col=3, lwd=4), horiz=T, main="Similarities of 10 Clara Clustering Results for k: 312")
# Example for clustering entire cluster result sets. First, 10 sample cluster results are created with Clara using kvalues
# from 3 to 12. The results are stored as named clustering vectors in a list object. Then a nested sapply loop is used to
# generate a similarity matrix of Jaccard Indices for the clustering results. After converting the result into a distance
# matrix, hierarchical clustering is performed with hclust.
Table of Contents
Clustering of 10 Cluster Results Using the Jaccard Index as Similarity Measure
Clustering Exercises
Slide Show ( R Code)
Install required libraries
The following exercises demonstrate several useful clustering and data mining utilities in R.
Sample data download:
 Download from GEO the Arabidopsis IAA treatment series "GSE1110" in TXT format. The direct link to the download is:
 Import the data set into R
temp < readLines("GSE1110_series_matrix.txt"); cat(temp[grep("^!^\"$", temp)], file="GSE1110clean.txt", sep="\n"); mydata < read.delim("GSE1110clean.txt", header=T, sep="\t")
# These import commands include a cleanup step to get rid of annotation lines and corrupted return signs.
rownames(mydata) < mydata[,1]; mydata < as.matrix(mydata[,1])
# Assigns row indices and converts the data into a matrix object.
mydata < mydata[apply(mydata > 100, 1, sum)/length(mydata[1,])>0.5 & apply(log2(mydata), 1, IQR) > 1.5, ]
# Retrieves all rows with high intensities (50% > 100) and high variability (IQR > 1.5).
dim(mydata) # Dimension of final sample matrix
[1] 123 22
mydata[1:10,1:10] # Slice of sample matrix
GSM18228 GSM18229 GSM18230 GSM18231 GSM18232 GSM18233 GSM18290 GSM18291 GSM18292 GSM18293
244932_at 4053.0 4130.4 3291.8 3383.1 4229.4 4338.2 903.2 1068.1 3609.3 3466.3
244933_at 2526.0 1800.0 1070.8 812.0 2345.0 2463.6 363.2 391.4 1074.6 823.3
244938_at 1789.7 1581.9 2043.4 2050.7 977.4 978.7 590.6 499.2 2091.9 2021.0
244939_at 749.0 759.1 936.4 947.0 531.7 604.5 378.9 492.5 1208.5 1065.2
244970_at 2377.0 2259.4 2053.1 2121.5 1401.8 1159.6 445.0 561.5 2243.1 2210.0
244982_at 2121.9 1953.2 5629.2 5240.5 1789.8 1997.7 1516.2 1592.3 5961.3 6067.3
244995_at 492.4 485.0 514.6 450.0 529.0 606.2 114.3 128.4 516.2 627.6
245002_at 2953.1 2866.9 6638.1 6562.6 4573.5 4481.8 1963.0 1885.5 7413.7 7784.2
245004_at 5447.6 5178.7 9046.1 8752.3 5121.3 4736.0 2730.6 2845.7 10264.3 9902.2
245008_at 2064.7 2267.7 1969.5 2277.2 1702.5 1844.0 698.1 546.3 2403.0 2553.8
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R")
# Import an alternative color scheme for the heatmap function.
mydatascale < t(scale(t(mydata))) # Centers and scales data.
hr < hclust(as.dist(1cor(t(mydatascale), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
hc < hclust(as.dist(1cor(mydatascale, method="spearman")), method="complete")
# Clusters columns by Spearman correlation.
heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row")
# Plot the data table as heatmap and the cluster results as dendrograms.
mycl < cutree(hr, h=max(hr$height)/1.5); mycolhc < sample(rainbow(256)); mycolhc < mycolhc[as.vector(mycl)]; heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc)
# Cut the tree at specific height and color the corresponding clusters in the heatmap color bar.
Tree Cutting of Hierarchical Clustering Result Shown in Color Bar on Left
 Obtain significant clusters by pvclust bootstrap analysis
library(pvclust); library(gplots) # Loads the required packages.
pv < pvclust(scale(t(mydata)), method.dist="correlation", method.hclust="complete", nboot=10)
# Perform the hierarchical cluster analysis. Due to time resrictions, we are using here only 10 bootstrap repetitions.
# Usually, one should use at least 1000 repetitions.
plot(pv, hang=1); pvrect(pv, alpha=0.95)
# Plots result as a dendrogram where the significant clusters are highlighted with red rectangles.
clsig < unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # Retrieve members of significant clusters.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
dend_colored < dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20)
# Create dendrogram object where the significant clusters are labeled in red.
heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc)
# Plot the heatmap from above, but with the significant clusters in red and the cluster bins from the tree cutting step in
# the color bar.
x11(height=12); heatmap.2(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", trace="none", RowSideColors=mycolhc) # Plot heatmap with heatmap.2() function which scales better for many entries.
mydatasort < mydata[pv$hclust$labels[pv$hclust$order], hc$labels[hc$order]] # Sort rows in data table by 'dend_colored' and its colums by 'hc'.
x11(height=16, width=12); par(mfrow=c(1,2)); plot(dend_colored, horiz=T, yaxt="n"); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n") # Plot heatmap with bootstrap tree in larger format using instead of heatmap the image function.
pdf("pvclust.pdf", height=21, width=6); plot(dend_colored, horiz=T, yaxt="n"); dev.off(); pdf("heatmap.pdf", height=20, width=6); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n"); dev.off()
# Save graphical results to two PDF files: 'pvclust.pdf' and'heatmap.pdf'.
 Compare PAM (Kmeans) with hierarchical clustering
library(cluster) # Loads required library.
mydist < t(scale(t(mydata))) # Center and scale data.
mydist < as.dist(1cor(t(mydist), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
pamy < pam(mydist, max(mycl)) # Clusters distance matrix into as many clusters as obtained by tree cutting step (6).
mycolkm < sample(rainbow(256)); mycolkm < mycolkm[as.vector(pamy$clustering)]
heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm)
# Compare PAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
pdf("pam.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm); dev.off()
# Save graphical results to PDF file 'pvclust.pdf'.
Table of Contents
Comparison of 3 Clustering Results: (1) hclust (row tree), pvclust (red branches in row tree) and pam (color bar)
 Compare SOM with hierarchical clustering
library(som) # Loads required library.
y < t(scale(t(mydata))) # Center and scale data.
y.som < som(y, xdim = 2, ydim = 3, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
plot(y.som) # Plots results.
pdf("som.pdf"); plot(y.som); dev.off() # Save plot to PDF 'som.pdf'.
somclid < as.numeric(paste(y.som$visual[,1], y.som$visual[,2], sep=""))+1 # Returns SOM cluster assignment in order of input data.
mycolsom < sample(rainbow(256)); mycolsom < mycolsom[somclid]; heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom)
# Compare SAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
pdf("somhc.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom); dev.off() # Save graphical results to PDF file 'somhc.pdf'.
pca < prcomp(mydata, scale=T) # Performs principal component analysis after scaling the data.
summary(pca) # Prints variance summary for all principal components.
library(scatterplot3d) # Loads 3D library.
scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom) # Plots PCA result in 3D. The SOM clusters are highlighted in their color.
pdf("pcasom.pdf"); scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom); dev.off() # Saves PCA plot in PDF format 'pcasom.pdf'.
 Compare MDS with HC, SOM and Kmeans
loc < cmdscale(mydist, k = 3) # Performs MDS analysis and returns results for three dimensions.
x11(height=8, width=8, pointsize=12); par(mfrow=c(2,2)) # Sets plotting parameters.
plot(loc[,1:2], pch=20, col=mycolsom, main="MDS vs SOM 2D")
# Plots MDSSOM comparison in 2D. The SOM clusters are highlighted in their color.
scatterplot3d(loc, pch=20, color=mycolsom, main="MDS vs SOM 3D") # Plots MDSSOM comparison in 3D.
scatterplot3d(loc, pch=20, color=mycolhc, main="MDS vs HC 3D") # Plots MDSHC comparison.
scatterplot3d(loc, pch=20, color=mycolkm, main="MDS vs KM 3D") # Plots MDSKM comparison.
Comparison of MDS Clustering with SOM, HC, and Kmeans Results
 Scoring similarities among clustering results with Jaccard coefficient
names(somclid) < row.names(mydata)
clist < list(hc=mycl, pam=pamy$clustering, som=somclid) # Store clustering results in list.
d < sapply(names(clist), function(x) sapply(names(clist), function(y) cindex(clV1=clist[[y]], clV2=clist[[x]], method="jaccard")[[3]]))
# Compute similarity matrix of Jaccard coefficients.
round(d, 2)
hc pam som
hc 1.00 0.56 0.62
pam 0.56 1.00 0.48
som 0.62 0.48 1.00
hv < hclust(as.dist(1d)) # Hierarchical clustering with hclust
plot(as.dendrogram(hv), edgePar=list(col=3, lwd=4), horiz=TRUE)
# Plot similarity among clustering results as tree.
Similarity Among Clustering Results
library(cluster) # Loads cluster library.
fannyy < fanny(mydist, k= max(mycl), memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
# Performs fuzzy clustering with as many coefficients as clusters were obtained by tree cutting step in HC. The hard
# clustering results are provided in the 'clustering' slot.
fannyyMA < round(fannyy$membership, 2) > 0.3; apply(fannyyMA, 1, which)
# Returns multiple cluster memberships for coefficient above a certain value (here >0.3).
Administration
Download and install R for your operating system from CRAN.
The latest instructions for installing Bioconductor packages are available on the Bioc Installation page. Only the essential steps are given here. To install Bioconductor packages, execute from the R console the following commands:
source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R install script.
biocLite() # Installs the biocLite.R default set of Bioconductor packages.
biocLite(c("pkg1", "pkg2")) # Command to install specific packages from Bioc.
update.packages(repos=biocinstallRepos(), ask=FALSE)
# Bioconductor packages, especially those in the development branch, are updated fairly regularly. To update all
# of your installed packages, run this command after sourcing "biocLite.R".
To install CRAN packages, execute from the R console the following command:
install.packages(c("pkg1", "pkg2")) # The selected package name(s) need to be provided within quotation marks.
Alternatively, a package can be downloaded and then installed in compressed form like this:
install.packages("pkg.zip", repos=NULL)
List all packages installed on a system:
row.names(installed.packages())
On UNIXbased systems one can also execute the following shell command (not from R):
$ R CMD INSTALL pkg_0.1.tar.gz
# This installs packages by default into a systemwide directory. To install packages into a user account
# directory (not systemwide), one can specify a custom directory with the argument 'l my_lib_dir'. A locally
# installed library can be called in R like this: 'library(my_lib, lib.loc= "/home/user/my_lib_dir/")'.
# To manage libraries by default in a custom directory, one can generate a '.Reviron' file containing the
# line: R_LIBS=/your_home_dir/Rlib
