R & Bioconductor Manual

Author: Thomas Girke, UC Riverside
 

Contents

  1. 1 R Basics
    1. 1.1 Introduction
    2. 1.2 Finding Help
    3. 1.3 Basics on Functions and Packages
    4. 1.4 System commands under Linux
    5. 1.5 Reading and Writing Data from/to Files
    6. 1.6 R Objects
      1. 1.6.1 Data and Object Types
      2. 1.6.2 General Subsetting Rules
      3. 1.6.3 Basic Operators and Calculations
      4. 1.6.4 Regular expressions
      5. 1.6.5 Vectors
      6. 1.6.6 Factors
      7. 1.6.7 Matrices and Arrays
      8. 1.6.8 Data Frames
      9. 1.6.9 Lists
    7. 1.7 Missing Values
    8. 1.8 Some Great R Functions
    9. 1.9 Graphical Procedures
      1. 1.9.1 Introduction to R Graphics
      2. 1.9.2 Scatter Plots
      3. 1.9.3 Line Plots
      4. 1.9.4 Bar Plots
      5. 1.9.5 Pie Charts
      6. 1.9.6 Heatmaps
      7. 1.9.7 Venn Diagrams
      8. 1.9.8 Intersect Plots
      9. 1.9.9 Histograms
      10. 1.9.10 Density Plots
      11. 1.9.11 Box Plots
      12. 1.9.12 ROC Plots
      13. 1.9.13 Feature Maps for DNA/Protein Sequences
      14. 1.9.14 Miscellaeneous Plotting Utilities
      15. 1.9.15 Arranging Plots
      16. 1.9.16 Customize X-Y Axes
      17. 1.9.17 Color Selection Utilities
      18. 1.9.18 Adding Text
      19. 1.9.19 Interactive Functions
      20. 1.9.20 Saving Graphics to Files
      21. 1.9.21 Importing Graphics into R
      22. 1.9.22 Graphics Exercises
    10. 1.10 Writing Your Own Functions
    11. 1.11 R Web Applications
    12. 1.12 R Basics Exercises
  2. 2 HT Sequence Analysis with R and Bioconductor
  3. 3 Programming in R
  4. 4 Bioconductor
    1. 4.1 Introduction
    2. 4.2 Finding Help
    3. 4.3 Affy Packages
      1. 4.3.1 Affy
      2. 4.3.2 Visualization and quality controls
      3. 4.3.3 Simpleaffy
    4. 4.4 Analysis of Differentially Expressed Genes
      1. 4.4.1 Limma
        1. 4.4.1.1 Limma: Dual Color Arrays
        2. 4.4.1.2 Limma: Affymetrix Arrays
      2. 4.4.2 RankProd
    5. 4.5 Additional Dual Color Array Packages
      1. 4.5.1 Marray
    6. 4.6 Chromosome maps
    7. 4.7 Gene Ontologies
      1. 4.7.1 General
      2. 4.7.2 GOHyperGAll  
      3. 4.7.3 GSEA
      4. 4.7.4 GOTools and goCluster
    8. 4.8 KEGG Pathway Analysis
    9. 4.9 Motif Identification in Promoter Regions
    10. 4.10 Phylogenetic Analysis
    11. 4.11 Cheminformatics in R
    12. 4.12 Protein Structure Analysis
    13. 4.13 MS Data Analysis
    14. 4.14 Genome-Wide Association Studies (GWAS)
    15. 4.15 BioConductor Exercises
  5. 5 Clustering and Data Mining in R
    1. 5.1 Introduction
    2. 5.2 Data Preprocessing
    3. 5.3 Hierarchical Clustering (HC)
    4. 5.4 Bootstrap Analysis in Hierarchical Clustering 
    5. 5.5 QT Clustering 
    6. 5.6 K-Means & PAM 
    7. 5.7 Fuzzy Clustering 
    8. 5.8 Self-Organizing Map (SOM) 
    9. 5.9 Principal Component Analysis (PCA) 
    10. 5.10 Multidimensional Scaling (MDS) 
    11. 5.11 Bicluster Analysis 
    12. 5.12 Network Analysis
    13. 5.13 Support Vector Machines (SVM) 
    14. 5.14 Similarity Measures for Clustering Results 
    15. 5.15 Clustering Exercises
  6. 6 Administration


R Basics


Introduction

General Overview
R (http://cran.at.r-project.org) is a comprehensive statistical environment and programming language for professional data analysis and graphical display. The associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems.

Scope of this Manual
This R tutorial provides a condensed introduction into the usage of the R environment and its utilities for general data analysis and clustering. It also introduces a subset of packages from the Bioconductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the R/Bioconductor projects. Many packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available R packages, it is strongly recommended to consult the official Bioconductor and R project sites. Due to the rapid development of most packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutely critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author.

Format of this Manual
A not always very easy to read, but practical copy & paste format has been chosen throughout this manual. In this format all commands are represented in code boxes, where the comments are given in blue color. To save space, often several commands are concatenated on one line and separated with a semicolon ';'. All comments/explanations start with the standard comment sign '#' to prevent them from being interpreted by R as commands. This way several commands can be pasted with their comment text into the R console to demo the different functions and analysis steps. Commands starting with a '$' sign need to be executed from a Unix or Linux shell. Windows users can simply ignore them. Commands highlighted in red color are considered essential knowledge. They are important for someone interested in a quick start with R and Bioconductor. Where relevant, the output generated by R is given in green color.

Installation of the R Software and R Packages
  • The installation instructions are provided in the Administrative Section of this manual.
  • R working environments with syntax highlighting support and utilities to send code to the R console:


R Projects and Interfaces

Basic R Usage

$ R # Starts the R console under Unix/Linux. The R GUI versions under Windows and Mac OS X can be started by double-clicking their icons.
object <- function(arguments) # This general R command syntax uses the assignment operator '<-' (or '=') to assign data generated by a command to its right to object on its left.
object = function(arguments) # A more recently introduced assignment operator is '='. Both of them work the same way and in both directions. For consistency reasons one should use only one of them.
assign("x", function(arguments)) # Has the same effect, but uses the assignment function instead of the assignment operator.
source("my_script") # Command to execute an R script, here 'my_script'. For example, generate a text file 'my_script' with the command 'print(1:100)', then execute it with the source function.
x <- edit(data.frame()) # Starts empty GUI spreadsheet editor for manual data entry.
x <- edit(x) # Opens existing data frame (table) 'x' in GUI spreadsheet editor.
x <- scan(w="c") # Lets you enter values from the keyboard or by copy&paste and assigns them to vector 'x'.
q() # Quits R console.

R Startup Behavior

The R environment is controlled by hidden files in the startup directory: .RData, .Rhistory and .Rprofile (optional)

Finding Help

Various online manuals are available on the R project site. Very useful manuals for beginners are:
Documentation within R can be found with the following commands

?function # or 'help(function)' opens documentation on a function
apropos(function) # Finds all functions containing a given term.
example(heatmap) # executes examples for function 'heatmap'
help.search("topic") # searches help system for documentation
RSiteSearch('regression', restrict='functions', matchesPerPage=100)
    # Searches for key words or phrases in the R-help mailing list archives, help pages, vignettes or task views, using the search engine
    # at http://search.r-project.org and view them in a web browser.

help.start() # Starts local HTML interface. The link 'Packages' provides a list of all installed packages. After initiating 'start.help()' in a session the '?function' commands will open as HTML pages.
sessionInfo() # Prints version information about R and all loaded packages. The generated output should be provided when sending questions or bug reports to the R and BioC mailing lists.
$ R -h # or 'R --help'; provides help on R environment, more detailed information on page 90 of 'An Introduction to R'


Basics on Functions and Packages

R contains most arithmetic functions like mean, median, sum, prod, sqrt, length, log, etc. An extensive list of R functions can be found on the function and variable index page. Many R functions and datasets are stored in separate packages, which are only available after loading them into an R session. Information about installing new packages can be found in the administrative section of this manual.

Loading of libraries/packages

library() # Shows all libraries available on a system.
library(my_package)
# Loads a particular package.
library(help=mypackage) # Lists all functions/objects of a package.
(v <- vignette("mypackage")) # Opens vignette of a package.
Stangle(v$file) # Writes code chunks of vignette to a file named mypackage.R for loading into R IDE (e.g. RStudio).
search() # Lists which packages are currently loaded.



Information and management of objects


ls() or objects() # Lists R objects created during session, they are stored in file '.RData' when exiting R and the workspace is saved.
rm(my_object1, my_object2, ...) # Removes objects.
rm(list = ls()) # Removes all objects without warning!
str(object) # Displays object types and structure of an R object.
ls.str(pattern="") # Lists object type info on all objects in a session.
print(ls.str(), max.level=0) # If a session contains very long list objects then one can simplify the output with this command.
lsf.str(pattern="") # Lists object type info on all functions in a session.
class(object) # Prints the object type.
mode(object) # Prints the storage mode of an object.
summary(object) # Generic summary info for all kinds of objects.
attributes(object) # Returns an object's attribute list.
gc() # Causes the garbage collection to take place. This is sometimes useful to clean up memory allocations after deleting large objects.
length(object) # Provides length of object.
.Last.value # Prints the value of the last evaluated expression.

Reading and changing directories
dir() # Reads content of current working directory.
getwd() # Returns current working directory.
setwd("/home/user") # Changes current working directory to the specified directory.

System commands under Linux


R IN/OUTPUT & BATCH Mode
One can redirect R input and output with '|', '>' and '<' from the Shell command line. More details on this topic can be found here.

$ R --slave < my_infile > my_outfile # The argument '--slave' makes R run as 'quietly' as possible. This option is intended to support programs which use R to compute results for them.
    # For example, if my_infile contains 'x <- c(1:100); x;' the result for this R expression will be written to 'my_outfile' (or STDOUT).
$ R CMD BATCH [options] my_script.R [outfile]
    # Syntax for running R programs in BATCH mode from the command-line. The output file lists the commands from the script file and their outputs. If no outfile is specified, the name used
    # is that of 'infile' and '.Rout' is appended to outfile. To stop all the usual R command line information from being written to the outfile, add this as first line to my_script.R file:
    # 'options(echo=FALSE)'. If the command is run like this 'R CMD BATCH --no-save my_script.R', then nothing will be saved in the .Rdata file which can get often very large. More on this

    # can be found on the help pages: '$ R CMD BATCH --help' or '> ?BATCH'.
$ echo 'sqrt(3)' | R --slave # calculates sqrt of 3 in R and prints it to STDOUT.


Executing Shell & Perl commands from R with system() function.
Remember, single escapes (e.g. '\n') need to be double escaped in R (e.g. '\\n')

system("ls -al") # Prints content of current working directory.
system("perl -ne 'print if (/my_pattern1/ ? ($c=1) : (--$c > 0)); print if (/my_pattern2/ ? ($d = 1) : (--$d > 0));' my_infile.txt > my_outfile.txt")
    # Runs Perl one-liner that extracts two patterns from external text file and writes result into new file.

Reading and Writing Data from/to Files

Import from files

read.delim("clipboard", header=T) # Command to copy&paste tables from Excel or other programs into R. If the 'header' argument is set to FALSE, then the first line of the data set will not be used as column titles.
read.delim(pipe("pbpaste")) # Command to copy & paste on Mac OS X systems.
file.show("my_file") # Prints content of file to screen, allows scrolling.
scan("my_file") # Reads vector/array into vector from file or keyboard.
my_frame <- read.table(file="my_table") # Reads in table and assigns it to data frame.
my_frame <- read.table(file="my_table", header=TRUE, sep="\t")
   # Same as above, but with info on column headers and field separators. If you want to import the data in character mode, then include this argument: colClasses = "character".
my_frame <- read.delim("ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt", na.strings = "", fill=TRUE, header=T, check.names=F, sep="\t")
   # The function read.delim() is often more flexible for importing tables with empty fields and long character strings (e.g. gene descriptions).
cat(month.name, file="zzz.txt", sep="\n"); x <- readLines("zzz.txt"); x <- x[c(grep("^J", as.character(x), perl = TRUE))]; t(as.data.frame(strsplit(x,"u")))
   # A somewhat more advanced example for retrieving specific lines from an external file with a regular expression. In this example an external file is created with the 'cat'
   # function, all lines of this file are imported into a vector with 'readLines', the specific elements (lines) are then retieved with the 'grep' function, and the resulting lines are
   # split into sub-fields with 'strsplit'.




Export to files

write.table(iris, "clipboard", sep="\t", col.names=NA, quote=F)
   # Command to copy&paste from R into Excel or other programs. It writes the data of an R data frame object into the clipbroard from where it can be pasted into other applications.
zz <- pipe('pbcopy', 'w'); write.table(iris, zz, sep="\t", col.names=NA, quote=F); close(zz) # Command to copy & paste from R into Excel or other programs on Mac OS X systems.
write.table(my_frame, file="my_file", sep="\t", col.names = NA)
   # Writes data frame to a tab-delimited text file. The argument 'col.names = NA' makes sure that the titles align with columns when row/index names are exported (default).
save(x, file="my_file.txt"); load(file="file.txt") # Commands to save R object to an external file and to read it in again from this file.
files <- list.files(pattern=".txtquot;); for(i in files) { x <- read.table(i, header=TRUE, row.names=1, comment.char = "A", sep="\t"); assign(print(i, quote=FALSE), x);
write.table(x, paste(i, c(".out"), sep=""), quote=FALSE, sep="\t", col.names = NA) }
   # Batch import and export of many files. First, the *.txt file names in the current directory are assigned to a list ($ sign is used to anchor string '*.txt' to end of names).
   # Second, the files are imported one-by-one using a for loop where the original names are assigned to the generated data frames with the 'assign' function. Read ?read.table to
   # understand arguments 'row.names=1' and 'comment.char = "A"'. Third, the data frames are exported using their names for file naming and appending '*.out'.

HTML(my_frame, file = "my_table.html")
   # Writes data frame to HTML table. Subsequent exports to the same file will arrange several tables in one HTML document. In order to access this function one needs to load the library 'R2HTML' first.
   # This library is usually not installed by default.

write(x, file="my_file") # Writes matrix data to a file.
sink("My_R_Output") # Redirects all subsequent R output to a file 'My_R_Output' without showing it in the R console anymore.
sink() # Restores normal R output behavior.

dump("myfct", "my_file") # Writes function definition to a file.


Interfacing with Google Docs
Google Docs imports and exports are described here: RGoogleDocs Usage.


R Objects

Data and Object Types


Data Types
      • Numeric data: 1, 2, 3
x <- c(1, 2, 3); x; is.numeric(x); as.character(x) # Creates a numeric vector, checks for the data type and
   # converts it into a character vector.


      • Character data: "a", "b" , "c"
x <- c("1", "2", "3"); x; is.character(x); as.numeric(x) # Creates a character vector, checks for the data type and
   # converts it into a numeric vector.


      • Complex data: 1, b, 3
      • Logical data: TRUE, FALSE, TRUE
1:10 < 5 # Returns TRUE where x is < 5.
Object Types in R
  • vectors: ordered collection of numeric, character, complex and logical values.
  • factors: special type vectors with grouping information of its components
  • data frames: two dimensional structures with different data types
  • matrices: two dimensional structures with data of same type
  • arrays: multidimensional arrays of vectors
  • lists: general form of vectors with different types of elements
  • functions: piece of code

Naming Rules
  • Object, row and column names should not start with a number.
  • Avoid spaces in object, row and column names.
  • Avoid special characters like '#'.


General Subsetting Rules

      Subsetting syntax:
my_object[row] # Subsetting of one dimensional objects, like vectors and factors.
my_object[row, col] # Subsetting of two dimensional objects, like matrices and data frames.
my_object[row, col, dim] # Subsetting of three dimensional objects, like arrays.


There are three possibilities to subset data objects:

(1) Subsetting by positive or negative index/position numbers

my_object <- 1:26; names(my_object) <- LETTERS # Creates a vector sample with named elements.
my_object[1:4] # Returns the elements 1-4.
my_object[-c(1:4)] # Excludes elements 1-4.


(2) Subsetting by same length logical vectors

my_logical <- my_object > 10 # Generates a logical vector as example.
my_object[my_logical] # Returns the elements where my_logical contains TRUE values.

(3) Subsetting by field names

my_object[c("B", "K", "M")] # Returns the elements with element titles: B, K and M


Calling a single column or list component by its name with the '$' sign.

iris$Species # Returns the 'Species' column in the sample data frame 'iris'.
iris[,c("Species")] # Has the same effect as the previous step.


Assigning values to object components

zzz <- iris[,1:3]; zzz <- 0 # Reassignment syntax to create/replace an entire object.
zzz <- iris[,1:3]; zzz[,] <- 0 # Populates all fields in an object with zeros.
zzz <- iris[,1:3]; zzz[zzz < 4] <- 0 # Populates only specified fields with zeros

Basic Operators and Calculations

      Comparison operators
      • equal: ==
      • not equal: !=
      • greater/less than: > <
      • greater/less than or equal: >= <=
      • Example:
1 == 1 # Returns TRUE.

Logical operators
      • AND: &
x <- 1:10; y <- 10:1 # Creates the sample vectors 'x' and 'y'.
x > y & x > 5 # Returns TRUE where both comparisons return TRUE.


      • OR: |
x == y | x != y # Returns TRUE where at least one comparison is TRUE.

      • NOT: !
!x > y # The '!' sign returns the negation (opposite) of a logical vector.

      Calculations [ Function Index ]

      • Four basic arithmetic functions: addition, subtraction, multiplication and division
1 + 1; 1 - 1; 1 * 1; 1 / 1 # Returns results of basic arithmetic calculations.

      • Calculations on vectors
x <- 1:10; sum(x); mean(x), sd(x); sqrt(x) # Calculates for the vector x its sum, mean, standard deviation
   # and square root.
A list of the basic R functions can be found on the function and variable index page.
x <- 1:10; y <- 1:10; x + y # Calculates the sum for each element in the vectors x and y.


      • Iterative calculations
apply(iris[,1:3], 1, mean) # Calculates the mean values for the columns 1-3 in the sample data frame 'iris'.
   # With the argument setting '1', row-wise iterations are performed and with '2' column-wise iterations.

tapply(iris[,4], iris$Species, mean) # Calculates the mean values for the 4th column based on the grouping
   # information in the 'Species' column in the 'iris' data frame.

sapply(x, sqrt) # Calculates the square root for each element in the vector x. Generates the same result as 'sqrt(x)'


Regular expressions

R's regular expression utilities work similar as in other languages. To learn how to use them in R, one can consult the main help page on this topic with: ?regexp.

?regexp # Opens general help page for regular expression support in R.
month.name[grep("A", month.name)] # The grep function can be used for finding patterns in strings, here letter
   # 'A' in vector 'month.name'.

gsub('(i.*a)', '\\1_xxx', iris$Species, perl = TRUE)
   # Example for using regular expressions to substitute a pattern by another one using a back reference.
   # Remember: single escapes '\' need to be double escaped '\\' in R.

Vectors

General information

Vectors are ordered collection of 'atomic' (same data type) components or modes of the following four types: numeric, character, complex and logical. Missing values are indicated by 'NA'. R inserts them automatically in blank fields.

x <- c(2.3, 3.3, 4.4) # Example for creating a numeric vector with ordered collection of numbers using the c() function.
z <- scan(file="my_file") # Reads data line-wise from file and assigns them to vector 'z'.
vector[rows] # Syntax to access vector sections.
z <- 1:10; z; as.character(z) # The function 'as.character' changes the data mode from numeric to character.
as.numeric(character) # The function 'as.numeric' changes the data mode from character to numeric.
d <- as.integer(x); d # Transforms numeric data into integers.
x <- 1:100; sample(x, 5) # Selects a random sample of size of 5 from a vector.
x <- as.integer(runif(100, min=1, max=5)); sort(x); rev(sort(x)); order(x); x[order(x)]
   # Generates random integers from 1 to 4. The sort() function sorts the items by size. The rev() function reverses the order. The order() function returns the corresponding
   # indices for a sorted object. The order() function is usually the one that needs to be used for sorting complex objects, such as data frames or lists.

x <- rep(1:10, times=2); x; unique(x) # The unique() function removes the duplicated entries in a vector.
sample(1:10, 5, replace=TRUE) # Returns a set of randomly selected elements from a vector (here 5 numbers from 1 to 10) using either with or without replacement.

Sequences

R has several facilities to create sequences of numbers:

1:30 # Generates a sequence of integers from 1 to 30.
letters; LETTERS; month.name; month.abb # Generates lower case letters, capital letters, month names and abbreviated month names, respectively.
2*1:10 # Creates a sequence of even numbers.
seq(1, 30, by=0.5) # Same as before, but with 0.5 increments.
seq(length=100, from=20, by=0.5) # Creates number sequence with specified start and length.
rep(LETTERS[1:8], times=5) # Replicates given sequence or vector x times.


Character Vectors

paste(LETTERS[1:8], 1:12, sep="") # The command 'paste' merges vectors after converting to characters.
x <- paste(rep("A", times=12), 1:12, sep=""); y <- paste(rep("B", times=12), 1:12, sep=""); append(x,y) # Possibility to build plate location vector in R (better example under 'arrays').



Subsetting Vectors

x <- 1:100; x[2:23] # Values in square brackets select vector range.
x <- 1:100; x[-(2:23)] # Prints everything except the values in square brackets.
x[5] <- 99 # Replaces value at position 5 with '99'.
x <- 1:10; y <- c(x[1:5],99,x[6:10]); y # Inserts new value at defined position of vector.
letters=="c" # Returns logical vector of "FALSE" and "TRUE" strings.
which(rep(letters,2)=="c") # Returns index numbers where "c" occurs in the 'letters' vector. For retrieving indices
   # of several strings provided by query vector, use the following 'match' function.

match(c("c","g"), letters)
   # Returns index numbers for "c" and "g" in the 'letters' vector. If the query vector (here 'c("c","g")') contains entries
   # that are duplicated in the target vector, then this syntax returns only the first occurence(s) for each duplicate.
   # To retrieve the indices for all duplicates, use the following '%in%' function.
x <- rep(1:10, 2); y <- c(2,4,6); x %in% y

   # The function '%in%' returns a logical vector. This syntax allows the subsetting of vectors and data frames with a
   # query vector ('y') containing entries that are duplicated in the target
   # vector ('x'). The resulting logical vector can be used for the actual subsetting step of vectors and data frames.



Finding Identical and Non-Identical Entries between Vectors

intersect(month.name[1:4], month.name[3:7]) # Returns identical entries of two vectors.
month.name[month.name %in% month.name[3:7]]
   # Returns the same result as in the previous step. The vector comparison with %in% returns first a logical vector of
   # identical items that is then used to subset the first vector.

setdiff(x=month.name[1:4], y=month.name[3:7]); setdiff(month.name[3:7], month.name[1:4])
   # Returns the unique entries occuring only in the first vector. Note: if the argument names are not used, as in the
   # second example, then the order of the arguments is important.

union(month.name[1:4], month.name[3:7]) # Joins two vectors without duplicating identical entries.
x <- c(month.name[1:4], month.name[3:7]); x[duplicated(x)] # Returns duplicated entries

Factors

Factors are vector objects that contain grouping (classification) information of its components

animalf <- factor(c("dog", "cat", "mouse", "dog", "dog", "cat")) # Creates factor 'animalf' from vector.
animalf # Prints out factor 'animalf', this lists first all components and then the different levels (unique entries); alternatively one can print only levels with 'levels(animalf)'.
table(animalf) # Creates frequency table for levels.

Function tapply applies calculation on all members of a level
weight <- c(102, 50, 5, 101, 103, 52) # Creates new vector with weight values for 'animalf' (both need to have same length).
mean <- tapply(weight, animalf, mean) # Applies function (length, mean, median, sum, sterr, etc) to all level values; 'length'
   # provides the number of entries (replicates) in each level.



Function cut divides a numeric vector into size intervals
y <- 1:200; interval <- cut(y, right=F, breaks=c(1, 2, 6, 11, 21, 51, 101, length(y)+1), labels=c("1","2-5","6-10", "11-20", "21-50", "51-100", ">=101")); table(interval)
   # Prints the counts for the specified size intervals (beaks) in the numeric vector: 1:200.
plot(interval, ylim=c(0,110), xlab="Intervals", ylab="Count", col="green"); text(labels=as.character(table(interval)), x=seq(0.7, 8, by=1.2), y=as.vector(table(interval))+2)
   # Plots the size interval counts as bar diagram.


Matrices and Arrays

Matrices are two dimensional data objects consisting of rows and columns. Arrays are similar, but they can have one, two or more dimensions. In contrast to data frames (see below), one can store only a single data type in the same object (e.g. numeric or character).
x <- matrix(1:30, 3, 10, byrow = T) # Lays out vector (1:30) in 3 by 10 matrix. The argument 'byrow' defines whether the matrix
   # is filled by row or columns.

dim(x) <- c(3,5,2) # Transforms above matrix into multidimensional array.
x <- array(1:25, dim=c(5,5)) # Creates 5 by 5 array ('x') and fills it with values 1-25.
y <- c(x) # Writes array into vector.
x[c(1:5),3] # Writes values from 3rd column into vector structure.
mean(x[c(1:5),3]) # Calculates mean of 3rd column.


Subsetting matrices and arrays
array[rows, columns] # Syntax to access columns and rows of matrices and two-dimensional arrays.
as.matrix(iris) # Many functions in R require matrices as input. If something doesn't work then try to convert the object into a
   # matrix with the as.matrix() function.

i <- array(c(1:5,5:1),dim=c(3,2)) # Creates 5 by 2 index array ('i') and fills it with the values 1-5, 5-1.
x[i] # Extracts the corresponding elements from 'x' that are indexed by 'i'.
x[i] <- 0 # Replaces those elements by zeros.
array1 <- array(scan(file="my_array_file", sep="\t"), c(4,3)) # Reads data from 'my_array_file' and writes it into 4 by 3 array.


Subsetting arrays with more than two dimensions
array[rows, columns, dimensions] # Syntax to access columns, rows and dimensions in arrays with more than two dimensions.
x <- array(1:250, dim=c(10,5,5)); x[2:5,3,] # Example to generate 10x5x5 array and how to retrieve slices from all sub-arrays.



Merging arrays: example for building location tables for microtiter plates

Z <- array(1:12, dim=c(12,8)); X <- array(c("A","B","C","D","E","F","G","H"), dim=c(8,12)); Y <- paste(t(X), Z, sep=""); Y
   # Creates array with 12 Ax-Hx columns.

M <- array(Y, c(96,1)) # Writes 12 Ax-Hx columns into one.


Script for mapping 24/48/96 to 384 well plates


source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/384_96_48_24conversion.txt")
   # Prints plate mappings for 24/48/96 well plates to 384 well plates.


Script for mapping 384 to 1536 well plates

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/1536_384conversion.txt")
   # Prints plate mappings for 384 well plates to 1536 well plates.


Appending arrays and matrices
cbind(matrix1, matrix2) # Appends columns of matrices with same number of rows.
rbind(matrix1, matrix2) # Appends rows of matrices with same number of columns.
c(array1, array2) # Clears dimension attributes from arrays and concatenates them in vector format. The function
   #
'as.vector(my_array1)' achieves similar result.


Calculations between arrays
Z <- array(1:12, dim=c(12,8)); X <- array(12:1, dim=c(12,8)) # Creates arrays Z and X with same dimensions.
calarray <- Z/X # Divides Z/X elements and assigns result to array 'calarray'.
t(my_array) # Transposes 'my_array'; a more flexible transpose function is 'aperm(my_array, perm)'.


Information about arrays
dim(X); nrow(X); ncol(X) # Returns number of rows and columns of array 'X'.

Data Frames

Data frames are two dimensional data objects that are composed of rows and columns. They are very similar to matrices. The main difference is that data frames can store different data types, whereas matrices allow only one data type (e.g. numeric or character).

Constructing data frames
my_frame <- data.frame(y1=rnorm(12), y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)) # Creates data frame with vectors 1-12 and 12-1.
rownames(my_frame) <- month.name[1:12] # Assigns row (index) names. These names need to be unique.
names(my_frame) <- c("y4", "y3", "y2", "y1") # Assigns new column titles.
names(my_frame)[c(1,2)] <- c("y3", "y4") # Changes titles of specific columns.
my_frame <- data.frame(IND=row.names(my_frame), my_frame) # Generates new column with title "IND" containing the row names.
my_frame[,2:5]; my_frame[,-1] # Different possibilities to remove column(s) from a data frame.



Accessing and slicing data frame sections

my_frame[rows, columns] # Generic syntax to access columns and rows in data frames.
dim(my_frame) # Gives dimensions of data frame.
length(my_frame); length(my_frame$y1) # Provides number of columns or rows of data frame, respectively.
colnames(my_frame); rownames(my_frame) # Gives column and row names of data frame.
row.names(my_frame) # Prints row names or indexing column of data frame.
my_frame[order(my_frame$y2, decreasing=TRUE), ] # Sorts the rows of a data frame by the specified columns, here 'y2';
   # for increasing order use 'decreasing=FALSE'.

my_frame[order(my_frame[,4], -my_frame[,3]),] # Subsequent sub-sorts can be performed by specifying additional columns. By adding a "-" sign one can reverse the sort order.
my_frame$y1 # Notation to print entire column of a data frame as vector or factor.
my_frame$y1[2:4] # Notation to access column element(s) of a data frame.
v <-my_frame[,4]; v[3] # Notation for returning the value of an individual cell. In this example the corresponding
   # column is first assigned to a vector and then the desired field is accessed by its index number.

my_frame[1:5,] # Notation to view only the first five rows of all columns.
my_frame[, 1:2] # Notation to view all rows of the first two columns.
my_frame[, c(1,3)] # Notation to view all rows of the specified columns.
my_frame[1:5,1:2] # Notation to view only the first five rows of the columns 1-2.
my_frame["August",] # Notation to retrieve row values by their index name (here "August").
x <- data.frame(row.names=LETTERS[1:10], letter=letters[1:10], Month=month.name[1:10]); x; match(c("c","g"), x[,1])
   # Returns matching index numbers of data frame or vector using 'match' function.
   # This syntax returns for duplicates only the index of their first occurence. To return all, use the following syntax.

x[x[, 2] %in% month.name[3:7],] # Subsets a data frame with a query vector using the '%in%' function. This returns all occurences of duplicates.
as.vector(as.matrix(my_frame[1,])) # Returns one row entry of a data frame as vector.
as.data.frame(my_list) # Returns a list object as data frame if the list components can be converted into a two dimensional object.

Calculations on data frames

summary(my_frame) # Prints a handy summary for a data frame.
mean(my_frame) # Calculates the mean for all columns.
data.frame(my_frame, mean=apply(my_frame[,2:5], 1, mean), ratio=(my_frame[,2]/my_frame[,3]))
   # The apply function performs row-wise or column-wise calculations on data frames or matrices (here mean and ratios). The results are returned as vectors.
   # In this example, they are appended to the original data frame with the data.frame function. The argument '1' in the apply function specifies row-wise calculations.
   # If '2' is selected, then the calculations are performed column-wise.

aggregate(my_frame, by=list(c("G1","G1","G1","G1","G2","G2","G2","G2","G3","G3","G3","G4")), FUN=mean)
   # The aggregate function can be used to compute the mean (or any other stats) for data groups specified under the argument 'by'.
cor(my_frame[,2:4]); cor(t(my_frame[,2:4])) # Syntax to calculate a correlation matrix based on all-against-all rows and all-against-all columns.
x <- matrix(rnorm(48), 12, 4, dimnames=list(month.name, paste("t", 1:4, sep=""))); corV <- cor(x["August",], t(x), method="pearson"); y <- cbind(x, correl=corV[1,]); y[order(-y[,5]), ]
   # Commands to perform a correlation search and ranking across all rows (matrix object required here). First, an example matrix 'x' is created. Second, the
   # correlation values for the "August" row against all other rows are calculated. Finally, the resulting vector with the correlation values is merged with the
   # original matrix 'x' and sorted by decreasing correlation values.

merge(frame1, frame2, by.x = "frame1col_name", by.y = "frame2col_name", all = TRUE)
   # Merges two data frames (tables) by common field entries. To obtain only the common rows, change 'all = TRUE' to 'all = FALSE'. To merge on row names (indices),
   # refer to it with "row.names" or '0', e.g.: 'by.x = "row.names", by.y = "row.names"'.

my_frame1 <- data.frame(title1=month.name[1:8], title2=1:8); my_frame2 <- data.frame(title1=month.name[4:12], title2=4:12); merge(my_frame1, my_frame2, by.x = "title1", by.y = "title1", all = TRUE)
   # Example for merging two data frames by common field.



Fast computations on large data frames and matrices
myDF <- as.data.frame(matrix(rnorm(100000), 10000, 10)) # Creates an example data frame.
myCol <- c(1,1,1,2,2,2,3,3,4,4); myDFmean <- t(aggregate(t(myDF), by=list(myCol), FUN=mean, na.rm=T)[,-1])
colnames(myDFmean) <- tapply(names(myDF), myCol, paste, collapse="_"); myDFmean[1:4,]
   # The aggregate function can be used to perform calculations (here mean) across rows for any combination of column selections (here myCol) after transposing the
   # data frame. However, this will be very slow for data frames with millions of rows.

myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
names(myList) <- sapply(myList, paste, collapse="_"); myDFmean <- sapply(myList, function(x) mean(as.data.frame(t(myDF[,x])))); myDFmean[1:4,]
   # Faster alternative for performing the aggregate computation of the previous step. However, this step will be still very slow for very large
   # data sets, due to the sapply loop over the row elements.

myList <- tapply(colnames(myDF), c(1,1,1,2,2,2,3,3,4,4), list)
myDFmean <- sapply(myList, function(x) rowSums(myDF[,x])/length(x)); colnames(myDFmean) <- sapply(myList, paste, collapse="_")
myDFmean[1:4,] # By using the rowSums or rowMeans functions one can perform the above aggregate computations 100-1000 times faster by avoiding a loop over the rows altogether.
myDFsd <- sqrt((rowSums((myDF-rowMeans(myDF))^2)) / (length(myDF)-1)); myDFsd[1:4]
   # Similarly, one can compute the standard deviation for large data frames by avoiding loops.
   # This approach is about 100 times faster than the loop-based alternatives: sd(t(myDF)) or apply(myDF, 1, sd).

Table of Contents

Conditional selections
my_frame[!duplicated(my_frame[,2]),] # Removes rows with duplicated values in selected column.
my_frame[my_frame$y2 > my_frame$y3,]
   # Prints all rows of data frame where values of col1 > col2. Comparison operators are: == (equal), != (not equal), >= (greater than or equal), etc. Logical operators are & (and), | (or) and ! (not).
x <- 0.5:10; x[x<1.0] <- -1/x[x<1.0] # Replaces all values in vector or data frame that are below 1 with their reciprocal value.
x <-data.frame(month=month.abb[1:12], AB=LETTERS[1:2], no1=1:48, no2=1:24); x[x$month == "Apr" & (x$no1 == x$no2 | x$no1 > x$no2),]
   # Prints all records of frame 'x' that contain 'Apr' AND have equal values in columns 'no1' and 'no2' OR have greater values in column 'no1'.
x[x[,1] %in% c("Jun", "Aug"),] # Retrieves rows with column matches specified in a query vector.
x[c(grep("\\d{2}", as.character(x$no1), perl = TRUE)),] # Possibility to print out all rows of a data frame where a regular expression matches (here all double digit values in col 'no1').
x[c(grep("\\d{2}", as.character(for(i in 1:4){x[,i]}), perl = TRUE)),] # Same as above, but searches all columns (1-4) using a for loop (see below).
z <- data.frame(chip1=letters[1:25], chip2=letters[25:1], chip3=letters[1:25]); z; y <- apply(z, 1, function(x) sum(x == "m") > 2); z[y,]
   # Identifies in a data frame ('z') all those rows that contain a certain number of identical fields (here 'm' > 2).
z <- data.frame(chip1=1:25, chip2=25:1, chip3=1:25); c <- data.frame(z, count=apply(z[,1:3], 1, FUN <- function(x) sum(x >= 5))); c
   # Counts in each row of a data frame the number of fields that are above or below a specified value and appends this information to the data frame. By default rows with "NA" values will
   # be ignored. To work around this limitation, one can replace the NA fields with a value that doesn't affect the result, e.g.: x[is.na(x)] <- 1.

x <- data.frame(matrix(rep(c("P","A","M"),20),10,5)); x; index <- x == "P"; cbind(x, Pcount=rowSums(index)); x[rowSums(index)>=2,]
   # Example of how one can count occurances of strings across rows. In this example the occurances of "P" in a data frame of "PMA" values are counted by converting to a data frame of
   # logical values and then counting the 'TRUE' occurences with the 'rowSums' function, which does in this example the same as: 'cbind(x, count=apply(x=="P",1,sum))'.



Reformatting data frames with reshape.2 and splitting/apply routines with plyr

##############
## reshape2 ##
##############
## Some of these operations are important for plotting routines with the ggplot2 library.
## melt: rbinds many columns

library(reshape2)
(iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean))
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026


(df_mean <- melt(iris_mean, id.vars=c("Species"), variable.name = "Samples")) # See ?melt.data.frame for details.
      Species      Samples value
1      setosa Sepal.Length 5.006
2  versicolor Sepal.Length 5.936
3   virginica Sepal.Length 6.588
4      setosa  Sepal.Width 3.428
5  versicolor  Sepal.Width 2.770
6   virginica  Sepal.Width 2.974
...


## dcast: cbinds row aggregates (reverses melt result)

dcast(df_mean, formula = Species ~ ...)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026



## colsplit: splits strings into columns

x <- c("a_1_4", "a_2_3", "b_2_5", "c_3_9")
colsplit(x, "_", c("trt", "time1", "time2"))
  trt time1 time2
1   a     1     4
2   a     2     3
3   b     2     5
4   c     3     9


##########
## plyr ##
##########

library(plyr)
ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize)

     Species  mean
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588


ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), transform)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species  mean
1          5.1         3.5          1.4         0.2  setosa 5.006
2          4.9         3.0          1.4         0.2  setosa 5.006
3          4.7         3.2          1.3         0.2  setosa 5.006
4          4.6         3.1          1.5         0.2  setosa 5.006
...


## Usage with parallel package
library(parallel); library(doMC); registerDoMC(2) # 2 cores
test <- ddply(.data=iris, .variables=c("Species"), mean=mean(Sepal.Length), summarize, parallel=TRUE)


Lists

Lists are ordered collections of objects that can be of different modes (e.g. numeric vector, array, etc.). A name can be assigned to each list component.

my_list <- list(name="Fred", wife="Mary", no.children=3, child.ages=c(4,7,9)) # Example of how to create a list.
attributes(my_list); names(my_list) # Prints attributes and names of all list components.
my_list[2]; my_list[[2]] # Returns component 2 of list; The [[..]] operator returns the object type stored in this component, while '[..]' returns a single component list.
my_list$wife # Named components in lists can also be called with their name. Command returns same component as 'my_list[[2]]'.
my_list[[4]][2] # Returns from the fourth list component the second entry.
length(my_list[[4]]) # Prints the length of the fourth list component.
my_list$wife <- 1:12 # Replaces the content of the existing component with new content.
my_list$wife <- NULL # Deletes list component 'wife'.
my_list <- c(my_list, list(my_title2=month.name[1:12])) # Appends new list component to existing list.
my_list <- c(my_name1=my_list1, my_name2=my_list2, my_name3=my_list3) # Concatenates lists.
my_list <- c(my_title1=my_list[[1]], list(my_title2=month.name[1:12])) # Concatenates existing and new components into a new list.
unlist(my_list); data.frame(unlist(my_list)); matrix(unlist(my_list)); data.frame(my_list) # Useful commands to convert lists into other objects, such as vectors, data frames or matrices.
my_frame <- data.frame(y1=rnorm(12),y2=rnorm(12), y3=rnorm(12), y4=rnorm(12)); my_list <- apply(my_frame, 1, list); my_list <- lapply(my_list, unlist); my_list
   # Stores the rows of a data frame as separate vectors in a list container.
mylist <- list(a=letters[1:10], b=letters[10:1], c=letters[1:3]); lapply(names(mylist), function(x) c(x, mylist[[x]]))
   # Syntax to access list component names with the looping function 'lapply'. In this example the list component names are prepended to the corresponding vectors.

Table of Contents


Missing Values

Missing values are represented in R data objects by the missing value place holder 'NA'. The default behavior for many R functions on data objects with missing values is 'na.fail' which returns the value 'NA'. Several 'na.action' options are available to change this behavior.

x <- 1:10; x <- x[1:12]; z <- data.frame(x,y=12:1) # Creates sample data.frame with 'NA' values
is.na(x) # Finds out which elements of x contain "NA" fields.
na.omit(z) # Removes rows with 'NA' fields in numeric data objects.
x <- letters[1:10]; print(x); x <- x[1:12]; print(x); x[!is.na(x)] # A similar result can be achieved on character data objects with the function 'is.na()'.
apply(z, 1, mean, na.rm=T) # Uses available values to perform calculation while ignoring the 'NA' values; the default 'na.rm=F' returns 'NA' for rows with missing values.
z[is.na(z)] <- 0
   # Replaces 'NA' values with zeros. To replace the special <NA> signs in character columns of data frames, convert the data frame with as.matrix() to a matrix,
   # perform the substitution on this level and then convert it back to a data frame with as.data.frame().


Some Great R Functions

This sections contains a small collection of extremely useful R functions.

The unique() function makes vector entries unique:

unique(iris$Sepal.Length); length(unique(iris$Sepal.Length))
   # The first step returns the unique entries for the Sepal.Length column in the iris data set and the second step
   # counts the number of its unique entries.



The table() function counts the occurrence of entries in a vector. It is the most basic "clustering function":
my_counts <- table(iris$Sepal.Length, exclude=NULL)[iris$Sepal.Length]; cbind(iris, CLSZ=my_counts)[1:4,]
   # Counts identical entries in the Sepal.Length column of the iris data set and aligns the counts with the original
   # data frame. Note: to count NAs, set the 'exclude' argument to NULL.

myvec <- c("a", "a", "b", "c", NA, NA); table(factor(myvec, levels=c(unique(myvec), "z"), exclude=NULL))
   # Additional count levels can be specified by turning the test vector into a factor and specifying them with the 'levels' argument.



The combn() function creates all combinations of elements:
labels <- paste("Sample", 1:5, sep=""); combn(labels, m=2, FUN=paste, collapse="-")
   # Example to create all possible pairwise combinations of sample labels. The number of samples to consider in the
   # comparisons can be controlled with the 'm' argument.

allcomb <- lapply(seq(along=labels), function(x) combn(labels, m=x, simplify=FALSE, FUN=paste, collapse="-")); unlist(allcomb)
   # Creates all possible combinations of sample labels.



The aggregate() function computes any type of summary statistics of data subsets that are grouped together:
aggregate(iris[,1:4], by=list(iris$Species), FUN=mean, na.rm=T) # Computes the mean (or any other stats) for the data groups
   # defined by the 'Species' column in the iris data set.

t(aggregate(t(iris[,1:4]), by=list(c(1,1,2,2)), FUN=mean, na.rm=T)[,-1])
   # The function can also perform computations for column aggregates (here: columns 1-2 and 3-4) after transposing the data
   # frame. A much faster alternative is given in the data frame section.



The apply() function simplifies the coding of iterative tasks (loops). In the following example, apply() is first used to compute a row-wise calculation on a matrix. In a second example apply() is called by the custom function colAg() to perform similar calculations, but allowing the user to select any combination of column aggregates with an easy to handle grouping vector.
myMA <- matrix(rnorm(100000), 10000, 10, dimnames=list(1:10000, paste("C", 1:10, sep=""))) # Creates a sample data set.
apply(myMA, 1, mean)[1:4] # Computes the mean for each row in myMA.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/colAg.R") # Imports the colAg() function.
colAg(myMA=myMA, group=c(1,1,1,2,2,2,3,3,4,4), myfct=mean)[1:4,]
   # Applies a computation given under the myfct argument (here: mean) for the column aggregates specified under the group argument
   # (here: c(1,1,1,2,2,2,3,3,4,4)). The columns in the resulting object are named after the chosen aggregates. Note: the function can
   # only perform those calculations that can be applied to sets of two or more values, such as mean, sum, sd, min and max. Much faster
   # to compute, but less flexible, alternatives are given in the
data frame section.

The %in% function returns the intersect between two vectors. In a subsetting context with '[ ]', it can be used to intersect matrices, data frames and lists:

my_frame <- data.frame(Month=month.name, N=1:12); my_query <- c("May", "August") # Generates a sample data frame and a sample query vector.
my_frame[my_frame$Month %in% my_query, ] # Returns the rows in my_frame where the "Months" column and my_query contain identical entries.


The merge() function joins data frames based on a common key column:

frame1 <- iris[sample(1:length(iris[,1]), 30), ]
my_result <- merge(frame1, iris, by.x = 0, by.y = 0, all = TRUE); dim(my_result)
   # Merges two data frames (tables) by common field entries, here row names (by.x=0). To obtain only the common rows,
   # change 'all = TRUE' to 'all = FALSE'. To merge on specific columns, refer to them by their position numbers or their column
   # names (e.g. "Species").

Table of Contents

Graphical Procedures

[ Slide Show ]  [ R Code ]

Introduction to R Graphics

R provides comprehensive graphics utilities for visualizing and exploring scientific data. In addition, several powerful graphics environments extend these utilities. These include the grid, lattice and ggplot2 packages. The grid package is part of R's base distribution. It provides the low-level infrastructure for many graphics packages, including lattice and ggplot2. Extensive information on graphics utilities in R can be found on the Graphics Task Page, the R Graph Gallery and the R Graphical Manual. Another useful reference for graphics procedures is Paul Murrell's book R Graphics. Interactive graphics in R can be generated with rggobi (GGobi) and iplots.

Base Graphics
The following list provides an overview of some very useful plotting functions in R's base graphics. To get familiar with their usage, it is recommended to carefully read their help documentation with ?myfct as well as the help on the functions plot and par
      • plot: generic x-y plotting
      • barplot: bar plots
      • boxplot: box-and-whisker plot
      • hist: histograms
      • pie: pie charts
      • dotchart: cleveland dot plots
      • image, heatmap, contour, persp: functions to generate image-like plots
      • qqnorm, qqline, qqplot: distribution comparison plots
      • pairs, coplot: display of multivariant data
Lattice  [ Manuals: lattice, Intro, book ]
The lattice package developed by Deepayan Sarkar implements in R the Trellis graphics system from S-Plus. The environment greatly simplifies many complicated high-level plotting tasks, such as automatically arranging complex graphical features in one or several plots. The syntax of the package is similar to R's base graphics; however, high-level lattice functions return an object of class "trellis", that can be either plotted directly or stored in an object. The command library(help=lattice) will open a list of all functions available in the lattice package, while ?myfct and example(myfct) can be used to access and/or demo their documentation. Important functions for accessing and changing global parameters are: ?lattice.options and ?trellis.device.

ggplot2 [ Manuals: ggplot2, DocsIntro and book ]
ggplot2 is another more recently developed graphics system for R, based on the grammar of graphics theory. The environment streamlines many graphics routines for the user to generate with minimum effort complex multi-layered plots. Its syntax  is centered around the main ggplot function, while the convenience function qplot provides many shortcuts. The ggplot function accepts two arguments: the data set to be plotted and the corresponding aesthetic mappings provided by the aes function. Additional plotting parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with '+' as separator. For instance,  the following command will generate a scatter plot for the first two columns of the iris data frame: ggplot(iris, aes(iris[,1], iris[,2])) + geom_point(). A list of the available geom_* functions can be found here. The settings of the plotting theme can be accessed with the command theme_get(). Their settings can be changed with the opts()function. To benefit from the many convenience features built into ggplot2, the expected input data class is usually a data frame where all labels for the plot are provided by the column titles and/or grouping factors in additional column(s).

The following graphics sections demonstrate how to generate different types of plots first with R's base graphics device and then with the lattice and ggplot2 packages.  

Scatter Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]) # Plots two vectors (columns) in form of a scatter plot against each other.
plot(y[,1], y[,2], type="n", main="Plot of Labels"); text(y[,1], y[,2], rownames(y)) # Prints instead of symbols the row names.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y))
   # Plots both, symbols plus their labels.

grid(5, 5, lwd = 2) # Adds a grid to existing plot.
op <- par(mar=c(8,8,8,8), bg="lightblue")
plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", ylab="y label", main="My Main", sub="My Sub")
par(op)
   # The previous commands demonstrates the usage of the most important plotting parameters. The 'mar' argument specifies the
   # margin sizes around the plotting area
in this order: c(bottom, left, top, right). The color of the plotted symbols can be
   # controlled with the 'col' argument. The plotting symbols can be selected
with the 'pch' argument, while their size is
   # controlled by the 'lwd' argument. A selection palette for 'pch' plotting symbols can be opened with the command
'example(points)'.
   # As alternative, one can plot any character string by passing it on to 'pch', e.g.: pch=".". The font sizes of the different text
   # components
can be changed with the 'cex.*' arguments. Please consult the '?par' documentation for more details on fine tuning R's
   # plotting behavior.

plot(y[,1], y[,2]); myline <- lm(y[,2]~y[,1], data=y[,1:2]); abline(myline, lwd=2) # Adds a regression line to the plot.
summary(myline) # Prints summary about regression line.
plot(y[,1], y[,2], log="xy") # Generates the same plot as above, but on log scale.
plot(y[,1], y[,2]); text(y[1,1], y[1,2], expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) # Adds a mathematical formula to the plot.
plot(y) # Produces all possible scatter plots for all-against-all columns in a matrix or a data frame. The column headers of
   # the matrix or data frame are used as axis titles.

pairs(y) # Alternative way to produce all possible scatter plots for all-against-all columns in a matrix or a data frame.
library(scatterplot3d); scatterplot3d(y[,1:3], pch=20, color="red") # Plots a 3D scatter plot for first three columns in y.
library(geneplotter); smoothScatter(y[,1], y[,2]) # Same as above, but generates a smooth scatter plot that shows the density
   # of the data points.

Scatter Plot Generated with Base Graphics
lattice

library(lattice)
xyplot(1:10 ~ 1:10) # Simple scatter plot.
xyplot(1:10 ~ 1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE) # Plots subcomponents specified by grouping vector after '|' in separate panels. The argument as.table controls the order of the panels.
myplot <- xyplot(Petal.Width ~ Sepal.Width | Species , data = iris); print(myplot) # Assigns plotting function to an object and executes it.
xyplot(Petal.Width ~ Sepal.Width | Species , data = iris, layout = c(3, 1, 1)) # Changes layout of individual plots.

## Change plotting parameters
show.settings() # Shows global plotting parameters in a set of sample plots.
default <- trellis.par.get(); mytheme <- default; names(mytheme) # Stores the global plotting parameters in list mytheme and prints its component titles.
mytheme["background"][[1]][[2]] <- "grey" # Sets background to grey
mytheme["strip.background"][[1]][[2]] <- "transparent" # Sets background of title bars to transparent.
trellis.par.set(mytheme) # Sets global parameters to 'mytheme'.
show.settings() # Shows custom settings.
xyplot(1:10 ~ 1:10 | rep(LETTERS[1:5], each=2), as.table=TRUE, layout=c(1,5,1), col=c("red", "blue"))
trellis.par.set(default)

Scatter Plot Generated with lattice

ggplot2

library(ggplot2)
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() # Plots two vectors (columns) in form of a scatter plot against each other.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) # Plots larger dots and colors them with default color scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point(aes(color = Species), size=4) + ylim(2,4) + xlim(4,8) + scale_color_manual(values=rainbow(10)) # Colors dots with custom scheme.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_text() + opts(title = "Plot of Labels") # Print instead of symbols a set of custom labels and adds a title to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width, label=1:150)) + geom_point() + geom_text(hjust=-0.5, vjust=0.5) # Prints both symbols and labels.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + opts(panel.background=theme_rect(fill = "white", colour = "black")) # Changes background color to white.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + stat_smooth(method="lm", se=FALSE) # Adds a regression line to the plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_point() + coord_trans(x = "log2", y = "log2") # Plots axes in log scale.


Scatter Plot Generated with ggplot2

Line Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], type="l", lwd=2, col="blue") # Plots a line graph instead.
split.screen(c(1,1))
plot(y[,1], ylim=c(0,1), xlab="Measurement", ylab="Intensity", type="l", lwd=2, col=1)
for(i in 2:length(y[1,])) screen(1, new=FALSE); plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", xlab="", main="", bty="n")
close.screen(all=TRUE)
   # Plots a line graph for all columns in data frame 'y'. The 'split.screen()' function is used in this example in a for loop
   # to overlay several line
graphs in the same plot. More details on this topic are provided in the 'Arranging Plots' section.
   # A very nice line plot function for time series data is available in the Mfuzz library.

Line Plot Generated with Base Graphics

lattice

xyplot(Sepal.Length ~ Sepal.Width | Species, data=iris, type="a", layout=c(1,3,1))
parallel(~iris[1:4] | Species, iris) # Plots data for each species in iris data set in separate line plot.
parallel(~iris[1:4] | Species, iris, horizontal.axis = FALSE, layout = c(1, 3, 1)) # Changes layout of plot.

Scatter Plot Generated with lattice




ggplot2

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) # Plots lines for three samples in the 'Species' column in a single plot.
ggplot(iris, aes(Sepal.Length, Sepal.Width)) + geom_line(aes(color=Species), size=1) + facet_wrap(~Species, ncol=1)  # Plots three line plots, one for each sample in Species column.

Scatter Plot Generated with ggplot2

Bar Plots

Base Graphics

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
barplot(as.matrix(y[1:4,]), ylim=c(0,max(y[1:4,])+0.1), beside=T)
   # Generates a bar diagram for the first four rows in y. The barplot() function expects as input format a matrix, and
   # it uses by default the column titles for labeling the bars.

text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1)+sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.02)
   # Adds corresponding values on top of each bar.
ysub <- as.matrix(y[1:4,]); myN <- length(ysub[,1])
mycol1 <- gray(1:(myN+1)/(myN+1))[-(myN+1)]
mycol2 <- sample(colors(),myN); barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot", sub="data: ysub")
legend("topright", legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
   # Generates a bar plot with a legend for the first four rows of the input matrix 'y'. To plot the diagram in black and
   # white, set the 'col' argument to 'col=col1". The argument 'ncol' controls the number of columns that are used for printing the legend.

par(mar=c(10.1, 4.1, 4.1, 2.1)); par(xpd=TRUE);
barplot(ysub, beside=T, ylim=c(0,max(ysub)*1.2), col=mycol2, main="Bar Plot"); legend(x=4.5, y=-0.3, legend=row.names(ysub), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=mycol2, ncol=myN)
   # Same as above, but places the legend below the bar plot. The arguments 'x' and 'y' control the placement of the legend
   # and the 'mar' argument specifies the margin sizes around the plotting area in this order: c(bottom, left, top, right).

bar <- barplot(x <- abs(rnorm(10,2,1)), names.arg = letters[1:10], col="red", ylim=c(0,5))
stdev <- x/5; arrows(bar, x, bar, x + stdev, length=0.15, angle = 90)
arrows(bar, x, bar, x + -(stdev), length=0.15, angle = 90)
   # Creates bar plot with standard deviation bars. The example in the 'barplot' documentation provides a very useful outline
   # of this function.

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")
   # Imports a function that plots a loan amortization table as bar plot.

Bar Plot with Error Bars Generated with Base Graphics

lattice

y <- matrix(sample(1:10, 40, replace=TRUE), ncol=4, dimnames=list(letters[1:10], LETTERS[1:4])) # Creates a sample data set.
barchart(y, auto.key=list(adj = 1), freq=T, xlab="Counts", horizontal=TRUE, stack=FALSE, groups=TRUE) # Plots all data in a single bar plot. The TRUE/FALSE arguments control the layout of the plot.
barchart(y, col="grey", layout = c(2, 2, 1), xlab="Counts", as.table=TRUE, horizontal=TRUE, stack=FALSE, groups=FALSE)
   # Plots the different data components in separate bar plots. The TRUE/FALSE and layout arguments allow to rearrange the bar plots in different ways.

Bar Plot Generated with lattice
ggplot2

## (A) Sample Set: the following transforms the iris data set into a ggplot2-friendly format 
iris_mean <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=mean)
# Calculates the mean values for the aggregates given by the Species column in the iris data set.
iris_sd <- aggregate(iris[,1:4], by=list(Species=iris$Species), FUN=sd) # Calculates the standard deviations for the aggregates given by the Species column in the iris data set.
convertDF <- function(df=df, mycolnames=c("Species", "Values", "Samples")) { myfactor <- rep(colnames(df)[-1], each=length(df[,1])); mydata <- as.vector(as.matrix(df[,-1])); df <- data.frame(df[,1], mydata, myfactor); colnames(df) <- mycolnames; return(df) } # Defines function to convert data frames into ggplot2-friendly format.
df_mean <- convertDF(iris_mean, mycolnames=c("Species", "Values", "Samples")) # Converts iris_mean.
df_sd <- convertDF(iris_sd, mycolnames=c("Species", "Values", "Samples")) # Converts iris_sd.
limits <- aes(ymax = df_mean[,2] + df_sd[,2], ymin=df_mean[,2] - df_sd[,2]) # Define standard deviation limits.

## (B) Bar plots of data stored in df_mean
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") # Plots bar sets defined by 'Species' column next to each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + coord_flip() + opts(axis.text.y=theme_text(angle=0, hjust=1)) # Plots bars and labels sideways.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="stack") # Plots same data set as stacked bars.
ggplot(df_mean, aes(Samples, Values)) + geom_bar(aes(fill = Species)) + facet_wrap(~Species, ncol=1) # Plots data sets below each other.
ggplot(df_mean, aes(Samples, Values, fill = Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") # Generates the same plot as before, but with error bars.

# (C) Customizing colors
library(RColorBrewer); display.brewer.all() # Select a color scheme and pass it on to 'scale_*' arguments.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_brewer(pal="Greys") + scale_color_brewer(pal = "Greys") # Generates the same plot as before, but with grey color scheme.
ggplot(df_mean, aes(Samples, Values, fill=Species, color=Species)) + geom_bar(position="dodge") + geom_errorbar(limits, position="dodge") + scale_fill_manual(values=c("red", "green3", "blue")) + scale_color_manual(values=c("red", "green3", "blue")) # Uses custom colors passed on as vectors.

Bar Plot Generated with ggplot2

Pie Charts

Base Graphics

y <- table(rep(c("cat", "mouse", "dog", "bird", "fly"), c(1,3,3,4,2))) # Creates a sample data set.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), main="Pie Chart", clockwise=T) # Plots a simple pie chart.
pie(y, col=rainbow(length(y), start=0.1, end=0.8), labels=NA, main="Pie Chart", clockwise=T)
legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, col=rainbow(length(y), start=0.1, end=0.8), ncol=1)
   # Same pie chart as above, but with legend.

Pie Chart Generated with Base Graphics


ggplot2 (see this page for more details)

df <- data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), value=c(1,3,3,4,2)) # Creates a sample data set.
ggplot(df, aes(x = "", y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots pie chart.
ggplot(df, aes(x = variable, y = value, fill = variable)) + geom_bar(width = 1) + coord_polar("y", start=pi / 3) + opts(title = "Pie Chart") # Plots wind rose pie chart.

Wind Rose Pie Chart Generated with ggplot2

Heatmaps

Base Graphics

y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep="")))
image(y) # The image function plots matrices as heatmaps.

Heatmap Generated with Base Graphics

lattice


## Sample data
library(lattice); library(gplots)
y <- lapply(1:4, function(x) matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))))

## Plot single heatmap:
levelplot(y[[1]])

## Arrange several heatmaps in one plot

x1 <- levelplot(y[[1]], col.regions=colorpanel(40, "darkblue", "yellow", "white"), main="colorpanel")
x2 <- levelplot(y[[2]], col.regions=heat.colors(75), main="heat.colors")
x3 <- levelplot(y[[3]], col.regions=rainbow(75), main="rainbow")
x4 <- levelplot(y[[4]], col.regions=redgreen(75), main="redgreen")
print(x1, split=c(1,1,2,2))
print(x2, split=c(2,1,2,2), newpage=FALSE)
print(x3, split=c(1,2,2,2), newpage=FALSE)
print(x4, split=c(2,2,2,2), newpage=FALSE)

Several Heatmaps in One Plot Generated with lattice

Venn Diagrams

Computation of Venn intersects for 2-20 samples and plotting of 2-5 way Venn diagrams.

Computation of Venn intersects
The following imports several functions from the overLapper.R script for computing Venn intersects and plotting Venn diagrams (old version: vennDia.R). These functions are relatively generic and scalable by supporting the computation of Venn intersects of 2-20 or more samples. The upper limit around 20 samples is unavoidable because the complexity of Venn intersects increases exponentially with the sample number n according to this relationship: (2^n) - 1. A useful feature of the actual plotting step is the possiblity to combine the counts from several Venn comparisons with the same number of test sets in a single Venn diagram. The overall workflow of the method is to first compute for a list of samples sets their Venn intersects using the overLapper function, which organizes the result sets in a list object. Subsequently, the Venn counts are computed and plotted as bar or Venn diagrams. The current implementation of the plotting function, vennPlot, supports Venn diagrams for 2-5 sample sets. To analyze larger numbers of sample sets, the Intersect Plot methods often provide reasonable alternatives. These methods are much more scalable than Venn diagrams, but lack their restrictive intersect logic. Additional Venn diagram resources are provided by limma, gplots, vennerable, eVenn, VennDiagram, shapes, C Seidel (online) and Venny (online).

#################
## Sample data ##
#################

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports required functions.
setlist <- list(A=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 22), E=sample(letters, 18), F=sample(letters, 22, replace=T))
   # To work with the overLapper function, the sample sets (here six) need to be stored in a list object where the different
   # compontents are named by unique identifiers, here 'A to F'. These names are used as sample labels in all subsequent data
   # sets and plots.

sets <- read.delim("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/sets.txt")
setlistImp <- lapply(colnames(sets), function(x) as.character(sets[sets[,x]!="", x]))
names(setlistImp) <- colnames(sets)
   # Example how a list of test sets can be imported from an external table file stored in tab delimited format. Such
   # a file can be easily created from a spreadsheet program, such as Excel. As a reminder, copy & paste from external
   # programs into R is also possible (see read.delim function).

OLlist <- overLapper(setlist=setlist, sep="", type="vennsets"); OLlist; names(OLlist)
   # With the setting type="vennsets", the overLapper function computes all Venn Intersects for the six test samples in
   # setlist and stores the results in the Venn_List component of the returned OLlist object. By default, duplicates are
   # removed from the test sets. The setting keepdups=TRUE will retain duplicates by appending a counter to each entry. When
   # assigning the value "intersects" to the type argument then the function will compute Regular
   # Intersects instead of Venn Intersects. The Regular Intersect approach (not compatible with Venn diagrams!) is described
   # in the next section. Both analyses return a present-absent matrix in the Intersect_Matrix component of OLlist. Each overlap
   # set in the Venn_List data set is labeled according to the sample names provided in setlist. For instance, the composite
   # name 'ABC' indicates that the entries are restricted to A, B and C. The seperator used for naming the intersect samples
   # can be specified under the sep argument. By adding the argument cleanup=TRUE, one can minimize formatting issues in the
   # sample sets. This setting will convert all characters in the sample sets to upper case and remove leading/trailing spaces.


#############################
## Bar plot of Venn counts ##
#############################

olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Venn Bar Plot")
   # Generates a bar plot for the Venn counts of the six test sample sets. In contrast to Venn diagrams, bar plots scale
   # to larger numbers of sample sets. The layout of the plot can be adjusted by changing the default values of the argument:
   # margins=c(4,10,3,1). The minimum number of counts to consider in the plot can be set with the mincount argument
   # (default is 0). The bars themselves are colored by complexity levels using the default setting: mycol=OLlist$Complexity_Levels.


#########################
## 2-way Venn diagrams ##
#########################
setlist2 <- setlist[1:2]; OLlist2 <- overLapper(setlist=setlist2, sep="_", type="vennsets")
OLlist2$Venn_List; counts <- sapply(OLlist2$Venn_List, length); vennPlot(counts=counts)
   # Plots a non-proportional 2-way Venn diagram. The main graphics features of the vennPlot function can be controlled by
   # the following arguments (here with 2-way defaults): mymain="Venn Diagram": main title; mysub="default": subtitle; 
   # ccol=c("black","black","red"): color of counts; lcol=c("red","green"): label color; lines=c("red","green"):
   # line color; mylwd=3: line width; ccex=1.0: font size of counts; lcex=1.0: font size of labels. Note: the vector
   # lengths provided for the arguments ccol, lcol and lines should match the number of their corresponding features
   # in the plot, e.g. 3 ccol values for a 2-way Venn diagram and 7 for a 3-way Venn diagram. The argument setlabels
   # allows to provide a vector of custom sample labels. However, assigning the proper names in the original test set list
   # is much more effective for tracking purposes.


#########################
## 3-way Venn diagrams ##
#########################

setlist3 <- setlist[1:3]; OLlist3 <- overLapper(setlist=setlist3, sep="_", type="vennsets")
counts <- list(sapply(OLlist3$Venn_List, length), sapply(OLlist3$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, -0.2))
   # Plots a non-proportional 3-way Venn diagram. The results from several Venn comparisons can be combined in a
   # single Venn diagram by assigning to the count argument a list with several count vectors. The positonal offset
   # of the count sets in the plot can be controlled with the yoffset argument. The argument setting colmode=2 allows
   # to assign different colors to each count set. For instance, with colmode=2 one can assign to ccol a color vector
   # or a list, such as ccol=c("blue", "red") or ccol=list(1:8, 8:1).


#########################
## 4-way Venn diagrams ##
#########################

setlist4 <- setlist[1:4]
OLlist4 <- overLapper(setlist=setlist4, sep="_", type="vennsets")
counts <- list(sapply(OLlist4$Venn_List, length), sapply(OLlist4$Venn_List, length))
vennPlot(counts=counts, mysub="Top: var1; Bottom: var2", yoffset=c(0.3, -0.2))
   # Plots a non-proportional 4-way Venn diagram. The setting type="circle" returns an incomplete 4-way Venn diagram as
   # circles. This representation misses two overlap sectors, but is sometimes easier to navigate than the default
   # ellipse version.


#########################
## 5-way Venn diagrams ##
#########################

setlist5 <- setlist[1:5]; OLlist5 <- overLapper(setlist=setlist5, sep="_", type="vennsets")
counts <- sapply(OLlist5$Venn_List, length)
vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a non-proportional 5-way Venn diagram.

################################
## Export and other utilities ##
################################

OLexport <- as.matrix(unlist(sapply(OLlist5[[4]], paste, collapse=" ")))
write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
OLexport <- data.frame(Venn_Comp=rep(names(OLlist5[[4]]), sapply(OLlist5[[4]], length)), IDs=unlist(OLlist5[[4]]))
write.table(OLexport, file="test.xls", row.names=F, quote=F, sep="\t") # Same as above, but exports to an alternative tabular format.
tapply(counts, OLlist5[[3]], function(x) rev(sort(x)))
   # Sorts the overlap results within each complexity level by their size. This allows to identify the sample set
   # combinations with the largest intersect within each complexity level.

sapply(names(setlist), function(x) table(setlist[[x]])[table(setlist[[x]])!=1])
   # Command to identify and count duplicated objects in the original sample set object 'setlist'. In the given example,
   # only set 'F' contains duplications. Their frequency is provided in the result.

vennPlot(counts, mymain="", mysub="", ccol="white", lcol="white") # Returns an empty Venn diagram without counts or labels.

################################################################################
## Typical analysis routine for sets of differentially expressed genes (DEGs) ##
################################################################################

ratio <- matrix(sample(seq(-5, 5, by=0.1), 100, replace=T), 100, 4, dimnames=list(paste("g", 1:100, sep=""), paste("DEG", 1:4, sep="")), byrow=T)
   # Creates a sample matrix of gene expression log2 ratios. This could be any data type!

setlistup <- sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]>=1,]))
setlistdown <- sapply(colnames(ratio), function(x) rownames(ratio[ratio[,x]<=-1,]))
   # Identifies all genes with at least a two fold up or down regulation and stores the corresponding gene identifiers
   # in setlistup and setlistdown, respectively.

OLlistup <- overLapper(setlist=setlistup, sep="_", type="vennsets")
OLlistdown <- overLapper(setlist=setlistdown, sep="_", type="vennsets")
counts <- list(sapply(OLlistup$Venn_List, length), sapply(OLlistdown$Venn_List, length))
vennPlot(counts=counts, ccol=c("red", "blue"), colmode=2, mysub="Top: DEG UP; Bottom: DEG Down", yoffset=c(0.3, -0.2))
   # Performs Venn analysis for the four sets stored in setlistup and setlistdown. The argument setting colmode=2 allows
   # to assign different colors to each count set.
   # For instance, with colmode=2 one can assign to ccol a color vector or a list, such as ccol=c("blue", "red") or ccol=list(1:8, 8:1).

Sample Venn Diagrams (here 4- and 5-way)

Intersect Plots

Collection of methods for analyzing and visualizing intersect relationships among large numbers of sample sets.

#############################################################################################
## All Possible Intersects: for comprehensive overlap analyses of 2-20 or more sample sets ##
#############################################################################################

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
   # To compute all possible intersects among more than two samples at the same time - such as common in AB, ABC,
   # ABCD and so on - the overLapper function can be used. Note: processing more than 20 sets with this function can take
   # a long time, because the complexity of all possible intersects increases with the number of sample sets n according
   # to this relationship: (2^n) - 1 (includes self comparisons). This means there are 63 possible intersects for n=6, 1,023
   # for n=10, 32,767 for n=15 and 33,554,431 for n=25. With the current implementation, the computation time is about 0.1
   # second for n=10 and 2.5 seconds for n=15.
setlist <- lapply(1:6, function(x) sample(letters, 20)); names(setlist) <- LETTERS[seq(along=setlist)]
  
# To work with the overLapper function, the sample sets need to be stored in a list object where the different
   # components are named by unique identifiers: A, B, C, ..., Z.

OLlist <- overLapper(setlist=setlist, complexity=1:length(setlist), sep="-", type="intersects"); OLlist; names(OLlist)
   # Computes all possible intersects for the samples stored in 'setlist'. The results are returned as a list where each
   # overlap component is labeled by the corresponding sample names. For instance, the name 'A-B-C' indicates that the
   # entries are common in samples A, B and C. The seperator used for naming the intersect samples can be specified under
   # the 'sep' argument. The complexity level range to consider can be controlled with the 'complexity' argument, which
   # can have values from 2 to the total number of samples.

OLlist[[2]]; OLlist[[3]] # Returns the corresponding intersect matrix and complexity levels.
OLexport <- as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" ")))
write.table(OLexport, file="test.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
counts <- sapply(OLlist[[4]], length) # Counts the number of elements in each intersect component.
tapply(counts, OLlist[[3]], function(x) rev(sort(x)))
   # Sorts the overlap results within each complexity level by their size. This allows to identify the sample set
   # combinations with the largest intersect within each complexity level.

x11(height=12, width=8); olBarplot(OLlist=OLlist, horiz=T, las=1, cex.names=0.6, main="Intersect Bar Plot")
   # Plots the counts as bar diagram. More details on this function are provided in the Venn diagram section.

#####################################################################################################
## All Pairwise Intersects: to identify and cluster overlaps among two to thousands of sample sets ##
#####################################################################################################

setlist <- lapply(11:30, function(x) sample(letters, x, replace=TRUE)); names(setlist) <- paste("S", seq(along=setlist), sep="")
   # To work with the following function, the sample sets (here 20) need to be stored in a list object (here 'setlist')
   # where each component is named by a unique identifier (here S1-S20).

setlist <- sapply(setlist, unique)
olMA <- sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(setlist[[x]] %in% setlist[[y]]))); olMA
   # Creates an intersect matrix for all pairwise sample comparisons stored in 'setlist'. This approach scales well up to
   # 3several thousands of sample sets.

library("gplots"); heatmap.2(olMA, trace="none", Colv="none", Rowv="none", dendrogram="none", col=colorpanel(40, "darkred", "orange", "yellow")) # Plots intersect matrix as heatmap.
heatmap.2(olMA, trace="none", col=colorpanel(40, "darkred", "orange", "yellow"))
   # Hierarchical clustering of the rows and columns of the intersect matrix 'olMA'. The result is plotted as heatmap
   # with two identical dendrograms representing the outcome of the hierarchical clustering. The latter is internally
   # performed by calls of heatmap.2() to the functions dist() and hclust() using their default settings: euclidean
   # distances and complete linkage. Note: the distance matrix used for clustering in this examples is based on the
   # row-to-row (column-to-column) similarities in the olMA object. The next example shows how one can use the olMA
   # object directly as distance matrix for clustering after transforming the intersect counts into similarity measures,
   # such as Jaccard or Rand indices.

diffMA1 <- sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[x]] %in% setlist[[y]])))
diffMA2 <- sapply(names(setlist), function(x) sapply(names(setlist), function(y) sum(!setlist[[y]] %in% setlist[[x]])))
jaccardMA <- olMA/(diffMA1+diffMA2+olMA)
hr <- hclust(as.dist(1-jaccardMA), method = "complete", members=NULL)
heatmap.2(jaccardMA, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hr), trace="none", col=colorpanel(40, "darkred", "orange", "yellow"))
   # Uses the olMA for clustering after transfroming its intersect counts into Jaccard (Tanimoto) similarity indices.
   # This transformation can give reasonable results for sample sets with large size differences. Many alternative
   # similarity measures for set comparisons could be considered here.

setlist[["S1"]][setlist[["S1"]] %in% setlist[["S2"]]]
   # Example for returning the intersect entries for one pairwise comparison, here S1 and S2.
name1 <- matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=T)
name2 <- matrix(colnames(olMA), length(olMA[,1]), length(olMA[,1]), byrow=F)
mynames <- paste(name1, name2, sep="_"); mynames <- lapply(strsplit(mynames, "_"), sort)
mynames <- sapply(mynames, paste, collapse="_"); olV <- as.vector(olMA)
names(olV) <- mynames; olV <- olV[!duplicated(names(olV))]; sort(olV)
   # Converts the intersect matrix 'olMA' into a named vector sorted by intersect size. This allows to identify the
   # combinations of pairwise sample comparisons that have the largest or smallest intersects.


######################################################################################################
## Presence-absence matrices: to indentify memberships of items across large numbers of sample sets ##
######################################################################################################

setlist <- lapply(11:30, function(x) sample(letters, x, replace=TRUE)); names(setlist) <- paste("S", seq(along=setlist), sep="")
    # Generates a sample data set.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
   # Imports the required overLapper() function.
paMA <- overLapper(setlist=setlist, type="intersects", complexity=2)[[2]]
paMA[names(rev(sort(rowSums(paMA)))),]
   # Creates a presence-absence matrix of all items across their sample sets. In this matrix the presence information is
   # indicated by ones and its rows are sorted by the presence frequencies. Alternative versions of the present-absent
   @ matrix can be returned by setting the 'type' argument to 1, 2 or 3.

library("gplots"); heatmap.2(paMA, trace="none", Colv="none", Rowv="none", dendrogram="none", col=c("white", "gray"))
   # Plots the present-absent matrix as heatmap.

rowSums(paMA) # Returns the presence frequencies of the items in the sample sets.
sapply(rownames(paMA), function(x) colnames(paMA)[paMA[x, ]==1])
   # Returns for each item the names of sample sets where it is present. The opposite result for absent calls can be
   # returned by changing '==1' to '==0'.


Intersect-based Heatmap with Clustering Using Jaccard Similarity Measure

Histograms

Base Graphics

x <- rnorm(100); hist(x, freq=FALSE); curve(dnorm(x), add=TRUE) # Plots a random distribution as histogram and overlays curve ('freq=F': ensures density histogram; 'add=T': enables 'overplotting').
plot(x<-1:50, dbinom(x,size=50,prob=.33), type="h") # Creates pin diagram for binomial distribution with n=50 and p=30.
Basic Histogram Generated with Base Graphics



ggplot2 (see this page for more details)

ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(fill = ..count..), binwidth=0.2) # Plots histogram for second column in 'iris' data set.
ggplot(iris, aes(x=Sepal.Width)) + geom_histogram(aes(y = ..density.., fill = ..count..), binwidth=0.2) + geom_density() # Density representation.
Histogram Generated with ggplot2



Density Plots

plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red") # Plots a random distribution in form of a density plot.
split.screen(c(1,1))
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red")
screen(1, new=FALSE)
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n")
close.screen(all = TRUE)
   # Several density plots can be overlayed with the 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim') and
   # overlayed on the same graphics device using 'screen(1,new=FALSE))'.


Box Plots

y <- as.data.frame(matrix(runif(300), ncol=10, dimnames=list(1:30, LETTERS[1:10]))) # Generates a sample data set.
boxplot(y, col=1:length(y))
   # Creates a box plot. A box plot (also known as a box-and-whisker diagram) is a graphical representation of a five-number summary, which consists of the smallest
   # observation, lower quartile, median, upper quartile and largest observation.

Basic Box Plot Generated with Base Graphics

ROC Plots

A variety of libraries are available for these tasks: ROCR, ROC, nonbinROC, pROCggplot2

Feature Maps for DNA/Protein Sequences

## The featureMap.R script plots simple feature maps of biological sequences based on provided mapping coordinates.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt")

## Alternative approach for plotting simple chromosome maps
plot(x <- rnorm(40,2e+07,sd=1e+07), y <- rep(1,times=40), type="h", col="blue", xaxt="n", yaxt="n", bty="n")
abline(h=0.78, col="green", lwd=12)
lines(a <- rnorm(5,2e+07,sd=1e+07), b <- rep(1,times=5), type="h", col="red", lwd=2)



Miscellaeneous Plotting Utilities

plot(1:10); lines(1:10, lty=1)
   # Superimposes lines onto existing plots. The usage of plotted values will connect the data points. Different line types can be specified with
   # argument "lty = 2". More on this can be found in the documentation for 'par'.

plot(x <- 1:10, y <- 1:10); abline(-1,1, col="green"); abline(1,1, col="red"); abline(v=5, col="blue"); abline(h=5, col="brown") # Function 'abline' adds lines in different colors to x-y-plot.

Arranging Plots

Base Graphics

x11(height=6, width=12, pointsize=12) # Defines the size of the plotting device.
par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } # With the argument 'par(mfrow=c(nrow,ncol))' one can define how several plots are arranged next to each other on the same plotting device.
nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE), c(3,7), c(5,5), respect=TRUE)
layout.show(nf)
for(i in 1:3) { barplot(1:10) }
   # The 'layout()' function allows to devide the plotting device into variable numbers of rows and columns with the column-widths and the row-heights specified in the respective arguments.
split.screen(c(1,1))
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="red")
screen(1, new=FALSE)
plot(density(rnorm(10)), xlim=c(-2,2), ylim=c(0,1), col="green", xaxt="n", yaxt="n", ylab="", xlab="", main="",bty="n")
close.screen(all = TRUE)
   # Possibility to overlay independent plots with 'split.screen' function. In ths example two density plots are plotted at the same scale ('xlim/ylim')
   # and overlayed on the same graphics device using 'screen(1,new=FALSE))'. Split.screen is also very useful for generating multiple plots next to each
   # other on a single device by setting the layout setting to several screens, e.g.: split.screen(c(2,2))


Example of Arranging Plots Using Base Graphics
ggplot2

library(grid); library(ggplot2) # Requires grid library.
dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # Creates a small sample data set.
a <- ggplot(dsmall, aes(color, price/carat)) + geom_jitter(size=4, alpha = I(1 / 1.5), aes(color=color))
b <- ggplot(dsmall, aes(color, price/carat, color=color)) + geom_boxplot()
c <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() + opts(legend.position = "none")
grid.newpage() # Open a new page on grid device
pushViewport(viewport(layout = grid.layout(2, 2))) # Assign to device viewport with 2 by 2 grid layout
print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2))
print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
print(c, vp = viewport(layout.pos.row = 2, layout.pos.col = 2, width=0.3, height=0.3, x=0.8, y=0.8))
Table of Contents
Example of Arranging Plots Using ggplot2

Customize X-Y Axes

op <- par(mar=c(6,6,6,8)); barplot(1:10, names.arg=letters[1:10], cex.names=1.5, col=3, yaxt = "n") # Creates bar plot without y-axis and wider margins around plot.
axis(2, las = 1, at=seq(0,10,0.5), lwd=2, cex.axis=1.5); axis(3, las = 1, lwd=2, cex.axis=1.5); axis(4, las = 1, at=seq(0,10,2), labels=month.name[1:6], lwd=2, cex.axis=1.5); par(op) # Adds three custom axes.
curve(sin(x), -pi, pi, axes=F, ylab="", xlab=""); axis(1, pos=0); axis(2, pos=0) # Plots a function in Cartesian-style coordinate system.

Examples of Customizing X-Y Axes


Color Selection Utilities

For more details on this topic consult Earl Glynn's Color Chart

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
cat("palette():\n"); palette(); palette(rainbow(20, start=0.1, end=0.2))
cat("palette(rainbow(20, start=0.1, end=0.2)):\n"); palette(); palette("default")
   # Prints and modifies the color palette which is used when the argument 'col=' has a numeric index. The last step sets the palette back to its default setting.
par(mfrow=c(1,2)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8)
palette(rainbow(8, start=0.1, end=0.8)); barplot(as.matrix(y[1:8,]), beside=T, col=1:8)
palette("default")
   # Example to call the default color palette with a numeric vector in the 'col=' argument'. In the second plot a modified palette is called the same way.
example(rainbow) # The function rainbow() allows to select various predefined color schemes.
barplot(as.matrix(y[1:10,]), beside=T, col=rainbow(10, start=0.1, end=0.2))
   # Example to select 10 colors with the rainbow() function. The start and end values need to be between 0 and 1. The wider their distance the more diverse are the resulting colors.
barplot(as.matrix(y[1:10,]), beside=T, col=gray(c(1:10)/10)) # The gray() function allows to select any type of gray shades by providing values from 0 to 1.
colors()[1:10]; col2rgb(colors()[1:10]) # The colors() function allows to select colors by their name. The col2rgb() can translates them into the RGB color code.
library(RColorBrewer); example(brewer.pal) # Shows how to select color schemes with the RColorBrewer library.

Adding Text

y <- as.data.frame(matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3]))) # Generates a sample data set.
plot(y[,1], y[,2]); text(0.2,0.2,"My Text") # Adds text at defined coordinates of plot.
plot(y[,1], y[,2]); mtext("My Text", 4) # Adds text to one of the four margin areas (side: 1-4) of the plot.
plot(y[,1], y[,2], pch=20, col="red", main="Plot of Symbols and Labels"); text(y[,1]+0.03, y[,2], rownames(y))
   # Plots a scatter plot with the corresponding labels (here row names) next to the data points.

Interactive Functions

locator() # Allows user to click on existing graph with left mouse button. System returns the corresponding x-y-coordinates after clicking on right mouse button.
text(locator(1), "My_Outlier", adj=0) # Adds text "My_Outlier" to graph at position where left mouse button is clicked.

Saving Graphics to Files

jpeg("test.jpeg"); plot(1:10, 1:10); dev.off()
   # After the 'jpeg("test.jpeg")' command all graphs are redirected to the file "test.jpeg" in JPEG format. To export images with the highest quality,
   # the default setting "quality = 75" needs to be changed to 100%. The actual image data are not written to the file until the 'dev.off()' command is executed!

pdf("test.pdf"); plot(1:10, 1:10); dev.off() # Same as above, but for pdf format. The pdf and svg formats provide often the best image quality, since they scale to any size without pixelation.
png("test.png"); plot(1:10, 1:10); dev.off() # Same as above, but for png format.
postscript("test.ps"); plot(1:10, 1:10); dev.off() # Same as above, but for PostScript format.
svg("test.svg"); plot(1:10, 1:10); dev.off() # Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape.

Importing Graphics into R

library(pixmap); x <- read.pnm(file="my_image.pnm"); plot(x); print(x) # Images need to be in pnm format. This format can be obtained in ImageMagick like this: 'convert my_image.jpg my_image.pnm'.

Graphics Exercises



Writing Your Own Functions

The R language allows users to create their own function objects. The basic syntax for defining new functions is:

name <- function(arg_1, arg_2, ...) expression # Basic syntax for functions.
    A much more detailed introduction into writing functions in R is available in the Programming in R section of this manual.


R Web Applications

      Various web-based R services and R web development kits are available:

R Basics Exercises

[ Slide Show ]  [ R Code ] 

The following exercises introduce a variety of useful data analysis utilities available in R.

Import from spreadsheet programs (e.g. Excel)

  • Download the following molecular weight and subcelluar targeting tables from the TAIR site, import the files into Excel and save them as tab delimited text files.
  • Import the tables into R
my_mw <- read.delim(file="MolecularWeight_tair7.xls", header=T, sep="\t") # Imports molecular weight table.
my_target <- read.delim(file="TargetP_analysis_tair7.xls", header=T, sep="\t") # Imports subcelluar targeting table.

  • Online import
my_mw <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/MolecularWeight_tair7.xls", header=T, sep="\t")
   # Online import of molecular weight table from TAIR.

my_target <- read.delim(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/TargetP_analysis_tair7.xls", header=T, sep="\t")
   # Online import of subcelluar targeting table from TAIR.


  • Check tables and rename gene ID columns
my_mw[1:4,]; my_target[1:4,]; colnames(my_target)[1] <- "ID"; colnames(my_mw)[1] <- "ID" # Prints out the first four rows of the two data frames and renames their gene ID columns.


  • Merge the two tables based on common ID field
my_mw_target <- merge(my_mw, my_target, by.x="ID", by.y="ID", all.x=T)

  • Shorten one table before the merge and then remove the non-matching rows (NAs) in the merged file
my_mw_target2 <- merge(my_mw, my_target[1:40,], by.x="ID", by.y="ID", all.x=T) # To remove non-matching rows, use the argument setting 'all=F'.
na.omit(my_mw_target2) # Removes rows containing "NAs" (non-matching rows).

Problem 1: How can the merge function in the previous step be executed so that only the common rows among the two data frames are returned? Prove that both methods - the two step version with na.omit and your method - return identical results.

Problem 2: Replace all NAs in the data frame my_mw_target2 with zeros.

  • Apply several combinatorial queries (filters) on the data set. For example:
my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C",] # Retrieves all records with MW greater than 100000 and targeting to chloroplasts.
dim(my_mw_target[my_mw_target[,2]>100000 & my_mw_target[,4]=="C", ]) # Determines the number of matches for the above query.


Problem 3: How many protein entries in the my_mw_target data frame have a MW of greater then 4000 and less then 5000. Subset the data frame accordingly and sort it by MW to check that your result is correct.

  • Use a regular expression in a substitute function to generate a separate ID column that lacks the gene model extensions
my_mw_target3 <- data.frame(loci=gsub("\\..*", "", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target)
my_mw_target3 <- data.frame(loci=gsub("(^.*)\\.\\d{1,}", "\\1", as.character(my_mw_target[,1]), perl = TRUE), my_mw_target) # Same as above but with backreference.

Problem 4: Retrieve those rows in my_mw_target3 where the second column contains the following identifiers: c("AT5G52930.1", "AT4G18950.1", "AT1G15385.1", "AT4G36500.1", "AT1G67530.1"). Use the '%in%' function for this query. As an alternative approach, assign the second column to the row index of the data frame and then perform the same query again using the row index. Explain the difference of the two methods.

  • Calculate the number of gene models per loci with the table() function
my_mw_target4 <- cbind(my_mw_target3, Freq=table(my_mw_target3[,1])[as.character(my_mw_target3[,1])])
   # The table() function counts the number of duplicates for each loci and cbind() appends the information to the data frame.
my_mw_target4[order(-my_mw_target4$Freq),][1:30,] # Sorts by count information.
length(unique(my_mw_target4[,1])) # Removes duplicates and prints the number of unique loci.
sum(!duplicated(my_mw_target4[,1])) # Gives the same result as previous command; remember: the function 'duplicated' returns a logical vector.
length(my_mw_target4[,2])-length(unique(my_mw_target4[,1])) # Calculates the number of duplicated entries.
sum(duplicated(my_mw_target4[,1])) # Generates the same result as the previous command.


  • Perform a variety calculations across columns
data.frame(my_mw_target4, avg_AA_WT=(my_mw_target4[,3]/my_mw_target4[,4]))[1:40,] # Calculates the average AA weight per protein.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean))[1:40,] # Calculates the mean across several fields of each row.
data.frame(my_mw_target4, mean=apply(my_mw_target4[,6:9], 1, mean), stdev=apply(my_mw_target4[,6:9], 1, sd, na.rm=T))[1:40,]
   # Calculates the mean and standard deviation across several fields of each row.

mean(my_mw_target4[,-c(1,2,5)], na.rm=T) # Calculates the mean for all numeric columns.
data.frame(my_mw_target4[1:40,], ttest=apply(my_mw_target4[1:40,6:9], 1, function(x) {t.test(x[1:2], x[3:4])$p.value} ))
   # Calculates a two-sample t-test for two slices in each row of the data frame and records the corresponding p-values. To calculate a paired t-test,
   # add the argument 'paired=T'.

  • Export data frame to Excel
write.table(my_mw_target4, file="my_file.xls", quote=F, sep="\t", col.names = NA) # Writes the data frame 'my_mw_target4' into a tab-delimited text file that can be opened with Excel.


  • Plotting samples
boxplot(my_mw_target4[,3], my_mw_target4[,4], names=c("MW","CC"), col=c("blue","red"), main="Box Plot") # Generates box plots for MW and AA counts.
plot(density(my_mw_target4[,3]), col="blue", main="Density Plot") # Plots density distribution plots for MW.
barplot(my_mw_target4[1:5,3], beside = F, col = c("red", "green", "blue", "magenta", "black"), names.arg=as.vector(my_mw_target4[1:5,1]), main="MW Bar Plot", ylim=c(0,230000))
   # Bar plot of MW for first five loci.
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R") # Imports the venn diagram function.
setlist <- list(A=sample(my_mw_target4[,1], 1000), B=sample(my_mw_target4[,1], 1100), C=sample(my_mw_target4[,1], 1200), D=sample(my_mw_target4[,1], 1300), E=sample(my_mw_target4[,1], 1400))
   # Generates 5 sets of random IDs.
OLlist <- overLapper(setlist=setlist, sep="_", type="vennsets")
counts <- sapply(OLlist$Venn_List, length); vennPlot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) # Plots a 5-way venn diagram for the random ID sets.
OLexport <- as.matrix(unlist(sapply(OLlist[[4]], paste, collapse=" ")))
write.table(OLexport, file="OLexport.xls", col.names=F, quote=F, sep="\t") # Exports intersect data in tabular format to a file.
pdf("venn_plot.pdf"); vennPlot(counts=counts); dev.off() # Saves the plot image to a PDF file.


Problem 5: Generate two key lists each with 4 random sample sets. Compute their overlap counts and plot the results for both lists in one venn diagram.

Problem 6: Write all commands from the previous exercises into an R script (exerciseRbasics.R) and execute it with the source() function like this: source("exerciseRbasics.R"). This will execute all of the above commands and generates the corresponding output files in the current working directory.


HT Sequence Analysis with R and Bioconductor

    This section of the manual is available on the HT-Seq site.

Programming in R

    This section of the manual is available on the Programming in R site.

Bioconductor

Introduction

Bioconductor is an open source and open development software project for the analysis of genome data (e.g. sequence, microarray, annotation and many other data types). This section of the manual provides a brief introduction into the usage and utilities of a subset of packages from the Bioconductor project. The included packages are a 'personal selection' of the author of this manual that does not reflect the full utility specturm of the Bioconductor project. The introduced packages were chosen, because the author uses them often for his own teaching and research. To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site. Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutley critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to its author.


Finding Help

The instructions for installing BioConductor packages are available in the administrative section of this manual. Documentation for Bioconductor packages can be found in the vignette of each package. A listing of the available packages is available on the Bioc Package page. Annotation libraries can be found here. The Bioc Mail GMANE Archive can be searched to find answers to questions that have been discussed on the mailing list. Another valuable information resource is the Bioconductor Book. The basic R help functions provide additional information about packages and their functions:


library(affy) # Loads a particular package (here affy package).
library(help=affy) # Lists all functions/objects of a package (here affy package).
library() # Lists all libraries/packages that are available on a system.
openVignette() # Provides documentation on packages.
sessionInfo()
   # Prints version information about R and all loaded packages. The generated output should be provided when sending
   # questions or bug reports to the R and BioC mailing lists.

?function # Opens documentation on a function.
help.search("topic") # Searches help system for documentation.
search() # Shows loaded packages and attached data frame in current search path.
system.file(package="affy") # Shows location of an installed package.
library("tools"); vigSrc <- list.files(pattern="Rnw$", system.file("doc",package="GOstats"), full.names=TRUE); vigSrc; for (v in vigSrc) Stangle(v) # Extracts R code from a vignette and saves it to file(s) in your current working directory.

Affy Packages

BioConductor provides extensive resources for the analysis of Affymetrix data. The most important ones are introduced here.

Affy

The Affy package provides basic methods for analyzing affymetrix oligonucleotide arrays. The following steps outline the usage of the Affy package and associated packages.

(1) Obtaining scaled expression values with 5 different methods (MAS5, RMA, GCRMA, Plier & dChip).
A comparison of three of these methods including references is available in this slide show. Move your CEL files into your working directory and make sure that the corresponding CDF library for your chips is available on your system.

library(affy) # Loads the affy package.
mydata <- ReadAffy()
   # Reads all *.CEL (*.cel) files in your current working directory and stores them into the AffyBatch object 'mydata'.

mydata <- ReadAffy(widget=TRUE)
   # Opens file browser to select specific CEL files.
eset <- rma(mydata)
   # Creates normalized and background corrected expression values using the RMA method. The generated data are stored as
   # ExpressionSet class in the 'eset' object. For large data sets use the more memory efficient justRMA() function.

eset <- mas5(mydata)
   # Uses MAS 5.0 normalization method, which is much slower than RMA!

eset_gcrma <- gcrma(mydata)
   # Use this command to acquire gcrma data. The 'library(gcrma)' needs to be loaded first.

eset_plier <- justPlier(mydata)
   # Use this command to generate plier data. The 'library(plier)' needs to be loaded first.
eset_PMA <- mas5calls(mydata)
   # Generates MAS 5.0 P/M/A calls. The command creates ExpressionSet with P/M/A calls in the 'exprs' slot and the
   # wilcoxon p-values in the 'se.exprs' slot. To access them see below.

eset <- expresso(mydata, normalize.method="invariantset", bg.correct=FALSE, pmcorrect.method="pmonly", summary.method="liwong")
   # Generates expression calls similar to dChip (MBEI) method from Li and Wong.
library(affycoretools); affystart(plot=T, express="mas5")
   # Handy function to normalize all cel files in current working directory, perform qc plots and export normalized data to file.
   # Works for mas5, rma and gcrma.


(2) Exporting data from ExpressionSet objects

write.exprs(eset, file="mydata.txt") # Writes expression values to text file in working directory.
x <- data.frame(exprs(eset), exprs(eset_PMA), assayDataElement(eset_PMA, "se.exprs"))
x <- x[,sort(names(x))]; write.table(x, file="mydata_PMA.xls", quote=F, col.names = NA, sep="\t")
   # Writes expression values, PMA values and wilcoxon p-values to text file in working directory. Remember: the old
   # command for accessing the wilcoxon p-values in BioC versions <2.0 was 'se.exprs(eset_PMA))'.

(3) Obtaining single probe-level data from 'affybatch' objects (see also documentation '?ReadAffy')

mypm <- pm(mydata) # Retrieves PM intensity values for single probes
mymm <- mm(mydata) # Retrieves MM intensity values for single probes
myaffyids <- probeNames(mydata) # Retrieves Affy IDs
result <- data.frame(myaffyids, mypm, mymm) # Combines the above information in data frame

    (4) Working with 'ExpressionSet' objects (see Biobase Manual)

    eset; pData(eset) # Provides summary information of ExpressionSet object 'eset' and lists the analyzed file names.
    exprs(eset)[1:2,1:4]; exprs(eset)[c("244901_at","244902_at"),1:4]
       # Retrieves specific rows and fields of ExpressionSet object. To learn more about this format class, consult
       # ExpressionSet manual with command '?ExpressionSet'.

    test <- as.data.frame(exprs(eset)); eset2 <-new("ExpressionSet", exprs = as.matrix(test), annotation="ath1121501"); eset2
       # Example for creating an ExpressionSet object from a data frame. To create the object from an external file, use
       # the read.delim() function first and then convert it accordingly.

    data.frame(eset) # Prints content of 'eset' as data frame to STDOUT.
    exprs(eset_PMA)[1:2,1:2]; assayDataElement(eset_PMA, "se.exprs")[1:2,1:2]
       # Returns from ExpressionSet of 'mas5calls(mydata)' the PMA values from its 'exprs' slot and the p-values from its
       # 'se.exprs' slot.

      (5) Retrieving annotation data for Affy IDs (see Annotation Package Manual)

      library(ath1121501.db) # Opens library with annotation data.
      library(help=ath1121501.db) # Shows availability and syntax for annotation data.
      ath1121501() # Provides a summary about the available annotation data sets of an annotation library.
      library(ath1121501cdf); ls(ath1121501cdf) # Retrieves all Affy IDs for a chip in vector format.
      x <- c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at") # Generates sample data set of Affy ID numbers.
      contents(ath1121501ACCNUM)[1:40] # Retrieves locus ID numbers for Affy IDs.
      myAnnot <- data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "),
                                          SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "),
                                          DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", "))
      myAnnot[x, ]
         # Organizes full set of annotation features in a data frame, here: ACCNUM, SYMBOL and GENENAME. The annotations
         # from probe sets with several mappings are collapsed into single strings.

      mget(x, ath1121501GO, ifnotfound=NA) # Retrieves GO information for Affy IDs.

        (6) Accessing probe sequence data

        library(ath1121501probe) # Opens library with probe sequence data.
        print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
        library(Biostrings)
        pm <- DNAStringSet(ath1121501probe$sequence); names(pm) <- ath1121501probe$Probe.Set.Name
           # Stores probe sequences as DNAStringSet object. See HT-Seq manual for more details.



        Visualization and quality controls

        Additional information on this topic can be found in the affyPLM QC Manual, the arrayQualityMetrics package and on the WEHI Affy Lab site.

        library(affyQCReport)
        QCReport(mydata, file="ExampleQC.pdf")
           # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
        deg <- AffyRNAdeg(mydata); summaryAffyRNAdeg(deg); plotAffyRNAdeg(deg)
           # Performs RNA degradation analysis. It averages on each chip the probes relative to the 5'/3' position on the target
           # genes. A summary list and a plot are returned.

        image(mydata[ ,1]) # Reconstructs image with log intensities of first chip.
        hist(mydata[ ,1:2]) # Plots histogram of PM intensities for 1st and 2nd array.
        hist(log2(pm(mydata[,1])), breaks=100, col="blue") # Plos bar histogram of the PM ('pm') or MM ('mm') log intensities of 1st array.
        boxplot(mydata,col="red") # Generates a box plot of un-normalized log intensity values.
        boxplot(data.frame(exprs(eset)), col="blue", main="Normalized Data") # Generates a box plot of normalized log intensity values.
        mva.pairs(pm(mydata)[,c(1,4)])
           # Creates MA-plot for un-normalized data. A MA-plot is a plot of log-intensity ratios (M-values) versus log-intensity
           # averages (A-values) between selected chips (here '[1,4]').

        mva.pairs(exprs(eset)[,c(1,4)]) # Creates MA-plot for normalized data.

        Simpleaffy

        The simpleaffy package automates many repetitive high-level analysis steps.

        (1) Creating expression values from CEL files
        Move CEL files into working directory and create white-space delimited table file "covdesc.txt": first column no title then CEL file names; second column title "treatment", then treatment names.

        library(simpleaffy) # Loads "affy" and "simpleaffy" packages.
        raw.data <- read.affy("covdesc.txt")
           # Reads data in working directory that are specified in "covdesc.txt" file; for help on this function type "?read.affy".
        x.rma <- call.exprs(raw.data, "rma")
           # Creates expression values using RMA method. The function 'call.exprs' returns always log2 expression values; for
           # help on this function type "?call.exprs". One can also use here the "gcrma" method after loading it with the command 'library(gcrma)'.

        x.mas <- call.exprs(raw.data, "mas5") # Computes expression values using MAS 5.0 method.
        write.exprs(x.rma, file="mydata.txt") # Writes expression values to file "mydata.txt".


        (2) Quality control steps
        The "qc" function provides several qc stats for chips. To use it, one follows the following steps.

        x <- read.affy("covdesc.txt") # Reads data in working directory.
        x.mas5 <- call.exprs(x,"mas5") # Calculates expression values with MAS 5.0 method which is required for the next step!
        qc <- qc(x,x.mas5)
           # Calls "qc" function which generates object containing scale factors, GAPDH-Actin_3'/5' ratios, target intensity,
           # percent present, average, min, max, mean background int and more; for more details type "?qc".

        x.mas5@description@preprocessing # Creates QC output like this.
        getGapdh3(cleancdfname(cdfName(x)))
           # To find out which GAPDH/Actin probe sets are used on a given chip; for others use getGapdhM, getGapdh5, getBioB, getBioC...
        plot(qc) # Creates summary plot of QC data. See description on page 5 of vignette "simpleaffy".
        (3) Filtering by expression values

        get.array.subset(x.rma, "treatment",c("CMP1","Ctrl"))
           # When R loaded *.CEL files it also reads experiment layout from covdesc.txt file (Ctrl and CMP1).
           # The "get.array.subset" function makes it easy to select subsets of arrays.

        results <- pairwise.comparison(x.rma, "treatment", c("1", "2"), raw.data)
           # Computes mean intensities of replicates from two treatments (means), calculates log2-fold changes between
           # means (fc), performs t-test (tt) and writes out PMA calls (calls). Type "?pairwise.comparison" for more help on this function.

        write.table(data.frame(means(results), fc(results), tt(results), calls(results)), file="my_comp.txt", sep="\t")
           # Command to export all 'pairwise.comparison' data into one table.
        sort(abs(fc(results)),decreasing=TRUE)[1:100]
           # Prints out 100 highest fold-changes (fc), for means of intensities (means), for ttest (tt), for PMA (calls).
        significant <- pairwise.filter(results, min.exp=log2(10), min.exp.no=2, fc=log2(8), min.present.no=4, tt= 0.001, present.by.group=FALSE)
           # Function 'pairwise.filter' takes the output from 'pairwise.comparison' and filters for significant changes. Filter Arguments:
           # min.exp: minimum expression cut off
           # min.exp.no: occurence of 'min.exp' in at least this number of chips
           # min.present.no: present calls on at least this number of chips
           # present.by.group: If true, then present count is restricted to replicate groups and not all chips of an experiment!
           # fc: A gene must show a log2 fold change greater than this to be called significant. A 2.5-fold change can be specified
           # like this: 'fc=log2(2.5)'
           # tt: A gene must be changing with a p-score less than this to be called significant
           # Type "?pairwise.filter" for more info on the different arguments of this function.

        write.table(data.frame(means(significant), fc(significant), tt(significant), calls(significant)), file="my_comp.txt", col.names = NA, sep="\t")
           # Exports all 'pairwise.filter' data into one table.
        (4) Plotting results

        plot(significant, type="scatter")
           # Plots significant changes from above as scatter plot. Alternative plotting types: type="ma" or type="volcano".

        plot(results,significant, type="scatter")
           # Plots means of the two replicate groups as scatter plot and high-lights all genes that meet criteria of 'significant'
           # filter. Meaning of colors: red - all present, orange - all present in one group or the other, yellow - all that remain.

        plot(results,type="scatter") # Plots means of the two replicate groups as scatter plot.
        png(file="figure1.png"); plot(significant,type="scatter"); dev.off() # Writes scatter plot to image file.

        (5) Exercise simpleaffy (Command Summary)


        Analysis of Differentially Expressed Genes

        Limma

        Limma is a software package for the analysis of gene expression microarray data, especially the use of linear models for analysing designed experiments and the assessment of differential expression. The package includes pre-processing capabilities for two-color spotted arrays. The differential expression methods apply to all array platforms and treat Affymetrix, single channel and two channel experiments in a unified way. The methods are described in Smyth 2004 and in the limma manual. An illustrated introduction for the GUI packages can be found at WEHI.

        • Tcl/Tk Requirements
        The limmaGUI and affylmGUI packages require Tcl/Tk. On Windows, simply install ActiveTcl.

        • Basic usage
        library(limma) # Loads command-line limma package.
        limmaUsersGuide() # Opens pdf manual for limma.

        ## The cDNA and affy sample data sets of the manual can be downloaded from the limmaGUI and affylmGUI pages.
        library(affylmGUI); affylmGUI()
          # Opens affylmGUI for affy data. For a quick start, follow the instructions for the Estrogen data set.
        library(limmaGUI); limmaGUI()
          # Opens limmaGUI for cDNA array data. For a quick start, follow the instructions for the Swirl Zebrafish data set.


        • Data objects in limma
        There are four main data objects created and used by limma:
        1. RGList: for cDNA data created by function read.maimages()
        2. MAList: for cDNA data created by functions MA.RG() or normalizeWithinArrays()
        3. MArrayLM: created by function lmFit()
        4. TestResults: created by function decideTests()
        More details on this can be found in paragraph 13 of the limma PDF manual (limmaUsersGuide).


        Limma: Dual Color Arrays

        (1) Reading cDNA array data


        To make the following commands work, save and extract the SWIRL cDNA microarray sample data into your R working directory. For a quick demonstration of the analysis of this data set, one can copy & paste or source the following command-line summary into the R terminal: my_swirl_commands.txt.

        Requirements
        • Have all intensity data files in one directory. Supported formats are: ArrayVision, ImaGene, GenePix, QuantArray, SMD (QuantArray) or SPOT. If an intensity data file format is not supported then one can specify the corresponding column names during the data import into R (see below). Type '?read.maimages' for more information about supported formats.
        • Targets file: Format of targets file. This file defines which RNA sample was hybridized to each channel of each array.
        • Only for SPOT: separate gene list file
        • Optional: spot type file for identifying special probes such as controls.

        ############################
        ## Reading intensity data ##
        ############################

        targets <- readTargets("Targets.txt") # Reads targets information from file 'Targets.txt' and assigns it to targets frame.
        RG <- read.maimages(targets$FileName, source="spot", sep="\t", path="./")
           # Writes intensity data into 'RGList' list object. The argument 'targets$FileName' specifies the intensity files;
           # 'source="spot"' specifies the image analysis program (e.g. spot); 'path="./"' provides the path to the directory where the intensity 
           # files are located and 'sep="\t"' specifies the field separator.

        RG <- read.maimages(targets$FileName, annotation="My_spot_labels", columns=list(Rf="Rmean", Gf="Gmean", Rb="morphR", Gb="morphG"))
           # If an intensity data file format is not supported then one can specify the corresponding column names as shown here. The
           # provided example is equivalent to the previous command with the exception of the additional argument: annotation="My_spot_labels".
           # This argument allows the import of a spot ID or annotation column. This can be useful for analyzing arrays with unavailable
           # printing layout or incomplete intensity data files. In those cases, all print-tip specific functions of the following steps
           # will not work (e.g. use 'loess' normalizaten instead of 'printtiploess').

        RG_combined <- cbind(RG1, RG2)
           # One can combine different 'RGList' objects (e.g. RG1, RG2) into one large object. The command
           # 'RG[,1]' shows only the first array, while 'RG[1:100,]' gives the data of the first 100 genes.


        ##########################
        ## Spot quality weights ##
        ##########################

        RG <- read.maimages(targets$FileName, source="spot", wt.fun=wtarea(100))
           # Assigns spot quality weights from 0 to 1 to each spot and writes information as 'weight' component into 'RGList' object.
           # Provided example with 'wt.fun=wtarea(100)' gives full weight of 1 to spots with 100 pixels. Spots with zero pixels or
           # twice the ideal size are given weight of 0. All spots with weight of 0 are considered as flagged and limma will ignore
           # them in the subsequent analysis. The appropriate way of computing quality weights depends on the image analysis software.
           # The command '?QualityWeight' provides more information on available weight functions.

        #################################
        ## Gene lists and print layout ##
        #################################

        RG$genes <- readGAL()
           # The output of some image analysis programs contains intensity, gene name and print layout information all in one file. In
           # the cases of the image analysis programs SPOT, GenePix, ArrayVision and others, this information is located in a separate file
           # from where it needs to be written into the 'RGList' using the 'readGal' and 'getLayout' functions. The function 'readGal'
           # reads this information by default from a *.gal file in the working directory. This file contains the columns: 'Block', 'Column',
           # 'Row', 'ID' and 'Name' (optional). If such a file is not provided then one can create it in R using the printing layout information
           # of a given array. Example of an array with 4 by 8 sub-grids (pins) each with 20 by 21 spots:
        x <- 1:32; c1 <- c(); for(i in x) c1 <- c(c1, rep(i, times=420)); x <- 1:21; c2 <- c(); for(i in x) c2 <- c(c2, rep(i, times=20)); c3 <- rep(1:21, times=20); test.gal <- data.frame(Block=c1, Row=rep(c2, times=32), Column=rep(c3, times=32), ID=rep("my_ID", times=13440), Name=rep("my_Description", times=13440)); test.gal[1:20,]


        RG$printer <- getLayout(RG$genes)
           # Extracts print layout (number of pins or subgrids) after the gene list is available and adds this information as
           # $printer component to RGList.


        ####################
        ## Spot type file ##
        ####################
        ## Control spot information can be added to the RGList object with the spot type file (SpotTypes.txt). Regular expressions can
        ## be used here to associate this information with the RG$genes component of the RGList. The incorporation of the spot type
        ## information has the advantage that controls can be highlighted in plots or separated in certain analysis steps. The format
        ## of this file is specified in the limma pdf manual.

        spottypes <- readSpotTypes("SpotTypes.txt") # Reads in the file SpotTypes.txt.
        RG$genes$Status <- controlStatus(spottypes, RG) # Adds a 'Status' column to the RG$genes component in RGList.

        (2) Quality plots

        Recommended quality exploration steps are the imageplot() function of the raw log-ratios and an MA-plot (plotMA) of the raw data for each array.

        spottypes <- readSpotTypes("SpotTypes.txt") # Same as above.
        RG$genes$Status <- controlStatus(spottypes, RG) # Same as above.
        plotMA(RG, array=1)
           # Creates MA-plot which is a plot of the log-intensity ratios (M-values) versus the log-intensity averages of both
           # channels (A-values). To display plots for all arrays next to each other, one can use the layout function 'par(mfrow=c(2,2))'
           # and plot the different arrays in the same plotting device by changing the array numbers in the argument 'array=1'.

        imageplot(log2(RG$R[,1]), RG$printer, low="white", high="red")
           # Creates an image of color shades that represent the values of a statistic in the microarray printing format. This function
           # can be used to explore spatial effects across the microarray.

        plotPrintTipLoess(RG) # Co-plots unnormalized RG data in form of MA-plots with the loess curves for each print-tip.

        (3) Normalization

        Limma integrates several within- and between-array normalization methods. The commands ?normalizeWithinArrays and ?normalizeBetweenArrays provide information about the different methods. The method printtiploess is the default. Use 'loess' instead if your arrays have no print-tip groups (e.g. Agilent) or have less than 150 spots per print-tip.

        MA <- normalizeWithinArrays(RG, method="printtiploess")
           # Background corrects and normalizes the expression log-ratios of the RGList and assigns the results to a MAList
           # object. The default background correction method is bc.method="subtract". Use bc.method="none" to ignore the background
           # in the analysis. If the quality weights should be ignored in this step, add to the command the 'weights=NULL' argument.
           # For analyzing arrays with unavailable printing layout or incomplete intensity data files, one should use here a print-tip
           # unspecific normalization method such as 'loess'.

        plotPrintTipLoess(MA)
           # Coplots normalized MA data in form of MA-plots with loess curves for each print-tip. Compare results with unnormalized
           # RG data from above.

        boxplot(as.data.frame(MA$M))
           # Creates box-and-whisker plot of MA values. This is a graphical summary of the MA distribution between the arrays. The two
           # hinges in the middle represent the first and third quartile, the lines (whiskers) represent the largest and smallest
           # observation in a distance of 1.5 times the box height and everything beyond that is printed as extreme values in form of dots.  
           # Inconsistent spreads of hinges and whiskers between arrays can indicate normalization issues. Between-array normalization may
           # improve this situaltion (see limma MAnual).

        MA <- normalizeBetweenArrays(MA)
           # If box-and-whisker plot indicates different spreads of M-values, as in the above example, then one can use this secondary
           # between-array normalization step.


        (4) Background correction

        The above normalizeWithinArrays function performs by default a background correction by subtracting the background values from the expression intensities.

        RGb <- backgroundCorrect(RG, method="subtract")
        MA <- normalizeWithinArrays(RGb)
           # These two commands perform the same operation as the above 'normalizeWithinArrays' command.
        RG <- backgroundCorrect(RG, method="normexp", offset=50)
           # This more advanced background correction method 'normexp' adjusts the expression values adaptively for background
           # values and results in strictly positive expression values. This method is particularly effective for obtaining more
           # robust ratios for low expressed genes.

          (5) Linear models and differential expression of dual color data

          Limma provides advanced statistical methods for linear modelling of microarray data and for identifying differentially expressed genes. The approach fits a linear model to the data and uses an empirical Bayes method for assessing differential expression. It is described in Smyth 2004. One or two experiment definition matrices need to be specified during the analysis: a design matrix defining the RNA samples and a contrast matrix (optional for simple experiments) defining the comparisons to be performed. How to set up these matrices is described in the limma manual and on this page from Natalie Thorne. Another useful manual is Advanced Linear Modeling and Time-Series Analysis.

          cDNA data set with common reference design
          The swirl experiment is a classical one sample test where a mutant is compared to a WT sample with four replicates that include 2 dye swaps. The following commands will work with the objects obtained in the above steps.


          design <- c(-1,1,-1,1) # Creates appropriate design matrix.
          fit <- lmFit(MA, design)
             # Fits a linear model for each gene based on the given series of arrays. The returned list object 'fit' contains the average
             #  M-value ($coefficients) for each gene and their standard deviations ($sigma). The ordinary t-statistics can be computed
             # with this command: ordinary.t <- fit$coef / fit$stdev.unscaled / fit$sigma. However, the empirical Bayes moderated t-statistics
             # should be the preferred method for limma users (see below).

          plotMA(fit); abline(0,0,col="blue") # Plots MA-plot with baseline.
          fit <- eBayes(fit)
             # Computes empirical Bayes statistics for differential expression. This moderated t-statistics uses standard deviations
             # that have been shrunk towards a pooled standard deviation value.

          qqt(fit$t,df=fit$df.prior+fit$df.residual,pch=16,cex=0.2); abline(0,1)
             # Plots the quantiles of a data sample against the theoretical quantiles of a Student's t distribution.
          options(digits=3) # Sets the number of digits to print in numeric output to 3 digits.
          topTable(fit, adjust="fdr", sort.by="B", number=10)
             # Generates list of top 10 ('number=10') differentially expressed genes sorted by B-values ('sort.by=B'). The summary table
             # contains the following information: logFC is the log2-fold change, AveExpr is the average expression value accross all
             # arrays and channels, the moderated t-statistic (t) is the logFC to its standard error, the P.Value is the associated p-value,
             # the adj.P.Value is the p-value adjusted for multiple testing and the B-value (B) is the empirical Bayes log-odds of differential
             # expression (the-higher-the-better). Usually one wants to base gene selection on the adj.P.Value rather than the t- or B-values.
             # More details on this can be found in the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page. If the swirl data set was
             # normalized with the 'printtiploess' method and the between-array method, then the top downregulated gene is BMP2 which is the
             # mutated gene of this sample.

          x <- topTable(fit, adjust="fdr", sort.by="P", number=100); x[x$adj.P.Val < 0.01,]
             # Filters out candidates that have P-values < 0.01.
          plotMA(fit); ord <- order(fit$lods, decreasing=TRUE); top30 <- ord[1:30]; text(fit$Amean[top30], fit$coef[top30], labels=fit$genes[top30, "Name"], cex=0.8, col="blue") # Plots MA-plot and highlights 30 top changes.


          Limma: Affymetrix Arrays
          To enable the following analysis steps, users need to save and extract this archive of 6 cel files into their R working directory. These sample files are from the Cold Stress Time Course of the AtGenExpress site (ftp batch download). This affy_targets.txt file is required to select specific cel files during the analysis. For a quick demonstration of the analysis of this data set, one can copy & paste the following commands into the R terminal.

          library(limma) # Loads limma library.
          targets <- readTargets("affy_targets.txt") # Reads targets information from file 'affy_targets.txt' and assigns it to targets frame.
          library(affy); data <- ReadAffy(filenames=targets$FileName) # Reads CEL files (specified in 'targets') into AffyBatch object.
          eset <- rma(data) # Normalizes data with 'rma' function and assigns them to ExpressionSet object (see above).
          # exprs(eset) <- log2(exprs(eset))
             # If eset contains absolute intensity values like MAS5 results, then they should be transformed to log2 (or loge) values for
             # limma. RMA/GCRMA generate log2 values and MAS5 produces absolute values.

          pData(eset) # Lists the analyzed file names.
          write.exprs(eset, file="affy_all.txt")
             # Exports all affy expression values to tab delimited text file. The MAS 5.0 P/M/A calls can be retrieved with the simpleaffy
             # package or with the affy package like this: 'eset <- mas5calls(data); write.exprs(eset, file="my_PMA.txt")'.

          design <- model.matrix(~ -1+factor(c(1,1,2,2,3,3)))
             # Creates appropriate design matrix. Alternatively, such a design matrix can be created in any spreadsheet program and then
             # imported into R.

          colnames(design) <- c("group1", "group2", "group3") # Assigns column names.
          fit <- lmFit(eset, design) # Fits a linear model for each gene based on the given series of arrays.
          contrast.matrix <- makeContrasts(group2-group1, group3-group2, group3-group1, levels=design)
             # Creates appropriate contrast matrix to perform all pairwise comparisons. Alternatively, such a contrast matrix can
             # be created in any spreadsheet program and then imported into R. For complex experiments one can also use this function
             # to compute a contrast matrix with all possible pairwise comparisons.

          fit2 <- contrasts.fit(fit, contrast.matrix) # Computes estimated coefficients and standard errors for a given set of contrasts.
          fit2 <- eBayes(fit2)
             # Computes moderated t-statistics and log-odds of differential expression by empirical Bayes shrinkage of the standard
             # errors towards a common value.

          topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=10)
             # Generates list of top 10 ('number=10') differentially expressed genes sorted by B-values ('sort.by=B') for each of the three
             # comparison groups ('coef=1') in this sample set. The summary table contains the following information: logFC is the
             # log2-fold change, the AveExpr is the average expression value accross all arrays and channels, the moderated t-statistic (t)
             # is the logFC to its standard error, the P.Value is the associated p-value, the adj.P.Value is the p-value adjusted for multiple
             # testing and the B-value (B) is the log-odds that a gene is differentially expressed (the-higher-the-better). Usually one
             # wants to base gene selection on the adjusted P-value rather than the t- or B-values. More details on this can be found in
             # the limma PDF manual (type 'limmaUsersGuide()') or on this FAQ page.

          write.table(topTable(fit2, coef=1, adjust="fdr", sort.by="B", number=50000), file="limma_complete.xls", row.names=F, sep="\t")
             # Exports complete limma statistics table for first comparison group ('coef=1') to tab delimited text file.
          results <- decideTests(fit2, p.value=0.05); vennDiagram(results)
             # Creates venn diagram of all changed genes with p-value equal or less than 0.05.

          x <- topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y <- x[x$adj.P.Val < 0.05,]; y; print("Number of genes in this list:"); length(y$ID)
             # Filters out candidates that have P-values < 0.05 in each group ('coef=1') and provides the number of candidates for each
             # list. These numbers should be identical with the sum of the values in each circle of the above venn diagram.

          x <- topTable(fit2, coef=1, adjust="fdr", sort.by="P", number=50000); y <- x[x$adj.P.Val < 0.01 & (x$logFC > 1 | x$logFC < -1) & x$AveExpr > 10,]; y; print("Number of genes in this list:"); length(y$ID)
             # Same as above but with complex filter: P-value < 0.01 AND at least 2-fold change AND expression value A > 10.

          results <- decideTests(fit2, p.value=0.000005); heatDiagram(results, fit2$coef, primary=1)
             # This function plots heat diagram gene expression profiles for genes which are significantly differentially expressed in the
             # primary condition (this is not a cluster analysis heat map). Genes are sorted by differential expression under the primary
             # condition. The argument 'primary=1' selects the first contrast column in the 'results' matrix as primary condition. The plotted
             # genes can be extracted like this 'results[results[,1]==1,]'. More information on this function can be found in the limma manual.

          For visualizing the analyzed affy data, use the same visualization and quality control steps as described in the affy package section.

          RankProd

          The RankProd package contains functions for the identification of differentially expressed genes using the rank product non-parametric method from Breitling et al., 2004. It generates a list of up- or down-regulated genes based on the estimated percentage of false positive predictions (pfp), which is also known as false discovery rate (FDR). The attraction of this method is its ability to analyze data sets from different origins (e.g. laboratories) or variable environments.

          Required data objects:
          1. 'data' object of type matrix or data frame containing expression values in log2 scale.
          2. 'cl' vector of length ncol(data) with class lables of samples/treatments.
          3. 'origin' vector of length ncol(data) with origin labels of samples. The origin vector is only required for analyzing data from multiple origins!


          (1) Differential expression analysis with Affymetrix data from single origin


          library(RankProd) # Loads the required library.
          data(arab) # Loads sample data matrix or data frame 'arab' that contains log2 intensities of 500 genes (rows) for 10 samples (columns).
          my_expr_matrix <- exprs(my_eset)
             # When the analysis is started from Affy Cel files, one can create the required expression matrix or data frame for input
             # into RankProd from an ExpressionSet object with the 'exprs' function.

          arab.sub <- arab[, which(arab.origin == 1)] # Generates sub-data set for single origin analysis.
          cl <- arab.cl[which(arab.origin == 1)] # Generates 'cl' vector with classes (contrast groups) labeled with '0' and '1'.
          RP.out <- RP(arab.sub, cl, num.perm = 100, logged = TRUE, na.rm = FALSE, plot = FALSE, gene.names = arab.gnames, rand = 123)
             # Performs rank product analysis for single-origin data. If the input data are not log transformed use 'logged=FALSE'.

          topGene(RP.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames)
             # The function 'topGene' outputs identified genes with user-specified filters: 'cutoff': pfp threshold; 'num.gene': number
             # of top genes identified; 'gene.names': vector with Affy IDs in order of input data set. For log2 scaled data use the
             # arguments 'logged=TRUE' and 'logbase=2'. The generated 'Table 1' lists the up-regulated genes and the 'Table 2' the
             # down-regulated genes. The four columns in each table contain the following information: (1) 'gene index' = position in
             # original data set, (2) 'RP/Rsum' = rank product, 'FC' = fold change of averaged expression levels, 'pfp' = estimated pfp or FDR.

          plotRP(RP.out, cutoff = 0.05)
             # Plots pfp (FDR) versus number of identified up- and down-regulated genes. Genes within the specified cutoff range are plotted in red.



          (2) Differential expression analysis with Affymetrix data from multiple origins

          cl <- arab.cl
             # Assigns 'cl' vector with classes (contrast groups) labeled with '0' and '1'. Object 'arab.cl' is part of sample
             # data set 'arab'.

          origin <- arab.origin # Assigns 'origin' vector with origin labels of samples. Object 'arab.origin' is part of sample data set 'arab'.
          RP.adv.out <- RPadvance(arab, cl, origin, num.perm = 100, logged = TRUE, gene.names = arab.gnames, rand = 123)
             # The function 'RPadvance' performs the rank product analysis for multiple-origin data that requires the above 'origin' vector.
          topGene(RP.adv.out, cutoff = 0.05, logged = TRUE, logbase = 2, gene.names = arab.gnames) # Generates the 'topGene' list (see above).

            (3) Differential expression analysis with cDNA arrays: see vignette RankProd


            SAM

            SAM is part of the 'siggenes' package.


            library(siggenes) # Loads siggenes package.
            ?sam # Opens help page for SAM.
            sam(data,cl....) # Basic syntax for running SAM.

            R/maaova


            Digital Gene Expression (DGE)


            Additional Dual Color Array Packages

            Bioconductor provides various additional packages for the analysis of dual-color cDNA arrays.

            Marray

            Exploratory analysis for two-color spotted microarray data...


            Chromosome maps

            Several BioConductor packages are available for displaying genomic information graphically. The following commands illustrate some of the chromosome plotting utilities from the geneplotter package. Much more detailed information on this topic is available in the NG Sequence Manual.

            library(annotate); library(geneplotter); library(hgu95av2); newChrom <- buildChromLocation("hgu95av2"); newChrom; cPlot(newChrom) # This displays all genes on the chromosomes of an organisms. Genes encoded by the antisense strands are represented by lines below the chromosomes.
            data(sample.ExpressionSet); myeset <- sample.ExpressionSet; cColor(featureNames(sample.ExpressionSet), "red", newChrom) # This highlights in the above plot a set of genes of interest in red color (e.g. expressed genes of an experiment).
            cPlot(newChrom,c("1","2"), fg="yellow", scale="relative"); cColor(featureNames(myeset), "red", newChrom) # Plots just a specific set of chromosomes.
                   

              Gene Ontologies

              General

              Several packages are available for Gene Ontology (GO) analysis. The following examples of the different packages use the GO annotations from Arabidopsis.
              Basic GO usage with GOstats
              ###################################
              ## Basic usage of GO information ##
              ###################################

              library(GOstats); library(GO.db); library(ath1121501.db); library(annotate) # Loads the required libraries.
              goann <- as.list(GOTERM) # Retrieves full set of GO annotations.
              zz <- eapply(GOTERM, function(x) x@Ontology); table(unlist(zz)) # Calculates the number of annotations for each ontology category.
              ?GOTERM # To find out, how to access the different GO components.
              GOTERM$"GO:0003700"; GOMFPARENTS$"GO:0003700"; GOMFCHILDREN$"GO:0003700"
                 # Shows how to print out the GO annotations for one entry and how to retrieve its direct parents and children.

              GOMFANCESTOR$"GO:0003700"; GOMFOFFSPRING$"GO:0003700" # Prints out complete lineages of parents and children for a GO ID.
              goterms <- unlist(eapply(GOTERM, function(x) x@Term)); goterms[grep("molecular_function", goterms)]
                 # Retrieves all GO terms and prints out only those matching a search string given in the grep function. The same can
                 # be done for the definition field with 'x@Definition'. A set of GO IDs can be provided as well: goterms[GOMFANCESTOR$"GO:0005507"]
              go_df <- data.frame(GOID=unlist(eapply(GOTERM, function(x) x@GOID)), Term=unlist(eapply(GOTERM, function(x) x@Term)), Ont=unlist(eapply(GOTERM, function(x) x@Ontology)))
                 # Generates data frame of the commonly used GO components: GOID, GO Term and Ontology Type.

              affyGO <- eapply(ath1121501GO, getOntology, "MF"); table(sapply(affyGO, length))
                 # Retrieves MF GO terms for all probe IDs of a chosen Affy chip and calculates how many probes have multiple GO terms
                 # associated. Use "BP" and "CC" arguments to retrieve BP/CC GO terms.

              affyGOdf <- data.frame(unlist(affyGO)); affyGOdf <- data.frame(AffyID=row.names(affyGOdf), GOID=affyGOdf[,1]); affyGOdf <- merge(affyGOdf, go_df, by.x="GOID", by.y="GOID", all.x=T)
                 # Converts above MF list object into a data frame. The AffyID occurence counts are appended to AffyIDs. The last step
                 # merges the two data frames: 'affyGOdf' and 'go_df'.

              unique(lookUp("GO:0004713", "ath1121501", "GO2ALLPROBES")) # Retrieves all Affy IDs that are associated with a GO node.
              z <- affyGO[c("254759_at", "260744_at")]; as.list(GOTERM)[z[[1]]]
                 # Retrieves GO IDs for set of Affy IDs and then the corresponding GO term for first Affy ID.
              a <- data.frame(unlist(z)); a <- data.frame(ID=row.names(a), a); b <- data.frame(goterms[as.vector(unlist(z))]); b <- data.frame(ID=row.names(b), b); merge(b, a, by.x = "ID", by.y="unlist.z.")
                 # Merges Affy ID, GO ID and GO annotation information.
              affyEv <- eapply(ath1121501GO, getEvidence); table(unlist(affyEv, use.names = FALSE))
                 # Provides evidence code information for each gene and summarizes the result.
              test1 <- eapply(ath1121501GO, dropECode, c("IEA", "NR")); table(unlist(sapply(test1, getEvidence), use.names = FALSE))
                 # This example shows how one can remove certain evidence codes (e.g. IEA, IEP) from the analysis.

              ##############################################
              ## GO term enrichment analysis with GOstats ##
              ##############################################
              ## Example of how to test a sample set of probe set keys for over-representation of GO terms using a hypergeometric distribution
              ## test with the function hyperGTest(). For more information, read the GOstatsHyperG manual.

              library(ath1121501.db);
              library(ath1121501cdf)
              affySample <- c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at")
              geneSample <- as.vector(unlist(mget(affySample, ath1121501ACCNUM, ifnotfound=NA)))
              affyUniverse <- ls(ath1121501cdf)
              geneUniverse <- as.vector(unlist(mget(affyUniverse, ath1121501ACCNUM, ifnotfound=NA)))
              params <- new("GOHyperGParams", geneIds = geneSample, universeGeneIds = geneUniverse, annotation="ath1121501", ontology = "MF", pvalueCutoff = 0.5, conditional = FALSE, testDirection = "over")
              hgOver <- hyperGTest(params)
              summary(hgOver)
              htmlReport(hgOver, file = "MyhyperGresult.html")


                GOHyperGAll  

                GOHyperGAll function: To test a sample population of genes for over-representation of GO terms, the function 'GOHyperGAll' computes for all GO nodes a hypergeometric distribution test and returns the corresponding raw and Bonferroni corrected p-values (notes about implementation). A subsequent filter function performs a GO Slim analysis using default or custom GO Slim categories. The method has been published in Plant Physiol (2008) 147, 41-57). GOHyperGAll provides similar utilities as the hyperGTest function in the GOstats package from BioConductor. The main difference is that GOHyperGAll simplifies the usage of custom chip-to-gene and gene-to-GO mappings.

                To demo the utility of the GOHyperGAll function, run it as shown below. The initial sample data generation step takes some time (~10 min), since it needs to generate the required data objects for all three ontologies. This needs to be done only once for every custom gene-to-GO annotation.

                #############################################################################
                ## (1.1) Import all required functions with the following source() command ##

                #############################################################################
                source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt")

                ###########################################################
                ## (2.1) Using gene-to-GO mappings from geneontology.org ##
                ###########################################################

                readGOorg(myfile = "gene_association.tair", colno = c(5,11,9), org = "Arabidopsis"); gene2GOlist(rootUK=T)
                   # Download the required annotation table from geneontology.org and unzip it. Then point the 'readGOorg()' function to
                   # this file name. The two functions generate 4 data frames with the assigned gene-to-GO mappings and 3 lists containing
                   # the gene-to-GO-OFFSPRING associations. When the processes are completed, 6 files will be saved in your working directory!
                   # They can be reloaded in future R sessions with the 'load' command below. If the argument 'rootUK' is set to TRUE, then
                   # the root nodes are treated as terminal nodes to account for the new assignment of unknown genes to the root nodes.


                ###############################################
                ## (2.2) Using gene-to-GO mappings from Bioc ##
                ###############################################
                ## Note: users should execute either step (2.1) or (2.2), but not both!

                sampleDFgene2GO(lib="ath1121501.db"); gene2GOlist(rootUK=T)
                   # Similar as above, but the gene-to-GO mappings are obtained from BioC. The generated 4 sample data frame and 3
                   # list objects can be reloaded in future R sessions with the 'load' command below.


                ######################################################################
                ## (2.3) Obtain AffyID-to-GeneID mappings when working with AffyIDs ##
                ######################################################################

                AffyID2GeneID(map = "ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt")
                   # When working with AffyIDs, this function creates a AffyID-to-GeneID mapping data frame using by default the TAIR
                   # mappings for the Arabidopsis ATH1 chip. To use the function for the mappings of other chips, one needs to create the
                   # corresponding decoding data frame 'affy2locusDF'.


                ############################################################
                ## (3.1) Reloading required data objects from local files ##
                ############################################################

                loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list")
                   # This step makes future sessions much faster, since it allows to skip the previous data generation steps (2.1-2.3).
                   # A sample data set is available here: ArabSampleGOHyperGAll.zip (Jan 2010).

                ##########################################
                ## (3.2) Obtain a sample set of GeneIDs ##
                ##########################################

                test_sample <- unique(as.vector(GO_MF_DF[1:40,2])) # When working with GeneIDs.
                test_sample <- AffyID2GeneID(affyIDs=affy_sample, probe2gene=1)
                   # When working with AffyIDs, one can use the function 'AffyID2GeneID' to obtain for a set of AffyIDs their corresponding
                   # GeneIDs from the data frame 'affy2locusDF' (see above). For probe sets that match several loci, only the first locus
                   # ID will be used if the argument 'probe2gene' is set to 1. To demo the function, one can use the following sample
                affy_sample <- c("266592_at", "266703_at", "266199_at", "246949_at", "267370_at", "267115_s_at", "266489_at", "259845_at", "266295_at", "262632_at") # AffyID sample


                ##########################################################################
                ## (4.1) Perform phyper test, goSlim subsetting and plotting of results ##
                ##########################################################################

                GOHyperGAll_result <- GOHyperGAll(gocat="MF", sample=test_sample, Nannot=2); GOHyperGAll_result[1:10,-8]
                   # The function 'GOHyperGAll()' performs the phyper test against all nodes in the GO network. It returns raw and
                   # adjusted p-values. The Bonferroni correction is used as p-values adjustment method according to Boyle et al, 2004 (online).
                   # The argument 'Nannot' defines the minimum number of direct annotations per GO node from the sample set to determine
                   # the number of tested hypotheses for the p-value adjustment. The argument 'gocat' can be assigned the values "MF", "BP"
                   # and "CC". Omitting the '-8' delimiter will provide the sample keys matching at every GO node.

                subset <- GOHyperGAll_Subset(GOHyperGAll_result, sample=test_sample, type="goSlim"); subset[,-8]
                   # The function 'GOHyperGAll_Subset()' subsets the GOHyperGAll results by assigned GO nodes or custom goSlim categories.
                   # The argument 'type' can be assigned the values "goSlim" or "assigned". The optional argument 'myslimv' can be used to
                   # provide a custom goSlim vector. Omitting the '-8' delimiter will show the sample keys matching at every GO node.

                pie(subset[subset$SampleMatch>0 ,3], labels=as.vector(subset[subset$SampleMatch>0 ,1]), main=unique(as.vector(subset[subset$SampleMatch>0, 7]))) # Plots pie chart of subsetted results.

                ##############################################################
                ## (4.2) Reduce GO Term Redundancy in 'GOHyperGAll_results' ##
                ##############################################################

                simplifyDF <- GOHyperGAll_Simplify(GOHyperGAll_result, gocat="MF", cutoff=0.001, correct=T)
                   # The result data frame 'GOHyperGAll_result' often contains several connected GO terms with significant scores which
                   # can complicate the interpretation of large sample sets. To reduce the redundancy, the function 'GOHyperGAll_Simplify'
                   # subsets the data frame 'GOHyperGAll_result' by a user specified p-value cutoff and removes from it all GO nodes with
                   # overlapping children sets (OFFSPRING), while the best scoring nodes remain in the data frame.

                data.frame(GOHyperGAll_result[GOHyperGAll_result[,1] %in% simplifyDF[,1], -8], GO_OL_Match=simplifyDF[,2])
                   # This command returns the redundancy reduced data set. The column 'GO_OL_Match' provides the number of accessions that
                   # match the connected nodes.


                ################################################
                ## (4.3) Batch Analysis of Many Gene Clusters ##
                ################################################

                BatchResult <- GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575"))
                   # The function 'GOCluster_Report' performs the three GO analyses in batch mode: 'GOHyperGAll', 'GOHyperGAll_Subset'
                   # or 'GOHyperGAll_Simplify'. It processes many groups of genes (e.g. gene expression clusters) and returns the results
                   # conveniently organized in a single data frame. The gene sets need to be provided in a data frame of the format
                   # specified at the end of the GOHyperGAll script. CLSZ: minimum cluster size to consider; method: "all", "slim" or
                   # "simplify"; gocat: "MF", "BP" or "CC"; cutoff: adjusted p-value cutoff; recordSpecGO: argument to include one specific
                   # GOID in each of the 3 ontologies, e.g: recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575").



                GSEA

                Gene Set Enrichment Analysis (GSEA) The following function converts the objects, generated by the above GOHyperGAll script, into the gene set format (*.gmt) for the GSEA tool from the Broad Institute. This way the custom GO files of the GOHyperGAll method can be used as gene sets for the Java or R releases of the GSEA program.

                (1) Import the required file format conversion function with the following source() command

                source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOhyper2GSEA() function.
                GOhyper2GSEA(myfile=c("MF_node_affy_list", "BP_node_affy_list", "CC_node_affy_list"), type="all")
                   # Converts XX_node_affy_list or GO_XX_DF files into GO_XX_ALL.gmt and GO_XX_TERM.gmt files, respecitively. Use the argument
                   # "all" for XX_node_affy_list files and "terminal" for GO_XX_DF files. The former contain all GO assignments, whereas the
                   # latter contain only the direct GO assignments.

                (2) Import the generated gene set files (*.gmt) along with the gene expression intensity data (*.txt) or pre-ranked gene lists (*.rnk) into the Java or R GSEA program, and follow the instructions on the GSEA User Guide. A detailed description of the data formats is provided on the Data Formats page.

                GOTools and goCluster

                Getting started with goTools

                my_list <- list(L1=c("245265_at", "260744_at", "259561_at", "254759_at", "267181_at"), L2=c("256497_at", "250942_at", "265611_at", "247488_at", "255148_at")) # Creates a working list with two components each consisting of Arabidopsis Affy IDs.
                library("goTools", verbose = FALSE) # Loads the required library.
                ontoCompare(my_list, probeType = "ath1121501", goType="MF")
                   # Provides compositional differences of the molecular function GOs between the Affy ID vectors contained in my_list.
                ontoCompare(my_list, probeType = "ath1121501", goType="MF", plot=TRUE) # Plots the above MF GO compositions in bar diagram.
                par(mar = c(5, 8, 5, 8)); res2 <- ontoCompare(my_list["L1"], probeType = "ath1121501", method="TIDS", goType="MF", plot=FALSE); ontoPlot(res2, cex = 0.7) # Plots pie chart for "L1" gene list component in my_list.

                goCluster demo

                library("goCluster", verbose = FALSE) # Loads the goCluster package.
                testdata <- eset[c(1:400),c(1:4)] # Selects 400 genes from previously generated ExpressionSet object (e.g. simpleaffy) with four chips.
                test <- new("goCluster") # Creates goCluster object.
                testConfigured <- config(test)
                   # The config-method starts the configuration process and stores the results in object 'testConfigured'. The required
                   # configuration information needs to be provided in the interactive mode (use 'ath1121501').

                setup(testConfigured)[["algo"]]; print(testConfigured) # Retrieves the generated configuration information.
                testExecuted <- execute(testConfigured) # Analyzes the configured object.
                print(testExecuted) # Prints the above analysis object to give a result summary.
                display(testExecuted, selection = 3) # Displays result in form of heat map.
                testExecuted2 <- testExecuted; testExecuted2@visu <- new("clusterVisualHeatmap"); testExecuted2@visu <- execute(testExecuted2@visu, testExecuted2); display(testExecuted2, selection = "GO:0009535") # Displays result for chosen GO identifier.


                  KEGG Pathway Analysis

                  • KEGG Pathway Script: Script to map Arabidopsis locus IDs to KEGG pathways. It can be easily adjusted to any organism.


                  Motif Identification in Promoter Regions

                  • COSMO from Oliver Bembom et al.: allows to discover motifs in unaligned DNA sequences
                  • rGADEM: de novo motif discovery


                  Phylogenetic Analysis


                  Cheminformatics in R

                  • ChemmineR is an R package for mining drug-like compound and screening data sets.


                  Protein Structure Analysis

                  • Bio3D in R: utilities for the analysis of protein structure and sequence data.


                  MS Data Analysis


                  Genome-Wide Association Studies (GWAS)

                  • SimHap: A comprehensive modelling framework and a multiple-imputation approach to haplotypic analysis of population-based data.
                  • GenABEL: R library for Genome-wide association analysis.
                  • crlmm: Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays (Carvalho et al., 2009).
                  • GGtools: Software and data for genetical genomics.
                  • PLINK: Free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.


                  BioConductor Exercises

                  Slide Show: Technology Overview (R Code)

                  Install of R and required Bioconductor libraries

                  • Download a sample set of Affymetrix cel files
                  • Right-click this link and save its content to your computer: Workshop.zip. These sample CEL files are from the GEO data set: GSE5621
                  • After unpacking this file archive one should see six *.cel files.

                  • Generate RMA expression data, MAS5 P/M/A calls and export results to Excel
                  library(affy); mydata <- ReadAffy() # Reads cel files in current working directory into affybatch object 'mydata'.
                  library(ath1121501probe); print.data.frame(ath1121501probe[1:22,]) # Prints probe sequences and their positions for first two Affy IDs.
                  eset_rma <- rma(mydata) # Generates RMA expression values and stores them as ExpressionSet.
                  exprs(eset_rma)[1:4,] # Prints first 4 rows in data frame structure.
                  # exprs(eset_rma) <- 2^(exprs(eset_rma)) # If needed unlog RMA expression values.
                  mydf <- 2^exprs(eset_rma); myList <- tapply(colnames(mydf), c(1,1,2,2,3,3), list); names(myList) <- sapply(myList, paste, collapse="_"); mymean <- sapply(myList, function(x) rowMeans(mydf[,x])) # Generic approach for calculating mean values for any sample combination.
                  eset_pma <- mas5calls(mydata) # Generates MAS 5.0 P/M/A calls.
                  my_frame <- data.frame(exprs(eset_rma), exprs(eset_pma), assayDataElement(eset_pma, "se.exprs"))
                     # Combine RMA intensities, P/M/A calls plus their wilcoxon p-values in one data frame.
                  my_frame <- my_frame[, sort(names(my_frame))] # Sorts columns by cel file name.
                  write.table(my_frame, file="my_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.

                  • Add annotation information
                  library("ath1121501.db") # Loads required annotation package.
                  Annot <- data.frame(ACCNUM=sapply(contents(ath1121501ACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(ath1121501SYMBOL), paste, collapse=", "), DESC=sapply(contents(ath1121501GENENAME), paste, collapse=", ")) # Constructs a data frame containing the gene IDs, gene symbols and descriptions for all probe sets on the chip.
                  all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T) # Merges everything with above expression data.
                  write.table(all, file="my_annot_file.xls", sep="\t", col.names = NA) # Exports data to text file that can be imported into Excel.

                  • Visualization and quality control
                  d <- cor(2^exprs(eset_rma), method="pearson"); plot(hclust(dist(1-d))) # Generates a correlation matrix for all-against-all chip comparisons.
                  library(affyQCReport); QCReport(mydata, file="ExampleQC.pdf") # Generates a comprehensive QC report for the AffyBatch object 'mydata' in PDF format. See affyQCReport for details.
                  eset <- eset_rma # Reassignment to work with following commands.

                  • Save this covdesc.txt to your R working directory.
                  • Generate expression data with RMA and MAS5.
                  • Filter each of the three data sets with the following parameters: 2-fold changes, present in all 4 chips and p-score less than 0.001.
                  • Write the results into separate files.
                  • Create scatter plots for the filtered data sets and save them to external image files.
                  • Identify the overlap of the significant changes between the RMA and MAS5 data.
                  • Perform simpleaffy QC checks: scaling factor, percent present calls, etc.

                  • Identify differtially expressed genes (DEGs) with the limma package
                  • Follow the instructions in the Limma Section for Affy Data.
                  • cDNA microarray users can save and extract the SWIRL cDNA microarray sample data. For a quick demonstration of the analysis of this data set, one can copy&paste or source the following command-line summary into the R terminal: my_swirl_commands.txt.

                  • Test gene sets for overrepresented GO terms using the GOHyperGAll script
                  source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/GOHyperGAll.txt") # Imports the GOHyperGAll functions.
                  loadData(); load(file="MF_node_affy_list"); load(file="BP_node_affy_list"); load(file="CC_node_affy_list")
                     # Loads the downloaded annotation objects for Arabidopsis.

                  GOHyperGAll_result <- GOHyperGAll(gocat="MF", sample=unique(as.vector(GO_MF_DF[1:40,2])), Nannot=2); GOHyperGAll_result[1:10,-8]
                     # Performs the enrichment test for provided set of gene identifiers.

                  CL_DF <- data.frame(geneID=GO_MF_DF[1:400,2], ClusterID=sort(rep(1:4,100)), ClusterSize=rep(100,400))
                     # Create sample data set for batch processing.

                  BatchResult <- GOCluster_Report(CL_DF=CL_DF, method="all", id_type="gene", CLSZ=10, cutoff=0.001, gocats=c("MF", "BP", "CC"), recordSpecGO=c("GO:0003674", "GO:0008150", "GO:0005575")); BatchResult[1:4,-10]
                     # Performs all three GO analyses for many sample set at once. When the method argument is set to "slim" then the goSlim method is used.

                  write.table(BatchResult, file="GO_Batch.xls", sep="\t", col.names = NA)
                     # Exports batch result to Excel file.
                  • Clustering of differentially expressed genes
                  my_fct <- function(x) hclust(x, method="complete") # Defines function to perform hierarchical clustering (complete linkage).
                  heatmap(as.matrix(2^exprs(eset)[1:40,]), col = cm.colors(256), hclustfun=my_fct) # Plots heatmap with dendrograms.
                     # Replace in last step 'exprs(eset)[1:40,]' by matrix of differentially expressed genes from limma analysis.

                  • Convert the above commands into an R script and execute it with the source() function
                  source("myscript.R") # Executes all of the above commands and generates the corresponding output files in the current working directory.


                    Clustering and Data Mining in R

                    Introduction

                    [ Slide Show ]

                    R contains many functions and libraries for clustering of large data sets. A very useful overview of clustering utilities in R is available on the Cluster Task Page and for machine learning algorithms on the Machine Learning Task Page and the MLInterfaces package.

                    Data Preprocessing

                    Generate a sample data set

                    y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data matrix.

                    Data centering and scaling

                    scale(t(y)); yscaled <- t(scale(t(y))); apply(yscaled, 1, sd)
                       # The function scale() centers and/or scales numeric matrices column-wise. When used with its default settings, the
                       # function returns columns that have a mean close to zero and a standard deviation of one. For row-wise scaling the
                       # data set needs to be transposed first using the t() function. Another transposing step is necessary to return the
                       # data set in its original orientation (same row/column assignment). Centering and scaling are common data transformation
                       # steps for many clustering techniques.

                      Obtain a distance matrix

                      dist(y, method = "euclidean")
                         # Computes and returns a distance matrix for the rows in the data matrix 'y' using the specified distance measure, here 'euclidean'.
                      c <- cor(t(y), method="spearman"); d <- as.dist(1-c); d
                         # To obtain correlation-based distances, one has to compute first a correlation matrix with the cor() function. Since this
                         # step iterates across matrix columns, a transpose 't()' step is required to obtain row distances. In a second step the distance
                         # matrix object of class "dist" can be obtained with the as.dist() function.

                      Hierarchical Clustering (HC)

                      The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50-100 faster) version of hclust. The pvclust package can be used for assessing the uncertainty in hierarchical cluster analyses. It provides approximately unbiased p-values as well as bootstrap p-values. As an introduction into R's standard hierarchical clustering utilities one should read the help pages on the following functions: hclust, dendrogram, as.dendrogram, cutree and heatmap. An example for sub-clustering (subsetting) heatmaps based on selected tree nodes is given in the last part of this section (see zoom into heatmaps).

                      Clustering with hclust

                      y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                      c <- cor(t(y), method="spearman"); d <- as.dist(1-c); hr <- hclust(d, method = "complete", members=NULL)
                         # This is a short example of clustering the rows of the generated sample matrix 'y' with hclust. Seven different cluster joining
                         # methods can be selected with the 'method' argument: ward, single, complete, average, mcquitty, median and centroid. The
                         # object returned by hclust is a list of class hclust which describes the tree generated by the clustering process with the
                         # following components: merge, height, order, labels, method, call and dist.method.

                      par(mfrow = c(2, 2)); plot(hr, hang = 0.1); plot(hr, hang = -1)
                         # The generated tree can be plotted with the plot() function. When the hang argument is set to '-1' then all leaves end on
                         # one line and their labels hang down from 0. More details on the plotting behavior is provided in the hclust help document (?hclust).

                      plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T)
                         # To plot trees horizontally, the hclust tree has to be transformed into a dendrogram object.
                      unclass(hr) # Prints the full content of the hclust object.
                      str(as.dendrogram(hr)) # Prints dendrogram structure as text.
                      hr$labels[hr$order] # Prints the row labels in the order they appear in the tree.
                      par(mfrow=c(2,1)); hrd1 <- as.dendrogram(hr); plot(hrd1); hrd2 <- reorder(hrd1, sample(1:10)); plot(hrd2); labels(hrd1); labels(hrd2)
                         # Example to reorder a dendrogram and print out its labels.
                      library(ctc); hc2Newick(hr)
                         # The 'hc2Newick' function of the Bioc 'ctc' library can convert a hclust object into the Newick tree file format for export
                         # into external programs.


                      Hierarchical Clustering Dendrogram Generated by hclust

                      Tree subsetting (see also Dynamic Tree Cut package)

                      mycl <- cutree(hr, h=max(hr$height)/2); mycl[hr$labels[hr$order]]
                         # Cuts tree into discrete cluster groups (bins) at specified height in tree that is specified by 'h' argument.
                         # Alternatively, one can cut based on a desired number of clusters that can be specified with the 'k' argument.

                      plot(hr); rect.hclust(hr, k=5, border="red") # Cuts dendrogram at specified level and draws rectangles around the resulting clusters.
                      subcl <- names(mycl[mycl==3]); subd <- as.dist(as.matrix(d)[subcl,subcl]); subhr <- hclust(subd, method = "complete"); par(mfrow = c(1, 2)); plot(hr); plot(subhr)
                         # This example shows how one can subset a tree by cutting it at a given branch point provided by the cutree() function.
                         # The identified row IDs are then used to subset the distance matrix and re-cluster it with hclust.

                      lapply(as.dendrogram(hr), unlist); lapply(as.dendrogram(hr)[[1]], unlist)
                         # Returns the indices of the cluster members of the two main branches. The second step returns the members of the next two
                         # descendent clusters of the first main branch.

                      example(hclust)
                         # The last step in this example shows how the 'members' argument can be used to reconstruct a hierarchical
                         # tree above the cut points using the cluster centers. More details on this can be found in the hclust help document.

                      source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/selectNodes.R"); Uniclusters <- nodes2Cluster(myhclust=hr, rootdist=2); clusterQCselect <- clusterQC(clusterlist=Uniclusters, simMatrix=as.matrix(c), method="min", cutoff=0.7, output="df"); clusterQCselect
                         # The function 'nodes2Cluster' cuts the tree at all height levels as defined in 'hr$height' and the second function 'clusterQC'
                         # selects clusters based on a given minimum cutoff in an all-against-all comparison within each cluster. The 'method' argument
                         # allows to calculate the cutoff value with one of the following functions: "min", "qu1", "median", "mean", "qu3" and "max". The
                         # end result can be returned as list (output="list") or as data frame (output="df"). The argument 'rootdist' defines the starting
                         # distance from the root of the tree to avoid testing of very large clusters.


                      Tree coloring and zooming into branches

                      source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
                      dend_colored <- dendrapply(as.dendrogram(hr), dendroCol, keys=subcl, xPar="edgePar", bgr="red", fgr="blue", lwd=2, pch=20)
                         # In this example the dendrogram for the above 'hr' object is colored with the imported 'dendroCol()' function based on
                         # the identifiers provided in its 'keys' argument. If 'xPar' is set to 'nodePar' then the labels are colored instead of
                         # the leaves.

                      par(mfrow = c(1, 3)); plot(dend_colored, horiz=T); plot(dend_colored, horiz=T, type="tr")
                      plot(dend_colored, horiz=T, edgePar=list(lwd=2), xlim=c(3,0), ylim=c(1,3))
                         # Plots the colored tree in different formats. The last command shows how one can zoom into the tree with the 'xlim and ylim'
                         # arguments, which is possible since R 2.8.

                      z <- as.dendrogram(hr); attr(z[[2]][[2]],"edgePar") <- list(col="blue", lwd=4, pch=NA); plot(z, horiz=T)
                         # This example shows how one can manually color tree elements.

                      Dendrogram/Tree Coloring Examples with Zooming into Trees

                      Plot heatmaps with the heatmap() function

                      y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                      source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R")
                         # Imports color function to obtain heat maps in red and green. Use this color function in heatmap color argument like
                         # this: 'col=my.colorFct()'.

                      hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete") # Clusters rows by Pearson correlation as distance method.
                      hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") # Clusters columns by Spearman correlation as distance method.
                      heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row")
                         # The previous two hclust objects are used to obtain a heatmap where the rows are clusterd by Pearson and the columns by
                         # Spearman correlation distances. The 'scale' argument allows to scale the input matrix by rows or columns. The default is
                         # by rows and "none" turns the scaling off.

                      mycl <- cutree(hr, h=max(hr$height)/1.5); mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(mycl)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
                         # The arguments 'ColSideColors' and 'RowSideColors' allow to annotate a heatmap with a color bar. In this example the row
                         # colors correspond to cluster bins that were obtained by the cutree() function.

                      dend_colored <- dendrapply(as.dendrogram(hr), dendroCol, keys=c("g1", "g2"), xPar="edgePar", bgr="black", fgr="red", pch=20); heatmap(y, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol) 
                         # In addition, one can color the tree leaves or labels by providing a colored dendrogram object, here 'dend_colored' from above.
                      heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", labRow = "", labCol="")
                         # The arguments 'labRow' and 'labCol' allow to provide custom labels or to omit the printing of the labels as shown here.
                      heatmap(y, Rowv=as.dendrogram(hr), Colv=NA, col=my.colorFct())
                         # If 'Rowv' or 'Cowv' is set to 'NA' then the row and/or the column clustering is turned off.
                      ysort <- y[hr$labels[hr$order], hc$labels[hc$order]]; heatmap(ysort, Rowv=NA, Colv=NA, col=my.colorFct())
                         # This example shows how the rows and columns of a heatmap can be sorted by the hclust trees without printing them.
                      x11(height=4, width=4); heatmap(matrix(rep(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),2), 2, byrow=T, dimnames=list(1:2, round(seq(min(as.vector(y)), max(as.vector(y)), length.out=10),1))), col=my.colorFct(), Colv=NA, Rowv=NA, labRow="", main="Color Legend")
                         # Prints color legend in a separate window.


                      Plot heatmaps with the image() or heatmap.2() functions

                      image(scale(t(ysort)), col=my.colorFct())
                         # A very similar plot as before can be obtained by the image() function which plots the values of a given matrix as a grid
                         # of colored rectangles.

                      x11(height=12); par(mfrow=c(1,2)); plot(as.dendrogram(hr), horiz=T, yaxt="n", edgePar=list(col="grey", lwd=4)); image(scale(t(ysort)), col=my.colorFct(), xaxt="n", yaxt="n")
                         # To create very large heatmaps that are still readable, one can plot the dendrogram and the heatmap of the image() function next to it.
                      library("gplots") # Loads the gplots library that contains the heatmap.2() function.
                      heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=redgreen(75), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol, trace="none", key=T, cellnote=round(t(scale(t(y))),1))
                         # The heatmap.2() function contains many additional plotting features that are not available in the basic heatmap() function.
                         # For instance, the 'key' argument allows to add a color key within the same plot. Numeric values can be added to each cell
                         # with the 'cellnote' argument. In addition, heatmap.2() scales to very large formats for viewing complex heatmaps.

                      library(RColorBrewer); mypalette <- rev(brewer.pal(9,"Blues")[-1])
                         # Select an alternative color scheme that is more appropriate for representing continous values (e.g. chip intensities).
                      heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=mypalette, scale="row", trace="none", key=T, cellnote=round(t(scale(t(y))),1)) # Example to illustrate how to turn off the row or column reordering in heatmap.2. The settings "dendrogram="row", Colv=F" turn off the column reordering and no column tree is shown. To turn off the re-ordering of both dimensions, use the following settings: "dendrogram="none", Rowv=F, Colv=F".


                      Zooming into heatmaps by sub-clustering selected tree nodes

                      y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                      hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete"); hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") 
                         # Generates row and column dendrograms.
                      mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc <- mycolhc[as.vector(mycl)]
                         # Cuts the tree and creates color vector for clusters.
                      library(gplots); myheatcol <- redgreen(75)
                         # Assign your favorite heatmap color scheme. Some useful examples: colorpanel(40, "darkblue", "yellow", "white");
                         # heat.colors(75); cm.colors(75); rainbow(75); redgreen(75); library(RColorBrewer); rev(brewer.pal(9,"Blues")[-1]).
                         # Type demo.col(20) to see more color schemes.

                      heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc)
                         # Creates heatmap for entire data set where the obtained clusters are indicated in the color bar.
                      x11(height=6, width=2); names(mycolhc) <- names(mycl); barplot(rep(10, max(mycl)), col=unique(mycolhc[hr$labels[hr$order]]), horiz=T, names=unique(mycl[hr$order]))
                         # Prints color key for cluster assignments. The numbers next to the color boxes correspond to the cluster numbers in 'mycl'.
                      clid <- c(1,2); ysub <- y[names(mycl[mycl%in%clid]),]; hrsub <- hclust(as.dist(1-cor(t(ysub), method="pearson")), method="complete")
                         # Select sub-cluster number (here: clid=c(1,2)) and generate corresponding dendrogram.
                      x11(); heatmap.2(ysub, Rowv=as.dendrogram(hrsub), Colv=as.dendrogram(hc), col=myheatcol, scale="row", density.info="none", trace="none", RowSideColors=mycolhc[mycl%in%clid])
                         # Create heatmap for chosen sub-cluster.
                      data.frame(Labels=rev(hrsub$labels[hrsub$order]))

                         # Print out row labels in same order as shown in the heatmap.


                      Row and Column Trees Plotted with Heatmap of Source Data

                        Bootstrap Analysis in Hierarchical Clustering 

                        The pvclust package allows to assess the uncertainty in hierarchical cluster analysis by calculating for each cluster p-values via multiscale bootstrap resampling. The method provides two types of p-values. The approximately unbiased p-value (AU) is computed by multiscale bootstrap resampling. It is a less biased p-value than than the second one, bootstrap probability (BP), which is computed by normal bootstrap resampling.

                        library(pvclust) # Loads the required pvclust package.
                        y <- matrix(rnorm(500), 50, 10, dimnames=list(paste("g", 1:50, sep=""), paste("t", 1:10, sep=""))) # Creates a sample data set.
                        pv <- pvclust(scale(t(y)), method.dist="correlation", method.hclust="complete", nboot=10)
                           # This step performs the hierarchical cluster analysis with multiscale bootstrap with 10 repetitions, complete linkage for
                           # cluster joining and a correlation-based dissimilarity matrix. Usually, one should set 'nboot' to at least 1000 bootstrap
                           # repetitions.

                        plot(pv, hang=-1) # Plots a dendrogram where the red numbers represent the AU p-values and the green ones the BP values.
                        pvrect(pv, alpha=0.95) # This command highlights with red rectangles all clusters in the dendrogram that have an AU value above 95%.
                        clsig <- unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters)
                           # The function 'pvpick' allows to pick significant clusters.
                        source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R")
                           # Import tree coloring function. The details on this function are provided in the hclust section of this manual.
                        dend_colored <- dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20)
                           # Creates dendrogram object where the significant clusters are labeled in red.
                        plot(dend_colored, horiz=T) # Plots a dendrogram where the significant clusters are highlighted in red.

                        QT Clustering 

                        QT (quality threshold) clustering is a partitioning method that forms clusters based on a maximum cluster diameter. It iteratively identifies the largest cluster below the threshold and removes its items from the data set until all items are assigned. The method was developed by Heyer et al. (1999) for the clustering of gene expression data.

                        library(flexclust) # Loads flexclust library.
                        x <- matrix(10*runif(1000), ncol=2) # Creates a sample data set.
                        cl1 <- qtclust(x, radius=3) # Uses 3 as the maximum distance of the points to the cluster centers.
                        cl2 <- qtclust(x, radius=1) # Uses 1 as the maximum distance of the points to the cluster centers.
                        par(mfrow=c(2,1)); plot(x, col=predict(cl1), xlab="", ylab=""); plot(x, col=predict(cl2), xlab="", ylab="") # Plots the two cluster results.
                        data.frame(name_index=1:nrow(x), cl=attributes(cl1)$cluster) # Provides the row-to-cluster assignment information.

                        qtclust Clustering Result

                        K-Means & PAM 

                        K-means, PAM (partitioning around medoids) and clara are related partition clustering algorithms that cluster data points into a predefined number of K clusters. They do this by associating each data point to its nearest centroids and then recomputing the cluster centroids. In the next step the data points are associated with the nearest adjusted centroid. This procedure continues until the cluster assignments are stable. K-means uses the average of all the points in a cluster for centering, while PAM uses the most centrally located point. Commonly used R functions for K-means clustering are: kmeans() of the stats package, kcca() of the flexclust package and trimkmeans() of the trimcluster package. PAM clustering is available in the pam() function from the cluster package. The clara() function of the same package is a PAM wrapper for clustering very large data sets.

                        K-means

                        km <- kmeans(t(scale(t(y))), 3)
                           # K-means clustering is a partitioning method where the number of clusters is pre-defined. The basic R implementation
                           # requires as input the data matrix and uses Euclidean distance. In contrast to pam(), the implementation does not allow
                           # to provide a distance matrix. In the given example the row-wise centered data matrix is provided.


                        km$cluster
                        g1  g2  g3  g4  g5  g6  g7  g8  g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 g20 g21 g22 g23 g24 g25 g26 g27 g28 g29 g30 g31 g32
                         1   2   2   1   3   3   3   1   3   1   2   3   1   2   1   2   3   2   3   3   1   1   3   3   3   3   1   3   3   3   2   2
                        g33 g34 g35 g36 g37 g38 g39 g40 g41 g42 g43 g44 g45 g46 g47 g48 g49 g50
                          2   1   1   1   2   1   2   3   3   3   2   1   3   3   1   3   2   1


                        mycol <- sample(rainbow(256)); mycol <- mycol[as.vector(km$cluster)]; heatmap(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", ColSideColors=heat.colors(length(hc$labels)), RowSideColors=mycol)
                           # This example shows how the obtained K-means clusters can be compared with the hierarchical clustering results by highlighting
                           # them in the heatmap color bar.


                          PAM (partitioning around medoids)

                          library(cluster) # Loads the required cluster library.
                          y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                          pamy <- pam(y, 4); pamy$clustering # Clusters data into 4 clusters using default Euclidean as distance method.
                           g1  g2  g3  g4  g5  g6  g7  g8  g9 g10
                            1   2   3   1   4   2   1   1   2   3


                          mydist <- as.dist(1-cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
                          pamy <- pam(mydist, 4); pamy$clustering # Same a above, but uses provided distance matrix.
                           g1  g2  g3  g4  g5  g6  g7  g8  g9 g10
                            1   2   3   1   3   2   4   1   4   3


                          pam(mydist, 4, cluster.only=TRUE) # Same as before, but with faster computation.
                          plot(pamy) # Generates cluster plot.

                          Clara (clustering large applications: PAM method for large data sets)

                          clarax <- clara(t(scale(t(y))), 4)
                             # Clusters above data into 4 clusters using default Euclidean as distance method. If the data set consists of gene expression
                             # values, then row-wise scaling of y is recommend in combination with Euclidean distances.


                          clarax$clustering
                           g1  g2  g3  g4  g5  g6  g7  g8  g9 g10
                            1   2   3   1   3   2   4   1   2   3


                          Fuzzy Clustering 

                          In contrast to strict/hard clustering approaches, fuzzy clustering allows multiple cluster memberships of the clustered items. This is commonly achieved by partitioning the membership assignments among clusters by positive weights that sum up for each item to one. Several R libraries contain implementations of fuzzy clustering algorithms. The library e1071 contains the cmeans (fuzzy C-means) and cshell (fuzzy C-shell) clustering functions. And the cluster library provides the fanny function, which is a fuzzy implementation of the above described k-medoids method.

                          Fuzzy clustering with the cluster library

                          library(cluster) # Loads the cluster library.
                          y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                          fannyy <- fanny(y, k=4, metric = "euclidean", memb.exp = 1.5)
                             # Computes membership coefficients (weights) for 4 clusters and stores them in the 'membership' component of the resulting
                             # fanny object. The corresponding hard clustering result is provided in the 'clustering' slot, which is formed by assigning
                             # each point to the cluster with the highes
                          t coefficient. If the distance method is set to metric="SqEuclidean", then the
                             # function performs fuzzy C-means clustering. The membership exponent can be controlled with the argument 'memb.exp'
                             # (default is 2). Values close to 1 give 'crispier' clustering, whereas larger values increase the fuzzyness of the results.
                             # If a clustering results in complete fuzzyness, then the functions returns a warning.


                          round(fannyy$membership, 2)
                              [,1] [,2] [,3] [,4]
                          g1  0.80 0.07 0.08 0.04
                          g2  0.09 0.80 0.08 0.03
                          g3  0.92 0.02 0.04 0.01
                          g4  0.76 0.09 0.10 0.04
                          g5  0.10 0.04 0.77 0.09
                          g6  0.00 0.00 0.01 0.98
                          g7  0.12 0.06 0.65 0.17
                          g8  0.13 0.16 0.29 0.41
                          g9  0.07 0.03 0.85 0.05
                          g10 0.03 0.94 0.02 0.01

                          fannyy$clustering 
                          g1  g2  g3  g4  g5  g6  g7  g8  g9 g10
                            1   2   1   1   3   4   3   4   3   2



                          mydist <- as.dist(1-cor(t(y), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
                          fannyy <- fanny(mydist, k=4, memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
                             # Example for a using a correlation-based distance matrix.

                          fannyyMA <- round(fannyy$membership, 2) > 0.10; apply(fannyyMA, 1, which)
                             # Returns multiple cluster memberships for coefficient above a certain value (here >0.1)

                          $g1
                          [1] 1 4

                          $g2
                          [1] 2

                          $g3
                          [1] 1

                          $g4
                          [1] 3

                          $g5
                          [1] 4

                          $g6
                          [1] 1

                          $g7
                          [1] 4

                          $g8
                          [1] 2

                          $g9
                          [1] 3 4

                          $g10
                          [1] 3

                          Fuzzy clustering with the e1071 library

                          library(e1071) # Loads the e1071 library.
                          y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                          cmeansy <- cmeans(y, 4, method="cmeans", m=1.5); round(cmeansy$membership, 2); cmeansy$cluster
                             # Uses the c-means clustering method to compute membership coefficients (weights) for 4 clusters and stores them in
                             # the 'membership' component of the resulting object. The result of this example should be identical to the above
                             # clustering with the fanny function when its metric argument is set to "SqEuclidean". The cmeans function does not
                             # accept a distance matrix as input, but the fanny function does.

                          cmeansyMA <- round(cmeansy$membership, 2) > 0.10; apply(cmeansyMA, 1, which) # Returns multiple cluster memberships for coefficient above a certain value (here >0.1).
                          cshelly <- cshell(y, 4, method="cshell", m=1.5); round(cshelly$membership, 2); cshelly$cluster
                             # Uses the c-shell clustering method to compute membership coefficients (weights) for 4 clusters and stores them in
                             # the 'membership' component of the resulting object.

                          Self-Organizing Map (SOM) 

                          Self-organizing map (SOM), also known as Kohonen network, is a popular artificial neural network algorithm in the unsupervised learning area. The approach iteratively assigns all items in a data matrix to a specified number of representatives and then updates each representative by the mean of its assigned data points. Widely used R packages for SOM clustering and visualization are: class (part of R), SOM and kohonen. The SOM package, that is introduced here, provides similar utilities as the GeneCluster software from the Broad Institute.

                          library(som) # Loads the required SOM library.
                          y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""), paste("t", 1:5, sep=""))) # Creates a sample data set.
                          y <- t(scale(t(y))) # Normalizes the data so that each row has a mean close to zero and a standard deviation of one.
                          y.som <- som(y, xdim = 5, ydim = 6, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.

                          plot(y.som) # Plots SOM clustering result.
                          y.som$visual # Provides the assignment of rows items to the SOM clusters. Remember that the coordinate counting starts here at zero!

                             x y    qerror
                          1  3 0 0.6037758
                          2  4 3 0.9750106
                          3  2 2 1.0911974
                          4  0 2 0.7000283
                          5  2 2 1.2006966
                          6  0 3 0.5653868
                          7  1 0 0.7291410
                          8  3 4 1.5029060
                          9  2 3 1.5038304
                          10 2 4 0.4960032

                          SOM Plot

                          Principal Component Analysis (PCA) 

                          Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis. The following commands introduce the basic usage of the prcomp() function. A very related function is princomp(). The BioConductor library pcaMethods provides many additional PCA functions. For viewing PCA plots in 3D, one can use the scatterplot3d library or the made4 library.

                          z1 <- rnorm(10000, mean=1, sd=1); z2 <- rnorm(10000, mean=3, sd=3); z3 <- rnorm(10000, mean=5, sd=5); z4 <- rnorm(10000, mean=7, sd=7); z5 <- rnorm(10000, mean=9, sd=9); mydata <- matrix(c(z1, z2, z3, z4, z5), 2500, 20, byrow=T, dimnames=list(paste("R", 1:2500, sep=""), paste("C", 1:20, sep="")))
                             # Generates sample matrix of five discrete clusters that have very different mean and standard deviation values.

                          pca <- prcomp(mydata, scale=T)
                             # Performs principal component analysis after scaling the data. It returns a list with class "prcomp" that contains five
                             # components: (1) the standard deviations (sdev) of the principal components, (2) the matrix of eigenvectors (rotation),
                             # (3) the principal component data (x), (4) the centering (center) and (5) scaling (scale) used.
                          summary(pca) # Prints variance summary for all principal components.

                          summary(pca)$importance[, 1:6] # Accesses subset of components.
                                                      PC1       PC2       PC3       PC4       PC5       PC6
                          Standard deviation     2.181627 0.9806665 0.9785759 0.9581832 0.9483988 0.9419336
                          Proportion of Variance 0.237970 0.0480900 0.0478800 0.0459100 0.0449700 0.0443600
                          Cumulative Proportion  0.237970 0.2860600 0.3339400 0.3798500 0.4248200 0.4691800


                          x11(height=6, width=12, pointsize=12); par(mfrow=c(1,2)) # Set plotting parameters.
                          mycolors <- c("red", "green", "blue", "magenta", "black") # Define plotting colors.
                          plot(pca$x, pch=20, col=mycolors[sort(rep(1:5, 500))])
                             # Plots scatter plot for the first two principal components that are stored in pca$x[,1:2].

                          plot(pca$x, type="n"); text(pca$x, rownames(pca$x), cex=0.8, col=mycolors[sort(rep(1:5, 500))])
                             # Same as above, but prints labels.
                          library(geneplotter); smoothScatter(pca$x) # Same as above, but generates a smooth scatter plot that shows the density of the data points.
                          pairs(pca$x[,1:4], pch=20, col=mycolors[sort(rep(1:5, 500))])
                             # Plots scatter plots for all combinations between the first four principal components.
                          biplot(pca)
                             # Plots a scatter plot for the first two principal components plus the corresponding eigen vectors that are stored in pca$rotation.
                          library(scatterplot3d) # Loads library scatterplot3d.
                          scatterplot3d(pca$x[,1:3], pch=20, color=mycolors[sort(rep(1:5, 500))])
                             # Same as above, but plots the first three principal components in 3D scatter plot.
                          library(rgl); rgl.open(); offset <- 50; par3d(windowRect=c(offset, offset, 640+offset, 640+offset)); rm(offset); rgl.clear(); rgl.viewpoint(theta=45, phi=30, fov=60, zoom=1); spheres3d(pca$x[,1], pca$x[,2], pca$x[,3], radius=0.3, color=mycolors, alpha=1, shininess=20); aspect3d(1, 1, 1); axes3d(col='black'); title3d("", "", "PC1", "PC2", "PC3", col='black'); bg3d("white")
                             # Creates an interactive 3D scatter plot with Open GL. The rgl library needs to be installed for this. To save a snapshot
                             # of the graph, one can use the command rgl.snapshot("test.png").

                          Table of Contents
                          Scatter Plots for First Four Principal Components: pairs(pca$x[,1:4])

                          Multidimensional Scaling (MDS) 

                          Multidimensional scaling (MDS) algorithms start with a matrix of item-item distances and then assign coordinates for each item in a low-dimensional space to represent the distances graphically. Cmdscale() is the base function for MDS in R. Additional MDS functions are sammon() and isoMDS() of the MASS library.

                          loc <- cmdscale(eurodist) # Performs MDS analysis on the geographic distances between European cities.
                          plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)")
                          text(loc[,1], -loc[,2], rownames(loc), cex=0.8)
                             # Plots the MDS results in 2D plot. The minus is required in this example to flip the plotting orientation.

                          Table of Contents
                          Map of Europe Generated by MDS Clustering of City-to-City Distances

                          Bicluster Analysis 

                          Biclustering (also co-clustering or two-mode clustering) is an unsupervised clustering technique which allows simultaneous clustering of the rows and columns of a matrix. The goal of biclustering is to find subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points. The biclust package, that is introduced here, contains a collection of bicluster algorithms, data preprocessing and visualization methods (Detailed User Manual). An algorithm that allows the integration of different data types is implemented in the cMonkey R program (BMC Bioinformatics 2006, 7, 280). A comparison of several bicluster algorithms for clustering gene expression data has been published by Prelic et al (2006). Since most biclustering algorithms expect the input data matrix to be properly preprocessed, it is especially important to carefully read the manual pages for the different functions.


                          Plaid model biclustering

                          library(biclust) # Loads the biclust library.
                          data(BicatYeast) # Loads sample data from Barkow et al. (Bioinformatics, 22, 1282-1283).
                          res1 <- biclust(BicatYeast, method=BCPlaid())
                             # Performs plaid model biclustering as described in Turner et al, 2003. This algorithm models data matrices to a sum of
                             # layers, the model is fitted to data through minimization of error.

                          summary(res1); show(res1)
                             # The summary() functions returns the sizes of all clusters found and the show() function provides an overview of the
                             # method applied.

                          names(attributes(res1))
                             # Converts res1 object of class Biclust into a list and returns the names of its components (slots). The results for the
                             # row clustering are stored in the "RowxNumber" slot and the results for the column clustering are stored in the "NumberxCol" slot.

                          BicatYeast[res1@RowxNumber[,2], res1@NumberxCol[2,]] # Extracts values for bicluster number 2 from original data set.
                          myrows <- attributes(res1)$RowxNumber; mycols <- attributes(res1)$NumberxCol; myclusters <- lapply(1:length(myrows[1,]), function(x) list(rownames(BicatYeast[myrows[,x], ]), colnames(BicatYeast[, mycols[x,]]))); names(myclusters) <- paste("CL", 1:length(myclusters), sep="_"); myclusters[2]
                             # Organizes all identified row and column clusters in a list object, here 'myclusters'.
                          writeBiclusterResults("results.txt", res1, "My Result", dimnames(BicatYeast)[1][[1]], dimnames(BicatYeast)[2][[1]])
                             # Writes bicluster results to a file.
                          bubbleplot(BicatYeast, res1, showLabels=T)
                             # Draws a bubble plot where each bicluster is represented as a circle (bubble). Up to three biclustering result sets can be
                             # compared in one plot (here one). The sets are then labled with different colors. The color brightness representes the
                             # bicluster homogeneity (darker, less homogeneous). The bubble sizes represent the size of the biclusters, as
                             # (number of rows)x(number of columns). The bubble location is a 2D-projection of row and column profiles.

                          parallelCoordinates(x=BicatYeast, bicResult=res1, number=4)
                             # Plots the profiles for the 4th cluster. The cluster number is specified under the number argument.
                          drawHeatmap(BicatYeast, res1, number=4)
                             # Draws a heatmap for the BicatYeast data matrix with the rows and columns of the selected 4th bicluster at the top-left of plot.

                            XMotifs biclustering

                            y <- discretize(BicatYeast, nof=10) # Converts input matrix with continuous data into matrix with discrete values of 10 levels (nof).
                            res2 <- biclust(y, method=BCXmotifs(), alpha=0.05, number=50)
                               # Performs XMotifs biclustering based on the framework by Murali and Kasif (2003). It searches for rows with constant values
                               # over a set of columns which results in a submatrix where all rows in a descretized matrix have the same value profile.
                               # Usually the algorihm needs a discrete matrix to perform well.

                            bubbleplot(BicatYeast, res1, res2, showLabels=F) # Compares the two bicluster results in form of a bubble plot.
                            jaccardind(res1, res2)
                               # Calculates the similarity for two clustering results using an adaptation of the Jaccard index (values from 0-1). An overview
                               # of related methods is available on this cluster validity algorithm page.

                            Bimax biclustering

                            y <- BicatYeast; y[y >= 1 | y <= -1] <- 1; y[y != 1] <- 0
                               # Transforms input data matrix into binary values. Here the log2 ratios with at least a two-fold change are set to 1 and
                               # the rest is set to 0.

                            res3 <- biclust(y, method=BCBimax(), minr=15, minc=10, number=100)
                               # Performs Bimax biclustering based on an algorithm described by Prelic et al (2006). This method searches in a binary
                               # matrix for sub-matrices containing only ones.

                            Spectral biclustering

                            res4 <- biclust(BicatYeast, method=BCSpectral(), numberOfEigenvalues=2)
                               # Performs spectral biclustering as described in Kluger et al, 2003. Spectral biclustering supposes that normalized microarray
                               # data matrices have a checkerboard structure that can be discovered by the use of svd decomposition in eigenvectors, applied to genes.

                            CC biclustering

                            res5 <- biclust(y, method=BCCC(), delta=1.5, alpha=1, number=50)
                               # Performs CC biclustering based on the framework by Cheng and Church (2000). Searches for submatrices with a score lower than
                               # a specific threshold in a standardized data matrix.


                            Network Analysis

                            • WGCNA: an R package for weighted correlation network analysis
                            • BioNet: routines for the functional analysis of biological networks


                            Support Vector Machines (SVM) 

                            Support vector machines (SVMs) are supervised machine learning classification methods. SVM implementations are available in the R packages kernlab and e1071. The e1071 package contains an interface to the C++ libsvm implementation from Chih-Chung Chang and Chih-Jen Lin. In addition to SVMs, the e1071 package includes a comprehensive set of machine learning utilities, such as functions for latent class analysis, bootstrapping, short time Fourier transform, fuzzy clustering, shortest path computation, bagged clustering, naive Bayes classifier, etc. The kernlab package contains additional functions for spectral clustering, kernel PCA, etc.

                            An excellent introduction into the usage of SVMs in R is available in David Meyer's SVM article.


                            Similarity Measures for Clustering Results 

                            To measure the similarities among clustering results, one can compare the numbers of identical and unique item pairs appearing in their clusters. This can be achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set. With this information it is possible to calculate a similarity coefficient, such as the Jaccard Index. The latter is defined as the size of the intersect divided by the size of the union of two sample sets: a/(a+b+c). In case of partitioning results, the Jaccard Index measures how frequently pairs of items are joined together in two clustering data sets and how often pairs are observed only in one set. Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d) that are not joined together in any of the clusters in both sets. A variety of alternative similarity coefficients can be considered for comparing clustering results. An overview of available methods is given on this cluster validity page. In addition, the Consense library contains a variety of functions for comparing cluster sets, and the mclust02 library contains an implementation of the variation of information criterion described by M. Meila (J Mult Anal 98, 873-895).

                            ####################################
                            ## Basic usage of cindex function ##
                            ####################################

                            source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/clusterIndex.R")
                               # Imports the cindex() function for computing the Jaccard Index or the Rand Index.
                            library(cluster); y <- matrix(rnorm(5000), 1000, 5, dimnames=list(paste("g", 1:1000, sep=""), paste("t", 1:5, sep=""))); clarax <- clara(y, 49); clV1 <- clarax$clustering; clarax <- clara(y, 50); clV2 <- clarax$clustering
                               # Creates two sample cluster data sets stored in the named vectors clV1 and clV2.
                            ci <- cindex(clV1=clV1, clV2=clV2, self=FALSE, minSZ=1, method="jaccard")
                               # Computes the Jaccard Index for clV1 and clV2, where values close to 0 indicate low similarities and values close to 1
                               # high similarities among the evaluated cluster sets. The Rand and Adjusted Rand Indices can be returned by assigning to
                               # the methods argument the values rand or arand, respectively. By setting the argument self=TRUE, one can allow self
                               # comparisons, where pairs with different and identical labels are counted. This way clusters with single items will be
                               # considered as well. If it is set to FALSE only pairs with different labels are considered. The implemented cindex() function
                               # is relatively speed and memory efficient. Two clustering results with 1000 items each partitioned into 50 clusters will
                               # be processed in about 0.05 seconds. To focus the analysis on clusters above a minimum size limit, one can use the minSZ
                               # argument. For instance, the setting minSZ=3 will only consider clusters with 3 or more items.


                            ci[2:3] # Returns Jaccard index and variables used to compute it.
                            $variables
                                a     b     c
                            10519  4186  3813

                            $Jaccard_Index
                            [1] 0.5680419



                            ci[[1]][1:4,1:4] # Returns slice of intersect matrix.
                               1  2 3  4
                            1 20  0 0  0
                            2  0 49 0  0
                            3  0  0 3  0
                            4  0  0 0 15


                            ################################
                            ## Clustering cluster results ##
                            ################################

                            clVlist <- lapply(3:12, function(x) clara(y[1:30, ], k=x)$clustering); names(clVlist) <- paste("k", "=", 3:12)
                            d <- sapply(names(clVlist), function(x) sapply(names(clVlist), function(y) cindex(clV1=clVlist[[y]], clV2=clVlist[[x]], method="jaccard")[[3]]))
                            hv <- hclust(as.dist(1-d))
                            plot(as.dendrogram(hv), edgePar=list(col=3, lwd=4), horiz=T, main="Similarities of 10 Clara Clustering Results for k: 3-12")
                               # Example for clustering entire cluster result sets. First, 10 sample cluster results are created with Clara using k-values
                               # from 3 to 12. The results are stored as named clustering vectors in a list object. Then a nested sapply loop is used to
                               # generate a similarity matrix of Jaccard Indices for the clustering results. After converting the result into a distance
                               # matrix, hierarchical clustering is performed with hclust.

                            Table of Contents
                            Clustering of 10 Cluster Results Using the Jaccard Index as Similarity Measure


                            Clustering Exercises

                            Slide Show (R Code)

                            Install required libraries

                            The following exercises demonstrate several useful clustering and data mining utilities in R.

                            Sample data download:
                            • Import the data set into R
                            temp <- readLines("GSE1110_series_matrix.txt"); cat(temp[-grep("^!|^\"$", temp)], file="GSE1110clean.txt", sep="\n"); mydata <- read.delim("GSE1110clean.txt", header=T, sep="\t")
                               # These import commands include a cleanup step to get rid of annotation lines and corrupted return signs.
                            rownames(mydata) <- mydata[,1]; mydata <- as.matrix(mydata[,-1])
                               # Assigns row indices and converts the data into a matrix object.

                            • Filtering
                            mydata <- mydata[apply(mydata > 100, 1, sum)/length(mydata[1,])>0.5 & apply(log2(mydata), 1, IQR) > 1.5, ]
                               # Retrieves all rows with high intensities (50% > 100) and high variability (IQR > 1.5).

                            dim(mydata) # Dimension of final sample matrix
                            [1] 123  22

                            mydata[1:10,1:10] # Slice of sample matrix
                                      GSM18228 GSM18229 GSM18230 GSM18231 GSM18232 GSM18233 GSM18290 GSM18291 GSM18292 GSM18293
                            244932_at   4053.0   4130.4   3291.8   3383.1   4229.4   4338.2    903.2   1068.1   3609.3   3466.3
                            244933_at   2526.0   1800.0   1070.8    812.0   2345.0   2463.6    363.2    391.4   1074.6    823.3
                            244938_at   1789.7   1581.9   2043.4   2050.7    977.4    978.7    590.6    499.2   2091.9   2021.0
                            244939_at    749.0    759.1    936.4    947.0    531.7    604.5    378.9    492.5   1208.5   1065.2
                            244970_at   2377.0   2259.4   2053.1   2121.5   1401.8   1159.6    445.0    561.5   2243.1   2210.0
                            244982_at   2121.9   1953.2   5629.2   5240.5   1789.8   1997.7   1516.2   1592.3   5961.3   6067.3
                            244995_at    492.4    485.0    514.6    450.0    529.0    606.2    114.3    128.4    516.2    627.6
                            245002_at   2953.1   2866.9   6638.1   6562.6   4573.5   4481.8   1963.0   1885.5   7413.7   7784.2
                            245004_at   5447.6   5178.7   9046.1   8752.3   5121.3   4736.0   2730.6   2845.7  10264.3   9902.2
                            245008_at   2064.7   2267.7   1969.5   2277.2   1702.5   1844.0    698.1    546.3   2403.0   2553.8


                            • Hierarchical clustering
                            source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/my.colorFct.R")
                               # Import an alternative color scheme for the heatmap function.

                            mydatascale <- t(scale(t(mydata))) # Centers and scales data.
                            hr <- hclust(as.dist(1-cor(t(mydatascale), method="pearson")), method="complete") # Cluster rows by Pearson correlation.
                            hc <- hclust(as.dist(1-cor(mydatascale, method="spearman")), method="complete")
                               # Clusters columns by Spearman correlation.
                            heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row")
                               # Plot the data table as heatmap and the cluster results as dendrograms.
                            mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- sample(rainbow(256)); mycolhc <- mycolhc[as.vector(mycl)]; heatmap(mydata, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc)
                               # Cut the tree at specific height and color the corresponding clusters in the heatmap color bar.

                            Tree Cutting of Hierarchical Clustering Result Shown in Color Bar on Left


                            • Obtain significant clusters by pvclust bootstrap analysis
                            library(pvclust); library(gplots) # Loads the required packages.
                            pv <- pvclust(scale(t(mydata)), method.dist="correlation", method.hclust="complete", nboot=10)
                               # Perform the hierarchical cluster analysis. Due to time resrictions, we are using here only 10 bootstrap repetitions.
                               # Usually, one should use at least 1000 repetitions.

                            plot(pv, hang=-1); pvrect(pv, alpha=0.95)
                               # Plots result as a dendrogram where the significant clusters are highlighted with red rectangles.

                            clsig <- unlist(pvpick(pv, alpha=0.95, pv="au", type="geq", max.only=TRUE)$clusters) # Retrieve members of significant clusters.
                            source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/dendroCol.R") # Import tree coloring function.
                            dend_colored <- dendrapply(as.dendrogram(pv$hclust), dendroCol, keys=clsig, xPar="edgePar", bgr="black", fgr="red", pch=20)
                               # Create dendrogram object where the significant clusters are labeled in red.
                            heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolhc)
                               # Plot the heatmap from above, but with the significant clusters in red and the cluster bins from the tree cutting step in
                               # the color bar.

                            x11(height=12); heatmap.2(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", trace="none", RowSideColors=mycolhc) # Plot heatmap with heatmap.2() function which scales better for many entries.
                            mydatasort <- mydata[pv$hclust$labels[pv$hclust$order], hc$labels[hc$order]] # Sort rows in data table by 'dend_colored' and its colums by 'hc'.
                            x11(height=16, width=12); par(mfrow=c(1,2)); plot(dend_colored, horiz=T, yaxt="n"); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n") # Plot heatmap with bootstrap tree in larger format using instead of heatmap the image function.
                            pdf("pvclust.pdf", height=21, width=6); plot(dend_colored, horiz=T, yaxt="n"); dev.off(); pdf("heatmap.pdf", height=20, width=6); image(scale(t(mydatasort)), col=my.colorFct(), xaxt="n",yaxt="n"); dev.off()
                               # Save graphical results to two PDF files: 'pvclust.pdf' and'heatmap.pdf'.


                            • Compare PAM (K-means) with hierarchical clustering
                            library(cluster) # Loads required library.
                            mydist <- t(scale(t(mydata))) # Center and scale data.
                            mydist <- as.dist(1-cor(t(mydist), method="pearson")) # Generates distance matrix using Pearson correlation as distance method.
                            pamy <- pam(mydist, max(mycl)) # Clusters distance matrix into as many clusters as obtained by tree cutting step (6).
                            mycolkm <- sample(rainbow(256)); mycolkm <- mycolkm[as.vector(pamy$clustering)]
                            heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm)
                               # Compare PAM clustering results with hierarchical clustering by labeling it in heatmap color bar.

                            pdf("pam.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolkm); dev.off()
                               # Save graphical results to PDF file 'pvclust.pdf'.

                            Table of Contents

                            Comparison of 3 Clustering Results: (1) hclust (row tree), pvclust (red branches in row tree) and pam (color bar)


                            • Compare SOM with hierarchical clustering
                            library(som) # Loads required library.
                            y <- t(scale(t(mydata))) # Center and scale data.
                            y.som <- som(y, xdim = 2, ydim = 3, topol = "hexa", neigh = "gaussian") # Performs SOM clustering.
                            plot(y.som) # Plots results.
                            pdf("som.pdf"); plot(y.som); dev.off() # Save plot to PDF 'som.pdf'.
                            somclid <- as.numeric(paste(y.som$visual[,1], y.som$visual[,2], sep=""))+1 # Returns SOM cluster assignment in order of input data.
                            mycolsom <- sample(rainbow(256)); mycolsom <- mycolsom[somclid]; heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom)
                               # Compare SAM clustering results with hierarchical clustering by labeling it in heatmap color bar.
                            pdf("somhc.pdf", height=20, width=20); heatmap(mydata, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct(), scale="row", RowSideColors=mycolsom); dev.off() # Save graphical results to PDF file 'somhc.pdf'.


                            • Compare PCA with SOM
                            pca <- prcomp(mydata, scale=T) # Performs principal component analysis after scaling the data.
                            summary(pca) # Prints variance summary for all principal components.
                            library(scatterplot3d) # Loads 3D library.
                            scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom) # Plots PCA result in 3D. The SOM clusters are highlighted in their color.
                            pdf("pcasom.pdf"); scatterplot3d(pca$x[,1:3], pch=20, color=mycolsom); dev.off() # Saves PCA plot in PDF format 'pcasom.pdf'.

                            • Compare MDS with HC, SOM and K-means
                            loc <- cmdscale(mydist, k = 3) # Performs MDS analysis and returns results for three dimensions.
                            x11(height=8, width=8, pointsize=12); par(mfrow=c(2,2)) # Sets plotting parameters.
                            plot(loc[,1:2], pch=20, col=mycolsom, main="MDS vs SOM 2D")
                               # Plots MDS-SOM comparison in 2D. The SOM clusters are highlighted in their color.

                            scatterplot3d(loc, pch=20, color=mycolsom, main="MDS vs SOM 3D") # Plots MDS-SOM comparison in 3D.
                            scatterplot3d(loc, pch=20, color=mycolhc, main="MDS vs HC 3D") # Plots MDS-HC comparison.
                            scatterplot3d(loc, pch=20, color=mycolkm, main="MDS vs KM 3D") # Plots MDS-KM comparison.

                            Comparison of MDS Clustering with SOM, HC, and K-means Results



                            • Scoring similarities among clustering results with Jaccard coefficient
                            names(somclid) <- row.names(mydata)
                            clist <- list(hc=mycl, pam=pamy$clustering, som=somclid) # Store clustering results in list.
                            d <- sapply(names(clist), function(x) sapply(names(clist), function(y) cindex(clV1=clist[[y]], clV2=clist[[x]], method="jaccard")[[3]]))
                               # Compute similarity matrix of Jaccard coefficients.
                            round(d, 2)
                                  hc  pam  som
                            hc  1.00 0.56 0.62
                            pam 0.56 1.00 0.48
                            som 0.62 0.48 1.00

                            hv <- hclust(as.dist(1-d)) # Hierarchical clustering with hclust
                            plot(as.dendrogram(hv), edgePar=list(col=3, lwd=4), horiz=TRUE)
                            # Plot similarity among clustering results as tree.
                            Similarity Among Clustering Results
                            • Fuzzy clustering
                            library(cluster) # Loads cluster library.
                            fannyy <- fanny(mydist, k= max(mycl), memb.exp = 1.5); round(fannyy$membership, 2); fannyy$clustering
                               # Performs fuzzy clustering with as many coefficients as clusters were obtained by tree cutting step in HC. The hard
                               # clustering results are provided in the 'clustering' slot.

                            fannyyMA <- round(fannyy$membership, 2) > 0.3; apply(fannyyMA, 1, which)
                               # Returns multiple cluster memberships for coefficient above a certain value (here >0.3).

                            • Biclustering
                            Follow the examples given in the Bicluster Analysis section.

                            Administration

                            • Installation of R
                            Download and install R for your operating system from CRAN.
                            R interfaces: RStudio (additional options)

                            The latest instructions for installing Bioconductor packages are available on the Bioc Installation page. Only the essential steps are given here. To install Bioconductor packages, execute from the R console the following commands:

                            source("http://bioconductor.org/biocLite.R") # Sources the biocLite.R install script.
                            biocLite() # Installs the biocLite.R default set of Bioconductor packages.
                            biocLite(c("pkg1", "pkg2")) # Command to install specific packages from Bioc.
                            update.packages(repos=biocinstallRepos(), ask=FALSE)
                               # Bioconductor packages, especially those in the development branch, are updated fairly regularly. To update all
                               # of your installed packages, run this command after sourcing "biocLite.R".


                            To install CRAN packages, execute from the R console the following command:

                            install.packages(c("pkg1", "pkg2")) # The selected package name(s) need to be provided within quotation marks.
                            Alternatively, a package can be downloaded and then installed in compressed form like this:

                            install.packages("pkg.zip", repos=NULL)
                            List all packages installed on a system:

                            row.names(installed.packages())
                            On UNIX-based systems one can also execute the following shell command (not from R):

                            $ R CMD INSTALL pkg_0.1.tar.gz
                               # This installs packages by default into a system-wide directory. To install packages into a user account
                               # directory (not system-wide), one can specify a custom directory with the argument '-l my_lib_dir'. A locally
                               # installed library can be called in R like this: 'library(my_lib, lib.loc= "/home/user/my_lib_dir/")'.
                               # To manage libraries by default in a custom directory, one can generate a '.Reviron' file containing the
                               # line: R_LIBS=/your_home_dir/Rlib



                             
                            This site was accessed times (detailed access stats).