HT Sequence Analysis with R and Bioconductor

Authors: Tyler Backman, Rebecca Sun and Thomas Girke, UC Riverside

This manual is outdated. An up-to-date version can be found here.

Contents

  1. 1 Introduction
  2. 2 Scope of this Manual
  3. 3 NGS Workflows
  4. 4 Table of Contents
  5. 5 Workshops
  6. 6 R Basics
  7. 7 String Handling Utilities in R's Base Distribution
  8. 8 Bioconductor Packages
    1. 8.1 Biostrings
      1. 8.1.1 Handling Single Sequences with XString
      2. 8.1.2 Comparing and Subsetting XString Objects
      3. 8.1.3 Subfeature Retrieval with StringViews
      4. 8.1.4 Masking Sequences
      5. 8.1.5 Handling Many Sequences with XStringSet
      6. 8.1.6 Comparing and Subsetting of XStringSet Objects
      7. 8.1.7 Sequence Import and Export
      8. 8.1.8 Handling of Sequence and Quality Data with QualityScaleXStringSet
      9. 8.1.9 Basic Sequence Manipulation Utilities
      10. 8.1.10 Trimming of Flanking Sequences (e.g. Adaptors)
      11. 8.1.11 Pattern Search and Alignment Tools for Genome Mapping
      12. 8.1.12 Computing Pairwise Sequence Alignments
      13. 8.1.13 Multiple Sequence Alignments (MSAs)
      14. 8.1.14 Position Weight Matrices (PWM)
      15. 8.1.15 Analysis Routines with Biostrings
        1. 8.1.15.1 Mapping of Affy Probes to Transcriptome
        2. 8.1.15.2 Analyzing Assembly Results
          1. 8.1.15.2.1 Compute N50 Values for Assembly Results
          2. 8.1.15.2.2 Plot Cumulative Length Distribution of Contigs
    2. 8.2 Handling Sequence Ranges with IRanges and GenomicRanges
      1. 8.2.1 IRanges Overview
      2. 8.2.2 GenomicRanges Overview
      3. 8.2.3 Analysis Routines with IRanges/GenomicRanges
        1. 8.2.3.1 Compute Intron and Intergenic Ranges from GFF/GTF Annotations
        2. 8.2.3.2 Computing Absolute and Relative Overlaps Among Ranges
        3. 8.2.3.3 Parsing Gene/Intergenic Regions Based on GTF Input (IRanges Version)
        4. 8.2.3.4 Parsing Genomic Features Based on GFF Input (old non-IRanges version)
    3. 8.3 Motif Discovery
      1. 8.3.1 cosmo
      2. 8.3.2 BCRANK
      3. 8.3.3 Additional Motif Discovery Packages
    4. 8.4 ShortRead
      1. 8.4.1 Function Overview
      2. 8.4.2 Illumina Pipeline
      3. 8.4.3 Quality Scores
      4. 8.4.4 ShortRead Objects
        1. 8.4.4.1 SolexaPath
        2. 8.4.4.2 ShortReadQ
        3. 8.4.4.3 AlignedRead
      5. 8.4.5 ShortRead Sequence Filters
      6. 8.4.6 Analysis Routines with ShortRead
        1. 8.4.6.1 Import Reads, Filter, and Export
        2. 8.4.6.2 SmallRNA Profiling
        3. 8.4.6.3 Trimming and Sorting Reads by 5' Muliplexing Adapters
        4. 8.4.6.4 Trimming Low Quality 3' Tails from Reads
        5. 8.4.6.5 Quality Reports of FASTQ Files 
    5. 8.5 Rsamtools
      1. 8.5.1 BWA Alignment
      2. 8.5.2 Rsamtools Bam Parsing
      3. 8.5.3 Rsamtools SNP Calling
      4. 8.5.4 Analysis Routines with Rsamtools
        1. 8.5.4.1 SNP/INDEL Analysis and Annotation
    6. 8.6 BSgenome
      1. 8.6.1 BSgenome Use
    7. 8.7 rtracklayer
    8. 8.8 biomaRt
      1. 8.8.1 biomaRt Use
    9. 8.9 ChIP-Seq Analysis Packages
      1. 8.9.1 chipseq
      2. 8.9.2 BayesPeak
      3. 8.9.3 PICS
      4. 8.9.4 ChIPpeakAnno
      5. 8.9.5 Additional ChIP-Seq Packages
    10. 8.10 RNA-Seq Analysis
      1. 8.10.1 Counting Reads that Overlap with Annotation Ranges 
      2. 8.10.2 Differential Gene Expression Analysis with DESeq
      3. 8.10.3 Differential Gene Expression Analysis with edgeR
      4. 8.10.4 Detection of Alternative Splice Junctions
    11. 8.11 DNA-Methylation Data Analysis
    12. 8.12 HT-Seq Data Visualization
  9. 9 Command Line Alignment Utilities
    1. 9.1 SOAP
      1. 9.1.1 Running SOAP
      2. 9.1.2 Parsing Output with R
    2. 9.2 MAQ
      1. 9.2.1 Running MAQ
      2. 9.2.2 Parsing Output with R
    3. 9.3 Bowtie
      1. 9.3.1 Running Bowtie
      2. 9.3.2 Parsing Output with R
    4. 9.4 BWA
  10. 10 Exercises
  11. 11 Session Info
  12. 12 Installing Packages

Introduction

Sequencing Technologies ]   [ Latest Slides from NGS Analysis Workshop ]

High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets. These 'BioC-Seq' packages allow to analyze these sequences with impressive speed performance. Their accelerations are achieved by using memory efficient string containers and performing the time consuming computations with calls to external programs that are implemented in compiled languages (e.g. C/C++). Together these packages form a novel framework that allows researchers to develop efficient pipelines by performing complex data analysis in a high level data analysis and programming environment.

Scope of this Manual

This manual is intended for users who have a basic knowledge of the R environment, and would like to use R/Bioconductor to perform general or HT sequencing analysis. For new users, it is recommended to first review the material covered in the "R Basics" section (see below). To obtain a more comprehensive overview of R's sequence bioinformatics utilities, Robert Gentleman's book "R Programming for Bioinformatics" is an excellent choice.

The packages introduced here are a 'personal selection' of the authors of this manual that does not reflect the full utility spectrum of the Bioconductor project for this field of application. The introduced packages were chosen, because the authors use them often for their own teaching and research. To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site (http://bioconductor.org). Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutely critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to the authors. 

Table of Contents

NGS Workflows

Readers of this manual might also be interested in the recently released systemPipeR package which provides convenience utilities for building end-to-end analysis pipelines with automated report generation for a variety next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq.

Table of Contents

Workshops

Workshops on HT sequence data analysis using BioC-Seq resources are announced on the Bioconductor workshop site. The Institute for Integrative Genome Biology, IIGB at UCR also offers workshops in this field. In addition, there are many more institutions that organize similar workshops. The best way to find out about upcoming events in a certain location is to run a Google search on this topic.

Table of Contents

R Basics

The R & BioConductor manual provides an introduction on the usage of the R environment and its basic command syntax, and the Programming in R manual introduces a more advanced overview of R's programming syntax. 

Table of Contents

String Handling Utilities in R's Base Distribution

The R base distribution contains various functions for handling and manipulating strings. These utilities are very useful for basic sequence analysis tasks. However, often it is much more effective to use for these tasks dedicated Bioconductor sequence analysis packages, such as Biostrings. This manual primarily introduces functions and data objects from the Bioconductor project. Nevertheless, the following commands provide a brief summary of some very useful string handling functions available in R's base distribution.

#####################
## String Matching ##
#####################
myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") # Creates a sample sequence data set.
myseq[grep("ATG", myseq)] # String searching with regular expression support.
pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT".
as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches.
pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT".
as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns position information of matches in first sequence.
gsub("^ATG", "atg", myseq) # String substitution with regular expression support.

########################
## Positional Parsing ##
########################
nchar(myseq) # Computes length of strings.
substring(myseq[1], c(1,3), c(2,5)) # Positional parsing of several fragments from one string.
substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of many strings.

############################
## Reverse and Complement ##
############################
myseq_comp <- chartr("ATGC", "TACG", myseq) # Returns complement for given DNA sequences.
substring(myseq[1], 1:nchar(myseq[1]), 1:nchar(myseq[1])) # Vectorizes single sequence.
x <- strsplit(myseq_comp, "", fixed=TRUE) # Vectorizes many sequences.
x <- lapply(x, rev) # Reverses vectors.
myseq_revcomp <- sapply(x, paste, collapse="") # Collapses vectors to strings. Final result: reverse and complement of myseq.

#########################################
## Translate DNA Sequence into Protein ##
#########################################
y <- c("ATGCATTGGACGTTAG") # Creates sample DNA sequence.
AAdf <- read.table(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt", header=T,"\t") # Imports genetic code.
AAv <- AAdf[,2]; names(AAv) <- AAdf[,1] # Creates named vector with genetic code
y <- gsub("(...)", "\\1_", y) # Inserts "_" after each triplet.
y <- unlist(strsplit(y, "_")) # Splits on "_" and returns vectorized triplets.
y <- y[grep("^...$", y)] # Removes incomplete triplets.
AAv[y] # Translation into protein by name-based subsetting.

#############################
## Create Random Sequences ##
#############################
sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")) # Creates 100 random DNA sequences with 20 residues.
sapply(1:100, function(x) paste(sample(1:40, 20, replace=T), collapse=" ")) # Creates 100 random score sets with 20 elements.

###################################################
## Sequence Printing and Writing in FASTA Format ##
###################################################
myseq <- sapply(1:12, function(x) paste(sample(c("A","T","G","C"), 40, replace=T), collapse=""))
writeLines(myseq) # Prints the sequences to the screen, one per line.
writeLines(strtrim(myseq, 10)) # The strtrim() function allows to print only the first section of the sequences.
writeLines(strwrap(myseq, indent = 20)) # The strwrap() function provides wrapping, indenting and prefixing utilities.
myname <- paste(">", month.name, sep="") # Creates sample names for the sequences.
writeLines(as.vector(t(cbind(myname, myseq))), "myseq.fasta") # Writes the sequences in FASTA format to a file.

Bioconductor Packages

Biostrings

Documentation

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages.

Handling Single Sequences with XString

The XString class allows to represent the different types of biosequence strings in the subclasses: BString, RNAString, DNAString and AAString.

library(Biostrings)
b <- BString("I store any set of characters. The other XString objects store only the IUPAC characters for each biosequence type.")

d <- DNAString("GCATAT-TAC") # Creates DNAString object.
r <- RNAString("GCAUAU-UAC") # Creates RNAString object.
r <- RNAString(d) # Converts d into RNAString object.
p <- AAString("HCWYHH")

Comparing and Subsetting XString Objects

Several useful utilities are available to access, subset and manipulate XString objects.

replaceLetterAt(d, c(1), "N") # Replaces residues at specified positions.
length(d) # Returns the length of the sequence that is stored in d.
toString(d) # Converts DNAString to character vector of length 1.
d[2:6] # Prints d from postions 2 to 6. This makes positional parsing of sequences very easy!
d[length(d):1] # Prints the reverse string of d.
subseq(d, start=3, end=6) # Note: the subseq() function from the IRanges package is more efficient for subsetting large sequences.
d[1:4]==d[1:4] # Checks whether two sequences are identical. In this case it returns TRUE.
d[1:4]==d[1:5] # Returns FALSE.
d == r # Interprets DNA and RNA sequences the same way. As a result, the given example returns TRUE.
d[7] <- DNAString("N") # Replaces base 7 with a "N".
subseq(d, start=7, width=1) <- DNAString("-") # Replaces base 7 with a "-".

Subfeature Retrieval with StringViews

The XStringViews class provides utilities to manage and view subfeatures from sequences stored in XString objects.

dlong <- DNAString("GCATATTACAATCGATCCGCATATTAC")
dsub <- Views(dlong, start = c(1,6,11,20), end = c(5,8,18,27)) # Generates XStringViews object with four views (subfeatures).
dsub[1:2] # Returns first two views.
dsub[[1]] # Returns first view as XString object.
subject(dsub) # Returns views as a single sequence.
paste(dsub, collapse="_") # Collapses views into a single sequence.
attributes(dsub) # Returns attribute list of XStringViews object.
start(dsub); end(dsub); width(dsub) # Functions to returns the different XStringView components as vectors.
sapply(as.list(dsub), toString) # Returns sequences in XStringView object as vector.
dsub[dsub == DNAString("ATCGATCC")] # Subsetting by a sequence string.

Masking Sequences

Biostrings allows to mask XString objects in order to exclude undesired regions from consideration by various analysis tools, e.g. repetitive sequences in the mapping of short reads. The advantage of this masking approach is that it does not result in any information loss, because masks can be turned on and off any time. In addition, the process is very memory efficient.

d <- DNAString("CGCTATCTGACTGCCGCGCC") # Creates sample sequence.
maskMotif(d, "GC") # Function for masking a sequence by a pattern.
m1 <- Mask(mask.width=nchar(d), start=c(1,17), end=c(3,20)) # Defines mask m1.
m2 <- Mask(mask.width=nchar(d), start=c(4,12), end=c(7,14)) # Defines mask m2
m12 <- append(m1, m2) # Joins two masks in one container.
names(m12) <- c("A", "B") # Assigns names to mask components.
active(m12)["B"] <- FALSE # Turns one mask off.
masks(d) <- m12 # Applies masks to DNAString object. No information is lost by this action!
d; nchar(d); length(d) # Returns some information about masked object.
active(m12)["B"] <- TRUE # Turns mask B back on.
masks(d) <- m12 # Reapplies modified mask object to d.
gaps(d) # Reverses all the masks.
reverseComplement(d) # Returns the reverse and complement for a masked sequence object.
dfrag <- as(d, "XStringViews") # Turns MaskedXString object into an XStringViews object.
sapply(as.list(dfrag), toString) # Retrieves all unmasked sections in d as strings in vector.
injectHardMask(d, "N") # Injects/fills the masked regions of a sequence with a user-specified letter.
masks(d) <- NULL # Removes all masks from DNAString object. Alternative command: unmasked().

Handling Many Sequences with XStringSet

The XStringSet class allows to represent many sequences in a single container. The subclasses for the different types of biosequences are: BStringSet, RNAStringSet, DNAStringSet and AAStringSet.

dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) # The corresponding XStringSet containers allow to store many biosequences in one object.
names(dset) <- c("seq1", "seq2", "seq3") # Assigns names to the elements in a XStringSet object.
rset <- RNAStringSet(c("GCAUAUUAC", "UUUAAAA", "GCAUAUUAC"))
rset <- RNAStringSet(dset) # Converts dset into RNAStringSet object.
pset <- AAStringSet(c("HCWYHH", "SPPRHA"))

Comparing and Subsetting of XStringSet Objects

Several useful utilities are available to access, subset and manipulate XStringSet objects.

dset[1:2] # Returns the first two sequence entries from dset.
append(dset, dset); c(dset, dset) # Appends/concatenates two XStringSet objects.
unlist(dset)] # Collapses many sequences to a single one stored in a DNAString container.
DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # Sequence subsetting by positions provided under start and end.
d==dset[[1]] # The [[ subsetting operator returns a single entry as XString object.
lapply(as.list(dset), toString) # Converts XStringSet to list. If the name field is populated then the list elements will be named accordingly.
names(dset); width(dset) # Returns the name and length of each sequences, respectively.
dsetv <- sapply(as.list(dset), toString); table(dsetv) # Returns the sequences in XStringSet as named vector and then counts the number of occurrences of each sequence.
dset[!duplicated(dset)] # Removes the duplicated sequence in dset.
Table of Contents

Sequence Import and Export

The set of read.XStringSet import functions allow to import one or many FASTA-formatted biosequences from an external file directly into XStingSet objects.

################################################
## Import and Export for XStringSet Container ##
################################################
myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.ffn","fasta") # Imports all open reading frames of the Halobacterium genome from NCBI into DNAStringSet object. If this returns an error message then first download the sequences to your desktop and import them from there.
myseq1[grep("phosphoesterase", names(myseq1))] # Retrieves all sequences containing the term "phosphodiesterase" in their name field.
myseq2 <- myseq1[grep("^ATGCTT", sapply(as.list(myseq1), toString))] # Retrieves all sequences starting with "ATGCTT".
write.XStringSet(myseq2, file="myseq2.txt", format="fasta", width=80) # Writes sequences to file in FASTA format.
write.XStringSet(myqual, file="myqual.txt", format="fasta", width=80) # Writes quality scores from above example to file.

Handling of Sequence and Quality Data with QualityScaleXStringSet

The QualityScaleXStringSet class allows to represent many sequences plus their quality data in a single container. The subclasses for the different types of biosequences are: QualityScaledBStringSet, QualityScaledDNAStringSet, QualityScaledRNAStringSet and QualityScaledAAStringSet.

##########################
## About Quality Scores ##
##########################
phred <- 1:9
phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse=""); phreda
as.integer(charToRaw(phreda))-33 # Phred quality scores are integers from 0-50 that are stored as ASCII characters after adding 33. The basic R functions rawToChar and charToRaw can be used to interconvert among their representations. Solexa scores are similar, but use an offset of 64.

##########################################
## Quality Score Handling in Biostrings ##
##########################################
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Creates random sample sequence.
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Creates random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters.
myqual <- PhredQuality(myqual) # Converts to a PhredQuality object.
lapply(seq(along=myqual), function(x) as.integer(charToRaw(as.character(myqual[x])))) # Converts Phred scores from ASCII decimal back to integer representation.
library(ShortRead); setSR <- ShortReadQ(sread=dset, quality=FastqQuality(BStringSet(myqual)), BStringSet(myqual))
myMA <- as(quality(setSR), "matrix") # Fast conversion of all quality ASCII values into a numeric matrix by using the ShortReadQ function from the ShortRead library.
boxplot(as.data.frame((myMA))) # Creates box plot of quality values.
dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object.
dsetq1[rowSums(myMA < 20) <= 4] # Filters for sequences that contain not more than 4 bases with a quality score below 20.
dset <- DNAStringSet(dsetq1) # Extract sequence component from QualityScaledDNAStringSet object.
qset <- quality(dsetq1) # Extract quality component from QualityScaledDNAStringSet object.
DNAStringSet(dset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for sequence data.
BStringSet(qset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for corresponding quality data.

###########################################################
## Example for Injecting Ns at Positions with Low Scores ##
###########################################################
XStringSetInject <- function(myseq=dset[[1]], myqual=qset[1], cutoff=20, mychar="N") { # Defines character inject function XStringSetInject.
    mypos <- which(as.integer(myqual)<cutoff)
    return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff),
    rep(mychar, length(mypos)))))
}
dvec <- sapply(seq(along=dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=qset[x], cutoff=20)) # Apply XStringSetInject function to all sequences.
dset2 <- DNAStringSet(dvec)
names(dset2) <- paste("d", seq(along=dvec), sep="")
names(myqual) <- paste("d", seq(along=dvec), sep="")
dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) # Reassemble results in QualityScaledDNAStringSet object.

####################################
## Filter Sequences by Ns (Score) ##
####################################
Ncount <- alphabetFrequency(DNAStringSet(dvec))[,15] # Generates N counts for each sequence in dvec.
dsetq2[Ncount <= 2] # Returns only sequences with 0-2 Ns.
length(dsetq2[Ncount <= 2]) # Returns the number of sequences that have 0-2 Ns.

Basic Sequence Manipulation Utilities

The Biostrings package contains a wide spectrum of functions for performing many of the basic sequence transformation and manipulation routines, such as computing frequency tables, reverse & complements, translating DNA into protein sequences, etc. The following examples introduce a subset of these basic sequence analysis functions.

#####################
## Getting Started ##
#####################
library(help=Biostrings) # Lists all functions provided by Biostrings.
myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.ffn", "fasta") # Imports a sample sequence data set: all open reading frames of the Halobacterium genome from NCBI. If this returns an error message then first download the sequences to your desktop and import them from there.

######################
## Useful Data Sets ##
######################
GENETIC_CODE # Prints genetic code information.
IUPAC_CODE_MAP # Returns IUPAC nucleotide ambiguity codes.
data(BLOSUM80) # Loads BLOSUM80 substitution matrix. More on this: ?substitution.matrices
?substitution.matrices # Detailed information on provided substitution matrices.

#####################################################
## Sequence Variants (SNPs) and Consensus Analysis ##
#####################################################
data(phiX174Phage) # Imports six versions of the complete genome for bacteriophage phi X174 as DNAStringSet object.
mymm <- which(rowSums(t(consensusMatrix(phiX174Phage))[,1:4] == 6)!=1) # Creates a consensus matrix and returns the mismatch (SNP) positions.
DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10) # Prints the sequence sections for the first SNP region in the phiX174Phage sequences.
consensusString(DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10)) # Prints the consensus sequence for the corresponding SNP region.
mysnp <- t(consensusMatrix(phiX174Phage))[mymm,1:4]; rownames(mysnp) <- paste("SNP", mymm, sep="_"); mysnp # Returns the consensus matrix data only for the SNP positions.
mysnp2 <- lapply(seq(along=mymm), function(x) sapply(as.list(DNAStringSet(phiX174Phage, start=mymm[x], end=mymm[x])), toString))
mysnp2 <- as.data.frame(mysnp2)
colnames(mysnp2) <- paste("SNP", mymm, sep="_"); mysnp2 # Returns all SNPs in a data frame and names them according to their positions.

#############################################
## Reverse and Complement of DNA Sequences ##
#############################################
reverseComplement(myseq1) # Returns reverse and complement for may BioC object types including XString, XStringView and XStringSet.
reverseComplement(dsub) # Does the same for XStringView objects. Adjust all start and end coordinates accordingly.

#########################################
## Residue Frequency (e.g. GC Content) ##
#########################################
alphabetFrequency(myseq1)[1:4,] # Computes the residue frequency for all sequences in myseq1.
data.frame(ID=names(myseq1), GC=rowSums(alphabetFrequency(myseq1)[,c(2,3)]/width(myseq1))*100)[1:4,] # Returns GC content as data frame.
trinucleotideFrequency(myseq1[1:4,]) # Returns triplet frequency.

##########################################
## Translate DNA into Protein Sequences ##
##########################################
translate(myseq1[1]) # Translates a single ORF and returns result as AAStringSet object.
translate(myseq1) # Translates all ORFs in myseq1. 

Trimming of Flanking Sequences (e.g. Adaptors)

Many cloning and barcoding approaches append flanking sequences (e.g. adaptors) to the actual sequences of a biosample. These artificial sequence fragments need to be removed to allow a reliable matching of the sequences against their corresponding genomes or transcriptomes. The trimLRPatterns() function is a very efficient and flexible utility for removing flanking patterns from both ends of sequences.

myseq <-DNAStringSet(c("CCCATGCAGACATAGTG", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG")); names(myseq) <- letters[1:3] # Creates sample sequences.
trimLRPatterns(Lpattern ="CCC", Rpattern="AGTG", subject=myseq, max.Lmismatch = 0.33, max.Rmismatch = 1) # Removes the specified R/L-patterns. The number of maximum mismatches can be specified for each pattern individually. Indel matches can be specified with the arguments: with.Lindels and with.Rindels.
trimLRPatterns(Lpattern = "CCC", Rpattern="AGTG", subject=myseq, max.Lmismatch=c(0,0,0), max.Rmismatch=c(1,0,0)) # To remove partial adaptors, the number of mismatches for all possible partial matches can be specified by providing a numeric vector of length nchar(mypattern). The numbers specifiy the number of mismatches for each partial match. Negative numbers are used to prevent trimming of a minimum fragment length, e.g. most terminal nucleotides.
trimLRPatterns(Lpattern = "CCC", Rpattern="AGTG", subject=myseq, max.Lmismatch=c(0,0,0), max.Rmismatch=c(1,0,0), ranges=T) # With the setting 'ranges=TRUE' one can retrieve the corresponding trimming coordinates.

Pattern Search and Alignment Tools for Genome Mapping

Biostrings contains a wide spectrum of sequence search and pattern matching tools. This section provides only a brief introduction into a subset of these tools.

######################
## Pattern Matching ##
######################
## The matchPattern() and vmatchPattern() functions allow to search a query (usually short) against one or many subject sequences, respectively. The number of mismatches and gaps can be specified by the user.
mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1) # Finds all matches of a pattern in a reference sequence.
mypos # The results are stored as XStringViews object.
tmp <- c(DNAStringSet("ATGGTG"), DNAStringSet(mypos)); tmp # Uses DNAStringSet to show a 'pseudo' alignment of the query sequence on the top and the hits below.
consensusMatrix(tmp) # Returns a consensus  matrix for query and hits.

myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) # Finds all matches of a pattern in a set of reference sequences.
myvpos # The results are stored as MIndex object.
myvcount <- countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) # Counts only the corresponding matches.
Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # Retrieves the result for single entry as XStringViews object.
lapply(seq(along=myseq1), function(x) as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]])))) # Retrieves the sequences of all matches.

######################################################
## Pattern Matching with Regular Expression Support ##
######################################################
myseq <- DNAStringSet(c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC")) # Creates a sample sequence data set.
myseq[grep("^ATG", myseq, perl=TRUE)] # String searching with regular expression support.
pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT".
as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches.
pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT".
as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns position information of matches in first sequence.
DNAStringSet(gsub("^ATG", "NNN", myseq)) # String substitution with regular expression support.

###################################
## Fast Matching with matchPDict ##
###################################
## The matchPDict function allows fast matching of many short sequences stored in a dictionary against a long reference sequence (e.g. chromosome).

## (1) Matching of Sequences to Reference
mydict <- DNAStringSet(sapply(1:1000, function(x) paste(sample(c("A","T","G","C"), 8, replace=T), collapse=""))) # Creates random sample sequences.
names(mydict)<- paste("d", 1:1000, sep="") # Names the sequences.
mypdict <- PDict(mydict) # Creates a PDict dictionary. Allows only sequences of same length.
mychr <- DNAString(sapply(1, function(x) paste(sample(c("A","T","G","C"), 20000, replace=T), collapse=""))) # Creates sample chromosome.
mysearch <- matchPDict(mypdict, mychr, max.mismatch=0) # Searches all dictionary entries against chromosome.

## (2) Accessing the Results
unlist(mysearch)[1:10] # Returns first 10 matches.
start_index <- startIndex(mysearch) # Returns start positions of matches.
end_index <- endIndex(mysearch) # Returns end positions of matches.
count_index <- countIndex(mysearch) # Counts the number of matches for each sequence.
with_match <- which(count_index >= 1) # Retrieves index of sequences with at least one match.
Views(mychr, start=unlist(start_index[with_match]), end=unlist(end_index[with_match])) # Returns matches as XStringViews object.
mydict[with_match] # Returns corresponding dictionary entries.

## (3) Coverage Plots
mycov <- as.integer(coverage(mysearch, 1, length(mychr)))
max(mycov); min(mycov)
sum(mycov != 0) / length(mycov)
plotCoverage <- function(coverage, start, end, mycol="red") { # Defines coverage plotting function.
                plot.new()
                plot.window(c(start, end), c(0, 10))
                axis(1)
                axis(2)
                axis(4)
                lines(start:end, coverage[start:end], type="l", col=mycol)
        }
plotCoverage(mycov, start=1, end=20000, mycol="red") # Plots coverage result.

## (4) Plot Fragments onto Chromosome
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt") # Imports plotting function.
mapDF <- data.frame(as.data.frame(unlist(mysearch))[,1:2], strand=1) # Creates coordinate data frame.
colnames(mapDF) <- c("startx", "endx", "strand")
genelength <- length(mychr) # Defines chromosome length.
mapDF[,2] <- mapDF$endx + 200 # To see the matches better in the plot, their length is here extended by an arbitrary value (200).
mapDF[mapDF[,2]>genelength,2] <- genelength
Mysub <- paste("Chromosome length", genelength, "bp") # Specifies subtitle of plot.
Mymain <- "Chromosome Plot for Sense Alignments" # Specifies title of plot.
plotDF <- transformDF(plotDF=mapDF, genelength, method="arrange") # Transforms coordinates for plotting.
featurePlot(plotDF=plotDF, geneline=6, genecol="green", featureline=2, featurecol=sample(c(2,4,8), length(plotDF[,1]), replace=T), showxy=F) # Plots the feature map for the match coordinates that are stored in 'mysearch'. The feature color argument allows to color each feature in a different color.

## (5) Allowing Mismatches with matchPDict 
mychr <- DNAString("CTAGGCTGTATCCTAAACACCGACAAGGTCTAGCCTGTATGCTAGCTAGTGTCCGCTCGGAAGCCGCT")
mydict <- DNAStringSet(c("GTAGGCTGTATCCTA", "CGCTCGGAAGCCGCT"))
names(mydict) <- paste("d", seq(along=mydict), sep="")

mypdict <- PDict(mydict, tb.end=1) # Uses a trusted band of nucleotides that need to match exactly (here only first nucleotide).
tb(mypdict); tail(mypdict); width(mypdict) # Methods for accessing the different trusted band components; '?PDict' provides more detailed information.
mysearch <- matchPDict(mypdict, mychr, max.mismatch=1) # Allows one mismatch outside of trusted band.
table(cut(countIndex(mysearch), c(0, 1:10), right=FALSE)) # Summary of matches.

Computing Pairwise Sequence Alignments

The pairwise sequence alignment function from Biostrings provides a flexible environment for computing optimal pairwise alignments using different dynamic programming algorithms.

s1 <- DNAString("GGGCCC"); s2 <- DNAString("GGGTTCCC") # Creates sample sequences.
myalign <- pairwiseAlignment(s1, s2, type="local") # Computes pairwise local alignment with Smith-Waterman dynamic programming algorithm.
myalign <- pairwiseAlignment(s1, s2, type="global") # Computes global alignment with Needleman-Wunsch algorithm.
pairwiseAlignment(c("GGGCCC", "GAGGCC", "CGAGA"), s2, type="local", scoreOnly=T) # The function can also be used for sequence searching by providing the 'sequence database' as vector in the first argument.
attributes(myalign)[-5] # Returns the components of an alignment object.
subject(myalign); pattern(myalign) # Returns aligned sequences with gaps. Many more helper functions for alignments can be found in Biostrings' Alignment Manual.

sq1 <- SolexaQuality(as.integer(c(22,22,22,12,12,12)))
sq2 <- SolexaQuality(as.integer(c(22,22,22,0,0,0,0,0)))

pairwiseAlignment(s1, s2, patternQuality = sq1, subjectQuality=sq2) # Uses a quality-based method for generating the substitution matrix.

Multiple Sequence Alignments (MSAs)

Multiple sequence alignments (MSAs) can be imported as FASTA formatted alignments into R using the read.XStringSet function. Almost all MSA programs support this format (e.g. ClustalW, T-Coffee, MUSCLE, Multalin). This import method will store the MSAs as XStringSet objects. The following examples introduce a variety of useful analysis routines on MSAs using this object class.

###########################
## Import of MSAs into R ##
###########################
library(Biostrings)
p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") # Imports MSA of P450 sequences from a FASTA formatted file and stores the result as AAStringSet object.
as.matrix(p450) # Converts MSA into a character matrix.
consensusMatrix(p450) # Computes the residue frequencies for each MSA column and returns the results as matrix.

####################################
## Viewing of MSAs in HTML Format ##
####################################
## Viewing large alignments in a web browser can be very efficient. The following example outputs an HTML formatted alignment with consensus line.
StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) {
        if(class(msa)=="AAStringSet") msa <- AAStringSet(msa, start=start, end=end)
        if(class(msa)=="DNAStringSet") msa <- DNAStringSet(msa, start=start, end=end)
        msavec <- sapply(msa, toString)
        offset <- (counter-1)-nchar(nchar(msavec[1]))
        legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, 
        nchar(msavec[1]), by=counter)[-1])), collapse=""), collapse="")

        consensus <- consensusString(msavec, ambiguityMap=".", ...)
        msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep="  ")
        msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec,
        consensus), sep="  ")

        msavec <- c("<html><pre>", msavec, "</pre></html>")
        writeLines(msavec, file)
        if(browser==TRUE) { browseURL(file) }
}
StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) # Writes MSA to file "p450.html", which can be viewed in a browser.
StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) # Same as above, but for alignment positions 450-470.

##########################################################
## Computation of Similarity/Distance Matrices for MSAs ##
##########################################################
masim <- sapply(names(p450), function(x) sapply(names(p450), function(y) pid(PairwiseAlignedXStringSet(p450[[x]], p450[[y]]), type="PID2")))
masim # Returns a percent identity matrix. The identity method can be specified under the "type" argument, see help for pid() function.
madist <- 1-(masim/100)
madist # Returns corresponding distance matrix.
disttree <- hclust(as.dist(madist))
plot(as.dendrogram(disttree), edgePar=list(col=3, lwd=4), horiz=T) # Computes and plots distance tree for MSA.

####################################################
## Mapping Sequence Postions to Gapped Alignments ##
####################################################
## Remove gaps from sequences, e.g. for pattern matching
removeGaps <- function(align=p450[[1]], gap="-") {
        myseq <- maskMotif(align, gap)
        myseq <- paste(as(myseq, "XStringViews"), collapse="")
        return(myseq)
}
p450seq <- AAStringSet(sapply(p450, function(x) removeGaps(align=x)))
mypos1 <- matchPattern("FSAGKRICAG", p450seq[[1]]); mypos1 # Removes gaps from sequences and provides a pattern matching example for P450 heme binding site.
p450gapmask <- maskMotif(p450[[1]], "-")

## Mapping sequence postions to gapped alignments
seqAlignmap <- function(seq=p450seq[1], align=p450[1], gap="-") {
        seqindex <- 1:length(seq[[1]])
        gapindex <- which(as.matrix(align)==gap)
        alignindex <- (1:length(align[[1]]))[-gapindex]
        names(alignindex) <- seqindex
        return(alignindex)
}
alignindex <- seqAlignmap(seq=p450seq[1], align=p450[1], gap="-") # Returns a named vector with the sequence coordinates in the name field and the corresponding alignment coordinates in the data field.
alignindex <- alignindex[start(mypos1)]:alignindex[end(mypos1)]; alignindex # Returns only the alignment positions for the above pattern match result.
AAStringSet(p450, start=min(alignindex), end=max(alignindex)) # Returns the alignment segment for the above pattern match result.

#############################################################
## Simple Example for Sliding Window Conservation Analysis ##
#############################################################
## (A) Compute conservation vector using a simple conservation model.
consCol <- function(myalign=p450) {
        myalign <- as.matrix(myalign)
        consV <- sapply(seq(along=myalign[1,]), function(x) max(table(myalign[,x][!myalign[,x]
        %in% "-"]))/length(myalign[,1]))

        return(consV)
}
consV <- consCol(myalign=p450)

## (B) Perform sliding window analysis on conservation vector.
slidingWindow <- function(mydata=consV, win=c(1, 100), start=1, end=length(consV)) {
        mydata <- mydata[start:end]
        windex <- t(sapply(0:length(mydata), function(x) win+x))
        mywind <- sapply(seq(along=mydata), function(x) mean(mydata[windex[x,1]:windex[x,2]],
        na.rm=TRUE))

        mywind <- mywind[1:(length(mywind)-win[2])]
        return(mywind)
}
slideV <- slidingWindow(mydata=consV, win=c(1, 20)) # Returns conservation vector for window size 20.

## (C) Plot sliding window conservation data
plot(slideV, type="l", lwd=2, col="red", main="Sliding Window Conservation Analysis", sub="Window Size: 20", xlab="Alignment Position", ylab="Conservation: max=1.0")
heme <- AAStringSet(p450, start=454, end=463); heme
text(454, 0.71, toString(heme[[1]]), cex=0.8) # The highly conserved cytochrome P450 heme binding signature maps to positions 454-463, which is located in the last big peak of the plot.
Table of Contents

Position Weight Matrices (PWM)

This section introduces functions for creating and matching position weight matrices (PWMs), and related utilities.

################################################################
## Searching for Patterns with Position Weight Matrices (PWM) ##
################################################################
?matchPWM # Gives an overview of functions for PWMs.
patterns <- DNAStringSet(c("GCT", "GGT", "GCA")) # Sample set of DNA fragments.
pwm <- PWM(patterns) # Computes a PWM for DNA fragments.
library(seqLogo); seqLogo(t(t(pwm) * 1/colSums(pwm))) # Plots pwm as sequence logo
.
chr <- DNAString("AAAGCTAAAGGTAAAGCAAAA") # Creates a sample chromosome for searching.
matchPWM(pwm, chr, min.score=0.9) # Searches sample chromosome for PWM matches and returns matches that score better than the min.score cutoff.
PWMscoreStartingAt(pwm, chr, starting.at=c(4,10,16))  # Scores PWA for predefined positions in chromosome.

#################################
## Testing for Enriched Motifs ##
#################################
## Enrichment/depletion of PWM matches in sequence sets relative to
## frequency in search space (e.g. all promoters of genome)

db <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")))
# Sample sequence set (e.g. all promoter sequences).
set <- sample(db, 10)
# Test data set (e.g. peak sequences from a ChIP-Seq experiment).
patterns <- DNAStringSet(c("GCTAGC", "GGTACC", "GCATGC"))
pwm <- PWM(patterns)
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/enrichMotif.R")
# Imports enrichment function.
enrichTest(pwm=pwm, revcomp=FALSE, cutoff=0.9, set=set, db=db, occurrence=1) # Returns enrichment p-values based on hypergeometric distribution. The occurrence argument sets the minimum number of motif matches per sequence to consider. If the argument 'revcomp' is set to TRUE, the function will consider the matches for the plus and minus strands.
sapply(set, function(x) matchPWM(pwm, x, min.score=0.9))
# Returns the corresponding matches.

Analysis Routines with Biostrings

The following examples introduce a variety of useful analysis routines that can be accomplished with functions provided by the Biostrings package.

Mapping of Affy Probes to Transcriptome
Example for aligning Affy probe sets back to their target mRNAs. The code retrieves the matching counts and mapping coordinates for all probes and probe sets. In addition, it identifies probe sets matching several genes and/or genes matching several probe sets. The given analysis routine can be easily adjusted to the needs of RNA-Seq data sets for obtaining read counts per mRNA.

##########################################
## Mapping Affy Probes to Transcriptome ##
##########################################

## (1) Import mRNA/cDNA sequences (here Arabidopsis)
library(Biostrings)
mygenes <- read.DNAStringSet("ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR9_blastsets/TAIR9_cdna_20090619", "fasta") # If this returns an error message then first download the sequences to your desktop and import them from there.
names(mygenes) <- gsub(" \\|.*", "", names(mygenes)) # Extracts locus IDs from description line for sequence naming.

## (2) Create pattern dictionary from probe sets
library(ath1121501probe)
mydict <- DNAStringSet(ath1121501probe$sequence) # Obtains the sequences of the probes from ath1121501 chip.
mynames <- ath1121501probe$Probe.Set.Name # Obtains the corresponding probe set names.
mynames <- table(mynames)[mynames]
# Next 4 lines create unique names for probes by appending a counter.

mycount <- unlist(sapply(mynames[!duplicated(names(mynames))], function(x) seq(1, x)))
mynames <- paste(names(mynames), mycount, sep="_")
names(mydict) <- mynames
mydict <- PDict(mydict) # Creates a PDict dictionary.

## (3) Count matches for each mRNA sequence
sapply(names(mygenes)[1:40], function(x) sum(countPDict(mydict, mygenes[[x]]))) # Returns matches for first 40 mRNAs.

## (4) Search with pattern dictionary each mRNA sequence (step 5 provides a much faster solution!)
hitCollect <- function(mydict=mydict, myref=mygenes[[1]]) { # Defines search function for many query sequences (here mRNAs)
        mymatch <- matchPDict(mydict, myref)
        hits <- sum(countIndex(mymatch)>=1)
        if(hits==0) { return(NULL) } else { return(unlist(mymatch)) } }
myhitList <- lapply(names(mygenes)[1:10], function(x) hitCollect(mydict=mydict, mygenes[[x]]))
names(myhitList) <- names(mygenes)[1:10]; myhitList # The last two lines perform the search for the first 10 mRNA sequences and organize the results in a list object. Searching all 39,000 mRNAs will take about 14 hours. 
countList <- sapply(names(myhitList), function(x) table(gsub("_at_.*", "_at", names(myhitList[[x]])))); countList # Counts for each mRNA the number of matching probes that belong to the same probe set.
data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector))) # Converts countList into a data frame. With this data structure one can easily determine with tapply() all probe sets matching several genes and/or genes matching several probe sets. 

## (5) Same as step (4) but with much faster computation by using the concatenation of all mRNAs as reference
myconc <- BString(toString(mygenes)); myconc <- maskMotif(myconc, ", "); myconc <- DNAString(injectHardMask(myconc, "+")) # Concatenates sequences to a single string where each component is separated by '++'.
myhitList <- matchPDict(mydict, myconc) # Performs search against the single reference containing all mRNAs. This way the mapping of all probe sets will take only a few seconds. Mappings across mRNA boundaries are prevented by inserted "+" marks.
mymask <- maskMotif(myconc, "+") # Creates a mask for mRNA boundaries labeled with "+".
mRNAloc <- data.frame(start=start(as(mymask, "XStringViews")), end=end(as(mymask, "XStringViews")), IDs=names(mygenes)) # Returns mRNA start and end positions in reference.
myhitDF <- as.data.frame(unlist(myhitList)); myhitDF <- myhitDF[order(myhitDF$start), ]
mRNAends <- mRNAloc$end; myhitends <- myhitDF$end
mylength <- sapply(mRNAends, function(x) { mylength <- sum(myhitends <= x); myhitends <<- myhitends[!myhitends <= x]; return(mylength) })
tmphitDF <- cbind(myhitDF, mRNAloc[rep(seq(along=mRNAloc[,1]), mylength),])
finalhitDF <- cbind(start=tmphitDF[,1]-tmphitDF[,5]+1, end=tmphitDF[,2]-tmphitDF[,5]+1, tmphitDF[,c(4,7)]); finalhitDF[1:28,] # The previous five lines map the hits back to the corresponding mRNAs.
countList <- tapply(gsub("_at_.*", "_at", finalhitDF$names), as.vector(finalhitDF$IDs), table); countList[1:20] # Counts for each mRNA the number of matching probes that belong to the same probe set.
data.frame(locusID=rep(names(countList), sapply(countList, length)), affyID=unlist(sapply(countList, names)), Nprobes=unlist(sapply(countList, as.vector)))[1:40,] # Converts countList into a data frame. With this data structure one can easily determine with tapply() all probe sets matching several genes and/or genes matching several probe sets.


Analyzing Assembly Results
With growing read lengths, de novo assemblies of genome or transcriptome data are nowadays a routine task in the NGS field. Due to their large data sizes, assemblies in this area often require very extensive memory and CPU resources which are often only available on high performance compute systems. Commonly used assembly software for NGS data are SSAKE, SHARCGS, MIRA, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, ALLPATHS2, ALLPATHS-LG, SOAPdenovo, Trinity, Ray and others. The following code samples are useful for evaluating and comparing the assembly results from these non-R assembly tools.

Compute N50 Values for Assembly Results
The N50 is a weighted median statistic such that 50% of the entire assembly is contained in contigs equal to or larger than this value, usually given in base pairs (bp). Formally, the N50 is calculated as follows: store the lengths of all contigs in a vector L of positive integers. Create a vector Lr where each integer is repeated as many times as given by the sum of its values in L. After this the N50 value is computed as the median of Lr. For example: if L = c(2, 2, 3, 3, 5, 7), then Lr consists of four 2's, six 3's, five 5's, and seven 7's. The N50 of L is the median of Lr, which is 5. In R the N50 value can be computed according to this definition as shown in the code box below. However, for large contig sets it is much more time and memory efficient to use the cumulative sum method introduced in the next subsection.

L <- c(2, 2, 3, 3, 5, 7) # Store length of each conitg in a vector.
Lr <- unlist(tapply(L, L, function(x) rep(x[1], sum(x))))
median(Lr) # Returns N50 value for L

Plot Cumulative Length Distribution of Contigs
With very large contig sizes the above implementation for computing the N50 value can become inefficient. A more practical estimation is to sort L decreasingly, compute for it the cumulative sum and then identify the length value for which the cumulative sum covers at least 50% of the combined length of all contigs or the length of a known target/reference sequence. Note, comparisons among N50 values from different assemblies are only meaningful when using for their calculation the same combined length value. Thus, a known target length value can often be a good solution for comparing assembly results.

The following contigStats function calculates the N50 values according to the more efficient cumulative sum method. In addition, it generates a distribution plot of the cumulative contig lengths, which can be a very informative visual representation for comparing assembly results. In this plot, the N50 value is the contig size (Y-axis) at 50% of the assembly coverage (X-axis).

## (A) Import contigs in FASTA format with the read.DNAStringSet function.
## The following examples creates two test assembly sets of random sequences
library(Biostrings)
assembly1 <- DNAStringSet(sapply(1:5000, function(x) paste(sample(c("A","T","G","C"), sample(200:1000, 1), replace=T), collapse="")))

assembly2 <- DNAStringSet(sapply(1:4500, function(x) paste(sample(c("A","T","G","C"), sample(200:800, 1), replace=T), collapse="")))
 
## (B) Store the contig length data for each assembly as vectors in a list

N <- list(assembly1=width(assembly1),
assembly2=width(assembly2))

## (C) Define combined length value for each assembly set
reflength <- sapply(N, sum) # Example uses the combined length of the contigs for each assembly. Alternatively, one can pass on here a known target length value.

## (D) Compute N50 values and plot cumulative length distribution with contigStats function
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/contigStats.R")
contigStats(N=N, reflength=reflength, style="ggplot2") # Plots cumulative contig lengths with ggplots2 library. Alternatively, the same plot can be generated with R's base graphics by assigning "base" to the style argument. The title and axis labels can be changed with the arguments main, xlab and ylab, respectively.
stats <- contigStats(N=N, reflength=reflength, style="data") # Returns a list with cumulative length vectors for each assembly and a contig summary matrix.
stats[["Contig_Stats"]] # Returns the contig summary matrix with the N50 values in the second column.


Handling Sequence Ranges with IRanges and GenomicRanges

Integer ranges are commonly used to represent coordinates of alignment positions, or various annotation features (e.g. genes) within genomes. They are central to many applications in the genome annotation and NGS analysis areas, such as RNA-Seq, ChIP-Seq and SNP-Seq.

IRanges Overview

Documentation

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. 

#######################################
## Looping Over Sequences of Numbers ##
#######################################

library(IRanges) # Load IRanges library.
aggregate(1:100, start = 1:90, width = 10, FUN = median)
# Example for looping over a numeric vector to perform a sliding window analysis (here median) for a window of width 10. Related looping functions are shiftApply and endoapply.
x <- Rle(1:10, 1:10) # Takes vector and stores it in run-length encoded format.
runsum(x, k = 3) # The ‘runsum’, ‘runmean’, ‘runwtsum’, ‘runq’ functions calculate the sum, mean, weighted sum, and order statistics for fixed width running windows.

#####################
## Sequence Ranges ##
#####################

ranges1 <- IRanges(start=c(1,2,3), end=c(2,3,4), names=month.name[1:3]) # Constructs an IRanges object with 3 ranges: 1-2, 2-3, and 3-4. These ranges are often used to represent genome features or alignment coordinates.
start(ranges1); end(ranges1); width(ranges1); names(ranges1) # Methods for accessing the sub-components of an IRanges object.
ranges2 <- IRanges(start=c(9,15,21), end=c(10,19,50),
names=month.name[6:8])
rangeList <- IRangesList(ranges1, ranges2) # Constructs a list of IRanges objects, and can be used to associate a set of ranges to filter, trim, or align each string in XStringSet objects.
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40), width = c(12, 6, 6, 15, 6, 2, 7)) # Sample IRanges set.
reduce(ir) # Collapses overlapping ranges (e.g. overlapping reads or genes) to continuous ranges.
reduce(ir, min.gapwidth=5) # Same as above, but with a minimum distance (gap) of 5 among ranges.
disjoin(ir)
# Returns disjoint ranges.
gaps(ir) # Returns uncovered regions (gaps) in an IRanges object.
chr <- DNAString("GCATATTACAATCGATCCGCATATTAC")
irextract <- IRanges(start = c(1, 25), width = 3)
Views(chr, irextract) # Example for extracting sub-sequences and storing them in a Views object.
seqselect(chr, irextract) # Example for extracting and joining sequences (e.g. exons).
ol <- findOverlaps(ir, reduce(ir), minoverlap=1); as.matrix(ol)
# Identifies overlapping ranges.
nearest(IRanges(37, 38), ir) # Identifies the nearest neighbor range.
precede(
IRanges(37, 38), ir) # Identifies the nearest non-overlapping downstream neighbor.
follow(IRanges(37, 38), ir) # Identifies the nearest non-overlapping upstream neighbor.

#########################
## Run Length Encoding ##
#########################

xRle <- Rle(c(1,1,2,3:30))
# Constructs an Rle (run length encoded) object. This is an efficient compressed way to store a long integer string with a lot of repeated elements, and is often used to represent genomic coverage of short read alignments for each residue in the genome.
as.vector(xRle); as.numeric(xRle)
# Functions to coerce Rle object into numeric vector.
slice(xRle, 2) # Returns a Views object where the Rle values are at least 2.
xRle[slice(xRle, 2)]
# Returns the corresponding section in the Rle object.
cov <- coverage(ranges2) # Computes coverage of the ranges2 IRanges object.
Views(xRle, irextract) # Subsetting example for Rle objects.
seqselect(xRle, irextract) # Example for extracting and joining Rle objects.

##############################
## Plotting IRanges Objects ##
##############################
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep = 0.5, ...) {
        height <- 1
        if (is(xlim, "Ranges")) xlim <- c(min(start(xlim)), max(end(xlim)))
        bins <- disjointBins(IRanges(start(x), end(x) + 1))
        plot.new()
        plot.window(xlim, c(0, max(bins) * (height + sep)))
        ybottom <- bins * (sep + height) - height
        rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col, ...)
        title(main)
        axis(1)
}
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40), width = c(12, 6, 6, 15, 6, 2, 7))
plotRanges(ir)
plotRanges(reduce(ir))
plotRanges(disjoin(ir))


#####################################################################
## Finding the Start/End Positions of Continuous Ranges in Vectors ##
#####################################################################
rangeStartEnd <- function(rangev) {
        ## Range start positions
        upshift <- c(0, rangev[-length(rangev)])
        tmp <- rangev!=upshift
        storage.mode(tmp) <- "numeric"
        start <- rangev
        start[rangev!=tmp] <- 0
        ## Range end postitions
        downshift <- c(rangev[-1],0)
        tmp <- rangev!=downshift
        storage.mode(tmp) <- "numeric"
        end <- rangev
        end[rangev!=tmp] <- 0
        return(IRanges(start=which(start==1), end=which(end==1)))
}
x <- c(1,1,1,0,0,0,1,0,1,0,1,1,1,1,0) # Sample data set.
rangeStartEnd(rangev=x) # Returns the start and end positions of continuous ranges.

Table of Contents


GenomicRanges Overview

Documentation

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome.  Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data.

######################################
## A.1 Constructing GRanges Objects ##
######################################

library(GenomicRanges)
gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length = 10)) # Example of creating a GRanges object with its constructor function.
library(rtracklayer)
gff <- import.gff("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff", asRangedData=FALSE, version="3") # Imports a simplified GFF3 genome annotation file and stores it as GRanges object. For importing additional annotation or aligned read formats, such as bed or wig, consult the help documentation with '?import.gff'.
seqlengths(gff) <- end(ranges(gff[which(elementMetadata(gff)[,"type"]=="chromosome"),])) # Populates slot for chromosome length data.
names(gff) <- 1:length(gff) # Assigns names to corresponding slot.
gff # When viewing GRanges containers then the information is separated
by | symbols into a left and right hand region. The genomic coordinates (seqnames, ranges, and strand) are located on the left-hand side and the element metadata (annotation) columns are located on the right. The genomic coordinates are essential components of this class, whereas the annotation columns are optional and customizable by the user.
gff_rd <- as(gff, "RangedData") # Coerces GRanges object to RangedData class.
as(gff_rd, "GRanges") # Coerces
RangedData object to GRanges class.

#############################################################
## A.2 Accessor and Subsetting Methods for GRanges Objects ##
#############################################################
gff[1:4]; gff[1:4, c("type", "Name")]; gff[2] <- gff[3] # The subsetting and replacement syntax of GRanges objects is similar to other R objects.
c(gff[1:2], gff[401:402]) # GRanges objects can be concatenated with the c() function.
seqnames(gff); ranges(gff); strand(gff); seqlengths(gff) # Accessor functions for genomic coordinate and sequence length slots.
start(gff[1:4]); end(gff[1:4]); width(gff[1:4])
# Direct access methods for IRanges components of GRanges objects.
elementMetadata(gff); elementMetadata(gff)[, "type"] # Accessor functions/syntax for metadata component which behaves like a data frame. 
gff[subgene_index <- which(elementMetadata(gff)[ ,"type"] == "gene")] # Returns only ranges for 'gene' start and end positions.

##############################################
## A.3 Useful Utilities for GRanges Objects ##
##############################################
## Note, range operations on GRanges objects are strand specific. To make them strand
## insensitive, one can erase the strand information with: strand(gff) <- "*"
shift(gff[2:5], 5)
# Moves ranges by
a specified value (here 5).
resize(gff[2:5], 5) # Extends ranges by a specified value.
reduce(gff) # Collapses overlapping ranges to continuous ranges.
gaps(gff)Returns uncovered regions.
intersect(gff[2], gff[4]); union(gff[2], gff[4]) # Intersect and union of range overlaps.
disjoin(gff)
#
Returns disjoint ranges.
coverage(gff) # Returns coverage of ranges.
findOverlaps(gff, gff[1:4]) # Returns the index pairings for the overlapping ranges among a query and a subject GRanges object.
countOverlaps(gff, gff[1:4]) # Counts the ranges in the query object that overlap with the ranges in the subject object.
subsetByOverlaps(gff, gff[1:4]) # Returns only ranges in query object that are overlapping with the ranges in the subject object.

#########################################################
## B.1 Constructing and Subsetting GRangesList Objects ##
#########################################################

sp <- split(gff) # Stores every annotation range in GRanges object in a separate component of a GRangesList object.
split(gff, seqnames(gff)) # Stores ranges for each chromosome in a separate GRangesList component.
unlist(sp) # Returns data as GRanges object.
sp[1:4, "type"] # The subsetting of GRangesList objects is very similar to the syntax used for GRanges objects.
lapply(sp, length); sapply(sp, length) # Looping over GRangesList objects works similarly to standard list objects.

#################################
## C.1 GappedAlignments Object ##
#################################

library(Rsamtools)
aln1_file <- system.file("extdata", "ex1.bam", package = "Rsamtools")
aln1 <- readGappedAlignments(aln1_file); aln1 # Imports BAM file with function from Rsamtools library and stores it as GappedAlignment object.

aln2 <- scanBam(aln1_file) # Note: the GappedAlignments object does not contain all data components of a SAM/BAM file. To import all information stored in these files, the scanBam import function can be used instead.
rname(aln1); strand(aln1); cigar(aln1); qwidth(aln1); start(aln1); end(aln1); width(aln1); ngap(aln1) # Basic accessor methods for
GappedAlignments objects.
Table of Contents



Analysis Routines with IRanges/GenomicRanges

Compute Intron and Intergenic Ranges from GFF/GTF Annotations
The following code shows how one can compute intergenic, intron and overlapping gene ranges from GFF3/GTF files and then combine the results with existing ranges. Both the input and output objects are of class GRanges.

###########################################################################################
## Generate Ranges Omitted in GFF/GTF Annotations Including Intron and Intergenic Ranges ##
###########################################################################################
library(rtracklayer); library(GenomicRanges); library(Rsamtools)
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/gffMod.R") # Imports gffFeat function. It can also be called by old name: 'gffMod'.
system("wget ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3/TAIR10_GFF3_genes.gff") # Or manually download this file to your current working directory.
gff <- import.gff("TAIR10_GFF3_genes.gff", asRangedData=FALSE)          
gffmod <- getFeat(x=gff, format="gff", range_types=c("intergenic", "gene_red")) # Argument 'range_types' supports the following values: intergenic, gene_red, intron, gene and exon.
 The input format ("gff" or "gtf") can be specified under the 'format' argument. Note: the computation of the intron ranges will take some time.
gffmod
# Returns GRanges object.
Computing Absolute and Relative Overlaps Among Ranges
The following introduces a function for identifying different types of overlaps among gene, annotation and alignment ranges and computing their absolute/relative lengths. This operation is often useful for various analysis routines in the ChIP-Seq and RNA-Seq areas. 

###################################################################################
## Identification of Different Overlap Types and their Absolute/Relative Lengths ##
###################################################################################

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rangeoverlapper.R
") # Imports olRanges function.
library(GenomicRanges)
grq <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(seq(1, 100, by=10), end = seq(30, 120, by=10)), strand = Rle(strand(c("-", "+", "-")), c(1, 7, 2)))
grs <- shift(grq[c(2,5,6)], 5) # Creates two sample GRanges data sets.
myol <- olRanges(query=grq, subject=grs, output="gr"); myol
# Runs olRanges function on grq and grs, which both need to be of the same class. GRanges or IRanges are supported. The function returns a GRanges object when "gr" is assigned to the output argument and a data frame when "df" is assigned. The four overlap types recorded in the last column of the output ("olup", "oldown", "inside" and "contained") are illustrated here. 
myol[elementMetadata(myol)[, "OLpercQ"] > 50] # Returns overlaps that cover at least 50% of the query ranges.
myol[elementMetadata(myol)[, "OLpercS"] > 50] # Returns overlaps that cover at least 50% of the subject ranges.
 

#######################################
## Older Alternative for Data Frames ##
#######################################

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/rangeoverlapper.R
") # Imports rangeOL function. 
featureDF <- data.frame(Start=c(1, 20, 90, 150, 180), End=c(40, 30, 105, 160, 200), Feature="mRNA") # Sample data set.
rangeOL(featureDF=featureDF, label="ID1", start=25, end=35)
# Runs rangeOL function for custom feature "ID1" with start=25 and end=35. The four overlap types olup, oldown, inside and contained are explained here.
lapply(seq(along=featureDF[,1]), function(x) rangeOL(featureDF=featureDF, label=featureDF[x,3], start=as.numeric(featureDF[x,1]), end=as.numeric(featureDF[x,2]))) # Example for running the function on many custom features.

Parsing Gene/Intergenic Regions Based on GTF Input (IRanges Version)
GTF (gene transfer format) is a standard file format for storing the mappings of genomic features in a tabular text file. The feature coordinates in the GTF file can be used to parse their corresponding sequence segments from chromosome sequences (GTF Format Definition). The following example is based on the very efficient range objects provided by the IRanges library.

########################################
## Coordinates for Intergenic Regions ##
########################################
## (A.1) Import/parsing of GTF containing only the start and end positions for each gene of all chromosomes.
## The GTF file needs to be downloaded from http://uswest.ensembl.org/info/data/ftp/index.html
library(Biostrings); library(IRanges)
gtf2genePos <- function(myfile="Homo_sapiens.GRCh37.61.gtf") {
        gtf <- read.delim(myfile, header=FALSE)
        colnames(gtf) <- c("seqname", "source", "feature", "start", "end", "score", "strand", "frame",     
        "attributes")
        chronly <- c(1:22, "X", "Y", "MT")
        gtf <- gtf[as.character(gtf$seqname) %in% chronly, ] # Cleanup to remove non-chromosome rows
        gtf[, length(gtf[1,])] <- gsub(" {1,}|gene_id|;.*", "", gtf$attributes)
        gtf1 <- gtf[order(gtf$seqname, gtf$start), ]
        gtf1 <- gtf1[!duplicated(gtf1$attributes), ] # Returns only rows with most left gene positions.
        gtf2 <- gtf[order(gtf$seqname, -gtf$start), ]
        gtf2 <- gtf2[!duplicated(gtf2$attributes), ] # Returns only rows with most right gene positions.
        gtf1 <- gtf1[order(gtf1$attributes), ]
        gtf2 <- gtf2[order(gtf2$attributes), ]
        gtf1[,5] <- gtf2[,5] # Assigns proper end postions
        gtfgenepos <- gtf1 # Returns gtf containing only start/end positions for all genes
        gtfgenepos <- gtfgenepos[order(gtfgenepos$seqname, gtfgenepos$start), ]
        return(gtfgenepos)
}
genepos <- gtf2genePos(myfile="Homo_sapiens.GRCh37.59.gtf") # This will take some time due to the large size of the GTF file for human.


## (A.2) Download and import chromosome sequence (here chr 1 from human)

system("wget ftp://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.57.dna.chromosome.1.fa.gz") # Note: system commands only relevant for Linux OSs.
system("gunzip Homo_sapiens.GRCh37.57.dna.chromosome.1.fa.gz")
chr1 <- read.DNAStringSet(file="Homo_sapiens.GRCh37.59.dna.chromosome.1.fa" , "fasta")
unlink("Homo_sapiens.GRCh37.57.dna.chromosome.1.fa")


## (B.1) Parsing of gene sequences
chrgenepos <- genepos[genepos$seqname==1,] # Selects entries for chr 1.
ir <- IRanges(start=chrgenepos$start, end=chrgenepos$end, names=chrgenepos$attributes) # Creates IRanges object from GTF entries for chr 1.
geneschr1 <- Views(chr1[[1]], start=start(ir), end=end(ir)) # Returns gene sequences in Views object.
geneschr1 <- DNAStringSet(geneschr1);
names(geneschr1) <- names(ir) # Returns sequences as DNAStringSet object.


## (B.2) Parsing of intergenic sequences
## (B.2.1) Collapse overlapping gene positions to continuous ranges.

irred <- reduce(ir) # Collapses ranges of overlapping genes to continuous range.
ol <- findOverlaps(ir, irred) # Identifies overlaps among initial and reduced range data sets.
names(irred) <- tapply(names(ir), as.matrix(ol)[,2], paste, collapse="_") # Assigns proper names to reduced data set.

## (B.2.2) Convert gene positions to intergenic positions
intergenic <- irred
start(intergenic) <- c(1, end(irred)[-length(irred)]+1)
end(intergenic) <- start(irred) - 1
names(intergenic) <- c("start", paste(names(irred)[-length(irred)], names(irred)[-1], sep="_:_"))
mywidth <- 100 # Note: mywidth needs to be set to distance from end of last gene to chromosome end!
intergenic <- c(intergenic, IRanges(start=end(irred)[length(irred)]+1, width=mywidth, names="end"))

## (B.2.3) Parsing of intergenic regions
interchr1 <- Views(chr1[[1]], start=start(intergenic), end=end(intergenic))
interchr1 <- DNAStringSet(interchr1)
names(interchr1) <- names(intergenic)

Parsing Genomic Features Based on GFF Input (old non-IRanges version)
GFF (general feature format) is a standard file format for storing the mappings of genomic features in a tabular text file. The feature coordinates in the GFF file can be used to parse their corresponding sequence segments from chromosome sequences (GFF3 Format Definition).

################################################
## Import GFF and Extract Feature Coordinates ##
################################################

## (1) Import chromosome sequences and corresponding GFF3
library(Biostrings)
read.gff3 <- function(file="gff3.txt") {
        gff <- read.delim(file)
        colnames(gff) <- c("seqid", "source", "type", "start", "end", "score", "strand", "phase", "attributes")
        return(gff)
}
gff <- read.gff3(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.txt")
chr <- read.DNAStringSet(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/genome_sample.txt", "fasta") # Imports a sample gff and the corresponding chromosome sequences (here a highly simplified example!)

## (2) Extract feature coordinates of interest
gff2Feature <- function(gff=gff, feature="exon", discont=TRUE) {
        features <- gff[gff[,3]==feature,]
        if(discont==TRUE) { # Sorting for discontinous features important to assure
                            # correct feature
order (e.g. exons of mRNAs)

            sortfeature <- tapply(1:length(features[,1]), features[,9], function(x)
            x[order(features[x,4])])

            features <- features[as.numeric(unlist(sortfeature)),]
        }
        return(features)
}

## (3) Parsing of continuous sequences (e.g. genes, promoters)
seqParse <- function(chr=chr, features=gene, chrid="Chr1", revcomp="-") {
        features <- features[features[,1]==chrid, ]
        fragment <- Views(chr[[which(names(chr) == chrid)]], start = features[, 4],
        end = features[, 5])

        names(fragment) <- features[, 9]
        if(revcomp != FALSE) {
                revcompindex <- which(names(fragment) %in%
                unique(features[features[,7]==revcomp, 9]))

                revcompseq <- reverseComplement(fragment[revcompindex])
                fragment <- c(DNAStringSet(fragment[-revcompindex]), DNAStringSet(revcompseq))
        }
        return(fragment)
}
gene <- gff2Feature(gff=gff, feature="gene", discont=FALSE)
gene1chr <- seqParse(chr=chr, features=gene, chrid="Chr1", revcomp="-") # Returns gene sequences from first chromosome. The reverse and complement will be returned for all entries with a "-" in the 'strand' column of the gff when "-" is assigned to the revcomp argument.
geneallchr <- sapply(c(names(chr)), function(x) seqParse(chr=chr, features=gene, chrid=x, revcomp="-")) # Same as in previous step but for all chromosomes.

## (4) Parsing of discontinuous features and joining them to continuous sequences: e.g. exon to mRNA sequences
discseqParse <- function(chr=chr, features=exons, chrid="Chr1", revcomp="-") {
        features <- features[features[,1]==chrid, ]
        fragment <- Views(chr[[which(names(chr) == chrid)]], start = features[, 4],
        end = features[, 5])

        fragment <- DNAStringSet(paste(fragment, collapse=""))
        fragmentindex <- features[features[,1]==chrid, 5] - features[features[,1]==chrid, 4] + 1
        mylength <- tapply(fragmentindex, features[features[,1]==chrid, 9], sum)
        mylength <- sapply(seq(along=mylength), function(x) sum(mylength[1:x]))
        mystart <- c(1, mylength[-length(mylength)] + 1)
        fragment <- Views(fragment[[1]], start = mystart, end = mylength)
        names(fragment) <- unique(features[features[,1]==chrid, 9])
        if(revcomp != FALSE) {
                revcompindex <- which(names(fragment) %in%
                unique(features[features[,7]==revcomp, 9]))

                revcompseq <- reverseComplement(fragment[revcompindex])
                fragment <- c(DNAStringSet(fragment[-revcompindex]), DNAStringSet(revcompseq))
        }
        return(fragment)
}
exons <- gff2Feature(gff=gff, feature="exon", discont=TRUE)
exons[, 9] <- gsub("^.*=", "", exons[,9]) # Note: the last column provides the exon-to-gene associations. This may need to be adjusted for other GFFs!!  
cdna1chr <- discseqParse(chr=chr, features=exons, chrid="Chr1", revcomp="-") # Parses all UTR/exon sequences and returns them as continuous cDNA sequences in proper orientation.
cdnaallchr <- sapply(c(names(chr)), function(x) discseqParse(chr=chr, features=exons, chrid=x, revcomp="-")) # Same as in previous step but for all chromosomes.
eval(parse(text=paste("c(", paste("cdnaallchr[[", 1:length(cdnaallchr), "]]", sep="", collapse=", "), ")", sep=""))) # Returns sequences in a single DNAStringSet object.

## (5) CDS parsing, ORF assembly and translation into proteins
cds <- gff2Feature(gff=gff, feature="CDS", discont=TRUE)
cds[, 9] <- gsub(",.*", "", cds[,9]) # Note: the last column provides the exon-to-gene associations. This may need to be adjusted for other GFFs!!
cds <- sapply(c(names(chr)), function(x) discseqParse(chr=chr, features=cds, chrid=x, revcomp="-"))
cds <- eval(parse(text=paste("c(", paste("cds[[", 1:length(cds), "]]", sep="", collapse=", "), ")", sep="")))
mypep <- AAStringSet(sapply(seq(along=cds), function(x) toString(translate(cds[[x]]))))
names(mypep) <- names(cds)

Motif Discovery

cosmo

Documentation

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy.

#################################################
## (A) Simulating Sequences Containing a Motif ##
#################################################
library(cosmo)
data(motifPWM); attributes(motifPWM) # Loads a sample position weight matrix (PWM) containing 8 positions.
plot(motifPWM) # Plots the PWM as sequence logo. 
data(transMats) # Loads the transition matrices for a second-order Markov model to define background.
simSeqs <- rseq(20, 100, 1, motifPWM, transMats, "ZOOPS") # Generates 20 sequences each with 100bp according to the OOPS model using the imported position weight matrix and background distribution. Available models for motif occurrences: zero-or-one-occurrence-per-sequence (ZOOPS), one-occurrence-per-sequence (OOPS) and two-component mixture model (TCM) with any number of motif occurrences.
library(Biostrings); test <- DNAStringSet(sapply(simSeqs$seq, function(x) unlist(x[[1]]))) # Example for converting cosmo's sequence format (list) into a DNAStringSet object.
DNAStringSet(test, start=attributes(simSeqs$motifs)$pos, end=attributes(simSeqs$motifs)$pos+7) # Returns motif positions for all sequences.
names(test) <- LETTERS[1:20]; test <- lapply(as.list(test), toString); lapply(names(test), function(x) list(seq=test[[x]], desc=x)) # Converts DNAStringSet back into cosmo format. Note: the name slot needs to be populated for this step.

########################################
## (B) Defining Constraint Parameters ##
########################################
## Define first constraint set
conSet1 <- makeConSet(numInt = 3, type = c("B", "V", "B"), length = c(3, NA, 3)) # Defines a constraint set consisting of 2x terminal 3bp intervals and one variable interval in the middle.
boundCon1 <- makeBoundCon(lower = 1, upper = 2) # Require the information content across the first interval to be bounded between 1.0 and 2.0.
boundCon2 <- makeBoundCon(lower = 0, upper = 1) # Same for the second interval.
palCon1 <- makePalCon(int1 = 1, int2 = 3, errBnd = 0.05) # Require the 1st and 3rd interval to be palindromes of each other.
constraint <- list(boundCon1, boundCon2, palCon1) # Organize constraints in a list.
int <- list(1, 2, NA)
conSet1 <- addCon(conSet = conSet1, constraint = constraint, int = int) # Combines everything in one constraint set.
## Define second constraint set
conSet2 <- makeConSet(numInt = 1, type = "V", length = NA)
subCon1 <- makeSubMotifCon(submotif = "TATA", minfreq = 0.9)
# Constructs a second constraint set that requires the motif to contain the submotif TATA.
conSet2 <- addCon(conSet = conSet2, constraint = subCon1, int = NA)

########################################
## (C) Define Background Markov Model ##
########################################
## By default, the Markov model for the distribution of background nucleotides is estimated from the input sequences supplied by the user. However, it is also possible to estimate this model from a separate, usualy larger set of sequences, such a all promoters of a genome.
bgFile <- system.file("Exfiles", "bgSeqs", package = "cosmo") # Loads a sample background sequence data set provided by the cosmo library.
tm <- bgModel(bgFile) # Estimates the background Markov model from the sample sequences. For custom data sets, one can point here to a FASTA file containing the background sequences.

######################
## (D) Motif Search ##
######################
res <- cosmo(seqs = simSeqs$seq, constraints = list(conSet1, conSet2), minW = 7, maxW = 8, models = c("OOPS", "TCM"), transMat = tm$transMat) # Searches the simulated sequences from step (A) for a shared motif, considering as candidate constraint sets the two constraint sets constructed under (B). As candidate model types OOPS and TCM will be tested, and as candidate motif widths 7 through 8.
summary(res); names(attributes(res)) # The summary() method prints a detailed summary of the result.
res@cand; res@sel # The cand slot summarizes the model selection process and the sel slot the chosen motif, which has in this case a width of 8 and meets the OOPS model criteria. 
summary(res@motifs) # Prints motif matches in sequences with an E-value for the alignment.
plot(res) # Plots the sequence logo for the identified PWM.
x11(); plot(res, type="prob") # Plots a map of the sequences with their motif matches.


BCRANK

Documentation

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences.

##########################################
## Predicting Overrepresented Sequences ##
##########################################
library(BCRANK)
fastaFile <- system.file("Exfiles/USF1_small.fa", package = "BCRANK")
set.seed(0)
BCRANKout <- bcrank(fastaFile, restarts=25, use.P1=TRUE, use.P2=TRUE)
data(BCRANKout)
topMotif <- toptable(BCRANKout, 1)
weightMatrix <- pwm(topMotif, normalize = FALSE)
weightMatrixNormalized <- pwm(topMotif, normalize = TRUE)
library(seqLogo)
seqLogo(weightMatrixNormalized)
plot(topMotif)

##############################
## Predicting Binding Sites ##
##############################
topConsensus <- as.character(toptable(BCRANKout)[1, "Consensus"])
print(topConsensus)
bindingSites <- matchingSites(fastaFile, topConsensus)
nrSites <- nrow(bindingSites)
cat("Number predicted binding sites:", nrSites, "\n")
print(bindingSites[1:15, ])

Additional Motif Discovery Packages

ShortRead

Documentation

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads.


Function Overview


 Functions                                                    
 Description                            
 SolexaPath("path") Parses a Solexa output folder, and identifies important files.    
 analysisPath(SolexaPathObject) Returns a list of GERALD output folders
compose(filter1, filter2, filter3, ...) Returns a sequence filter based on a combination of various filter objects
readAligned(SolexaPathObject, "filename", filter= myfilter) Reads in an Eland export file which contains sequences, alignment coordinates, and alignment-normalized quality scores based on a given filter object
 sread(AlignedReadObject) Returns a list of sequences contained in an AlignedReadObject (as returned by readAligned)
coverage(AlignedReadObject, start=startnt, stop=stopnt) Calculates coverage of an alignment over each base in the alignment genome, and compresses the data to save memory

Illumina Pipeline


The Illumina Pipeline is a set of software tools for the unix platform provided to users of the Illumina Genome Analyzer. The pipeline reads the images produced by the sequencer, analyzes them, performs base calling, and (optionally) aligns sequences to a reference genome. The user is provided with a set of sequences corresponding to each read, along with quality scores representing the accuracy of each base call.

Quality Scores


A numeric Phred score represents the error probability of a given base call. When a nucleotide sequence is produced by sequencing, random error results in the possibility that any given base call may be incorrect. Thus, a quality score is provided for each base. A more detailed outline is given in this slide show from Illumina.

The phred score can be calculated from the error probability of a given base call:
  • phred score=-10*log(error probability)/log(10)
 Error Probability
 Phred Score
1 0
0.1
10
0.01  20
0.001  30
 0.0001  40

When quality scores are used to represent a long sequence (such as in a fastq file), they are often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to Illumina scores (The Illumina pipeline produces phred scores, but uses a different ASCII offset). For example, a Phred score of 40 can be represented as the ASCII char "I" (40+33=Ascii #73), and an Illumina score of 40 as "h" (40+64=Ascii #104).

Here is a link to a reference on the ASCII alphabet: http://www.asciitable.com/

ShortRead Objects

SolexaPath
This object stores output file locations from the Illumina Pipeline. The SolexaPath function which creates this object automatically finds the subfolders for each output file type.

system("wget http://biocluster.ucr.edu/~tbackman/sample_flowcell.tgz") # download sample flowcell
system("cp /home/tbackman/.html/ .") # Alternative- copy from biocluster folder
system("tar xvfz sample_flowcell.tgz") # uncompress sample flowcell
library(ShortRead)

sp <- SolexaPath("sample_flowcell") # Creates a SolexaPath for a pipeline output folder # general format: sp <- SolexaPath("/path/to/folder")
baseCallPath(sp) # returns base call path subfolder
analysisPath(sp) # returns GERALD/Eland alignment subfolder
imageAnalysisPath(sp) # Returns firecrest image analysis subfolder

#####################################
# Import Reads from Illumina Folder #
#####################################
lane <- 7

# read in all tiles of data
qualityInputFiles <- paste("s_", lane, "_[0-9]+_prb.txt", sep="")
sequenceInputFiles <- paste("s_", lane, "_[0-9]+_seq.txt", sep="")

reads <- readBaseQuality(baseCallPath(sp), seqPattern=sequenceInputFiles, prbPattern=qualityInputFiles,type="Solexa")
ShortReadQ
This object stores sequences, quality scores, and ids for a series of reads.

library(ShortRead)
system("wget http://biocluster.ucr.edu/~tbackman/query.fastq") # download a sample illumina read set in fastq format
system("cp /home/tbackman/.html/query.fastq .") # Alternative- copy from biocluster folder
reads <- readFastq("query.fastq") # Reads in a FASTQ file from the current folder

id(reads) # outputs read ids as a list as BStringSet
sread(reads) # outputs read sequences as a list as DNAStringSet
quality(reads) # outputs list of quality scores as BStringSet
writeFastq(reads, "output.fastq") # writes ShortReadQ object to output file
AlignedRead
This objects stores the genome alignment coordinates (chromsome, position, strand) for each of a series of reads.

system("wget http://biocluster.ucr.edu/~tbackman/output.soap") # download a sample alignment result produced by soap
system("cp /home/tbackman/.html/output.soap .") # Alternative- copy from biocluster folder

## Import AlignedRead Object

alignedReads <- readAligned("./", pattern="output.soap", type="SOAP")
   # Reads in a external alignment (change type depending on program)
   # general format: readAligned("/path/to/folder", pattern="filename", type="AlignmentTool")
alignedReads[1] # choose a read and inspect it's alignment coordinates

## AlignedRead Inspection and Alignment QC
length(alignedReads) # total number of hits
length(unique(id(alignedReads))) # count number of query sequences which aligned
length(unique(id(alignedReads[srduplicated(id(alignedReads))]))) # count reads producing multiple hits
table(table(as.character(id(alignedReads))) == 1)["TRUE"] # count reads with unique (single hit) alignment (assumes unique ids)
length(unique(position(alignedReads))) # count unique positions hit
length(unique(position(alignedReads)[duplicated(position(alignedReads))])) # count positions with multiple aligning reads

ShortRead Sequence Filters

ShortRead provides functions that return sequence filter objects, which allow you to filter your input sequences by various criteria. It is also possible to create new filter objects.

Filter Function
Description
srFilter Creates a custom filter (see Vignettes)
chromosomeFilter("chromosome name regex") Selects reads which align to specific chromosome(s) matching the given regular expression
strandFilter("+/-") Selects reads which align to the + or - strand on the genome
nFilter(# of Ns) Selects reads which contain fewer than a given number of N characters
polynFilter(# of given nt) Selects only reads which contain less than a given number of the same nucleotide
srdistanceFilter(DNAStringSet, DNAString) Performs Pairwise alignment of each DNAStringSet string against DNAString, and calculates edit distance
uniqueFilter() Removes duplicate occurances of a given nucleotide sequence from a set of reads

Depending on the nature of the filter, they may work with different object types. Consult the Reference Manual Vignette for instructions on building filters, and additional built-in filter functions.

# Create several filters:
filter1 <- nFilter(threshold=1) # keep only reads without Ns
filter2 <- polynFilter(threshold=20, nuc=c("A")) # remove reads with 20 or more As
filter3 <- strandFilter("+") # choose reads which align to positive strand

filter <- compose(filter1, filter2, filter3) # Combine filters into one
alignedReads <- alignedReads[filter(alignedReads)] # filter alignedReads object with all 3 filters at once

Analysis Routines with ShortRead

The following examples introduce a variety of useful analysis routines that can be accomplished with functions provided by the ShortRead package.
Import Reads, Filter, and Export

##################
## Import Reads ##
##################
# Hint: download with wget and import with readFastq
# Reads: http://biocluster.ucr.edu/~tbackman/query.fastq


##################
## Filter Reads ##
##################
# Hint: construct a filter object and apply to reads (any type of filter)

##################
## Export Reads ##
##################
# Hint: store filtered reads in new object and export with writeFastq
Answers: Click here

fastq Quality Inspection

## Get sample fastq file
system("wget
http://biocluster.ucr.edu/~tbackman/query.fastq") # download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/query.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)

## plot per-cycle quality boxplot
reads <- readFastq("query.fastq")
   # Read in datafile (replace query.fastq with your filename)
if (length(reads) > 1000000) {
   reads <- sample(reads,1000000) # if more than a million reads, sample randomly
}
qual <- FastqQuality(quality(quality(reads))) # get quality scores
readM <- as(qual, "matrix") # convert scores to matrix
pdf(file="boxplot.pdf") # Save box plot as boxplot.pdf in current folder
boxplot(as.data.frame((readM)), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality")
   # build per cycle boxplot
dev.off()
system("cp boxplot.pdf ~/.html") # If you're on our biocluster server, this copies the output to your public web space where you can access it such as: http://biocluster.ucr.edu/~tbackman/boxplot.pdf (replace tbackman with your username)


## qa object quality control
qa <- qa(dirPath=".", pattern="query.fastq", type=c("fastq")) # read in fastq file
qa[['readCounts']] # inspect read yield
qa[['baseCalls']] # inspect base composition
qa[['frequentSequences']] # inspect most common sequences

## generate HTML quality report from qa object in qcReport folder (must have X server connected)
report(qa, dest="qcReport", type="html") # sample: http://biocluster.ucr.edu/~tbackman/qcReport/

SmallRNA Profiling

##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/srnareads.fastq") # download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/srnareads.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)
reads <- readFastq("srnareads.fastq")
id(reads) # inspect read ids
sread(reads) # inspect read sequences
quality(reads) # inspect quality scores

######################
## Filter Sequences ##
######################
# Construct several filters
filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns
filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter
filter <- compose(filter1, filter2) # Combine filters into one
reads <- reads[filter(sread(reads))] # filter sequences with all filters at once
   # since these filters work on the sequences themselves, it is necessary to pass them the sequences
   # with the sread(reads) function, rather than the entire ShortReadQ object
reads # inspect filtered reads

###################
## Trim Adapters ##
###################
seqs <- sread(reads) # get sequence list
qual <- quality(reads) # get quality score list
qual <- quality(qual) # strip quality score type
seqs <- DNAStringSet(seqs, start=4) # trim first 3 bases from sequence
qual <- BStringSet(qual, start=4) # trim first 3 bases from Qscore
   # This is only necessary if you need to shorten the read by a fixed
   # distance at the beginning. This is done here because the samples
   # are multiplexed, and we wish to omit the multiplexing tag.
adapter <- "TGTAGGCACCATCACTCGGGCACCAAGGA"
   # This is the adapter sequence to be trimmed from the end of your smallRNA reads
mismatchVector <- c(rep(2,length(DNAString(adapter))))
   # This is a vector of equal length to the adapter with "2" for each value.
   #  This will allow the adapter to match the end of the sequence with any offset
   #  and up to 2 mismatches.
trimCoords <- trimLRPatterns(Rpattern=adapter, subject=seqs, max.Rmismatch=mismatchVector, ranges=T)
   # Trim sequences looking for a right end pattern
   # Gets IRanges object with trimmed coordinates
seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
   # Use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
   # Rebuild reads object with trimmed sequences and quality scores
   
#######################################
## Choose Size Range and Make Unique ##
#######################################
maxSize <- 30
minSize <- 19
trimmed <- trimmed[width(sread(trimmed)) <= maxSize] # remove all reads longer than maxSize
trimmed <- trimmed[minSize <= width(sread(trimmed))] # remove all reads shorter than minSize
uniquef <- uniqueFilter() # create filter
unique <- trimmed[uniquef(trimmed)] # run unique filter (eliminate duplicates)
   # keep working with the non-unique reads if analyzing expression level data
   
######################
## Export Sequences ##
######################
writeFastq(trimmed, "filename.fastq") # write reads to filename.fastq

###########################################################
## Perform Alignment With Bowtie Using Unix Command Line ##
###########################################################
system("wget http://biocluster.ucr.edu/~tbackman/a_thaliana.ebwt.zip") # Download your reference genome in fasta format with wget (this is a prebuilt Bowtie Tair9 reference)
system("cp /home/tbackman/.html/a_thaliana.ebwt.zip .") # Alternative- copy from biocluster folder
system("unzip a_thaliana.ebwt.zip") # uncompress reference genome
system("bowtie -q a_thaliana filename.fastq output.bowtie") # perform bowtie alignment with default options

######################
## Import Alignment ##
######################
alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # read in alignment
table(width(sread(alignedReads))) # view table of SmallRNA Size distribution

#########################
## Find Coverage Peaks ##
#########################
coverage <- coverage(alignedReads) # calculate genomic coverage and returns an IRanges Rle object
islands <- slice(coverage[2], lower = 2) # look for regions on chrom 1 of 2x coverage or higher
islands # inspect regions

###########################################
## Visualize Alignment In Genome Browser ##
###########################################
library(rtracklayer)
chr1ranges <- as(coverage[2], "RangedData") # convert coverage for chromosome 1 into rtracklayer object
names(chr1ranges) <- "chr1"
# use chromosome names compatible with UCSC genome browser
export(chr1ranges, con="chromsome1.bed") # create output track for viewing in genome browser
system("cp chromsome1.bed ~/.html") # If you're on our biocluster server, this copies the output to your public web space where you can access it such as: http://biocluster.ucr.edu/~tbackman/chromsome1.wig (replace tbackman with your username)

If you are working on a remote server, you will need to copy this file locally before you can upload it to a genome browser. If you have trouble with the copy, you can download a sample here: chromsome1.wig

Go to an genome browser, upload your .wig file, and view it. You can try the Arabidopsis UCSC browser (http://epigenomics.mcdb.ucla.edu/cgi-bin/hgGateway?org=A.+thaliana&db=araTha1), by uploading your .wig file via the "add your own custom tracks" button. After uploading this, browse to the region you would like to see. A read pileup in the test dataset for the workshop can be found at chr1:17,825,707-17,825,775. Also, try inspecting regions of high coverage you found in the previous step.

Note: In this case, the UCSC genome browser is not using the same version of the genome (TAIR9) so your read loci may not correlate with the correct annotation features.


Trimming and Sorting Reads by 5' Muliplexing Adapters
Next generation sequencing technologies frequently produce a much larger volume of reads than necessary for a given library. Libraries can be combined on a single lane, and identified by unique 5' sequencing adapters, and are easily trimmed and sorted in R.

##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/srnareads.fastq") # download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/srnareads.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)
reads <- readFastq("srnareads.fastq")

##################################
## Inspect Adapter Distribution ##
##################################
seqs <- sread(reads) # get sequence list
adapterLength <- 4 # this is the length of your adapters
seqs <- DNAStringSet(seqs, end=adapterLength)
sort(table(as.character(seqs)), decreasing=TRUE)[1:10] # show top 10 adapter sequences
topAdapters <- rownames(sort(table(as.character(seqs)), decreasing=TRUE)[1:10])

# iterate through each of these top 10 adapters and write output to fastq files
for(adapter in topAdapters) {
    ###################
    ## Trim Adapters ##
    ###################
    seqs <- sread(reads) # get sequence list
    qual <- quality(reads) # get quality score list
    qual <- quality(qual) # strip quality score type
    mismatchVector <- 0 # allow no mismatches
   
    trimCoords <- trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=mismatchVector, ranges=T)
       # Trim sequences looking for a 5' pattern
       # Gets IRanges object with trimmed coordinates
   
    ########################################
    ## Generate trimmed ShortReadQ object ##
    ########################################
    seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
    qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
       # Use IRanges coordinates to trim sequences and quality scores
    qual <- SFastqQuality(qual) # reapply quality score type
    trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
       # Rebuild reads object with trimmed sequences and quality scores
 
   trimmed <- trimmed[start(trimCoords) == adapterLength + 1]
         # keep only reads which trimmed the full adapter

      
    ###############################

    ## Write reads to Fastq file ##
    ###############################
    outputFileName <- paste(adapter, ".fastq", sep="")
    writeFastq(trimmed, outputFileName)
       # export filtered and trimmed reads to fastq file
    print(paste("wrote", length(trimmed), "reads to file", outputFileName))
}

Trimming Low Quality 3' Tails from Reads
Many Illumina reads have very low quality in the last few cycles, which can pose a problem for later analysis steps which are not quality score aware. In these cases, removing low quality 3' tails from a read can be beneficial.

##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/query.fastq") # download some test Illumina reads
system("cp /home/tbackman/.html/query.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)
reads <- readFastq("query.fastq")

########################################
## Inject Ns at low quality positions ##
########################################
qualityCutoff <- 10 # remove read tails with quality lower than this
seqs <- sread(reads) # get sequence list
qual <- PhredQuality(quality(quality(reads))) # get quality score list as PhredQuality
myqual_mat <- matrix(charToRaw(as.character(unlist(qual))), nrow=length(qual), byrow=TRUE) # convert quality score to matrix
at <- myqual_mat < charToRaw(as.character(PhredQuality(as.integer(qualityCutoff)))) # find positions of low quality
letter_subject <- DNAString(paste(rep.int("N", width(seqs)[1]), collapse="")) # create a matrix of Ns
letter <- as(Views(letter_subject, start=1, end=rowSums(at)), "DNAStringSet") # trim to length needed for each read
injectedseqs <- replaceLetterAt(seqs, at, letter) # inject Ns at low quality positions

####################################
## Get coordinates of polyN tails ##
####################################
adapter <- paste(rep("N", max(width(injectedseqs))), sep="", collapse="")
mismatchVector <- c(rep(0,width(adapter))) # allow no mismatches at each adapter offset
trimCoords <- trimLRPatterns(Rpattern=adapter, subject=injectedseqs, max.Rmismatch=mismatchVector, ranges=T)
   # Trim sequences looking for a right end pattern (polyN in this case)
   # Gets IRanges object with trimmed coordinates

##########################################################################
## Apply trimming coordinates from injected reads to non-injected reads ##
##########################################################################
seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
   # Use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
   # Rebuild reads object with trimmed sequences and quality scores
Table of Contents

Quality Reports of FASTQ Files 
The following seeFastq/seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The functions allow processing of reads with variable length, but most plots are only meaningful if the read positions in the FASTQ file are aligned with the sequencing cycles. For instance, constant length clipping of the reads on either end or variable length clipping on the 3' end maintains this relationship, while variable length clipping on the 5' end without reversing the reads erases it. Related resources on this topic can be found here: qrqc, FastQC and FASTX-Toolkit

##################################
## Generate FASTQ quality plots ##

##################################
library(ShortRead); library(ggplot2)
## Import seeFastq/seeFastqPlot functions
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/fastqQuality.R")
## Optional download of sample fastq files
system("wget http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rngsapps/chipseqBioc2012/data.zip"); system("unzip data.zip")
## Specify fastq files names in named vector (here for all *.fastq files in 'data' directory of downloaded sample data set)
fastq <- list.files("data", "*.fastq$"); fastq <- paste("data/", fastq, sep="")
names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_") # Values in name slot will be used for sample labeling in plot!
## Compute quality stats
fqlist <- seeFastq(fastq=fastq, batchsize=100000, klength=8) # Returns summary stats in a relatively small list object that can be saved to disk with 'save()' and reloaded with 'load()' for later plotting. The argument 'klength' specifies the k-mer length and 'batchsize' the number of reads to random sample from each fastq file.  
## Plot quality stats 
seeFastqPlot(fqlist)
seeFastqPlot(fqlist[4:1], arrange=c(1,2,3,4,6,7)) # Example for plotting specific rows and columns.
## Output plot to PDF
pdf("fastqReport.pdf", height=18, width=4*length(fastq))
seeFastqPlot(fqlist)
dev.off()

## To run the functions from the Unix command-line, instead of the R console, download this run script seeFastq.R to your current working directory, specify your FASTQ files in a file of this format fastqFiles.txt and then execute:
Rscript seeFastq.R 
-i fastqFiles.txt -o fastqReport.pdf -a all -c all -s 100000 -k 8 # The results will be written to a PDF file called fastqReport.pdf. The summary statistics will be stored in a separate *.list file, here 'fastqReport.pdf.list', that can be used to plot the results later without revisiting the fastq files. This is supported by running the script with the '-l' argument and assigning to it the *.list file, but dropping the '-i' option. To plot only specific rows and columns, one can assign them to the '-a' and '-c' arguments using this syntax: '-a 1_2_3_4_5_8_6_7 -c 1_2_3_4'. The arguments '-s' and '-k' specify the random sampling size and the k-mer length, respectively.  

Example report generated by seeFastq/seeFastqPlot for 4 FASTQ files


Rsamtools

Documentation
Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools.

BWA Alignment

The Burrows-Wheeler Alignment (BWA) Tool is a powerful shortread alignment tool which can generate gapped short read output in the SAM format, perfect for use with Rsamtools importer functions.

These commands should be run on the unix command line, or within R wrapped in a system("put command in here"):

## Option 1 of 2: Download test data from internet
wget http://biocluster.ucr.edu/~tbackman/genome.fasta
# download a test reference genome (TAIR9 Chromosome 1)
wget http://biocluster.ucr.edu/~tbackman/query.fastq # download some test Illumina reads from Arabidopsis

## Option 2 of 2: If logged into biocluster.ucr.edu copy files directly
cp /home/tbackman/.html/genome.fasta . # copy from biocluster folder
cp /home/tbackman/.html/query.fastq . # copy from biocluster folder

## perform alignment
bwa index -a is genome.fasta # index reference genome
bwa aln
genome.fasta query.fastq > output.sai # perform alignment
bwa samse -n -1
genome.fasta output.sai query.fastq > output.sam # generate SAM formatted alignment output
samtools view -bST
genome.fasta output.sam -o output.bam # use samtools to generate bam file

Rsamtools Bam Parsing

Bam file data elements
Name    Description
R Data Object Type
 qname  read id
 character vector
 flag  a numeric value which encodes extra alignment details for use in sorting and filtering (see samtools man page)
 integer vector
 rname
 reference chromosome/sequence to which read aligned
 factor
 strand  alignment strand
factor
 pos  alignment start coordinate (left most at 3' end of - strand)
 integer vector
 qwidth  read width
  integer vector
 mapq  mapping quality
 integer vector
 cigar  extended samtools CIGAR string
 character vector
 mrnm  reference chromosome/sequence to which opposite read pair aligned factor
 mpos alignment start coordinate to which opposite read pair aligned   integer vector
 isize  paired end insert size
  integer vector
 seq  query sequence in 5' to 3' direction
 DNAStringSet
 qual  read quality score
 PhredQuality

library(Rsamtools) # load Rsamtools library
sortBam("output.bam", "output.sorted") # sort entries in bam file
indexBam("output.sorted.bam") # index reads in bam file

## Option 1 of 2: Import all elements from bam file
param <- ScanBamParam()

## Option 2 of 2: Select specific elements from bam file to import
which <- RangesList(Chr1 = IRanges(c(1, 10000), c(20000,30000)))
# select only elements you need:
what <- c("pos", "strand", "rname", "seq", "qual", "qname")
param <- ScanBamParam(which = which, what = what)

## Read in sorted indexed bam file
bam <- scanBam("output.sorted.bam", index="output.sorted.bam", param=param)

## Inspect and extract elements from bam object
names(bam) # shows ranges you selected with "which"
names(bam[[1]]) # selects first range and shows elements selected with "what"
bam[[1]]$seq # selects a specific element from a specific range

## Read in sorted indexed bam file as GappedAlignments object
bamAlignment <- readBamGappedAlignments("output.sorted.bam", index="output.sorted.bam")

Rsamtools SNP Calling


This step requires the sorted and indexed alignment and genome from previous steps

## unix command to extract pileup (check samtools options first)
system("samtools pileup -vcf genome.fasta output.sorted.bam > raw.pileup")

## read in snps and indels
snps <- readPileup("raw.pileup", variant="SNP")
indels <- readPileup("raw.pileup", variant="indel")

## inspect and filter snps
snps[1:5] # see first 5 snps
table(seqnames(snps)) # table snps per chromosome
snps <- snps[elementMetadata(snps)$coverage > 3] # keep only snps with coverage > 3

## write SNP spreadsheet
write.table(as.data.frame(snps), "snps.xls")

Analysis Routines with Rsamtools


SNP/INDEL Analysis and Annotation
[ Slide Show ]   [ R Code ]
###################################################
### chunk number 1: install-packages 
###################################################
## #line 92 "SNPAnno.Rnw"
## source("http://bioconductor.org/biocLite.R")
## biocLite(c("GenomicFeatures","rtracklayer","GenomicRanges","Rsamtools"))


###################################################
### chunk number 2: install-mac 
###################################################
## #line 102 "SNPAnno.Rnw"
## install.packages("SNPAnno_1.0.tar.gz", repos=NULL, type="source")


###################################################
### chunk number 3: install-windows 
###################################################
## #line 110 "SNPAnno.Rnw"
## install.packages("SNPAnno.zip", repos=NULL)


###################################################
### chunk number 4: load-library
###################################################
#line 115 "SNPAnno.Rnw"
library(SNPAnno)


###################################################
### chunk number 5: example1-snp
###################################################
#line 119 "SNPAnno.Rnw"
snpfile <-system.file("examples","s.vcf",package="SNPAnno")
snpfile
snpgr <-vcf2GR(snpfile)
chrid="dmel_mitochondrion_genome"
snpgr <-snpgr[seqnames(snpgr)==chrid]
snpgr[1:3]


###################################################
### chunk number 6:  example1-txdb
###################################################
## #line 128 "SNPAnno.Rnw"
## txdb <- makeTranscriptDbFromBiomart("ensembl", "dmelanogaster_gene_ensembl")
## txdb


###################################################
### chunk number 7:  example1-txdb
###################################################
## #line 133 "SNPAnno.Rnw"
## library(biomaRt)
## listMarts()
## ensembl = useMart("ensembl")#Choose to query the Ensembl BioMart database
## listDatasets(ensembl)#list dataset in database


###################################################
### chunk number 8: example1-txdb
###################################################
#line 140 "SNPAnno.Rnw"
txdbfile <-system.file("examples", "dm5.25.sqlite",package="SNPAnno")
txdb <-loadFeatures(txdbfile)
txdb


###################################################
### chunk number 9: example1-txdb
###################################################
#line 146 "SNPAnno.Rnw"
genomefile <-system.file("examples", "dm.fasta",package="SNPAnno")
genomesequence <-read.DNAStringSet(genomefile)


###################################################
### chunk number 10: example1-cds
###################################################
#line 151 "SNPAnno.Rnw"
cds <-GetCDSfromTXDb(txdb,chrid)
cds[1:2]


###################################################
### chunk number 11: example1-nonsynonymous
###################################################
#line 156 "SNPAnno.Rnw"
nonsynonymous<- NonSynDetection(snpgr,chr=genomesequence, features=cds, chrid=chrid, revcomp="-")
nonsynonymous[1:10]


###################################################
### chunk number 12: example1-coding
###################################################
#line 161 "SNPAnno.Rnw"
codingGR <- GetCDSfromTXDb(txdb,chrid)
coding <-CodingMutation(codingGR, snpgr)
coding[1:10]


###################################################
### chunk number 13: example1_intergenic
###################################################
#line 167 "SNPAnno.Rnw"
intergenicGR <-IntergenicfromtxDB(txdb,chrid)
intergenicGR[1:3]


###################################################
### chunk number 14: example1_intergenic
###################################################
#line 172 "SNPAnno.Rnw"
intergenic <-IntergenicMutation(intergenicGR,snpgr)
intergenic[34:38]


###################################################
### chunk number 15:  example1_intron
###################################################
## #line 177 "SNPAnno.Rnw"
## intronGR <-intronfromtxDB(txdb,chrid)
## intronGR


###################################################
### chunk number 16:  example1_splicesite
###################################################
## #line 182 "SNPAnno.Rnw"
## splicesite <-SpliceSiteMutation(intronGR, snpgr)


###################################################
### chunk number 17: example1_all
###################################################
#line 186 "SNPAnno.Rnw"
SNPAnnotation(snpfile=snpfile,
               type="vcf",
               source="transcriptDB",
               annosource=txdb,
               detection=c("nosynonymous","intergenic","coding"),
               genomesequence=genomesequence,
               chrid=chrid)


###################################################
### chunk number 18: example2
###################################################
## #line 197 "SNPAnno.Rnw"
## library(SNPAnno)


###################################################
### chunk number 19: example2_snp
###################################################
#line 201 "SNPAnno.Rnw"
snpfile <-system.file("examples","snp.pileup",package="SNPAnno")
snpgr <-pileupSNP2GR(snpfile)
highqualityreads <-supportreads(snpgr,10)
snpgr <-GRanges(seqnames=seqnames(snpgr),ranges=ranges(snpgr),strand="*",values(snpgr)["ref"], 
               values(snpgr)["alt"],values(snpgr)["consensusQuality"],values(snpgr)["snpQuality"],
               values(snpgr)["maxMappingQuality"],values(snpgr)["coverage"],
               highqualityreads =highqualityreads)
snpgr <-snpgr[seqnames(snpgr)=="Chr1"]
snpgr[1:3]


###################################################
### chunk number 20: example2_gff
###################################################
#line 210 "SNPAnno.Rnw"
gfffile <-system.file("examples","TAIR10_part.gff3",package="SNPAnno")
gff <- import.gff(gfffile, asRangedData=FALSE)
elementMetadata(gff)["type"] <-as.character( as.data.frame(elementMetadata(gff)[,"type"])[,1])
elementMetadata(gff)["group"] <- gsub("^.*?=(.*?)($|,|;|-).*", "\\1",
            elementMetadata(gff)[,"group"])


###################################################
### chunk number 21: example2_genome
###################################################
#line 218 "SNPAnno.Rnw"
genomefile <-system.file("examples", "genome.fasta",package="SNPAnno")
genomesequence <-read.DNAStringSet(genomefile)


###################################################
### chunk number 22: example2_cds
###################################################
#line 223 "SNPAnno.Rnw"
chrid="Chr1"
cds <-GetCDSfromGFF(gff,chrid)
cds[1:3]


###################################################
### chunk number 23: example2_Nonsynonymous
###################################################
#line 229 "SNPAnno.Rnw"
nonsynonymous<- NonSynDetection(snpgr,
          chr=genomesequence, features=cds, chrid=chrid, revcomp="-")
nonsynonymous[1:10]


###################################################
### chunk number 24: example2_cds
###################################################
#line 235 "SNPAnno.Rnw"
coding <-CodingMutation(cds, snpgr)
coding[1:10]


###################################################
### chunk number 25: example2_cds
###################################################
#line 240 "SNPAnno.Rnw"
intergenicGR <-Intergenicfromgff(gff,chrid)
intergenicGR[1:2]


###################################################
### chunk number 26: example2_inter
###################################################
#line 245 "SNPAnno.Rnw"
intergenic <-IntergenicMutation(intergenicGR,snpgr)
intergenic[1:10] 


###################################################
### chunk number 27: example2_intron
###################################################
#line 250 "SNPAnno.Rnw"
intronGR <-intronfromgff(gff,chrid)
intronGR[1:2]


###################################################
### chunk number 28: example2_splicesite
###################################################
#line 255 "SNPAnno.Rnw"
splicesite <-SpliceSiteMutation(intronGR, snpgr) 
splicesite[10:20]


###################################################
### chunk number 29: example2_all
###################################################
## #line 260 "SNPAnno.Rnw"
SNPAnnotation(snpfile=snpfile, type="pileup", source="gff", annosource=gff, detection=c("nosynonymous","splicesite","intergenic","coding"), genomesequence=genomesequence, chrid="Chr1")

Additional tools for SNP analysis: 


BSgenome

Documentation

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism.

BSgenome Use

library(BSgenome)
available.genomes()
   # Get list of all available pre-built genomes to choose from
# source("http://bioconductor.org/biocLite.R")
# biocLite("BSgenome.Celegans.UCSC.ce2")
   # Run these commands to install a genome you haven't used before, or to update the version
library(BSgenome.Celegans.UCSC.ce2)
   # load the house mouse genome
Celegans
   # Inspect overview of data provided with this package
masknames(Celegans)
   # Returns a list of built in masks defined for sequences in the genome
   # AGAPS = assembly gaps
   # AMB = intra-contig ambiguities
   # RM = RepeatMasker
   # TRF = Repeat Regions found via Tandem Repeats Finder
chr1 <- Celegans$chrI
   # load chromosome 1 into memory with default masks
length(chr1)
   # returns length of chromosome 1

rtracklayer

Documentation

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage.

biomaRt

Documentation

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data.

biomaRt Use


library("biomaRt")
listMarts() # get list of all available data sources
ensembl <- useMart("ensembl") # load the snp source
listDatasets(ensembl) # list available datasets
cfamiliaris_gene_ensembl <- useDataset("cfamiliaris_gene_ensembl", mart= ensembl)
   # use domestic dog gene data
listAttributes(cfamiliaris_gene_ensembl)
   # show attributes in this dataset
listFilters(cfamiliaris_gene_ensembl)
   # show available filters
cfamiliaris_gene_ensembl <- useMart("ensembl","cfamiliaris_gene_ensembl")
   # select specific dataset
getBM(attributes=c("chromosome_name"), mart= cfamiliaris_gene_ensembl)
   # retrieve data regarding specific attributes in table/dataframe
   # format
   # optional filters= option allows for specific filters to be applied

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

Documentation

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis.

#################
## Data Import ##
#################
library(chipseq); library(lattice)
data(cstest); cstest # Loads a reduced test data set containing the start positions and orientation of the aligned reads for three mouse chromosomes.
cstest; cstest$ctcf # cstest is a list-like object of class "GenomeDataList". Each of its components contains the data from one Illumina lane.

#####################
## Extending Reads ##
#####################
## HT sequencing data contain often only the partial sequence of ChIP-Seq pull-down fragments. Therefore, it can be useful in certain cases to extend the reads to cover the entire binding site of DNA binging proteins.
estimate.mean.fraglen(cstest$ctcf, method="SISSR") # The function 'e
stimate.mean.fraglen' implements several methods to estimate the mean fragment length. See help file for details.
ext <- resize(cstest$ctcf, width = 200) # Returns an IRanges object of read intervals extended to 200 bases.

#################################
## Coverage, Islands and Depth ##
#################################
cov <- coverage(ext); cov # Counts the read coverage for each base. The results are stored in a run-length encoded form as Rle object.
islands <- slice(cov, lower = 1) # Returns read islands where the read coverage is
equal or above a specified bound.
viewSums(islands) # Returns the sum of coverage in the islands.
table(viewSums(islands) / 200) # Returns the number of estimated reads of the corresponding islands.
viewMaxs(head(islands)) # Returns the maximum coverage depth within each island.
table(viewMaxs(islands))
# Counts the corresponding islands.

## Function to perform above analysis on many data sets with seqapply()
islandReadSummary <- function(x) {
    g <- resize(x, width = 200)
    s <- slice(coverage(g), lower = 1)
    tab <- table(viewSums(s) / 200)
    df <- DataFrame(tab)
    colnames(df) <- c("chromosome", "nread", "count")
    df$nread <- as.integer(df$nread)
    df

}
head(islandReadSummary(cstest$ctcf)) # Returns same result as before.
nread.islands <- seqapply(cstest, islandReadSummary) # Performs analysis on all chromosomes.
nread.islands <- stack(nread.islands, "sample") # Returns results as data frame.

## Plots the distribution of read numbers against the number of islands.
xyplot(log(count) ~ nread | sample, as.data.frame(nread.islands),
       subset = (chromosome == "chr10" & nread <= 40),
       layout = c(1, 2), pch = 16, type = c("p", "g"))

##############################
## Chromosome Coverage Plot ##
##############################

plotCov <- function(mycov=cov, mychr=1, mypos=c(1,1000), mymain="Coverage", ...) {
    op <- par(mar=c(8,3,6,1))
    plot(as.numeric(mycov[[mychr]][mypos[1]:mypos[2]]), type="l",
    lwd=1, col="blue", ylab="", main=mymain, xlab="", xaxt="n", ...)
    axis(1, las = 2, at=seq(1,mypos[2]-mypos[1], length.out= 10),
    labels=as.integer(seq(mypos[1], mypos[2], length.out= 10)))
    par(op)
    }
plotCov(mycov=cov, mychr="chr10", mypos=c(3000000,4000000))

##############################
## Peak Calling and Summary ##
##############################
peakCutoff(cov, fdr = 0.0001) # Using Poisson-based approach for estimating the noise distribution, the 'peakCutoff' function returns a cutoff value for a specific FDR.
peaks <- slice(cov, lower = 8); peaks # Based on the above calculated coverage value ~7 at an FDR of 0.0001, one can choose a coverage >=8 as a conservative cutoff value for peak calling.
peaks.sum <- peakSummary(peaks) # Summarizes the peak data in a RangedData object.

###############################################
## Strand-Specific Peak Calling and Plotting ##
###############################################

peak.depths <- viewMaxs(peaks)
cov.pos <- coverage(ext[strand(ext) == "+"])
# Returns coverage for sense matches.
cov.neg <- coverage(ext[strand(ext) == "-"]) # Returns coverage for antisense matches.
peaks.pos <- Views(cov.pos, peaks)
peaks.neg <- Views(cov.neg, peaks)
wpeaks <- tail(order(peak.depths$chr10), 4); wpeaks
# Returns four highest peaks.
coverageplot(peaks.pos$chr10[wpeaks[1]], peaks.neg$chr10[wpeaks[1]]) # Plots the first peak.

###############################################
## DPs: Differential Peaks Among Two Samples ##
###############################################
## Simplest strategy is to combine the data from two samples and then compare the contribution of each sample to the peaks.
cov.gfp <- coverage(resize(cstest$gfp, 200))
peaks.gfp <- slice(cov.gfp, lower = 8)
peakSummary <- diffPeakSummary(peaks.gfp, peaks)
xyplot(asinh(sums2) ~ asinh(sums1) | space,
    data = as.data.frame(peakSummary),
    panel = function(x, y, ...) {
        panel.xyplot(x, y, ...)
        panel.abline(median(y - x), 1)
    },
    
type = c("p", "g"), alpha = 0.5, aspect = "iso")

Table of Contents

BayesPeak

Documentation [ Publication ]

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. 

#############################
## Import Data Sample Data ##
#############################
library(BayesPeak); library(multicore)

sig <- read.table("H3K4me3-chr16.bed", skip=1, colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)]) # Import of a ChIP-Seq sample data set. The file needs to be downloaded
from the BayesPeak site, uncompressed and stored in the current working directory of an R session. 
bgr <- read.table("Input-chr16.bed", colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)]) #
Import of the background sample data set.
sig <- RangedData(space=sig[,1], IRanges(start=sig[,2], end=sig[,3]), strand=sig[,6])
bgr <- RangedData(space=bgr[,1], IRanges(start=bgr[,2], end=bgr[,3]), strand=bgr[,6])

#################################
## Peak Calling with BayesPeak ##
#################################
raw.output <- bayespeak(treatment=sig, control=bgr, chr = "chr16", start = 9.2E7, end = 9.5E7, use.multicore = TRUE, mc.cores = 1)
# The bayespeak() function fits a Markov model to the aligned read data via Markov Chain Monte Carlo (MCMC) techniques. The incorporation of a control sample is beneficial but not required for this function. The aligned read data can be read directly from a BED file or provided as a data frame or a RangedData object as in this example. For speed reasons the analysis is restricted in this example to a small subrange on chromosome 16. For a complete analysis of an entire genome, one usually wants to omit the arguments: 'chr', 'start and 'end'. In addition, one can increase the number of CPU cores utilized for the computation under the 'mc.cores' argument.  
unreliable.jobs <- log(raw.output$QC$lambda1) < 1.5 # Removal of potential false positive peaks due to overfitting. Consult the BayesPeak manual
for more details.
bpeaks <- as.data.frame(summarise.peaks(raw.output, method = "lowerbound", exclude.jobs = unreliable.jobs)) # Returns the final set of peaks in a data frame.

##########################################################
## Adding Coverage Information with the chipseq Package ##
##########################################################
bpeaksIR <- IRanges(start=bpeaks[, 2], end=bpeaks[, 3])
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/chipseqFct.R")
# Imports the rangeCoverage function for computing coverage information of peak calls. The function expects the peak ranges in an IRanges object for single chromosome data or as a list of IRanges for multiple chromosome data. In addition, the corresponding aligned read data need to be provided as a RangedData object.
sigcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="sig_", peaksIR=bpeaksIR, sig=sig, readextend=200, normfactor=10^6/length(start(sig))) # Returns a data frame with the maximum coverage information for each peak normalized per million reads (see 'normfactor' argument). The first column contains the total coverage and the other two the coverage for the positive and negative strands, respectively.
To obtain mean coverage values, assign 'viewMeans' to the summaryFct argument.
bgrcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="bgr_", peaksIR=bpeaksIR, sig=bgr, readextend=200, normfactor=10^6/length(start(bgr))) # Returns the corresponding coverage information for the background sample.
bpeaksDF <- cbind(bpeaks, sigcovDF[,-1], bgrcovDF[,-1]); bpeaksDF[1:4,] # Constructs a data frame containing the peaks called by BayesPeaks as well as the corresponding maximum coverage information for each peak.

Table of Contents

PICS

Documentation [ Publication ]

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set.

#############################
## Import Data Sample Data ##
#############################

library(PICS); library(snowfall)
sig <- read.table("H3K4me3-chr16.bed", skip=1, colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)])[,c(1:3,6)]; colnames(sig) <- c("space", "start", "end", "strand"); siggd <- as(sig, "GenomeData")
bgr <- read.table("Input-chr16.bed", colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)])[,c(1:3,6)]; colnames(bgr) <- c("space", "start", "end", "strand"); bgrgd <- as(bgr, "GenomeData")
# Import of the sample data set. The files need to be downloaded from the BayesPeak site, uncompressed and stored in the current working directory of an R session.

############################
## Peak Calling with PICS ##
############################

sfInit(parallel = TRUE, cpus = 2); sfLibrary(PICS) # Specifies the number of CPU cores to use for the following computations.
seg <- segmentReads(siggd, dataC = bgrgd, minReads = 1) # Detection of candidate peak regions with a minimum read coverage specified under the 'minReads' argument.
pics <- PICS(seg, dataType = "TF") # Identification and scoring of DNA binding sites.
segC <- segmentReads(bgrgd, dataC = siggd, minReads = 1); picsC <- PICS(segC, dataType = "TF"); fdr <- picsFDR(pics, picsC, filter = list(delta = c(50, Inf), se = c(0, 50), sigmaSqF = c(0, 22500), sigmaSqR = c(0, 22500))) # Estimation of FDRs for peak calls.

myFilter = list(delta = c(50, 300), se = c(0, 50), sigmaSqF = c(0, 2500), sigmaSqR = c(0, 22500)) # Predefine filter parameters.
fdr2score <- fdr[fdr[,1]<=0.05,][1,2] # Defines sore cutoff for a FDR <= 0.05.
rdata <- makeRangedDataOutput(pics, type = "bed", filter = c(myFilter, list(score = c(fdr2score, Inf)))) # Returns quality filtered peaks with scores as RangedData object.
ppeaks <- as.data.frame(rdata)


##########################################################
## Adding Coverage Information with the chipseq Package ##
##########################################################
ppeaksIR <- IRanges(start=ppeaks[, 2], end=ppeaks[, 3])
sig <- RangedData(sig); bgr <- RangedData(bgr)
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/chipseqFct.R")
# Imports the rangeCoverage function for computing coverage information of peak calls. The function expects the peak ranges in an IRanges object for single chromosome data or as a list of IRanges for multiple chromosome data. In addition, the corresponding aligned read data need to be provided as a RangedData object.
sigcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="sig_", peaksIR=ppeaksIR, sig=sig, readextend=200, normfactor=10^6/length(start(sig))) # Returns a data frame with the maximum coverage information for each peak normalized per million reads (see 'normfactor' argument). The first column contains the total coverage and the other two the coverage for the positive and negative strands, respectively.
To obtain mean coverage values, assign 'viewMeans' to the summaryFct argument.
bgrcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="bgr_", peaksIR=ppeaksIR, sig=bgr, readextend=200, normfactor=10^6/length(start(bgr))) # Returns the corresponding coverage information for the background sample.
ppeaksDF <- cbind(ppeaks, sigcovDF[,-1], bgrcovDF[,-1]); ppeaksDF[1:4,] # Constructs a data frame containing the peaks called by PICS as well as the corresponding maximum coverage information for each peak.

##########################################################
## Peaks Identified by both Methods: PICS and BayesPeak ##
##########################################################
commonpeaks <- subsetByOverlaps(as(ppeaks, "RangedData"), as(bpeaks, "RangedData"), minoverlap=100) # Returns the peaks called by PICS that overlap by at least 100bp with those from BayesPeak for the range
9.2E7-9.5E7. Note: to compare the overall results from both methods the BayesPeak analysis needs to be performed on the entire data set by removing the 'start' and 'end' restrictions in the above bayespeak function call.
ppeaksDF[ppeaksDF$start %in% start(commonpeaks),] # Returns the corresponding coverage information for the common peaks. 

Table of Contents


ChIPpeakAnno

Documentation

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages.

##################################################
## Obtain Genomic Context Information for Peaks ##
##################################################
## Use annotations from BioC annotation packages
library(ChIPpeakAnno)
data(myPeakList) # Loads a sample peak list stored as RangeData object.
data(TSS.human.NCBI36) # Loads gene locations for human genome.
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6, ], AnnotationData = TSS.human.NCBI36) # Combines the peak information with the genomic context.
as.data.frame(annotatedPeak)

## Use annotations from BioMart
mart <- useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
myAnnotation <- getAnnotation(mart, featureType="miRNA")
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6, ], AnnotationData = myAnnotation)

## Use your own annotation data (e.g. BED or GFF)
myPeak1 <- RangedData(IRanges(start = c(967654, 2010897, 2496704, 3075869, 3123260, 3857501, 201089),
end = c(967754, 2010997, 2496804, 3075969, 3123360, 3857601, 201089),
names = c("Site1", "Site2", "Site3", "Site4", "Site5", "Site6", "site7")),
space = c("1", "2", "3", "4", "5", "6", "2"))
TFbindingSites <- RangedData(IRanges(start = c(967659, 2010898, 2496700, 3075866, 3123260, 3857500,
                                              96765, 201089, 249670, 307586, 312326, 385750),
end = c(967869, 2011108, 2496920, 3076166, 3123470, 3857780, 96985, 201299, 249890, 307796, 312586, 385960),
names = c("t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9", "t10", "t11", "t12")),
space = c("1", "2", "3", "4", "5", "6", "1", "2", "3", "4", "5", "6"),
strand = c(1, + 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1))
annotatedPeak2 <- annotatePeakInBatch(myPeak1, AnnotationData = TFbindingSites)
as.data.frame(annotatedPeak2)

########################################
## Obtain Seqeunces Surrounding Peaks ##
########################################
peaks <- RangedData(IRanges(start = c(100, 500), end = c(300, 600), names = c("peak1", "peak2")), space = c("NC_008253", "NC_010468")) # Simple peak examples.
library(BSgenome.Ecoli.NCBI.20080805)
peaksWithSequences <- getAllPeakSequence(peaks, upstream = 20, downstream = 20, genome = Ecoli) # Extracts corresponding sequences including 20bp extensions on each end from the E. coli genome.
write2FASTA(peaksWithSequences) # Prints out results in FASTA format.

#################################
## GO Term Enrichment Analysis ##
#################################
library(org.Hs.eg.db)
enrichedGO <- getEnrichedGO(annotatedPeak, orgAnn = "org.Hs.eg.db", maxP = 0.01, multiAdj = TRUE, minGOterm = 10, multiAdjMethod = "BH")
enrichedGO$bp[1:6, ]

Table of Contents

Additional ChIP-Seq Packages


RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges 

Documentation

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual.

##########################
## Import Aligned Reads ##
##########################
library(GenomicRanges); library(Rsamtools); library(leeBamViews) # Load required libraries.
testFile <- system.file("bam", "isowt5_13e.bam", package = "leeBamViews")
aligns <- readBamGappedAlignments(testFile) # Imports a BAM alignment file (here yeast example) and stores it as a GappedAlignments object.
rname(aligns) <- sub("^Sc", "", rname(aligns)); rname(aligns) <- sub("13", "XIII", rname(aligns)) # Required for data consistency.
alignscan <- scanBam(testFile); names(alignscan[[1]]) # Imports BAM data into a nested list object.

######################################################
## Organize Annotation Data in a GRangesList Object ##
######################################################
library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome="sacCer2", tablename="sgdGene") # Creates a TranscriptDb object from transcript annotations available at the UCSC Genome Browser.
exonRanges <- exonsBy(txdb, "tx") # Stores exon data as a GRangesList object.

#######################################################
## Create Annotation Objects from Custom Data Frames ##
#######################################################

transcripts <- data.frame(tx_id=1:3, tx_chrom="chr1", tx_strand=c("-", "+", "+"), tx_start=c(1, 2001, 2001), tx_end=c(999, 2199, 2199))
splicings <- data.frame(tx_id=c(1L, 2L, 2L, 2L, 3L, 3L), exon_rank=c(1, 1, 2, 3, 1, 2), exon_start=c(1, 2001, 2101, 2131, 2001, 2131), exon_end=c(999, 2085, 2144, 2199, 2085, 2199), cds_start=c(1, 2022, 2101, 2131, NA, NA), cds_end=c(999, 2085, 2144, 2193, NA, NA))
myTxdb <- makeTranscriptDb(transcripts, splicings) # Example for creating TranscriptDb object from two data frames.
exonsBy(myTxdb, "tx")
# Returns exon data as GRangesList object.

###################################################
## Counting Reads that Overlap Annotation Ranges ##
###################################################

counts <- countOverlaps(exonRanges, aligns) # Counts matching reads per transcript.
numBases <- sum(width(reduce(exonRanges))) # Length of exon union per gene.
geneLengthsInKB <- (numBases/1000) # Conversion to kbp.
millionsMapped <- sum(counts)/1e+06 # Factor for converting to million of mapped reads.
rpm <- counts/millionsMapped #
RPK: reads per kilobase of exon model.
rpkm <- rpm/geneLengthsInKB # RPKM: reads per kilobase of exon model per million mapped reads. Note: the results are stored in a named vector that matches the index of the initial GRangesList object!!!

sortedRPKM <- sort(rpkm); highScoreGenes <- tail(sortedRPKM)
txs <- transcripts(txdb, vals = list(tx_id = names(highScoreGenes)), columns = c("tx_id", "gene_id"))

systNames <- as.vector(unlist(elementMetadata(txs)["gene_id"])) # Example for returning the six most highly expressed transcripts.

###################################################################
## Counting Reads for Non-Annotation Ranges (Intergenic Regions) ##
###################################################################

filtData <- subsetByOverlaps(aligns, exonRanges) # Returns all reads that align in non-exon ranges. This includes introns which are rare in this example.
filtData2 <- subsetByOverlaps(aligns, transcriptsBy(txdb, "gene"))
# Returns all reads that align in intergenic regions (excludes introns).
cov <- coverage(filtData) # Calculates coverage for non-exon ranges.
cov <- cov[13] # Subsetting of chromosome 13 because data for the other chromosomes is missing in this example.
islands <- slice(cov, lower = 1) # Filters for areas with a continuous read coverage of >=1.
transcribedRegions <- islands[width(islands) > 1000] # Returns only those transcribed regions that are at least 1000 bp long.

#########################################################################
## Avoid Double Counting and Ambiguous Mappings with summarizeOverlaps ##
#########################################################################
## Note: when counting single range features, a GRanges object is passed
## on to summarizeOverlaps and with multiple range features a GRangesList.
## It also accepts a ScanBamParam() function under the param argument to filter
## the reads stored in a bam file, e.g. by "NH" tag to exclude reads with multiple mappings. 
library(GenomicRanges); library(Rsamtools)
## Create some sample data
rd <- GappedAlignments(letters[1:3], seqnames = Rle(rep("chr1",3)),
      pos = as.integer(c(100, 500, 900)),
      cigar = rep("300M", 3), strand = strand(rep("+", 3)))
gr1 <- GRanges("chr1", IRanges(start=100, width=101), strand="+", ID="gene1_exon1")
gr2 <- GRanges("chr1", IRanges(start=100, width=151), strand="+", ID="gene1_exon2")
gr3 <- GRanges("chr1", IRanges(start=900, width=101), strand="+", ID="gene1_exon3")
gr4 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="gene2_exon1")
gr <- c(gr1, gr2, gr3, gr4)
grl <- split(gr, gsub("_.*", "", elementMetadata(gr)$ID))
grl
GRangesList of length 2:
$gene1
GRanges with 3 ranges and 1 metadata column:
      seqnames      ranges strand |          ID
         <Rle>   <IRanges>  <Rle> | <character>
  [1]     chr1 [100,  200]      + | gene1_exon1
  [2]     chr1 [100,  250]      + | gene1_exon2
  [3]     chr1 [900, 1000]      + | gene1_exon3
...


## Count reads overlapping with exonic regions of genes
assays(summarizeOverlaps(grl, rd, mode="Union", ignore.strand=TRUE))$counts
      [,1]
gene1    2
gene2    1

countOverlaps(grl, rd, ignore.strand=TRUE)
gene1 gene2
    2     1


## Compare results with corresponding gene range counts
gr1 <- GRanges("chr1", IRanges(start=100, width=901), strand="+", ID="gene1")
gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="gene2")
grgene <- c(gr1, gr2)
grgenel <- split(grgene, elementMetadata(grgene)$ID)
assays(summarizeOverlaps(grgenel, rd, mode="Union", ignore.strand=TRUE))$counts
      [,1]
gene1    2
gene2    0

countOverlaps(grgenel, rd, ignore.strand=TRUE)
gene1 gene2
    3     1



Differential Gene Expression Analysis with DESeq

Documentation

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns).  Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above.

##################################
## Import Sample RNA-Seq Counts ##
##################################
## Uses sample data from pasilla data package containing RNA-Seq experiments of
## Drosophila melanogaster cell cultures with RNAi knock-down of the splicing
## factor pasilla (Brooks et al, 2011, Genome Research).

library(DESeq); library("pasilla"); data("pasillaGenes")
counts(pasillaGenes)[1:4,]; pData(pasillaGenes) # Usually, users want to import here their own count data with the read.delim() or read.table() functions.
pairedSamples <- pData(pasillaGenes)$type == "paired-end"
countsTable <- counts(pasillaGenes)[, pairedSamples] # Select only paired-end data
countsTable[1:4,]

            treated2fb treated3fb untreated3fb untreated4fb
FBgn0000003          1          1            0            0
FBgn0000008        139         77           84           76
FBgn0000014         10          0            0            0
FBgn0000015          0          0            1            2

conds <- pData(pasillaGenes)$condition[pairedSamples] # Define biological replicates with factor
conds

[1] treated   treated   untreated untreated
Levels: treated untreated

##################
## Calling DEGs ##
##################
cds <- newCountDataSet(countsTable, conds) # Creates object of class CountDataSet derived from eSet class.
cds[, -1]; counts(cds)[1:4, ] # CountDataSet has similar accessor methods as eSet class.
cds <- estimateSizeFactors(cds) # Estimates library size factors from count data. Alternatively, one can provide here the true library sizes with: sizeFactors(cds) <- c(..., ...)
pData(cds); sizeFactors(cds)
counts(cds, normalized=TRUE)[1:4,] # Returns count values divided by their size factor which brings each sample to a common scale.
cds <- estimateDispersions(cds) # Estimates dispersion of each gene across replicates.
res <- nbinomTest(cds, "treated", "untreated") # Calls DEGs with nbinomTest.
res[order(res$padj), ][1:4,] # Returns for each gene: mean expression levels; fold change; log2 fold change; p-value for statistical significance of change; and the p-values adjusted for multiple testing with the Benjamini-Hochberg procedure which controls the false discovery rate (FDR).

               id  baseMean baseMeanA  baseMeanB foldChange log2FoldChange          pval          padj
9696  FBgn0039155  697.2227 1335.0258   59.41957 0.04450818      -4.489786 5.001208e-110 5.746388e-106
10162 FBgn0039827  296.2687  562.7488   29.78866 0.05293420      -4.239656  2.643077e-70  1.518448e-66
3163  FBgn0029167 3434.7912 5679.5372 1190.04511 0.20953206      -2.254757  8.164834e-50  3.127131e-46
6408  FBgn0034434  139.2476  263.8843   14.61102 0.05536905      -4.174776  1.130289e-42  3.246754e-39

sigDESeq <- res[res$padj <= 0.01, ]; sigDESeq <- na.omit(sigDESeq); sigDESeq <- as.character(sigDESeq[,1]) # Store genes with FDR <= 0.01 for comparison with edgeR

plotDE <- function(res) { plot(res$baseMean, res$log2FoldChange,  log="x", pch=20, cex=.3, col=ifelse( res$padj<.1, "red", "black")) } # Defines plotting function
plotDE(res) # Plots the log2 fold changes against the base means, coloring in red those genes that are significant at 10% FDR.

Differential Gene Expression Analysis with edgeR

Documentation

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data. 

##################################
## Import Sample RNA-Seq Counts ##
##################################
## Uses the same sample data as in the DESeq section.

library("edgeR"); library(DESeq); library("pasilla"); data("pasillaGenes")
counts(pasillaGenes)[1:4,]; pData(pasillaGenes)
pairedSamples <- pData(pasillaGenes)$type == "paired-end"
x <- counts(pasillaGenes)[, pairedSamples] # Select only paired-end data
group <- pData(pasillaGenes)$condition[pairedSamples] # Define biological replicates with factor
x[1:4,]; group

            treated2fb treated3fb untreated3fb untreated4fb
FBgn0000003          1          1            0            0
FBgn0000008        139         77           84           76
FBgn0000014         10          0            0            0
FBgn0000015          0          0            1            2
[1] treated   treated   untreated untreated
Levels: treated untreated

#######################################################################
## Pairwise Comparisons Among Two or Mores Groups with Classic edgeR ##
#######################################################################
## Before fitting a negative binomial model, the dispersion(s) need to be estimated.
## For single factor experiments, edgeR uses the quantile-adjusted conditional
## maximum likelihood (qCML) method. Note: classic edgeR is limited to single factor designs.
y <- DGEList(counts=x, group=group) # Constructs DGEList object
y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("treated", "untreated")) # Computes exact test for the negative binomial distribution, which has strong parallels with Fisher’s exact test. The 'pair' argument allows the user to determine which samples to compare. Note: the first group listed in the pair is the baseline for the comparison (
'untreated-treated') meaning genes with positive log-fold change are up-regulated in group "untreated" compared with group 'treated' and vice versa for genes with negative log-fold change.
topTags(et, n=4) # Returns DEG table ranked by p-values adjusted for multiple testing (last column). It uses by default the Benjamini and Hochberg’s approach for controlling the false discovery rate (FDR). For more detailed examples, see sections 10-12 in edgeR manual.

Comparison of groups:  untreated-treated
              logConc    logFC       P.Value           FDR
FBgn0039155 -15.43537 4.444217 2.866244e-108 4.147455e-104
FBgn0039827 -16.57627 4.233360  3.752429e-80  2.714882e-76
FBgn0034434 -17.62894 4.153032  1.025818e-57  4.947862e-54
FBgn0034736 -16.88889 3.365485  1.119077e-51  4.048260e-48

## Compare DEGs from edgeR with those from DESeq in Venn diagram
all <- as.data.frame(topTags(et, n=100000))
sigedgeR <- all[all$adj.P.Val <= 0.01, ]; sigedgeR <- na.omit(sigedgeR); sigedgeR <- row.names(sigedgeR) # Store genes with FDR <= 0.01 for comparison with DESeq
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
OLlist <- overLapper(setlist=list(DESeq=sigDESeq, edgeR=sigedgeR), sep="_", type="vennsets"); counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts) # Plots Venn diagram
counts # Returns Venn counts

      DESeq       edgeR DESeq_edgeR
          9        1091         340



#########################################
## Multifactor Analysis with glm edgeR ##
#########################################
## For multiple factor design experiments, edgeR uses the Cox-Reid profile-adjusted likelihood (CR)
## method in estimating dispersions. It handles multiple factors by fitting generalized linear models
## (GLM) with a design matrix.
The CR method can be used to calculate a common dispersion for all the
## tags, trended dispersion depending on the tag abundance, or separate dispersions for individual
## tags using the different
estimateGLMX function group. Note: one needs to estimate either common
## dispersion or trended dispersions prior to the estimation of tagwise dispersions.

y <- DGEList(counts=x, group=group) # Constructs DGEList object
design <- model.matrix(~group)
y <- estimateGLMTrendedDisp(y, design) # Estimates trended dispersions
y <- estimateGLMTagwiseDisp(y, design) # Estimates tagwise dispersions (requires pre-processing with
estimateGLMCommonDisp or estimateGLMTrendedDisp). To estimate tagwise dispersions, the empirical Bayes method is applied to squeeze tagwise dispersions towards common dispersion or trended dispersions whichever exists.
fit <- glmFit(y, design) # Fits the negative binomial GLM for each tag and produces an object of class DGEGLM with some new components.
lrt <- glmLRT(fit, coef=2) # Takes DGEGLM object and carries out the likelihood ratio test.
topTags(lrt, n=4)

Coefficient:  groupuntreated
               logConc     logFC       LR       P.Value           FDR
FBgn0039155  -9.822991  4.445678 584.9491 3.144586e-129 4.550215e-125
FBgn0039827 -10.684400  4.228985 301.2053  1.799604e-67  1.302014e-63
FBgn0035085 -10.059529  2.437774 235.3055  4.152965e-53  2.003114e-49
FBgn0033764 -11.856412 -3.575744 233.7005  9.297436e-53  3.363347e-49

lrt <- glmLRT(y, fit, contrast=c(-1, 1)) # A contrast/comparison among specific samples can be specified with a vector containing values 0 , 1 and -1, and length matching the number of columns in the design matrix.

###########################
## Normalization Factors ##
###########################
## The use of normalization factors, such as the
weighted trimmed mean of M-values (TMM),
##
can be considered for RNA samples with extreme composition bias, where a small number
## of genes accounts for most of the reads in a library.

y <- calcNormFactors(y) #
Given a count table or DGEList object, the calcNormFactors() function computes by default the normalization factors with the TMM method.
y$samples

                 group lib.size norm.factors
treated2fb     treated 15620018    1.0218827
treated3fb     treated 12733865    1.0126983
untreated3fb untreated 10283129    0.9982300
untreated4fb untreated 11653031    0.9680287

## Smear plot for the first sample in y$counts (left panel) and the third one (right panel)
par(mfrow = c(1, 2))
maPlot(y$counts[, 1], y$counts[, 3], normalize = TRUE, pch = 19, cex = 0.4, ylim = c(-8, 8))
grid(col = "blue")
abline(h = log2(y$samples$norm.factors[3]/y$samples$norm.factors[1]), col = "red", lwd = 4)
eff.libsize <- y$samples$lib.size * y$samples$norm.factors
maPlot(y$counts[, 1]/eff.libsize[1], y$counts[, 3]/eff.libsize[3], normalize = FALSE, pch = 19, cex = 0.4, ylim = c(-8, 8))
grid(col = "blue")


A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

Table of Contents

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

DNA-Methylation Data Analysis

HT-Seq Data Visualization

  • ggbio: ggplot2 extension for genomics data (online manual)

    Gviz: Plotting data and annotation information along genomic coordinates

    HilbertVis: Hilbert genome plots

  • GenomeGraphs: Plotting genomic information from Ensembl

  • TileQC: Flow Cell Quality Visualization

  • rtracklayer: R interface to genome browsers

  • genoPlotR: Plotting maps of genes and genomes 

  • Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

Command Line Alignment Utilities

SOAP

SOAP stands for "Short Oligonucleotide Analysis Package." This is a particularly powerful and fast alignment tool which supports mismatches and indels. It also contains a new tool for SNP calling.

SOAP Web Page

Running SOAP

SOAP is run from the unix command line, and requires a reference genome in .fasta format, and a list of query sequences in .fasta or .fastq format. This example was tested with SOAP version 2.18.

wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq # download some test Illumina reads from Arabidopsis
soap # inspect command line options
2bwt-builder genome.fasta
   # create binary of reference genome
soap -a query.fastq -D genome.fasta.index -o output.soap
   # align query to genome and store output

Parsing Output with R

library(ShortRead)
alignedReads <- readAligned("./", pattern="output.soap", type="SOAP")
   # Reads in alignment coordinates as a ShortRead AlignedRead object

MAQ

MAQ stands for "Mapping and Assembly with Quality." It is a suite of tools which allows you to build a consensus assembly by mapping short reads to a reference. This is particularly useful for identifying SNPs. MAQ can also be used as a general purpose short read alignment tool, and contains some useful file format converters.

MAQ Web Page

Running MAQ

wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq # download some test Illumina reads from Arabidopsis
maq # inspect command line options
maq fasta2bfa genome.fasta genome.bfa
   # create binary of reference genome
maq fastq2bfq query.fastq readBinary.bfq
   # create a binary of dataset
maq match out.map genome.bfa readBinary.bfq
# align query to genome and store output

Parsing Output with R

library(ShortRead)
alignedReads <- readAligned("./", pattern="out.map", type="MAQMap")
   # Reads in alignment coordinates as a ShortRead AlignedRead object

Bowtie


Bowtie is an ultra-fast alignment tool. It has somewhat less flexibility with regard to mismatches than SOAP or MAQ, but much greater performance. It does not currently support gapped alignment. Bowtie is generally an excellent choice if you don't need any of the extended features provided by lower performance tools.

Bowtie web page

Running Bowtie

wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq # download some test Illumina reads from Arabidopsis
cp /home/tbackman/.html/genome.fasta . # Alternative- copy from biocluster folder
cp /home/tbackman/.html/query.fastq . # Alternative- copy from biocluster folder
bowtie # inspect command line options
bowtie-build genome.fasta genome.binary
   # create binary genome (replace genome.fasta with your genome)
   # You can also download pre-built genomes from the Bowtie web page
bowtie -q genome.binary query.fastq output.bowtie
   # align query to genome and store output (replace ~/query.fastq with your query sequences
   # and ~/genome.binary with the genome you built in the previous step )

Parsing Output with R

library(ShortRead)
alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie")
   # Reads in alignment coordinates as a ShortRead AlignedRead object

BWA

See the example in the Rsamtools section.

Exercises

  • Basic Sequence and Range Operations
  • NGS Application Domains
  • SNP/INDEL Discovery and Annotation

Session Info

If you receive errors or unexpected behavior when performing exercises from this manual, please double check that you are using the same software version(s) for which it was written:

> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods 
[8] base    

other attached packages:
 [1] ChIPpeakAnno_2.2.0                  gplots_2.10.1                     
 [3] KernSmooth_2.23-7                   caTools_1.12                      
 [5] bitops_1.0-4.1                      gdata_2.8.2                       
 [7] gtools_2.6.2                        limma_3.10.0                      
 [9] org.Hs.eg.db_2.6.4                  GO.db_2.6.1                       
[11] RSQLite_0.10.0                      DBI_0.2-5                         
[13] BSgenome.Ecoli.NCBI.20080805_1.3.17 multtest_2.10.0                   
[15] DESeq_1.6.1                         locfit_1.5-6                      
[17] lattice_0.20-0                      akima_0.5-4                       
[19] BayesPeak_1.6.0                     BiocInstaller_1.2.1               
[21] GenomicFeatures_1.6.4               AnnotationDbi_1.16.4              
[23] leeBamViews_0.99.13                 Biobase_2.14.0                    
[25] edgeR_2.4.0                         Rsamtools_1.6.2                   
[27] biomaRt_2.10.0                      BSgenome_1.22.0                   
[29] GenomicRanges_1.6.3                 Biostrings_2.22.0                 
[31] IRanges_1.12.2                    

loaded via a namespace (and not attached):
 [1] annotate_1.32.0    genefilter_1.36.0  geneplotter_1.32.1 MASS_7.3-16      
 [5] RColorBrewer_1.0-5 RCurl_1.7-0        rtracklayer_1.14.3 splines_2.14.0   
 [9] survival_2.36-10   tools_2.14.0       XML_3.4-3          xtable_1.6-0     
[13] zlibbioc_1.0.0   


Table of Contents


Installing Packages


source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))


This site was accessed web analytics times (detailed access stats)