HT Sequence Analysis with R and Bioconductor Authors: Tyler Backman, Rebecca Sun and Thomas Girke, UC Riverside Introduction[ Sequencing Technologies ] [ Latest Slides from NGS Analysis Workshop ] High throughput sequencing (HT-Seq or HTS), also known as next generation sequencing (NGS), presents a wide spectrum of opportunities for genome research. Unfortunately, many existing bioinformatic tools do not scale well to large datasets consisting of tens of millions of sequences generated by technologies like Illumina/Solexa, Roche/454, ABI/SOLiD and Helicos. The Bioconductor project fills this gap by providing a rapidly growing suite of well designed R packages for analyzing traditional and HT-Seq datasets. These 'BioC-Seq' packages allow to analyze these sequences with impressive speed performance. Their accelerations are achieved by using memory efficient string containers and performing the time consuming computations with calls to external programs that are implemented in compiled languages (e.g. C/C++). Together these packages form a novel framework that allows researchers to develop efficient pipelines by performing complex data analysis in a high level data analysis and programming environment.
Scope of this ManualThis manual is intended for users who have a basic knowledge of the R environment, and would like to use R/Bioconductor to perform general or HT sequencing analysis. For new users, it is recommended to first review the material covered in the "R Basics" section (see below). To obtain a more comprehensive overview of R's sequence bioinformatics utilities, Robert Gentleman's book "R Programming for Bioinformatics" is an excellent choice. The packages introduced here are a 'personal selection' of the authors of this manual that does not reflect the full utility spectrum of the Bioconductor project for this field of application. The introduced packages were chosen, because the authors use them often for their own teaching and research. To obtain a broad overview of available Bioconductor packages, it is strongly recommended to consult its official project site (http://bioconductor.org). Due to the rapid development of many packages, it is also important to be aware that this manual will often not be fully up-to-date. Because of this and many other reasons, it is absolutely critical to use the original documentation of each package (PDF manual or vignette) as primary source of documentation. Users are welcome to send suggestions for improving this manual directly to the authors.
NGS WorkflowsReaders of this manual might also be interested in the recently released systemPipeR package which provides convenience utilities for building end-to-end analysis pipelines with automated report generation for a variety next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq. WorkshopsWorkshops on HT sequence data analysis using BioC-Seq resources are announced on the Bioconductor workshop site. The Institute for Integrative Genome Biology, IIGB at UCR also offers workshops in this field. In addition, there are many more institutions that organize similar workshops. The best way to find out about upcoming events in a certain location is to run a Google search on this topic.
R BasicsThe R & BioConductor manual provides an introduction on the usage of the R environment and its basic command syntax, and the Programming in R manual introduces a more advanced overview of R's programming syntax. String Handling Utilities in R's Base DistributionThe R base distribution contains various functions for handling and manipulating strings. These utilities are very useful for basic sequence analysis tasks. However, often it is much more effective to use for these tasks dedicated Bioconductor sequence analysis packages, such as Biostrings. This manual primarily introduces functions and data objects from the Bioconductor project. Nevertheless, the following commands provide a brief summary of some very useful string handling functions available in R's base distribution.##################### ## String Matching ## ##################### myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") # Creates a sample sequence data set. myseq[grep("ATG", myseq)] # String searching with regular expression support. pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT". as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches. pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT". as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns position information of matches in first sequence. gsub("^ATG", "atg", myseq) # String substitution with regular expression support. ######################## ## Positional Parsing ## ######################## nchar(myseq) # Computes length of strings. substring(myseq[1], c(1,3), c(2,5)) # Positional parsing of several fragments from one string. substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of many strings. ############################ ## Reverse and Complement ## ############################ myseq_comp <- chartr("ATGC", "TACG", myseq) # Returns complement for given DNA sequences. substring(myseq[1], 1:nchar(myseq[1]), 1:nchar(myseq[1])) # Vectorizes single sequence. x <- strsplit(myseq_comp, "", fixed=TRUE) # Vectorizes many sequences. x <- lapply(x, rev) # Reverses vectors. myseq_revcomp <- sapply(x, paste, collapse="") # Collapses vectors to strings. Final result: reverse and complement of myseq. ######################################### ## Translate DNA Sequence into Protein ## ######################################### y <- c("ATGCATTGGACGTTAG") # Creates sample DNA sequence. AAdf <- read.table(file="http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/AA.txt", header=T,"\t") # Imports genetic code. AAv <- AAdf[,2]; names(AAv) <- AAdf[,1] # Creates named vector with genetic code y <- gsub("(...)", "\\1_", y) # Inserts "_" after each triplet. y <- unlist(strsplit(y, "_")) # Splits on "_" and returns vectorized triplets. y <- y[grep("^...$", y)] # Removes incomplete triplets. AAv[y] # Translation into protein by name-based subsetting. ############################# ## Create Random Sequences ## ############################# sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse="")) # Creates 100 random DNA sequences with 20 residues. sapply(1:100, function(x) paste(sample(1:40, 20, replace=T), collapse=" ")) # Creates 100 random score sets with 20 elements. ################################################### ## Sequence Printing and Writing in FASTA Format ## ################################################### myseq <- sapply(1:12, function(x) paste(sample(c("A","T","G","C"), 40, replace=T), collapse="")) writeLines(myseq) # Prints the sequences to the screen, one per line. writeLines(strtrim(myseq, 10)) # The strtrim() function allows to print only the first section of the sequences. writeLines(strwrap(myseq, indent = 20)) # The strwrap() function provides wrapping, indenting and prefixing utilities. myname <- paste(">", month.name, sep="") # Creates sample names for the sequences. writeLines(as.vector(t(cbind(myname, myseq))), "myseq.fasta") # Writes the sequences in FASTA format to a file. Bioconductor Packages
BiostringsThe Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages.
Handling Single Sequences with XStringThe XString class allows to represent the different types of biosequence strings in the subclasses: BString, RNAString, DNAString and AAString.library(Biostrings) d <- DNAString("GCATAT-TAC") # Creates DNAString object. r <- RNAString("GCAUAU-UAC") # Creates RNAString object. r <- RNAString(d) # Converts d into RNAString object. p <- AAString("HCWYHH") Comparing and Subsetting XString ObjectsSeveral useful utilities are available to access, subset and manipulate XString objects.replaceLetterAt(d, c(1), "N") # Replaces residues at specified positions. length(d) # Returns the length of the sequence that is stored in d. toString(d) # Converts DNAString to character vector of length 1. d[2:6] # Prints d from postions 2 to 6. This makes positional parsing of sequences very easy! d[length(d):1] # Prints the reverse string of d. subseq(d, start=3, end=6) # Note: the subseq() function from the IRanges package is more efficient for subsetting large sequences. d[1:4]==d[1:4] # Checks whether two sequences are identical. In this case it returns TRUE. d[1:4]==d[1:5] # Returns FALSE. d == r # Interprets DNA and RNA sequences the same way. As a result, the given example returns TRUE. d[7] <- DNAString("N") # Replaces base 7 with a "N". subseq(d, start=7, width=1) <- DNAString("-") # Replaces base 7 with a "-". Subfeature Retrieval with StringViewsThe XStringViews class provides utilities to manage and view subfeatures from sequences stored in XString objects.dlong <- DNAString("GCATATTACAATCGATCCGCATATTAC") dsub <- Views(dlong, start = c(1,6,11,20), end = c(5,8,18,27)) # Generates XStringViews object with four views (subfeatures). dsub[1:2] # Returns first two views. dsub[[1]] # Returns first view as XString object. subject(dsub) # Returns views as a single sequence. paste(dsub, collapse="_") # Collapses views into a single sequence. attributes(dsub) # Returns attribute list of XStringViews object. start(dsub); end(dsub); width(dsub) # Functions to returns the different XStringView components as vectors. sapply(as.list(dsub), toString) # Returns sequences in XStringView object as vector. dsub[dsub == DNAString("ATCGATCC")] # Subsetting by a sequence string. Masking SequencesBiostrings allows to mask XString objects in order to exclude undesired regions from consideration by various analysis tools, e.g. repetitive sequences in the mapping of short reads. The advantage of this masking approach is that it does not result in any information loss, because masks can be turned on and off any time. In addition, the process is very memory efficient.d <- DNAString("CGCTATCTGACTGCCGCGCC") # Creates sample sequence. maskMotif(d, "GC") # Function for masking a sequence by a pattern. m1 <- Mask(mask.width=nchar(d), start=c(1,17), end=c(3,20)) # Defines mask m1. m2 <- Mask(mask.width=nchar(d), start=c(4,12), end=c(7,14)) # Defines mask m2 m12 <- append(m1, m2) # Joins two masks in one container. names(m12) <- c("A", "B") # Assigns names to mask components. active(m12)["B"] <- FALSE # Turns one mask off. masks(d) <- m12 # Applies masks to DNAString object. No information is lost by this action! d; nchar(d); length(d) # Returns some information about masked object. active(m12)["B"] <- TRUE # Turns mask B back on. masks(d) <- m12 # Reapplies modified mask object to d. gaps(d) # Reverses all the masks. reverseComplement(d) # Returns the reverse and complement for a masked sequence object. dfrag <- as(d, "XStringViews") # Turns MaskedXString object into an XStringViews object. sapply(as.list(dfrag), toString) # Retrieves all unmasked sections in d as strings in vector. injectHardMask(d, "N") # Injects/fills the masked regions of a sequence with a user-specified letter. masks(d) <- NULL # Removes all masks from DNAString object. Alternative command: unmasked(). Handling Many Sequences with XStringSetThe XStringSet class allows to represent many sequences in a single container. The subclasses for the different types of biosequences are: BStringSet, RNAStringSet, DNAStringSet and AAStringSet.dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) # The corresponding XStringSet containers allow to store many biosequences in one object. names(dset) <- c("seq1", "seq2", "seq3") # Assigns names to the elements in a XStringSet object. rset <- RNAStringSet(c("GCAUAUUAC", "UUUAAAA", "GCAUAUUAC")) rset <- RNAStringSet(dset) # Converts dset into RNAStringSet object. pset <- AAStringSet(c("HCWYHH", "SPPRHA")) Comparing and Subsetting of XStringSet ObjectsSeveral useful utilities are available to access, subset and manipulate XStringSet objects.dset[1:2] # Returns the first two sequence entries from dset. append(dset, dset); c(dset, dset) # Appends/concatenates two XStringSet objects. unlist(dset)] # Collapses many sequences to a single one stored in a DNAString container. DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # Sequence subsetting by positions provided under start and end. d==dset[[1]] # The [[ subsetting operator returns a single entry as XString object. lapply(as.list(dset), toString) # Converts XStringSet to list. If the name field is populated then the list elements will be named accordingly. names(dset); width(dset) # Returns the name and length of each sequences, respectively. dsetv <- sapply(as.list(dset), toString); table(dsetv) # Returns the sequences in XStringSet as named vector and then counts the number of occurrences of each sequence. dset[!duplicated(dset)] # Removes the duplicated sequence in dset. Table of Contents
Sequence Import and ExportThe set of read.XStringSet import functions allow to import one or many FASTA-formatted biosequences from an external file directly into XStingSet objects.################################################ ## Import and Export for XStringSet Container ## ################################################ myseq1 <- read.DNAStringSet("ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.ffn","fasta") #
Imports all open reading frames of the Halobacterium genome from NCBI
into DNAStringSet object. If this returns an error message then first
download the sequences to your desktop and import them from there. myseq1[grep("phosphoesterase", names(myseq1))] # Retrieves all sequences containing the term "phosphodiesterase" in their name field. myseq2 <- myseq1[grep("^ATGCTT", sapply(as.list(myseq1), toString))] # Retrieves all sequences starting with "ATGCTT". write.XStringSet(myseq2, file="myseq2.txt", format="fasta", width=80) # Writes sequences to file in FASTA format. write.XStringSet(myqual, file="myqual.txt", format="fasta", width=80) # Writes quality scores from above example to file.
Handling of Sequence and Quality Data with QualityScaleXStringSetThe QualityScaleXStringSet class allows to represent many sequences plus their quality data in a single container. The subclasses for the different types of biosequences are: QualityScaledBStringSet, QualityScaledDNAStringSet, QualityScaledRNAStringSet and QualityScaledAAStringSet.########################## ## About Quality Scores ## ########################## phred <- 1:9 phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse=""); phreda as.integer(charToRaw(phreda))-33 # Phred quality scores are integers from 0-50 that are stored as ASCII characters after adding 33. The basic R functions rawToChar and charToRaw can be used to interconvert among their representations. Solexa scores are similar, but use an offset of 64. ########################################## ## Quality Score Handling in Biostrings ## ########################################## dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Creates random sample sequence. myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Creates random Phred score list. myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters. myqual <- PhredQuality(myqual) # Converts to a PhredQuality object. lapply(seq(along=myqual), function(x) as.integer( charToRaw( as.character(myqual[x])))) # Converts Phred scores from ASCII decimal back to integer representation. library(ShortRead); setSR <- ShortReadQ(sread=dset, quality=FastqQuality(BStringSet(myqual)), BStringSet(myqual)) myMA <- as(quality(setSR), "matrix") # Fast conversion of all quality ASCII values into a numeric matrix by using the ShortReadQ function from the ShortRead library. boxplot(as.data.frame((myMA))) # Creates box plot of quality values. dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object. dsetq1[rowSums(myMA < 20) <= 4] # Filters for sequences that contain not more than 4 bases with a quality score below 20. dset <- DNAStringSet(dsetq1) # Extract sequence component from QualityScaledDNAStringSet object. qset <- quality(dsetq1) # Extract quality component from QualityScaledDNAStringSet object. DNAStringSet(dset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for sequence data. BStringSet(qset[1:2], start=c(1,5), end=c(4,9)) # Positional subsetting for corresponding quality data. ########################################################### ## Example for Injecting Ns at Positions with Low Scores ## ########################################################### XStringSetInject <- function(myseq=dset[[1]], myqual=qset[1], cutoff=20, mychar="N") { # Defines character inject function XStringSetInject. mypos <- which(as.integer(myqual)<cutoff) return(toString(replaceLetterAt(myseq, which(as.integer(myqual)<cutoff), rep(mychar, length(mypos))))) } dvec <- sapply(seq(along=dset), function(x) XStringSetInject(myseq=dset[[x]], myqual=qset[x], cutoff=20)) # Apply XStringSetInject function to all sequences. dset2 <- DNAStringSet(dvec) names(dset2) <- paste("d", seq(along=dvec), sep="") names(myqual) <- paste("d", seq(along=dvec), sep="") dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) # Reassemble results in QualityScaledDNAStringSet object. #################################### ## Filter Sequences by Ns (Score) ## #################################### Ncount <- alphabetFrequency(DNAStringSet(dvec))[,15] # Generates N counts for each sequence in dvec. dsetq2[Ncount <= 2] # Returns only sequences with 0-2 Ns. length(dsetq2[Ncount <= 2]) # Returns the number of sequences that have 0-2 Ns. Basic Sequence Manipulation UtilitiesThe Biostrings package contains a wide spectrum of functions for performing many of the basic sequence transformation and manipulation routines, such as computing frequency tables, reverse & complements, translating DNA into protein sequences, etc. The following examples introduce a subset of these basic sequence analysis functions.##################### ## Getting Started ## ##################### library(help=Biostrings) # Lists all functions provided by Biostrings. myseq1 <- read.DNAStringSet(" ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp_uid217/AE004437.ffn ", "fasta") # Imports a sample sequence data set: all open reading frames of the Halobacterium genome from NCBI. If this returns an error message then first download the sequences to your desktop and import them from there. ###################### ## Useful Data Sets ## ###################### GENETIC_CODE # Prints genetic code information. IUPAC_CODE_MAP # Returns IUPAC nucleotide ambiguity codes. data(BLOSUM80) # Loads BLOSUM80 substitution matrix. More on this: ?substitution.matrices ?substitution.matrices # Detailed information on provided substitution matrices. ##################################################### ## Sequence Variants (SNPs) and Consensus Analysis ## ##################################################### data(phiX174Phage) # Imports six versions of the complete genome for bacteriophage phi X174 as DNAStringSet object. mymm <- which(rowSums(t(consensusMatrix(phiX174Phage))[,1:4] == 6)!=1) # Creates a consensus matrix and returns the mismatch (SNP) positions. DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10) # Prints the sequence sections for the first SNP region in the phiX174Phage sequences. consensusString(DNAStringSet(phiX174Phage, start=mymm[1], end=mymm[1]+10)) # Prints the consensus sequence for the corresponding SNP region. mysnp <- t(consensusMatrix(phiX174Phage))[mymm,1:4]; rownames(mysnp) <- paste("SNP", mymm, sep="_"); mysnp # Returns the consensus matrix data only for the SNP positions. mysnp2 <- lapply(seq(along=mymm), function(x) sapply(as.list(DNAStringSet(phiX174Phage, start=mymm[x], end=mymm[x])), toString)) mysnp2 <- as.data.frame(mysnp2) colnames(mysnp2) <- paste("SNP", mymm, sep="_"); mysnp2 # Returns all SNPs in a data frame and names them according to their positions. ############################################# ## Reverse and Complement of DNA Sequences ## ############################################# reverseComplement(myseq1) # Returns reverse and complement for may BioC object types including XString, XStringView and XStringSet. reverseComplement(dsub) # Does the same for XStringView objects. Adjust all start and end coordinates accordingly. ######################################### ## Residue Frequency (e.g. GC Content) ## ######################################### alphabetFrequency(myseq1)[1:4,] # Computes the residue frequency for all sequences in myseq1. data.frame(ID=names(myseq1), GC=rowSums(alphabetFrequency(myseq1)[,c(2,3)]/width(myseq1))*100)[1:4,] # Returns GC content as data frame. trinucleotideFrequency(myseq1[1:4,]) # Returns triplet frequency. ########################################## ## Translate DNA into Protein Sequences ## ########################################## translate(myseq1[1]) # Translates a single ORF and returns result as AAStringSet object. translate(myseq1 ) # Translates all ORFs in myseq1.
Trimming of Flanking Sequences (e.g. Adaptors)Many cloning and barcoding approaches append flanking sequences (e.g. adaptors) to the actual sequences of a biosample. These artificial sequence fragments need to be removed to allow a reliable matching of the sequences against their corresponding genomes or transcriptomes. The trimLRPatterns() function is a very efficient and flexible utility for removing flanking patterns from both ends of sequences.myseq <-DNAStringSet(c("CCCATGCAGACATAGTG", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG")); names(myseq) <- letters[1:3] # Creates sample sequences.
Pattern Search and Alignment Tools for Genome MappingBiostrings contains a wide spectrum of sequence search and pattern matching tools. This section provides only a brief introduction into a subset of these tools.###################### ## Pattern Matching ## ###################### ## The matchPattern() and vmatchPattern() functions allow to search a query (usually short) against one or many subject sequences, respectively. The number of mismatches and gaps can be specified by the user. mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1) # Finds all matches of a pattern in a reference sequence. mypos # The results are stored as XStringViews object.
myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) # Finds all matches of a pattern in a set of reference sequences. myvpos # The results are stored as MIndex object. myvcount <- countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) # Counts only the corresponding matches. Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # Retrieves the result for single entry as XStringViews object. lapply(seq(along=myseq1), function(x) as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]])))) # Retrieves the sequences of all matches. ###################################################### ## Pattern Matching with Regular Expression Support ## ###################################################### myseq <- DNAStringSet(c("ATGCAGACATAGTG",
"ATGAACATAGATCC", "GTACAGATCAC")) #
Creates a sample sequence data set. myseq[grep("^ATG", myseq, perl=TRUE)] # String searching with regular
expression support. pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of
pattern "AT". as.numeric(pos1);
attributes(pos1)$match.length #
Returns position information of matches. pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of
pattern "AT". as.numeric(pos2[[1]]);
attributes(pos2[[1]])$match.length #
Returns position information of matches in first sequence. DNAStringSet(gsub("^ATG", "NNN", myseq)) # String substitution with regular
expression support. ################################### ## Fast Matching with matchPDict ## ################################### ## The matchPDict function allows fast matching of many short sequences stored in a dictionary against a long reference sequence (e.g. chromosome). ## (1) Matching of Sequences to Reference mydict <- DNAStringSet(sapply(1:1000, function(x) paste(sample(c("A","T","G","C"), 8, replace=T), collapse=""))) # Creates random sample sequences. names(mydict)<- paste("d", 1:1000, sep="") # Names the sequences. mypdict <- PDict(mydict) # Creates a PDict dictionary. Allows only sequences of same length. mychr <- DNAString(sapply(1, function(x) paste(sample(c("A","T","G","C"), 20000, replace=T), collapse=""))) # Creates sample chromosome. mysearch <- matchPDict(mypdict, mychr, max.mismatch=0) # Searches all dictionary entries against chromosome. ## (2) Accessing the Results unlist(mysearch)[1:10] # Returns first 10 matches. start_index <- startIndex(mysearch) # Returns start positions of matches. end_index <- endIndex(mysearch) # Returns end positions of matches. count_index <- countIndex(mysearch) # Counts the number of matches for each sequence. with_match <- which(count_index >= 1) # Retrieves index of sequences with at least one match. Views(mychr, start=unlist(start_index[with_match]), end=unlist(end_index[with_match])) # Returns matches as XStringViews object. mydict[with_match] # Returns corresponding dictionary entries. ## (3) Coverage Plots mycov <- as.integer(coverage(mysearch, 1, length(mychr))) max(mycov); min(mycov) sum(mycov != 0) / length(mycov) plotCoverage <- function(coverage, start, end, mycol="red") { # Defines coverage plotting function. plot.new() plot.window(c(start, end), c(0, 10)) axis(1) axis(2) axis(4) lines(start:end, coverage[start:end], type="l", col=mycol) } plotCoverage(mycov, start=1, end=20000, mycol="red") # Plots coverage result. ## (4) Plot Fragments onto Chromosome source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/featureMap.txt") # Imports plotting function. mapDF <- data.frame(as.data.frame(unlist(mysearch))[,1:2], strand=1) # Creates coordinate data frame. colnames(mapDF) <- c("startx", "endx", "strand") genelength <- length(mychr) # Defines chromosome length. mapDF[,2] <- mapDF$endx + 200 # To see the matches better in the plot, their length is here extended by an arbitrary value (200). mapDF[mapDF[,2]>genelength,2] <- genelength Mysub <- paste("Chromosome length", genelength, "bp") # Specifies subtitle of plot. Mymain <- "Chromosome Plot for Sense Alignments" # Specifies title of plot. plotDF <- transformDF(plotDF=mapDF, genelength, method="arrange") # Transforms coordinates for plotting. featurePlot(plotDF=plotDF, geneline=6, genecol="green", featureline=2, featurecol=sample(c(2,4,8), length(plotDF[,1]), replace=T), showxy=F) # Plots the feature map for the match coordinates that are stored in 'mysearch'. The feature color argument allows to color each feature in a different color. ## (5) Allowing Mismatches with matchPDict mychr <- DNAString("CTAGGCTGTATCCTAAACACCGACAAGGTCTAGCCTGTATGCTAGCTAGTGTCCGCTCGGAAGCCGCT") mydict <- DNAStringSet(c("GTAGGCTGTATCCTA", "CGCTCGGAAGCCGCT")) names(mydict) <- paste("d", seq(along=mydict), sep="") mypdict <- PDict(mydict, tb.end=1) # Uses a trusted band of nucleotides that need to match exactly (here only first nucleotide). tb(mypdict); tail(mypdict); width(mypdict) # Methods for accessing the different trusted band components; '?PDict' provides more detailed information. mysearch <- matchPDict(mypdict, mychr, max.mismatch=1) # Allows one mismatch outside of trusted band. table(cut(countIndex(mysearch), c(0, 1:10), right=FALSE)) # Summary of matches. Computing Pairwise Sequence AlignmentsThe pairwise sequence alignment function from Biostrings provides a flexible environment for computing optimal pairwise alignments using different dynamic programming algorithms.s1 <- DNAString("GGGCCC"); s2 <- DNAString("GGGTTCCC") # Creates sample sequences. myalign <- pairwiseAlignment(s1, s2, type="local") # Computes pairwise local alignment with Smith-Waterman dynamic programming algorithm. myalign <- pairwiseAlignment(s1, s2, type="global") # Computes global alignment with Needleman-Wunsch algorithm. pairwiseAlignment(c("GGGCCC", "GAGGCC", "CGAGA"), s2, type="local", scoreOnly=T) # The function can also be used for sequence searching by providing the 'sequence database' as vector in the first argument. attributes(myalign)[-5] # Returns the components of an alignment object. subject(myalign); pattern(myalign) # Returns aligned sequences with gaps. Many more helper functions for alignments can be found in Biostrings' Alignment Manual. sq1 <- SolexaQuality(as.integer(c(22,22,22,12,12,12))) pairwiseAlignment(s1, s2, patternQuality = sq1, subjectQuality=sq2) # Uses a quality-based method for generating the substitution matrix. Multiple Sequence Alignments (MSAs)Multiple sequence alignments (MSAs) can be imported as FASTA formatted alignments into R using the read.XStringSet function. Almost all MSA programs support this format (e.g. ClustalW, T-Coffee, MUSCLE, Multalin). This import method will store the MSAs as XStringSet objects. The following examples introduce a variety of useful analysis routines on MSAs using this object class.########################### ## Import of MSAs into R ## ########################### library(Biostrings) p450 <- read.AAStringSet("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/Samples/p450.mul", "fasta") # Imports MSA of P450 sequences from a FASTA formatted file and stores the result as AAStringSet object. as.matrix(p450) # Converts MSA into a character matrix. consensusMatrix(p450) # Computes the residue frequencies for each MSA column and returns the results as matrix. #################################### ## Viewing of MSAs in HTML Format ## #################################### ## Viewing large alignments in a web browser can be very efficient. The following example outputs an HTML formatted alignment with consensus line. StringSet2html <- function(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=TRUE, ...) { if( class(msa)=="AAStringSet") msa <-
AAStringSet(msa, start=start, end=end) if( class(msa)=="DNAStringSet") msa <- DNAStringSet(msa, start=start, end=end) msavec <- sapply(msa, toString) offset <- (counter-1)-nchar(nchar(msavec[1])) legend <- paste(paste(paste(paste(rep(" ", offset), collapse=""), format(seq(0, consensus <- consensusString(msavec, ambiguityMap=".", ...) msavec <- paste(msavec, rowSums(as.matrix(msa) != "-"), sep=" ") msavec <- paste(format(c("", names(msa), "Consensus"), justify="left"), c(legend, msavec, msavec <- c("<html><pre>", msavec, "</pre></html>") writeLines(msavec, file) if(browser==TRUE) { browseURL(file) } } StringSet2html(msa=p450, file="p450.html", start=1, end=length(p450[[1]]), counter=20, browser=T, threshold=1.0) # Writes MSA to file "p450.html", which can be viewed in a browser. StringSet2html(msa=p450, file="p450.html", start=450, end=470, counter=20, browser=T, threshold=1.0) # Same as above, but for alignment positions 450-470. ########################################################## ## Computation of Similarity/Distance Matrices for MSAs ## ########################################################## masim <- sapply(names(p450), function(x) sapply(names(p450), function(y) pid(PairwiseAlignedXStringSet(p450[[x]], p450[[y]]), type="PID2"))) masim # Returns a percent identity matrix. The identity method can be specified under the "type" argument, see help for pid() function. madist <- 1-(masim/100) madist # Returns corresponding distance matrix. disttree <- hclust(as.dist(madist)) plot(as.dendrogram(disttree), edgePar=list(col=3, lwd=4), horiz=T) # Computes and plots distance tree for MSA. #################################################### ## Mapping Sequence Postions to Gapped Alignments ## #################################################### ## Remove gaps from sequences, e.g. for pattern matching removeGaps <- function(align=p450[[1]], gap="-") { myseq <- maskMotif(align, gap) myseq <- paste(as(myseq, "XStringViews"), collapse="") return(myseq) } p450seq <- AAStringSet(sapply(p450, function(x) removeGaps(align=x))) mypos1 <- matchPattern("FSAGKRICAG", p450seq[[1]]); mypos1 # Removes gaps from sequences and provides a pattern matching example for P450 heme binding site. p450gapmask <- maskMotif(p450[[1]], "-") ## Mapping sequence postions to gapped alignments seqAlignmap <- function(seq=p450seq[1], align=p450[1], gap="-") { seqindex <- 1:length(seq[[1]]) gapindex <- which(as.matrix(align)==gap) alignindex <- (1:length(align[[1]]))[-gapindex] names(alignindex) <- seqindex return(alignindex) } alignindex <- seqAlignmap(seq=p450seq[1], align=p450[1], gap="-") # Returns a named vector with the sequence coordinates in the name field and the corresponding alignment coordinates in the data field. alignindex <- alignindex[start(mypos1)]:alignindex[end(mypos1)]; alignindex # Returns only the alignment positions for the above pattern match result. AAStringSet(p450, start=min(alignindex), end=max(alignindex)) # Returns the alignment segment for the above pattern match result. ############################################################# ## Simple Example for Sliding Window Conservation Analysis ## ############################################################# ## (A) Compute conservation vector using a simple conservation model. consCol <- function(myalign=p450) { myalign <- as.matrix(myalign) consV <- sapply(seq(along=myalign[1,]), function(x) max(table(myalign[,x][!myalign[,x] return(consV) } consV <- consCol(myalign=p450) ## (B) Perform sliding window analysis on conservation vector. slidingWindow <- function(mydata=consV, win=c(1, 100), start=1, end=length(consV)) { mydata <- mydata[start:end] windex <- t(sapply(0:length(mydata), function(x) win+x)) mywind <- sapply(seq(along=mydata), function(x) mean(mydata[windex[x,1]:windex[x,2]], mywind <- mywind[1:(length(mywind)-win[2])] return(mywind) } slideV <- slidingWindow(mydata=consV, win=c(1, 20)) # Returns conservation vector for window size 20. ## (C) Plot sliding window conservation data plot(slideV, type="l", lwd=2, col="red", main="Sliding Window Conservation Analysis", sub="Window Size: 20", xlab="Alignment Position", ylab="Conservation: max=1.0") heme <- AAStringSet(p450, start=454, end=463); heme text(454, 0.71, toString(heme[[1]]), cex=0.8) # The highly conserved cytochrome P450 heme binding signature maps to positions 454-463, which is located in the last big peak of the plot. Table of Contents
Position Weight Matrices (PWM)
This section introduces functions
for creating and matching position weight matrices (PWMs), and related utilities. |
Functions |
Description |
SolexaPath("path") |
Parses a Solexa output folder, and identifies important files. |
analysisPath(SolexaPathObject) |
Returns a list of GERALD output folders |
compose(filter1, filter2, filter3, ...) |
Returns a sequence filter based on a combination of various filter objects |
readAligned(SolexaPathObject, "filename", filter= myfilter) |
Reads in an Eland export file which contains sequences, alignment coordinates, and alignment-normalized quality scores based on a given filter object |
sread(AlignedReadObject) |
Returns a list of sequences contained in an AlignedReadObject (as returned by readAligned) |
coverage(AlignedReadObject, start=startnt, stop=stopnt) |
Calculates coverage of an alignment over each base in the alignment genome, and compresses the data to save memory |
Illumina Pipeline
- phred score=-10*log(error probability)/log(10)
Error Probability |
Phred Score |
1 | 0 |
0.1 |
10 |
0.01 | 20 |
0.001 | 30 |
0.0001 | 40 |
When quality scores are used to represent a long sequence (such as in a fastq file), they are often represented using the ASCII alphabet, adding the number 33 to Phred scores, and 64 to Illumina scores (The Illumina pipeline produces phred scores, but uses a different ASCII offset). For example, a Phred score of 40 can be represented as the ASCII char "I" (40+33=Ascii #73), and an Illumina score of 40 as "h" (40+64=Ascii #104).
Here is a link to a reference on the ASCII alphabet: http://www.asciitable.com/
ShortRead Objects
SolexaPath
This object stores output file locations from the Illumina Pipeline. The SolexaPath function which creates this object automatically finds the subfolders for each output file type.system("wget http://biocluster.ucr.edu/~tbackman/sample_flowcell.tgz") # download sample flowcell
system("cp /home/tbackman/.html/ .") # Alternative- copy from biocluster folder
system("tar xvfz sample_flowcell.tgz") # uncompress sample flowcell
library(ShortRead)
sp <- SolexaPath("sample_flowcell") # Creates a SolexaPath for a pipeline output folder
# general format: sp <- SolexaPath("/path/to/folder")
baseCallPath(sp) # returns base call path subfolder
analysisPath(sp) # returns GERALD/Eland alignment subfolder
imageAnalysisPath(sp) # Returns firecrest image analysis subfolder
#####################################
# Import Reads from Illumina Folder #
#####################################
lane <- 7
# read in all tiles of data
qualityInputFiles <- paste("s_", lane, "_[0-9]+_prb.txt", sep="")
sequenceInputFiles <- paste("s_", lane, "_[0-9]+_seq.txt", sep="")
reads <- readBaseQuality(baseCallPath(sp), seqPattern=sequenceInputFiles, prbPattern=qualityInputFiles,type="Solexa")
ShortReadQ
This object stores sequences, quality scores, and ids for a series of reads.library(ShortRead)
system("wget http://biocluster.ucr.edu/~tbackman/query.fastq") # download a sample illumina read set in fastq format
system("cp /home/tbackman/.html/query.fastq .") # Alternative- copy from biocluster folder
reads <- readFastq("query.fastq") # Reads in a FASTQ file from the current folder
id(reads) # outputs read ids as a list as BStringSet
sread(reads) # outputs read sequences as a list as DNAStringSet
quality(reads) # outputs list of quality scores as BStringSet
writeFastq(reads, "output.fastq") # writes ShortReadQ object to output file
AlignedRead
This objects stores the genome alignment coordinates (chromsome, position, strand) for each of a series of reads. system("wget http://biocluster.ucr.edu/~tbackman/output.soap") # download a sample alignment result produced by soap
system("cp /home/tbackman/.html/output.soap .") # Alternative- copy from biocluster folder
## Import AlignedRead Object
alignedReads <- readAligned("./", pattern="output.soap", type="SOAP")
# Reads in a external alignment (change type depending on program)
# general format: readAligned("/path/to/folder", pattern="filename", type="AlignmentTool")
alignedReads[1] # choose a read and inspect it's alignment coordinates
## AlignedRead Inspection and Alignment QC
length(alignedReads) # total number of hits
length(unique(id(alignedReads))) # count number of query sequences which aligned
length(unique(id(alignedReads[srduplicated(id(alignedReads))]))) # count reads producing multiple hits
table(table(as.character(id(alignedReads))) == 1)["TRUE"] # count reads with unique (single hit) alignment (assumes unique ids)
length(unique(position(alignedReads))) # count unique positions hit
length(unique(position(alignedReads)[duplicated(position(alignedReads))])) # count positions with multiple aligning reads
ShortRead Sequence Filters
ShortRead provides functions that return sequence filter objects, which allow you to filter your input sequences by various criteria. It is also possible to create new filter objects.Filter Function |
Description |
srFilter |
Creates a custom filter (see Vignettes) |
chromosomeFilter("chromosome name regex") |
Selects reads which align to specific chromosome(s) matching the given regular expression |
strandFilter("+/-") |
Selects reads which align to the + or - strand on the genome |
nFilter(# of Ns) |
Selects reads which contain fewer than a given number of N characters |
polynFilter(# of given nt) |
Selects only reads which contain less than a given number of the same nucleotide |
srdistanceFilter(DNAStringSet, DNAString) |
Performs Pairwise alignment of each DNAStringSet string against DNAString, and calculates edit distance |
uniqueFilter() |
Removes duplicate occurances of a given nucleotide sequence from a set of reads |
# Create several filters:
filter1 <- nFilter(threshold=1) # keep only reads without Ns
filter2 <- polynFilter(threshold=20, nuc=c("A")) # remove reads with 20 or more As
filter3 <- strandFilter("+") # choose reads which align to positive strand
filter <- compose(filter1, filter2, filter3) # Combine filters into one
alignedReads <- alignedReads[filter(alignedReads)] # filter alignedReads object with all 3 filters at once
Analysis Routines with ShortRead
The following examples introduce a variety of useful analysis routines that can be accomplished with functions provided by the ShortRead package.Import Reads, Filter, and Export
##################
## Import Reads ##
##################
# Hint: download with wget and import with readFastq
# Reads: http://biocluster.ucr.edu/~tbackman/query.fastq
##################
## Filter Reads ##
##################
# Hint: construct a filter object and apply to reads
(any type of filter)
##################
## Export Reads ##
##################
# Hint: store filtered reads in new object and export with writeFastq
fastq Quality Inspection
## Get sample fastq file
system("wget
http://biocluster.ucr.edu/~tbackman/query.fastq")
# download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/
query.fastq
.") # Alternative- copy from biocluster folderlibrary(ShortRead)
## plot per-cycle quality boxplot
reads <- readFastq("query.fastq")
# Read in datafile (replace query.fastq with your filename)
if (length(reads) > 1000000) {
reads <- sample(reads,1000000) # if more than a million reads, sample randomly
}
qual <- FastqQuality(quality(quality(reads))) # get quality scores
readM <- as(qual, "matrix") # convert scores to matrix
pdf(file="boxplot.pdf") # Save box plot as boxplot.pdf in current folder
boxplot(as.data.frame((readM)), outline = FALSE, main="Per Cycle Read Quality", xlab="Cycle", ylab="Phred Quality")
# build per cycle boxplot
dev.off()
system("cp boxplot.pdf ~/.html") # If you're on our biocluster server, this copies the output to your public web space where you can access it such as: http://biocluster.ucr.edu/~tbackman/boxplot.pdf (replace tbackman with your username)
## qa object quality control
qa <- qa(dirPath=".", pattern="query.fastq", type=c("fastq")) # read in fastq file
qa[['readCounts']] # inspect read yield
qa[['baseCalls']] # inspect base composition
qa[['frequentSequences']] # inspect most common sequences
## generate HTML quality report from qa object in qcReport folder (must have X server connected)
report(qa, dest="qcReport", type="html") # sample: http://biocluster.ucr.edu/~tbackman/qcReport/
SmallRNA Profiling
##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/srnareads.fastq") # download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/srnareads.fastq
.") # Alternative- copy from biocluster folderlibrary(ShortRead)
reads <- readFastq("srnareads.fastq")
id(reads) # inspect read ids
sread(reads) # inspect read sequences
quality(reads) # inspect quality scores
######################
## Filter Sequences ##
######################
# Construct several filters
filter1 <- nFilter(threshold=3) # keep only reads with fewer than 3 Ns
filter2 <- polynFilter(threshold=20, nuc=c("A", "C", "T", "G")) # remove reads with 20 or more of the same letter
filter <- compose(filter1, filter2) # Combine filters into one
reads <- reads[filter(sread(reads))] # filter sequences with all filters at once
# since these filters work on the sequences themselves, it is necessary to pass them the sequences
# with the sread(reads) function, rather than the entire ShortReadQ object
reads # inspect filtered reads
###################
## Trim Adapters ##
###################
seqs <- sread(reads) # get sequence list
qual <- quality(reads) # get quality score list
qual <- quality(qual) # strip quality score type
seqs <- DNAStringSet(seqs, start=4) # trim first 3 bases from sequence
qual <- BStringSet(qual, start=4) # trim first 3 bases from Qscore
# This is only necessary if you need to shorten the read by a fixed
# distance at the beginning. This is done here because the samples
# are multiplexed, and we wish to omit the multiplexing tag.
adapter <- "TGTAGGCACCATCACTCGGGCACCAAGGA"
# This is the adapter sequence to be trimmed from the end of your smallRNA reads
mismatchVector <- c(rep(2,length(DNAString(adapter))))
# This is a vector of equal length to the adapter with "2" for each value.
# This will allow the adapter to match the end of the sequence with any offset
# and up to 2 mismatches.
trimCoords <- trimLRPatterns(Rpattern=adapter, subject=seqs, max.Rmismatch=mismatchVector, ranges=T)
# Trim sequences looking for a right end pattern
# Gets IRanges object with trimmed coordinates
seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
# Use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
# Rebuild reads object with trimmed sequences and quality scores
#######################################
## Choose Size Range and Make Unique ##
#######################################
maxSize <- 30
minSize <- 19
trimmed <- trimmed[width(sread(trimmed)) <= maxSize] # remove all reads longer than maxSize
trimmed <- trimmed[minSize <= width(sread(trimmed))] # remove all reads shorter than minSize
uniquef <- uniqueFilter() # create filter
unique <- trimmed[uniquef(trimmed)] # run unique filter (eliminate duplicates)
# keep working with the non-unique reads if analyzing expression level data
######################
## Export Sequences ##
######################
writeFastq(trimmed, "filename.fastq") # write reads to filename.fastq
###########################################################
## Perform Alignment With Bowtie Using Unix Command Line ##
###########################################################
system("wget http://biocluster.ucr.edu/~tbackman/a_thaliana.ebwt.zip") # Download your reference genome in fasta format with wget (this is a prebuilt Bowtie Tair9 reference)
system("cp /home/tbackman/.html/a_thaliana.ebwt.zip
.") # Alternative- copy from biocluster foldersystem("unzip
a_thaliana.ebwt.zip") # uncompress reference genome
system("bowtie -q
a_thaliana
filename.fastq output.bowtie") # perform bowtie alignment with default options
######################
## Import Alignment ##
######################
alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie") # read in alignment
table(width(sread(alignedReads))) # view table of SmallRNA Size distribution
#########################
## Find Coverage Peaks ##
#########################
coverage <- coverage(alignedReads) # calculate genomic coverage and returns an IRanges Rle object
islands <- slice(coverage[2], lower = 2) # look for regions on chrom 1 of 2x coverage or higher
islands # inspect regions
###########################################
## Visualize Alignment In Genome Browser ##
###########################################
library(rtracklayer)
chr1ranges <- as(coverage[2], "RangedData") # convert coverage for chromosome 1 into rtracklayer object
names(chr1ranges) <- "chr1"
# use chromosome names compatible with UCSC genome browser
export(
chr1ranges
, con="chromsome1.bed") # create output track for viewing in genome browser
system("cp
chromsome1.bed
~/.html") # If you're on our biocluster server, this copies the output to your public web space where you can access it such as:
http://biocluster.ucr.edu/~tbackman/chromsome1.wig (replace tbackman with your username)
If you are working on a remote server, you will need to copy this file locally before you can upload it to a genome browser. If you have trouble with the copy, you can download a sample here: chromsome1.wig
Go to an genome browser, upload your .wig file, and view it. You can try the Arabidopsis UCSC browser (http://epigenomics.mcdb.ucla.edu/cgi-bin/hgGateway?org=A.+thaliana&db=araTha1), by uploading your .wig file via the "add your own custom tracks" button. After uploading this, browse to the region you would like to see. A read pileup in the test dataset for the workshop can be found at chr1:17,825,707-17,825,775. Also, try inspecting regions of high coverage you found in the previous step.
Note: In this case, the UCSC genome browser is not using the same
version of the genome (TAIR9) so your read loci may not correlate with
the correct annotation features.
Trimming and Sorting Reads by 5' Muliplexing Adapters
Next generation sequencing technologies frequently produce a much larger volume of reads than necessary for a given library. Libraries can be combined on a single lane, and identified by unique 5' sequencing adapters, and are easily trimmed and sorted in R.##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/srnareads.fastq") # download some test Illumina reads from Arabidopsis
system("cp /home/tbackman/.html/srnareads.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)
reads <- readFastq("srnareads.fastq")
##################################
## Inspect Adapter Distribution ##
##################################
seqs <- sread(reads) # get sequence list
adapterLength <- 4 # this is the length of your adapters
seqs <- DNAStringSet(seqs, end=adapterLength)
sort(table(as.character(seqs)), decreasing=TRUE)[1:10] # show top 10 adapter sequences
topAdapters <- rownames(sort(table(as.character(seqs)), decreasing=TRUE)[1:10])
# iterate through each of these top 10 adapters and write output to fastq files
for(adapter in topAdapters) {
###################
## Trim Adapters ##
###################
seqs <- sread(reads) # get sequence list
qual <- quality(reads) # get quality score list
qual <- quality(qual) # strip quality score type
mismatchVector <- 0 # allow no mismatches
trimCoords <- trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=mismatchVector, ranges=T)
# Trim sequences looking for a 5' pattern
# Gets IRanges object with trimmed coordinates
########################################
## Generate trimmed ShortReadQ object ##
########################################
seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
# Use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
# Rebuild reads object with trimmed sequences and quality scores
trimmed <- trimmed[start(trimCoords) == adapterLength + 1]
# keep only reads which trimmed the full adapter
###############################
## Write reads to Fastq file ##
###############################
outputFileName <- paste(adapter, ".fastq", sep="")
writeFastq(trimmed, outputFileName)
# export filtered and trimmed reads to fastq file
print(paste("wrote", length(trimmed), "reads to file", outputFileName))
}
Trimming Low Quality 3' Tails from Reads
Many Illumina reads have very low quality in the last few cycles, which can pose a problem for later analysis steps which are not quality score aware. In these cases, removing low quality 3' tails from a read can be beneficial.##################################
## Import Reads From Fastq file ##
##################################
system("wget http://biocluster.ucr.edu/~tbackman/query.fastq") # download some test Illumina reads
system("cp /home/tbackman/.html/query.fastq .") # Alternative- copy from biocluster folder
library(ShortRead)
reads <- readFastq("query.fastq")
########################################
## Inject Ns at low quality positions ##
########################################
qualityCutoff <- 10 # remove read tails with quality lower than this
seqs <- sread(reads) # get sequence list
qual <- PhredQuality(quality(quality(reads))) # get quality score list as PhredQuality
myqual_mat <- matrix(charToRaw(as.character(unlist(qual))), nrow=length(qual), byrow=TRUE) # convert quality score to matrix
at <- myqual_mat < charToRaw(as.character(PhredQuality(as.integer(qualityCutoff)))) # find positions of low quality
letter_subject <- DNAString(paste(rep.int("N", width(seqs)[1]), collapse="")) # create a matrix of Ns
letter <- as(Views(letter_subject, start=1, end=rowSums(at)), "DNAStringSet") # trim to length needed for each read
injectedseqs <- replaceLetterAt(seqs, at, letter) # inject Ns at low quality positions
####################################
## Get coordinates of polyN tails ##
####################################
adapter <- paste(rep("N", max(width(injectedseqs))), sep="", collapse="")
mismatchVector <- c(rep(0,width(adapter))) # allow no mismatches at each adapter offset
trimCoords <- trimLRPatterns(Rpattern=adapter, subject=injectedseqs, max.Rmismatch=mismatchVector, ranges=T)
# Trim sequences looking for a right end pattern (polyN in this case)
# Gets IRanges object with trimmed coordinates
##########################################################################
## Apply trimming coordinates from injected reads to non-injected reads ##
##########################################################################
seqs <- DNAStringSet(seqs, start=start(trimCoords), end=end(trimCoords))
qual <- BStringSet(qual, start=start(trimCoords), end=end(trimCoords))
# Use IRanges coordinates to trim sequences and quality scores
qual <- SFastqQuality(qual) # reapply quality score type
trimmed <- ShortReadQ(sread=seqs, quality=qual, id=id(reads))
# Rebuild reads object with trimmed sequences and quality scores
Quality Reports of FASTQ Files
The following seeFastq/seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The functions allow processing of reads with variable length, but most plots are only meaningful if the read positions in the FASTQ file are aligned with the sequencing cycles. For instance, constant length clipping of the reads on either end or variable length clipping on the 3' end maintains this relationship, while variable length clipping on the 5' end without reversing the reads erases it. Related resources on this topic can be found here: qrqc, FastQC and FASTX-Toolkit.##################################
## Generate FASTQ quality plots ##
##################################
library(ShortRead); library(ggplot2)
## Import seeFastq/seeFastqPlot functions
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/fastqQuality.R")
## Optional download of sample fastq files
system("wget http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Rngsapps/chipseqBioc2012/data.zip"); system("unzip data.zip")
## Specify fastq files names in named vector (here for all *.fastq files in 'data' directory of downloaded sample data set)
fastq <- list.files("data", "*.fastq$"); fastq <- paste("data/", fastq, sep="")
names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_") # Values in name slot will be used for sample labeling in plot!
## Compute quality stats
fqlist <- seeFastq(fastq=fastq, batchsize=100000, klength=8) # Returns summary stats in a relatively small list object that can be saved to disk with 'save()' and reloaded with 'load()' for later plotting. The argument 'klength' specifies the k-mer length and 'batchsize' the number of reads to random sample from each fastq file.
## Plot quality stats
seeFastqPlot(fqlist)
seeFastqPlot(fqlist[4:1], arrange=c(1,2,3,4,6,7)) # Example for plotting specific rows and columns.
## Output plot to PDF
pdf("fastqReport.pdf", height=18, width=4*length(fastq))
seeFastqPlot(fqlist)
dev.off()
## To run the functions from the Unix command-line, instead of the R console, download this run script seeFastq.R to your current working directory, specify your FASTQ files in a file of this format fastqFiles.txt
and then execute:
Rscript seeFastq.R
-i fastqFiles.txt -o fastqReport.pdf -a all -c all -s 100000 -k 8 # The results will be written to a PDF file called
fastqReport.pdf. The summary statistics will be stored in a separate *.list file, here 'fastqReport.pdf.list', that can be used to plot the results later without revisiting the fastq files. This is supported by running the script with the '-l' argument and assigning to it the *.list file, but dropping the '-i' option.
To plot only specific rows and columns, one can assign them to the '-a' and '-c' arguments using this syntax: '-a 1_2_3_4_5_8_6_7 -c 1_2_3_4'. The arguments '-s' and '-k' specify the random sampling size and the k-mer length, respectively. Rsamtools
DocumentationSamtools Website
BWA (Burrows-Wheeler Alignment) Website
Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools.
BWA Alignment
The Burrows-Wheeler Alignment (BWA) Tool is a powerful shortread alignment tool which can generate gapped short read output in the SAM format, perfect for use with Rsamtools importer functions.These commands should be run on the unix command line, or within R wrapped in a system("put command in here"):
## Option 1 of 2: Download test data from internet
wget http://biocluster.ucr.edu/~tbackman/genome.fasta
# download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq
# download some test Illumina reads from Arabidopsis
## Option 2 of 2: If logged into biocluster.ucr.edu copy files directly
cp /home/tbackman/.html/genome.fasta
. # copy from biocluster foldercp /home/tbackman/.html/
query.fastq
. # copy from biocluster folder## perform alignment
bwa index -a is genome.fasta # index reference genome
bwa aln
genome
.fasta
query.fastq > output.sai # perform alignment
bwa samse -n -1
genome
.fasta
output
.sai query.fastq > output.sam # generate SAM formatted alignment output
samtools view -bST
genome
.fasta
output
.sam -o
output
.bam # use samtools to generate bam file
Rsamtools Bam Parsing
Bam file data elementsName | Description |
R Data Object Type |
qname | read id |
character vector |
flag | a numeric value which encodes extra alignment details for use in sorting and filtering (see samtools man page) |
integer vector |
rname |
reference chromosome/sequence to which read aligned |
factor |
strand | alignment strand |
factor |
pos | alignment start coordinate (left most at 3' end of - strand) |
integer vector |
qwidth | read width |
integer vector |
mapq | mapping quality |
integer vector |
cigar | extended samtools CIGAR string |
character vector |
mrnm | reference chromosome/sequence to which opposite read pair aligned | factor |
mpos | alignment start coordinate to which opposite read pair aligned | integer vector |
isize | paired end insert size |
integer vector |
seq | query sequence in 5' to 3' direction |
DNAStringSet |
qual | read quality score |
PhredQuality |
library(Rsamtools) # load Rsamtools library
sortBam("output.bam", "output.sorted") # sort entries in bam file
indexBam("output.sorted.bam") # index reads in bam file
## Option 1 of 2: Import all elements from bam file
param <- ScanBamParam()
## Option 2 of 2: Select specific elements from bam file to import
which <- RangesList(Chr1 = IRanges(c(1, 10000), c(20000,30000)))
# select only elements you need:
what <- c("pos", "strand", "rname", "seq", "qual", "qname")
param <- ScanBamParam(which = which, what = what)
## Read in sorted indexed bam file
bam <- scanBam("output.sorted.bam", index="output.sorted.bam", param=param)
## Inspect and extract elements from bam object
names(bam) # shows ranges you selected with "which"
names(bam[[1]]) # selects first range and shows elements selected with "what"
bam[[1]]$seq # selects a specific element from a specific range
## Read in sorted indexed bam file as GappedAlignments object
bamAlignment <- readBamGappedAlignments("output.sorted.bam", index="output.sorted.bam")
Rsamtools SNP Calling
This step requires the sorted and indexed alignment and genome from previous steps
## unix command to extract pileup (check samtools options first)
system("samtools pileup -vcf genome.fasta output.sorted.bam > raw.pileup")
## read in snps and indels
snps <- readPileup("raw.pileup", variant="SNP")
indels <- readPileup("raw.pileup", variant="indel")
## inspect and filter snps
snps[1:5] # see first 5 snps
table(seqnames(snps)) # table snps per chromosome
snps <- snps[elementMetadata(snps)$coverage > 3] # keep only snps with coverage > 3
## write SNP spreadsheet
write.table(as.data.frame(snps), "snps.xls")
Analysis Routines with Rsamtools
SNP/INDEL Analysis and Annotation
[ Slide Show ] [ R Code ]################################################### ### chunk number 1: install-packages ################################################### ## #line 92 "SNPAnno.Rnw" ## source("http://bioconductor.org/biocLite.R") ## biocLite(c("GenomicFeatures","rtracklayer","GenomicRanges","Rsamtools")) ################################################### ### chunk number 2: install-mac ################################################### ## #line 102 "SNPAnno.Rnw" ## install.packages("SNPAnno_1.0.tar.gz", repos=NULL, type="source") ################################################### ### chunk number 3: install-windows ################################################### ## #line 110 "SNPAnno.Rnw" ## install.packages("SNPAnno.zip", repos=NULL) ################################################### ### chunk number 4: load-library ################################################### #line 115 "SNPAnno.Rnw" library(SNPAnno) ################################################### ### chunk number 5: example1-snp ################################################### #line 119 "SNPAnno.Rnw" snpfile <-system.file("examples","s.vcf",package="SNPAnno") snpfile snpgr <-vcf2GR(snpfile) chrid="dmel_mitochondrion_genome" snpgr <-snpgr[seqnames(snpgr)==chrid] snpgr[1:3] ################################################### ### chunk number 6: example1-txdb ################################################### ## #line 128 "SNPAnno.Rnw" ## txdb <- makeTranscriptDbFromBiomart("ensembl", "dmelanogaster_gene_ensembl") ## txdb ################################################### ### chunk number 7: example1-txdb ################################################### ## #line 133 "SNPAnno.Rnw" ## library(biomaRt) ## listMarts() ## ensembl = useMart("ensembl")#Choose to query the Ensembl BioMart database ## listDatasets(ensembl)#list dataset in database ################################################### ### chunk number 8: example1-txdb ################################################### #line 140 "SNPAnno.Rnw" txdbfile <-system.file("examples", "dm5.25.sqlite",package="SNPAnno") txdb <-loadFeatures(txdbfile) txdb ################################################### ### chunk number 9: example1-txdb ################################################### #line 146 "SNPAnno.Rnw" genomefile <-system.file("examples", "dm.fasta",package="SNPAnno") genomesequence <-read.DNAStringSet(genomefile) ################################################### ### chunk number 10: example1-cds ################################################### #line 151 "SNPAnno.Rnw" cds <-GetCDSfromTXDb(txdb,chrid) cds[1:2] ################################################### ### chunk number 11: example1-nonsynonymous ################################################### #line 156 "SNPAnno.Rnw" nonsynonymous<- NonSynDetection(snpgr,chr=genomesequence, features=cds, chrid=chrid, revcomp="-") nonsynonymous[1:10] ################################################### ### chunk number 12: example1-coding ################################################### #line 161 "SNPAnno.Rnw" codingGR <- GetCDSfromTXDb(txdb,chrid) coding <-CodingMutation(codingGR, snpgr) coding[1:10] ################################################### ### chunk number 13: example1_intergenic ################################################### #line 167 "SNPAnno.Rnw" intergenicGR <-IntergenicfromtxDB(txdb,chrid) intergenicGR[1:3] ################################################### ### chunk number 14: example1_intergenic ################################################### #line 172 "SNPAnno.Rnw" intergenic <-IntergenicMutation(intergenicGR,snpgr) intergenic[34:38] ################################################### ### chunk number 15: example1_intron ################################################### ## #line 177 "SNPAnno.Rnw" ## intronGR <-intronfromtxDB(txdb,chrid) ## intronGR ################################################### ### chunk number 16: example1_splicesite ################################################### ## #line 182 "SNPAnno.Rnw" ## splicesite <-SpliceSiteMutation(intronGR, snpgr) ################################################### ### chunk number 17: example1_all ################################################### #line 186 "SNPAnno.Rnw" SNPAnnotation(snpfile=snpfile, type="vcf", source="transcriptDB", annosource=txdb, detection=c("nosynonymous","intergenic","coding"), genomesequence=genomesequence, chrid=chrid) ################################################### ### chunk number 18: example2 ################################################### ## #line 197 "SNPAnno.Rnw" ## library(SNPAnno) ################################################### ### chunk number 19: example2_snp ################################################### #line 201 "SNPAnno.Rnw" snpfile <-system.file("examples","snp.pileup",package="SNPAnno") snpgr <-pileupSNP2GR(snpfile) highqualityreads <-supportreads(snpgr,10) snpgr <-GRanges(seqnames=seqnames(snpgr),ranges=ranges(snpgr),strand="*",values(snpgr)["ref"], values(snpgr)["alt"],values(snpgr)["consensusQuality"],values(snpgr)["snpQuality"], values(snpgr)["maxMappingQuality"],values(snpgr)["coverage"], highqualityreads =highqualityreads) snpgr <-snpgr[seqnames(snpgr)=="Chr1"] snpgr[1:3] ################################################### ### chunk number 20: example2_gff ################################################### #line 210 "SNPAnno.Rnw" gfffile <-system.file("examples","TAIR10_part.gff3",package="SNPAnno") gff <- import.gff(gfffile, asRangedData=FALSE) elementMetadata(gff)["type"] <-as.character( as.data.frame(elementMetadata(gff)[,"type"])[,1]) elementMetadata(gff)["group"] <- gsub("^.*?=(.*?)($|,|;|-).*", "\\1", elementMetadata(gff)[,"group"]) ################################################### ### chunk number 21: example2_genome ################################################### #line 218 "SNPAnno.Rnw" genomefile <-system.file("examples", "genome.fasta",package="SNPAnno") genomesequence <-read.DNAStringSet(genomefile) ################################################### ### chunk number 22: example2_cds ################################################### #line 223 "SNPAnno.Rnw" chrid="Chr1" cds <-GetCDSfromGFF(gff,chrid) cds[1:3] ################################################### ### chunk number 23: example2_Nonsynonymous ################################################### #line 229 "SNPAnno.Rnw" nonsynonymous<- NonSynDetection(snpgr, chr=genomesequence, features=cds, chrid=chrid, revcomp="-") nonsynonymous[1:10] ################################################### ### chunk number 24: example2_cds ################################################### #line 235 "SNPAnno.Rnw" coding <-CodingMutation(cds, snpgr) coding[1:10] ################################################### ### chunk number 25: example2_cds ################################################### #line 240 "SNPAnno.Rnw" intergenicGR <-Intergenicfromgff(gff,chrid) intergenicGR[1:2] ################################################### ### chunk number 26: example2_inter ################################################### #line 245 "SNPAnno.Rnw" intergenic <-IntergenicMutation(intergenicGR,snpgr) intergenic[1:10] ################################################### ### chunk number 27: example2_intron ################################################### #line 250 "SNPAnno.Rnw" intronGR <-intronfromgff(gff,chrid) intronGR[1:2] ################################################### ### chunk number 28: example2_splicesite ################################################### #line 255 "SNPAnno.Rnw" splicesite <-SpliceSiteMutation(intronGR, snpgr) splicesite[10:20] ################################################### ### chunk number 29: example2_all ################################################### ## #line 260 "SNPAnno.Rnw"
SNPAnnotation(snpfile=snpfile, type="pileup", source="gff", annosource=gff, detection=c("nosynonymous","splicesite","intergenic","coding"), genomesequence=genomesequence, chrid="Chr1")
Additional tools for SNP analysis:
BSgenome
BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism.
BSgenome Use
library(BSgenome)
available.genomes()
# Get list of all available pre-built genomes to choose from
# source("http://bioconductor.org/biocLite.R")
# biocLite("
BSgenome.Celegans.UCSC.ce2
")
# Run these commands to install a genome you haven't used before, or to update the version
library(
BSgenome.Celegans.UCSC.ce2
)
# load the house mouse
genome
Celegans
# Inspect overview of data provided with this package
masknames(
Celegans
)
# Returns a list of built in masks defined for sequences in the genome
# AGAPS = assembly gaps
# AMB = intra-contig ambiguities
# RM = RepeatMasker
# TRF = Repeat Regions found via Tandem Repeats Finder
chr1 <-
Celegans
$chrI
# load chromosome 1 into memory with default masks
length(chr1)
# returns length of chromosome 1
rtracklayer
rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage.
biomaRt
The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data.
biomaRt Use
library("biomaRt")
listMarts() # get list of all available data sources
ensembl <- useMart("ensembl") # load the snp source
listDatasets(ensembl) # list available datasets
cfamiliaris
_gene_ensembl <- useDataset("cfamiliaris_gene_ensembl", mart= ensembl)
# use domestic dog gene data
listAttributes(
cfamiliaris
_gene_ensembl
)
# show attributes in this dataset
listFilters(
cfamiliaris
_gene_ensembl
)
# show available filters
cfamiliaris
_gene_ensembl
<- useMart("ensembl","
cfamiliaris
_gene_ensembl
")
# select specific dataset
getBM(attributes=c("chromosome_name"), mart=
cfamiliaris
_gene_ensembl
)
# retrieve data regarding specific attributes in table/dataframe
# format
# optional filters= option allows for specific filters to be applied
ChIP-Seq Analysis Packages
Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.
chipseq
The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis.
#################
## Data Import ##
#################
library(chipseq); library(lattice)
data(cstest); cstest # Loads a reduced test data set containing the start positions and orientation of the aligned reads for three mouse chromosomes.
cstest; cstest$ctcf # cstest is a list-like object of class "GenomeDataList". Each of its components contains the data from one Illumina lane.
#####################
## Extending Reads ##
#####################
## HT sequencing data contain often only the partial sequence of ChIP-Seq pull-down fragments. Therefore, it can be useful in certain cases to extend the reads to cover the entire binding site of DNA binging proteins.
estimate.mean.fraglen(cstest$ctcf, method="SISSR") # The function 'e
stimate.mean.fraglen
' implements several methods to estimate the mean fragment length. See help file for details.
ext <- resize(cstest$ctcf, width = 200) # Returns an IRanges object of read intervals extended to 200 bases.
#################################
## Coverage, Islands and Depth ##
#################################
cov <- coverage(ext); cov # Counts the read coverage for each base. The results are stored in a run-length encoded form as Rle object.
islands <- slice(cov, lower = 1) # Returns read islands where the read coverage is
equal or above a specified bound.
viewSums(islands) # Returns the sum of coverage in the islands.
table(viewSums(islands) / 200) # Returns the number of estimated reads of the corresponding islands.
viewMaxs(head(islands)) # Returns the maximum coverage depth within each island.
table(viewMaxs(islands))
# Counts the corresponding islands.
## Function to perform above analysis on many data sets with seqapply()
islandReadSummary <- function(x) {
g <- resize(x, width = 200)
s <- slice(coverage(g), lower = 1)
tab <- table(viewSums(s) / 200)
df <- DataFrame(tab)
colnames(df) <- c("chromosome", "nread", "count")
df$nread <- as.integer(df$nread)
df
}
head(islandReadSummary(cstest$ctcf)) # Returns same result as before.
nread.islands <- seqapply(cstest, islandReadSummary) # Performs analysis on all chromosomes.
nread.islands <- stack(nread.islands, "sample")
# Returns results as data frame.
## Plots the distribution of read numbers against the number of islands.
xyplot(log(count) ~ nread | sample, as.data.frame(nread.islands),
subset = (chromosome == "chr10" & nread <= 40),
layout = c(1, 2), pch = 16, type = c("p", "g"))
##############################
## Chromosome Coverage Plot ##
##############################
plotCov <- function(mycov=cov, mychr=1, mypos=c(1,1000), mymain="Coverage", ...) {
op <- par(mar=c(8,3,6,1))
plot(as.numeric(mycov[[mychr]][mypos[1]:mypos[2]]), type="l",
lwd=1, col="blue", ylab="", main=mymain, xlab="", xaxt="n", ...)
axis(1, las = 2, at=seq(1,mypos[2]-mypos[1], length.out= 10),
labels=as.integer(seq(mypos[1], mypos[2], length.out= 10)))
par(op)
}
plotCov(mycov=cov, mychr="chr10", mypos=c(3000000,4000000))
##############################
## Peak Calling and Summary ##
##############################
peakCutoff(cov, fdr = 0.0001) #
Using Poisson-based approach for estimating the noise distribution, the
'peakCutoff' function returns a cutoff value for a specific FDR.
peaks <- slice(cov, lower = 8); peaks #
Based on the above calculated coverage value ~7 at an FDR of 0.0001,
one can choose a coverage >=8 as a conservative cutoff value for peak
calling.
peaks.sum <- peakSummary(peaks) # Summarizes the peak data in a RangedData object.
###############################################
## Strand-Specific Peak Calling and Plotting ##
###############################################
peak.depths <- viewMaxs(peaks)
cov.pos <- coverage(ext[strand(ext) == "+"])
# Returns coverage for sense matches.
cov.neg <- coverage(ext[strand(ext) == "-"])
# Returns coverage for antisense matches.
peaks.pos <- Views(cov.pos, peaks)
peaks.neg <- Views(cov.neg, peaks)
wpeaks <- tail(order(peak.depths$chr10), 4); wpeaks
# Returns four highest peaks.
coverageplot(peaks.pos$chr10[wpeaks[1]], peaks.neg$chr10[wpeaks[1]])
# Plots the first peak.
###############################################
## DPs: Differential Peaks Among Two Samples ##
###############################################
## Simplest strategy is to combine the data from two samples and then compare the contribution of each sample to the peaks.
cov.gfp <- coverage(resize(cstest$gfp, 200))
peaks.gfp <- slice(cov.gfp, lower = 8)
peakSummary <- diffPeakSummary(peaks.gfp, peaks)
xyplot(asinh(sums2) ~ asinh(sums1) | space,
data = as.data.frame(peakSummary),
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(median(y - x), 1)
},
type = c("p", "g"), alpha = 0.5, aspect = "iso")
BayesPeak
BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm
uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package.
#############################
## Import Data Sample Data ##
#############################
library(BayesPeak); library(multicore)
sig <-
read.table("H3K4me3-chr16.bed", skip=1, colClasses=c("character",
"numeric")[c(1,2,2,1,1,1,1,1,1)]) # Import of a ChIP-Seq sample data set. The file needs to be downloaded
from the BayesPeak
site
, uncompressed and stored in the current working directory of an R session.
bgr <-
read.table("Input-chr16.bed", colClasses=c("character",
"numeric")[c(1,2,2,1,1,1,1,1,1)]) #
Import of the background sample data set.
sig <- RangedData(space=sig[,1],
IRanges(start=sig[,2], end=sig[,3]), strand=sig[,6])
bgr <- RangedData(space=bgr[,1],
IRanges(start=bgr[,2], end=bgr[,3]), strand=bgr[,6])
#################################
## Peak Calling with BayesPeak ##
#################################
raw.output <- bayespeak(treatment=sig, control=bgr, chr = "chr16",
start = 9.2E7, end = 9.5E7, use.multicore = TRUE, mc.cores = 1) # The bayespeak() function fits a Markov model to the aligned read data via Markov Chain Monte Carlo (MCMC) techniques. The incorporation of a control sample is beneficial but not required for this function. The aligned read data can be read directly from a BED file or provided as a data frame or a RangedData object as in this example. For speed reasons the analysis is restricted in this example to a small subrange on chromosome 16. For a complete analysis of an entire genome, one usually wants to omit the arguments: 'chr', 'start and 'end'. In addition, one can increase the number of CPU cores utilized for the computation under the 'mc.cores' argument.
unreliable.jobs <-
log(raw.output$QC$lambda1) < 1.5 # Removal of potential false
positive peaks due to overfitting. Consult the BayesPeak manual
for more details.
bpeaks <-
as.data.frame(summarise.peaks(raw.output, method = "lowerbound",
exclude.jobs = unreliable.jobs)) # Returns the final set of peaks in a data frame.
##########################################################
## Adding Coverage Information with the chipseq Package ##
##########################################################
bpeaksIR
<- IRanges(start=bpeaks[,
2], end=bpeaks[, 3])
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/chipseqFct.R")
# Imports the rangeCoverage function for computing coverage information of peak calls. The function expects the peak ranges in an IRanges object for single chromosome data or as a list of IRanges for multiple chromosome data. In addition, the corresponding aligned read data need to be provided as a RangedData object.
sigcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="sig_", peaksIR=bpeaksIR, sig=sig, readextend=200, normfactor=10^6/length(start(sig))) # Returns a data frame with the maximum coverage information for each peak normalized per million reads (see 'normfactor' argument). The first column contains the total coverage and the other two the coverage for the positive and negative strands, respectively.
To
obtain mean coverage values, assign
'viewMeans' to the summaryFct argument.
bgrcovDF <- rangeCoverage(summaryFct=viewMaxs, myname="bgr_", peaksIR=bpeaksIR, sig=bgr, readextend=200, normfactor=10^6/length(start(bgr))) # Returns the corresponding coverage information for the background sample.
bpeaksDF
<- cbind(bpeaks,
sigcovDF[,-1], bgrcovDF[,-1])
;
bpeaksDF[1:4,]
# Constructs a data frame containing the peaks called by BayesPeaks as well as the corresponding maximum coverage information for each peak.
PICS
The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set.
#############################
## Import Data Sample Data ##
#############################
library(PICS); library(snowfall)
sig <- read.table("H3K4me3-chr16.bed", skip=1, colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)])[,c(1:3,6)]; colnames(sig) <- c("space", "start", "end", "strand"); siggd <- as(sig, "GenomeData")
bgr <- read.table("Input-chr16.bed", colClasses=c("character", "numeric")[c(1,2,2,1,1,1,1,1,1)])[,c(1:3,6)]; colnames(bgr) <- c("space", "start", "end", "strand"); bgrgd <- as(bgr, "GenomeData")
# Import of the sample data
set. The files need to be downloaded
from the BayesPeak
site
, uncompressed and
stored in the current working directory of an R session.
############################
## Peak Calling with PICS ##
############################
sfInit(parallel = TRUE, cpus = 2); sfLibrary(PICS) # Specifies the number of CPU cores to use for the following computations.
seg <- segmentReads(siggd, dataC = bgrgd, minReads = 1) # Detection of candidate peak regions with a minimum read coverage specified under the 'minReads' argument.
pics <- PICS(seg, dataType = "TF") # Identification and scoring of DNA binding sites.
segC <- segmentReads(bgrgd, dataC = siggd, minReads = 1); picsC <- PICS(segC, dataType = "TF"); fdr <- picsFDR(pics, picsC, filter = list(delta = c(50, Inf), se = c(0, 50), sigmaSqF = c(0, 22500), sigmaSqR = c(0, 22500))) # Estimation of FDRs for peak calls.
myFilter = list(delta = c(50, 300), se = c(0, 50), sigmaSqF = c(0, 2500), sigmaSqR = c(0, 22500)) # Predefine filter parameters.
fdr2score <- fdr[fdr[,1]<=0.05,][1,2] # Defines sore cutoff for a FDR <= 0.05.
rdata <- makeRangedDataOutput(pics, type = "bed", filter = c(myFilter, list(score = c(fdr2score, Inf)))) # Returns quality filtered peaks with scores as RangedData object.
ppeaks <- as.data.frame(rdata)
##########################################################
## Adding Coverage Information with the chipseq Package ##
##########################################################
ppeaksIR
<- IRanges(start=ppeaks[,
2], end=ppeaks[, 3])
sig <- RangedData(sig); bgr <- RangedData(bgr)
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/chipseqFct.R")
# Imports the rangeCoverage function for
computing coverage information of peak calls. The function expects the
peak ranges in an IRanges object for single chromosome data or as a list
of IRanges for multiple chromosome data. In addition, the corresponding
aligned read data need to be provided as a RangedData object.
sigcovDF
<- rangeCoverage(summaryFct=viewMaxs, myname="sig_",
peaksIR=ppeaksIR, sig=sig, readextend=200,
normfactor=10^6/length(start(sig))) # Returns a data frame with
the maximum coverage information for each peak normalized per million
reads (see 'normfactor' argument). The first column contains the total
coverage and the other two the coverage for the positive and negative
strands, respectively.
To
obtain mean coverage values, assign
'viewMeans' to the summaryFct argument.
bgrcovDF <-
rangeCoverage(summaryFct=viewMaxs, myname="bgr_", peaksIR=ppeaksIR,
sig=bgr, readextend=200, normfactor=10^6/length(start(bgr))) #
Returns the corresponding coverage information for the background
sample.
ppeaksDF
<- cbind(ppeaks,
sigcovDF[,-1], bgrcovDF[,-1])
;
ppeaksDF[1:4,]
# Constructs a data frame
containing the peaks called by PICS as well as the corresponding
maximum coverage information for each peak.
##########################################################
## Peaks Identified by both Methods: PICS and BayesPeak ##
##########################################################
commonpeaks <- subsetByOverlaps(as(ppeaks, "RangedData"), as(bpeaks, "RangedData"), minoverlap=100) # Returns the peaks called by PICS that overlap by at least 100bp with those from BayesPeak for the range
9.2E7-9.5E7
. Note: to compare the overall results from both methods the BayesPeak analysis needs to be performed on the entire data set by removing the 'start' and 'end' restrictions in the above bayespeak function call.
ppeaksDF[ppeaksDF$start %in%
start(commonpeaks),] # Returns the corresponding coverage information for the common peaks.
ChIPpeakAnno
The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages.
##################################################
## Obtain Genomic Context Information for Peaks ##
##################################################
## Use annotations from BioC annotation packages
library(ChIPpeakAnno)
data(myPeakList) # Loads a sample peak list stored as RangeData object.
data(TSS.human.NCBI36) # Loads gene locations for human genome.
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6, ], AnnotationData = TSS.human.NCBI36) # Combines the peak information with the genomic context.
as.data.frame(annotatedPeak)
## Use annotations from BioMart
mart <- useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
myAnnotation <- getAnnotation(mart, featureType="miRNA")
annotatedPeak <- annotatePeakInBatch(myPeakList[1:6, ], AnnotationData = myAnnotation)
## Use your own annotation data (e.g. BED or GFF)
myPeak1 <- RangedData(IRanges(start = c(967654, 2010897, 2496704, 3075869, 3123260, 3857501, 201089),
end = c(967754, 2010997, 2496804, 3075969, 3123360, 3857601, 201089),
names = c("Site1", "Site2", "Site3", "Site4", "Site5", "Site6", "site7")),
space = c("1", "2", "3", "4", "5", "6", "2"))
TFbindingSites <- RangedData(IRanges(start = c(967659, 2010898, 2496700, 3075866, 3123260, 3857500,
96765, 201089, 249670, 307586, 312326, 385750),
end = c(967869, 2011108, 2496920, 3076166, 3123470, 3857780, 96985, 201299, 249890, 307796, 312586, 385960),
names = c("t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9", "t10", "t11", "t12")),
space = c("1", "2", "3", "4", "5", "6", "1", "2", "3", "4", "5", "6"),
strand = c(1, + 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1))
annotatedPeak2 <- annotatePeakInBatch(myPeak1, AnnotationData = TFbindingSites)
as.data.frame(annotatedPeak2)
########################################
## Obtain Seqeunces Surrounding Peaks ##
########################################
peaks <- RangedData(IRanges(start = c(100, 500), end = c(300, 600), names = c("peak1", "peak2")), space = c("NC_008253", "NC_010468")) # Simple peak examples.
library(BSgenome.Ecoli.NCBI.20080805)
peaksWithSequences <- getAllPeakSequence(peaks, upstream = 20, downstream = 20, genome = Ecoli) # Extracts corresponding sequences including 20bp extensions on each end from the E. coli genome.
write2FASTA(peaksWithSequences) # Prints out results in FASTA format.
#################################
## GO Term Enrichment Analysis ##
#################################
library(org.Hs.eg.db)
enrichedGO <- getEnrichedGO(annotatedPeak, orgAnn = "org.Hs.eg.db", maxP = 0.01, multiAdj = TRUE, minGOterm = 10, multiAdjMethod = "BH")
enrichedGO$bp[1:6, ]
Additional ChIP-Seq Packages
- DiffBind: Documentation
- MOSAICS: Documentation
- iSeq: Documentation
- ChIPseqR: Documentation
- ChiPsim: Documentation
- CSAR: Documentation
- ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)
- SPP: ChIP-seq processing pipeline
- SPP Tutorial
- MACS
- SIPeS
RNA-Seq Analysis
Counting Reads that Overlap with Annotation Ranges
DocumentationThe GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual.
##########################
## Import Aligned Reads ##
##########################
library(GenomicRanges); library(Rsamtools); library(leeBamViews) # Load required libraries.
testFile <- system.file("bam", "isowt5_13e.bam", package = "leeBamViews")
aligns <- readBamGappedAlignments(testFile) # Imports a BAM alignment file (here yeast example) and stores it as a GappedAlignments object.
rname(aligns)
<- sub("^Sc", "", rname(aligns));
rname(aligns)
<-
sub("13", "XIII", rname(aligns))
# Required for data consistency.
alignscan <- scanBam(testFile); names(alignscan[[1]]) # Imports BAM data into a nested list object.
######################################################
## Organize Annotation Data in a GRangesList Object ##
######################################################
library(GenomicFeatures)
txdb <- makeTranscriptDbFromUCSC(genome="sacCer2", tablename="sgdGene") # Creates a TranscriptDb object from transcript annotations available at the UCSC Genome Browser.
exonRanges <- exonsBy(txdb, "tx") # Stores exon data as a GRangesList object.
#######################################################
## Create Annotation Objects from Custom Data Frames ##
#######################################################
transcripts <- data.frame(tx_id=1:3, tx_chrom="chr1", tx_strand=c("-", "+", "+"), tx_start=c(1, 2001, 2001), tx_end=c(999, 2199, 2199))
splicings <- data.frame(tx_id=c(1L, 2L, 2L, 2L, 3L, 3L), exon_rank=c(1, 1, 2, 3, 1, 2), exon_start=c(1, 2001, 2101, 2131, 2001, 2131), exon_end=c(999, 2085, 2144, 2199, 2085, 2199), cds_start=c(1, 2022, 2101, 2131, NA, NA), cds_end=c(999, 2085, 2144, 2193, NA, NA))
myTxdb <- makeTranscriptDb(transcripts, splicings) # Example for creating TranscriptDb object from two data frames.
exonsBy(myTxdb, "tx")
# Returns exon data as GRangesList
object.
###################################################
## Counting Reads that Overlap Annotation Ranges ##
###################################################
counts <- countOverlaps(exonRanges, aligns) # Counts matching reads per transcript.
numBases <- sum(width(reduce(exonRanges))) # Length of exon union per gene.
geneLengthsInKB <- (numBases/1000) # Conversion to kbp.
millionsMapped <- sum(counts)/1e+06 # Factor for converting to million of mapped reads.
rpm <- counts/millionsMapped #
RPK: reads per kilobase of exon model.
rpkm <- rpm/geneLengthsInKB # RPKM: reads per kilobase of exon model per million mapped reads. Note: the results are stored in a named vector that matches the index of the initial GRangesList object!!!
sortedRPKM <- sort(rpkm); highScoreGenes <- tail(sortedRPKM)
txs <- transcripts(txdb, vals = list(tx_id = names(highScoreGenes)), columns = c("tx_id", "gene_id"))
systNames <- as.vector(unlist(elementMetadata(txs)["gene_id"])) # Example for returning the six most highly expressed transcripts.
###################################################################
##
Counting Reads for Non-Annotation Ranges (Intergenic Regions) ##
###################################################################
filtData <- subsetByOverlaps(aligns, exonRanges) # Returns all reads that align in non-exon ranges. This includes introns which are rare in this example.
filtData2 <- subsetByOverlaps(aligns, transcriptsBy(txdb, "gene"))
# Returns all reads that align in
intergenic regions (excludes introns).
cov <- coverage(filtData) # Calculates coverage for non-exon ranges.
cov <- cov[13] # Subsetting of chromosome 13 because data for the other chromosomes is missing in this example.
islands <- slice(cov, lower = 1) # Filters for areas with a continuous read coverage of >=1.
transcribedRegions <- islands[width(islands) > 1000] # Returns only those transcribed regions that are at least 1000 bp long.
#########################################################################
## Avoid Double Counting and Ambiguous Mappings with summarizeOverlaps ##
#########################################################################
## Note: when counting single range features, a GRanges object is passed
## on to summarizeOverlaps and with multiple range features a GRangesList.
## It also accepts a ScanBamParam() function under the param argument to filter
## the reads stored in a bam file, e.g. by "NH" tag to exclude reads with multiple mappings.
library(GenomicRanges); library(Rsamtools)
## Create some sample data
rd <- GappedAlignments(letters[1:3], seqnames = Rle(rep("chr1",3)),
pos = as.integer(c(100, 500, 900)),
cigar = rep("300M", 3), strand = strand(rep("+", 3)))
gr1 <- GRanges("chr1", IRanges(start=100, width=101), strand="+", ID="gene1_exon1")
gr2 <- GRanges("chr1", IRanges(start=100, width=151), strand="+", ID="gene1_exon2")
gr3 <- GRanges("chr1", IRanges(start=900, width=101), strand="+", ID="gene1_exon3")
gr4 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="gene2_exon1")
gr <- c(gr1, gr2, gr3, gr4)
grl <- split(gr, gsub("_.*", "", elementMetadata(gr)$ID))
grl
GRangesList of length 2:
$gene1
GRanges with 3 ranges and 1 metadata column:
seqnames ranges strand | ID
<Rle> <IRanges> <Rle> | <character>
[1] chr1 [100, 200] + | gene1_exon1
[2] chr1 [100, 250] + | gene1_exon2
[3] chr1 [900, 1000] + | gene1_exon3
...
## Count reads overlapping with exonic regions of genes
assays(summarizeOverlaps(grl, rd, mode="Union", ignore.strand=TRUE))$counts
[,1]
gene1 2
gene2 1
countOverlaps(grl, rd, ignore.strand=TRUE)
gene1 gene2
2 1
## Compare results with corresponding gene range counts
gr1 <- GRanges("chr1", IRanges(start=100, width=901), strand="+", ID="gene1")
gr2 <- GRanges("chr1", IRanges(start=500, width=101), strand="+", ID="gene2")
grgene <- c(gr1, gr2)
grgenel <- split(grgene, elementMetadata(grgene)$ID)
assays(summarizeOverlaps(grgenel, rd, mode="Union", ignore.strand=TRUE))$counts
[,1]
gene1 2
gene2 0
countOverlaps(grgenel, rd, ignore.strand=TRUE)
gene1 gene2
3 1
Differential Gene Expression Analysis with DESeq
DocumentationThe DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above.
##################################
## Import Sample RNA-Seq Counts ##
##################################
## Uses sample data from pasilla data package containing RNA-Seq experiments of
## Drosophila melanogaster cell cultures with RNAi knock-down of the splicing
## factor pasilla (Brooks et al, 2011, Genome Research).
library(DESeq); library("pasilla"); data("pasillaGenes")
counts(pasillaGenes)[1:4,]; pData(pasillaGenes) # Usually, users want to import here their own count data with the read.delim() or read.table() functions.
pairedSamples <- pData(pasillaGenes)$type == "paired-end"
countsTable <- counts(pasillaGenes)[, pairedSamples] # Select only paired-end data
countsTable[1:4,]
treated2fb treated3fb untreated3fb untreated4fb
FBgn0000003 1 1 0 0
FBgn0000008 139 77 84 76
FBgn0000014 10 0 0 0
FBgn0000015 0 0 1 2
conds <- pData(pasillaGenes)$condition[pairedSamples] # Define biological replicates with factor
conds
[1] treated treated untreated untreated
Levels: treated untreated
##################
## Calling DEGs ##
##################
cds <- newCountDataSet(countsTable, conds) # Creates object of class CountDataSet derived from eSet class.
cds[, -1]; counts(cds)[1:4, ] # CountDataSet has similar accessor methods as eSet class.
cds <- estimateSizeFactors(cds) # Estimates library size factors from count data. Alternatively, one can provide here the true library sizes with: sizeFactors(cds) <- c(..., ...)
pData(cds); sizeFactors(cds)
counts(cds, normalized=TRUE)[1:4,] # Returns count values divided by their size factor which brings each sample to a common scale.
cds <- estimateDispersions(cds) # Estimates dispersion of each gene across replicates.
res <- nbinomTest(cds, "treated", "untreated") # Calls DEGs with nbinomTest.
res[order(res$padj), ][1:4,] # Returns for each gene: mean expression levels; fold change; log2 fold change; p-value for statistical significance of change; and the p-values adjusted for multiple testing with the Benjamini-Hochberg procedure which controls the false discovery rate (FDR).
id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval padj
9696 FBgn0039155 697.2227 1335.0258 59.41957 0.04450818 -4.489786 5.001208e-110 5.746388e-106
10162 FBgn0039827 296.2687 562.7488 29.78866 0.05293420 -4.239656 2.643077e-70 1.518448e-66
3163 FBgn0029167 3434.7912 5679.5372 1190.04511 0.20953206 -2.254757 8.164834e-50 3.127131e-46
6408 FBgn0034434 139.2476 263.8843 14.61102 0.05536905 -4.174776 1.130289e-42 3.246754e-39
sigDESeq <- res[res$padj <= 0.01, ]; sigDESeq <- na.omit(sigDESeq); sigDESeq <- as.character(sigDESeq[,1]) # Store genes with FDR <= 0.01 for comparison with edgeR
plotDE <- function(res) { plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col=ifelse( res$padj<.1, "red", "black")) } # Defines plotting function
plotDE(res) # Plots the log2 fold changes against the base means, coloring in red those genes that are significant at 10% FDR.
Differential Gene Expression Analysis with edgeR
DocumentationThe edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.
##################################
## Import Sample RNA-Seq Counts ##
##################################
## Uses the same sample data as in the DESeq section.
library("edgeR"); library(DESeq); library("pasilla"); data("pasillaGenes")
counts(pasillaGenes)[1:4,]; pData(pasillaGenes)
pairedSamples <- pData(pasillaGenes)$type == "paired-end"
x <- counts(pasillaGenes)[, pairedSamples] # Select only paired-end data
group <- pData(pasillaGenes)$condition[pairedSamples] # Define biological replicates with factor
x[1:4,]; group
treated2fb treated3fb untreated3fb untreated4fb
FBgn0000003 1 1 0 0
FBgn0000008 139 77 84 76
FBgn0000014 10 0 0 0
FBgn0000015 0 0 1 2
[1] treated treated untreated untreated
Levels: treated untreated
#######################################################################
## Pairwise Comparisons Among Two or Mores Groups with Classic edgeR ##
#######################################################################
## Before fitting a negative binomial model, the dispersion(s) need to be estimated.
## For single factor experiments, edgeR uses the quantile-adjusted conditional
## maximum likelihood (qCML) method. Note: classic edgeR is limited to single factor designs.
y <- DGEList(counts=x, group=group) # Constructs DGEList object
y <- estimateCommonDisp(y) # Estimates common dispersion
y <- estimateTagwiseDisp(y) # Estimates tagwise dispersion
et <- exactTest(y, pair=c("treated", "untreated")) # Computes exact test for the negative binomial distribution, which has strong parallels with Fisher’s exact test. The 'pair' argument allows the user to determine which samples to compare. Note: the first group listed in the pair is the baseline for the comparison (
'untreated-treated'
) meaning genes with positive log-fold change are up-regulated in group "untreated" compared with group 'treated' and vice versa for genes with negative log-fold change.
topTags(et, n=4) # Returns DEG table ranked by p-values adjusted for multiple testing (last column). It uses by default the Benjamini and Hochberg’s approach for controlling the false discovery rate (FDR). For more detailed examples, see sections 10-12 in edgeR manual.
Comparison of groups: untreated-treated
logConc logFC P.Value FDR
FBgn0039155 -15.43537 4.444217 2.866244e-108 4.147455e-104
FBgn0039827 -16.57627 4.233360 3.752429e-80 2.714882e-76
FBgn0034434 -17.62894 4.153032 1.025818e-57 4.947862e-54
FBgn0034736 -16.88889 3.365485 1.119077e-51 4.048260e-48
## Compare DEGs from edgeR with those from DESeq in Venn diagram
all <- as.data.frame(topTags(et, n=100000))
sigedgeR <- all[all$adj.P.Val <= 0.01, ]; sigedgeR <- na.omit(sigedgeR); sigedgeR <- row.names(sigedgeR)
# Store genes with FDR <= 0.01 for comparison with DESeq
source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/overLapper.R")
OLlist <- overLapper(setlist=list(DESeq=sigDESeq, edgeR=sigedgeR), sep="_", type="vennsets"); counts <- sapply(OLlist$Venn_List, length)
vennPlot(counts=counts) # Plots Venn diagram
counts # Returns Venn counts
DESeq edgeR DESeq_edgeR
9 1091 340
#########################################
## Multifactor Analysis with glm edgeR ##
#########################################
## For multiple factor design experiments, edgeR uses the Cox-Reid profile-adjusted likelihood (CR)
## method in estimating dispersions. It handles multiple factors by fitting generalized linear models
## (GLM) with a design matrix.
The CR method can be used to calculate a common dispersion for all the
## tags, trended dispersion depending on the tag abundance, or separate dispersions for individual
## tags using the different
estimateGLMX
function group.
Note: one needs to estimate either common
## dispersion or trended dispersions prior to the estimation of tagwise dispersions.
y <- DGEList(counts=x, group=group)
# Constructs DGEList object
design <- model.matrix(~group)
y <- estimateGLMTrendedDisp(y, design) # Estimates trended dispersions
y <- estimateGLMTagwiseDisp(y, design) # Estimates tagwise dispersions (requires pre-processing with
estimateGLMCommonDisp
or
estimateGLMTrendedDisp
). To estimate tagwise dispersions, the empirical Bayes method is applied to squeeze tagwise dispersions towards common dispersion or trended dispersions whichever exists.
fit <- glmFit(y, design) # Fits the negative binomial GLM for each tag and produces an object of class DGEGLM with some new components.
lrt <- glmLRT(fit, coef=2) # Takes DGEGLM object and carries out the likelihood ratio test.
topTags(lrt, n=4)
Coefficient: groupuntreated
logConc logFC LR P.Value FDR
FBgn0039155 -9.822991 4.445678 584.9491 3.144586e-129 4.550215e-125
FBgn0039827 -10.684400 4.228985 301.2053 1.799604e-67 1.302014e-63
FBgn0035085 -10.059529 2.437774 235.3055 4.152965e-53 2.003114e-49
FBgn0033764 -11.856412 -3.575744 233.7005 9.297436e-53 3.363347e-49
lrt <- glmLRT(y, fit, contrast=c(-1, 1)) # A contrast/comparison among specific samples can be specified with a vector containing values 0 , 1 and -1, and length matching the number of columns in the design matrix.
###########################
## Normalization Factors ##
###########################
## The use of normalization factors, such as the
weighted trimmed mean of M-values (TMM),
##
can be considered for RNA samples with extreme composition bias, where a small number
## of genes accounts for most of the reads in a library.
y <- calcNormFactors(y) #
Given a count table or DGEList object, the
calcNormFactors() function c
omputes by default the
normalization factors with the
TMM method.
y$samples
group lib.size norm.factors
treated2fb treated 15620018 1.0218827
treated3fb treated 12733865 1.0126983
untreated3fb untreated 10283129 0.9982300
untreated4fb untreated 11653031 0.9680287
## Smear plot for the first sample in y$counts (left panel) and the third one (right panel)
par(mfrow = c(1, 2))
maPlot(y$counts[, 1], y$counts[, 3], normalize = TRUE, pch = 19, cex = 0.4, ylim = c(-8, 8))
grid(col = "blue")
abline(h = log2(y$samples$norm.factors[3]/y$samples$norm.factors[1]), col = "red", lwd = 4)
eff.libsize <- y$samples$lib.size * y$samples$norm.factors
maPlot(y$counts[, 1]/eff.libsize[1], y$counts[, 3]/eff.libsize[3], normalize = FALSE, pch = 19, cex = 0.4, ylim = c(-8, 8))
grid(col = "blue")
A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):
- easyRNASeq (simplifies read counting per genome feature)
- DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R
- DEGseq
- baySeq (also see: segmentSeq)
- Genominator (Bullard et al. 2010)
Detection of Alternative Splice Junctions
Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:DNA-Methylation Data Analysis
- methylPipe
- bsseq
- BiSeq
- Much more under BiocViews
HT-Seq Data Visualization
-
ggbio: ggplot2 extension for genomics data (online manual)
Gviz: Plotting data and annotation information along genomic coordinates
HilbertVis: Hilbert genome plots
-
GenomeGraphs: Plotting genomic information from Ensembl
-
TileQC: Flow Cell Quality Visualization
-
rtracklayer: R interface to genome browsers
- genoPlotR: Plotting maps of genes and genomes
- Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.
Command Line Alignment Utilities
SOAP
SOAP stands for "Short Oligonucleotide Analysis Package." This is a particularly powerful and fast alignment tool which supports mismatches and indels. It also contains a new tool for SNP calling.SOAP Web Page
Running SOAP
SOAP is run from the unix command line, and requires a reference genome in .fasta format, and a list of query sequences in .fasta or .fastq format. This example was tested with SOAP version 2.18.wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq
# download some test Illumina reads from Arabidopsis
soap # inspect command line options
2bwt-builder genome.fasta
# create binary of reference genome
soap -a query.fastq -D genome.fasta.index -o output.soap
# align query to genome and store output
Parsing Output with R
library(ShortRead)
alignedReads <- readAligned("./", pattern="output.soap", type="SOAP")
# Reads in alignment coordinates as a ShortRead AlignedRead object
MAQ
Running MAQ
wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq
# download some test Illumina reads from Arabidopsis
maq # inspect command line options
maq fasta2bfa genome.fasta genome.bfa
# create binary of reference genome
maq fastq2bfq query.fastq readBinary.bfq
# create a binary of dataset
maq match out.map genome.bfa readBinary.bfq
# align query to genome and store output
Parsing Output with R
library(ShortRead)
alignedReads <- readAligned("./", pattern="out.map", type="MAQMap")
# Reads in alignment coordinates as a ShortRead AlignedRead object
Bowtie
Running Bowtie
wget http://biocluster.ucr.edu/~tbackman/genome.fasta # download a test reference genome (TAIR9 Chromosome 1)
wget
http://biocluster.ucr.edu/~tbackman/query.fastq
# download some test Illumina reads from Arabidopsis
cp /home/tbackman/.html/
genome.fasta
. # Alternative- copy from biocluster foldercp /home/tbackman/.html/
query.fastq
. # Alternative- copy from biocluster folderbowtie # inspect command line options
bowtie-build
genome.fasta
genome.binary
# create binary genome (replace
genome.fasta
with your genome)
# You can also download pre-built genomes from the Bowtie web page
bowtie -q genome.binary query.fastq output.bowtie
# align query to genome and store output (replace ~/query.fastq with your query sequences
# and ~/genome.binary with the genome you built in the previous step )
Parsing Output with R
library(ShortRead)
alignedReads <- readAligned("./", pattern="output.bowtie", type="Bowtie")
# Reads in alignment coordinates as a ShortRead AlignedRead object
BWA
See the example in the Rsamtools section.Exercises
-
Basic Sequence and Range Operations
- Workshop Dec 14, 2013: [ Slide Show ] [ R Code ]
- NGS Application Domains
- RNA-Seq Profiling
- Workshop Dec 14, 2013: [ Slide Show ] [ R Code ]
- ChIP-Seq Analysis
- Workshop Dec 15, 2013: [ Slide Show ] [ R Code ]
- SNP/INDEL Discovery and Annotation
- Workshop Dec 15, 2013: [ Slide Show ] [ R Code ]
Session Info
If you receive errors or unexpected behavior when performing exercises from this manual, please double check that you are using the same software version(s) for which it was written:> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ChIPpeakAnno_2.2.0 gplots_2.10.1
[3] KernSmooth_2.23-7 caTools_1.12
[5] bitops_1.0-4.1 gdata_2.8.2
[7] gtools_2.6.2 limma_3.10.0
[9] org.Hs.eg.db_2.6.4 GO.db_2.6.1
[11] RSQLite_0.10.0 DBI_0.2-5
[13] BSgenome.Ecoli.NCBI.20080805_1.3.17 multtest_2.10.0
[15] DESeq_1.6.1 locfit_1.5-6
[17] lattice_0.20-0 akima_0.5-4
[19] BayesPeak_1.6.0 BiocInstaller_1.2.1
[21] GenomicFeatures_1.6.4 AnnotationDbi_1.16.4
[23] leeBamViews_0.99.13 Biobase_2.14.0
[25] edgeR_2.4.0 Rsamtools_1.6.2
[27] biomaRt_2.10.0 BSgenome_1.22.0
[29] GenomicRanges_1.6.3 Biostrings_2.22.0
[31] IRanges_1.12.2
loaded via a namespace (and not attached):
[1] annotate_1.32.0 genefilter_1.36.0 geneplotter_1.32.1 MASS_7.3-16
[5] RColorBrewer_1.0-5 RCurl_1.7-0 rtracklayer_1.14.3 splines_2.14.0
[9] survival_2.36-10 tools_2.14.0 XML_3.4-3 xtable_1.6-0
[13] zlibbioc_1.0.0
Installing Packages
source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno",
"Rsamtools"
, "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "
BSgenome.Celegans.UCSC.ce2
"
))