IntroductionSequencing Technology Slide ShowThis manual introduces the basics of aligning next generation sequence (NGS) data to reference genomes/transcriptomes using the tools available at Galaxy, which is a powerful web service for sequence analysis. To explore and visualize the resulting read pileups along with genome annotation features, this tutorial also introduces the very easy-to-use IGV genome browser from the Broad Institute. No special computer knowledge is required to work through this manual. Its content should be useful for both complete beginners in the NGS analysis area and intermediate users who are mainly interested in visualizing their NGS results.
File Formats for Sequences, Alignments and Annotations
A basic knowledge of the most common sequence, alignment and annotation formats is often useful to understand the inputs/outputs and contents of many analysis tools used in the NGS field. The table below lists some of the most commonly used data formats. More detailed information can be found at the UCSC Genome Browser and the IGV sites. |
Source Data |
Recommended File Formats |
Sequence and base call quality data |
FASTQ and FASTA formats |
Sequence alignment data |
SAM and BAM formats |
Genome annotations | GTF, GFF and BED format |
Dense continuous data (e.g. %GC) | WIG and bigWig |
Galaxy for NGS Analysis
Galaxy Slide ShowGalaxy is a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a well-defined interface. On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analysis steps, a workflow system for convenient reuse, data management, sharing, publishing, and more.
The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the
user to create custom workflows and save analysis results.
For developing tools, or working with large amounts of data, or data with privacy concerns, there are two options for running your custom instances of Galaxy:
Web Interface
The Galaxy's web interface is straightforward. There are three sections. The left column contains links to the downloading, preparation and analysis tools. The center column is where the menus and data will appear. The right hand column shows the history of analysis steps, allows to view data and results, and more.Getting Data
The researcher can obtain data from many different sources, including the UCSC Table Browser, BioMart, WormBase, and many more databases. And most importantly one can upload custom data sets.Getting Data from Databases
The user will find links to large amounts of data from the UCSC genome database, the BioMart database, and several other server sources.
Uploading Custom Data
There are two ways to upload custom data. For small data files, one can specify a local directory for direct upload. For NGS and other large files (>~2GB), uploading via HTTP/FTP is the most reliable method.NGS Data Quality Control
The FASTQ format comes in a number of variants. Thus, converting it into a consistent format for downstream quality checking can be particularly troublesome. Galaxy contains a set of tools that is able to handle all known FASTQ variants and is intended to simplify the first steps following data acquisition. These steps typically include (1) parsing sequencer output, (2) calculating and (3) visualizing summary statistics on quality scores and nucleotide distributions, (4) trimming reads if necessary, (5) filtering reads by quality scores and other useful manipulations.Go to "NGS QC and manipulation". This group of tools contains a variety of utilities for dealing with all flavors of FASTQ formats as well as outputs from SOLiD and 454 instruments.
FASTQ Groomer
The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The output formats created by this tool are guaranteed
to conform with the target variant specified by
the user, including the enforcement of quality score minimums and
maximums.
After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools. The user should select the correct input variant, otherwise Groomer will create outputs that can not reflect the values intended by the sequencing technology. The summary information provided by Groomer can be used as a sanity check. For example, if a user incorrectly specifies the input variant "Solexa" for a Sanger encoded FASTQ file, after grooming, he will get summary information that ‘the input data is valid for: Sanger’, which is a direct contradiction of the user's selection (Solexa).
FASTQ Summary Statistics
As quality scores can vary along the length of sequencing reads, a common pre-processing routine is to trim and filter reads based on their per cycle (base position) quality. The FASTQ Summary Statistics by column accomplishes this task. The output of this tool contains read counts, minimums, maximums, sums, means, quartiles with ranges, outliers and nucleotide counts for each base position in a FASTQ file. This statistical summary can be graphed by using a boxplot tool.Quality Boxplots
Creates a boxplot for the quality scores of a sequence set. The FASTQ Statistics tool is used to generate the report file needed for this tool.
Draw nucleotides distribution chart
Creates a stacked-histogram graph for the nucleotide distribution. The FASTQ Statistics tool is used to generate the report file needed for this tool.Filter FASTQ reads
The option Filter FASTQ reads by quality score and length allows filtering
by minimum and maximum read lengths and by minimum
and maximum quality score values over the entire
read while allowing a configurable number of deviant bases. Complex
filters
can also be constructed that allow the user to
set offsets, just like with the trimmer tool, to use as bounds for
performing
a selected aggregation action that is compared
to a user specified value. Any number of complex filters can be designed
and
applied to a set of sequencing reads.FASTQ Trimmer
To prevent otherwise high-quality reads from being rejected during
quality filtering or from influencing mapping or assembly
processes, it can be beneficial to trim bases
from poor-quality ends of reads. The FASTQ Trimmer by column tool allows
trimming
either at the end of a set of reads by using absolute
offsets or by specifying percentage values.
Offsets begin
at 0 for each end and increase towards the
opposing end of the read.For example, to trim the outer 3 bases from each end of a 36 length sequencing read, a user can specify absolute 5' and 3' offsets of 3 or percentage-based offsets of 8.33 (0.0833 × 36 = 2.9988, rounded to the nearest integer = 3).
Mapping Data
At present, the NGS Mapping group tools includes the short read aligners Bowtie, BWA, Lastz and Megablast.Alignment with Bowtie
Bowtie
is a short read aligner designed to be ultrafast and memory-efficient.
It is developed by Ben Langmead and Cole Trapnell (Genome Biology
10:R25). To use Bowtie, go to "NGS Toolbox Beta", click "NGS: Mapping", and click "Map with Bowtie for Illumina".
Alignment with BWA
BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database, such as the human reference genome. It is developed by Heng Li at the Sanger Insitute (Li H. and Durbin R., 2009). It is a fast and accurate short read aligner that allows mismatches and indels (Bioinformatics, 25, 1754-60).To used BWA, go to "NGS Toolbox Beta", click "NGS: Mapping", and click "Map with BWA".
Post-processing Data
BWA tools
NGS SAMTools group of tools includes a variety of utilities for SAM/BAM manipulation.SAM-to-BAM
This tool converts SAM format to BAM format. Go to "NGS Toolbox Beta", click "NGS: SAM Tools", then click open "SAM-to-BAM".Histories & Workflows
Galaxy's history feature can be very helpful. It list all the analysis histories of a user, creates new ones when starting a new analysis. It also can convert a history into a workflow, supports history sharing with other users.You can show and delete data from a current history.In the option menu, click “Extract Workflow" to create a workflow. When a series of analysis steps is defined that might be handy to re-use in the future, can turn it into workflows. A workflow stores the chosen analysis steps, so that one can simply upload a new data set at any time and run the same analysis steps. This way one can easily rerun complex analysis routines on new data sets and share
the workflows with others. During a new analysis it is recommended to regularly save a workflow. One can rename the workflows, and add analysis steps to it any time. In addition, one can select which steps to keep in a workflow. Note: the workflows are not fully supported in IE6.
Galaxy Exercises
NGS Alignments
- Data source: AP1 ChIP-Seq data set from Kaufmann et al, 2010: http://www.ncbi.nlm.nih.gov/sra/SRP002174.
- Download the following Fastq files and reference genome:
- AP1-GR_uninduced_control: http://biocluster.ucr.edu/~rsun/workshop/SRR038849_1M.fastq.
- AP1-GR_2h_induced_sample: http://biocluster.ucr.edu/~rsun/workshop/SRR038847_1M.fastq.
- Arabidopsis Genome TAIR10: http://biocluster.ucr.edu/~rsun/workshop/TAIR9.fas
- Convert Fastq file to sanger sequences
- Select NGS: QC and manipulation and Fastq groomer
- Alignment with BWA
- Select NGS: Mapping and BWA
- Filter SAM
- Usually, one cares mainly about the reads that map uniquely (exactly once) to a reference genome. One of the problems with NGS mappers is that they all have different ways of reporting multiple hits. In the case of BWA if a read maps at multiple locations, a single location is randomly chosen and is reported. BWA reports this information through optional tags. The tag to use here is XT:A:U (XT:A:U - user defined tag called XT. A means the tag holds a character. The value associated with this tag is 'U'.), where U stands for unique (R for repeat). So first one selects all lines from the BWA output where XT:A:U is present. This is done with the Filter and Sort -> Select tool.
- Convert SAM to BAM.
- For the import into IGV one needs to convert the SAM format into its binary BAM representation by using the function NGS SAM Tool -> SAM to BAM. The BAM files will be stored like using the following names: http://biocluster.ucr.edu/~rsun/workshop/AP1UIND.bam and http://biocluster.ucr.edu/~rsun/workshop/AP1IND.bam. One can then import the two alignment pileups as BAM files into IGV for visualization. (Note: an index file is required for importing SAM or BAM file into IGV. One can use the "samtools index" command or "igvtools index" command to generate the index file.)
- In this workshop, we also generated two small .bam files in the region Chr1:15166146-15242215 for the following IGV exercise. These two bam files can be imported into IGV via the option "import from URL". For this the following URLs should used: http://biocluster.ucr.edu/~rsun/workshop/APIUIND.sorted.region.bam and http://biocluster.ucr.edu/~rsun/workshop/APIIND.sorted.region.bam.
- Download data from Galaxy's published data site.
- Go to Shared Data-> Data Library ->Sample NGS Datasets-> Import human illumina dataset to your current history.
- Convert Fastq file to sanger sequences
- Select NGS: QC and manipulation and Fastq groomer
- FASTQ summary statistics
- To understand the quality properties of the reads, one can run the FASTQ Summary Statistics tool from NGS: QC and manipulation.
- Quality boxplots
- A graphical representation of these data facilitates its quality assessment. The plot can be readily produced with the tool Graph/Display Data -> Boxplot.
- Trim low quality tail off reads and mapping with Bowtie
- Go to NGS: Mapping and choose Bowtie. Once Bowtie's interface is displayed in the center, select hg19 from Select a reference genome, then select Full parameter list, and set 10 for Trim n bases from low-quality (right) end of each read before alignment (-3) to trim off the low quality tails from the reads.
- Convert SAM to BAM.
- For the import into IGV one needs to convert the SAM format into its binary BAM representation by using the NGS SAM Tool -> SAM to BAM.
NGS Data Visualization and Exploration Using IGV
Introduction
The Integrative Genomics Viewer (IGV) is an efficient visualization tool for interactive exploration of
large genome datasets. It supports a wide variety of data types
including NGS alignments, genomic annotations, expression data, genetic variations, etc. Alternative browsers can be found on these pages: NGS Alignment Viewers and NGS Viewers Reviewed.Installation
IGV is a local Java application with web start capabilities. It can be downloaded from this page: http://www.broadinstitute.org/igv/download. Usually, the first option for the smallest memory footprint (750MB) is sufficient for most applications.User Manuals
IGV Exercises
- Save the following genome sequence and GFF annotation files to a directory called 'myigv':
- Import the new genome into IGV
- File -> Import Genome
- Follow instructions and name genome: Arab_test
- Explore IGV's genome viewing and panel functions. For instance:
- Click chromosome ID and zoom in and out.
- Click icons on top: home (whole genome view), refresh screen and define region of interest.
- Expand Genome track to view gene regions (triangle in top left corner) then right click on a feature and copy its annotation or sequence to clipboard.
- Paste gene ID AT1G34575 into search field and hit Go.
- Save current session in myigv directory, close IGV and then restart this session.
- File -> Save Session
Viewing Expression Data
- Prebuilt genomes
- Select in genome drop down Human hg18
- Import expression data and display as heatmap and line plots
- File -> Load from File
Viewing NGS Data
- Import NGS data from human
- File -> Load from File
- NA12878.SLX.chr1_sample.bam (requires *.bam file from this site)
- Import custom NGS aligned read data generated in above Galaxy exercise
- Select under prebuilt genomes A. thaliana (TAIR9)
- Import aligned reads from Galaxy exercise
- File -> Load from URL
- http://biocluster.ucr.edu/~rsun/workshop/APIUIND.sorted.region.bam
- http://biocluster.ucr.edu/~rsun/workshop/APIIND.sorted.region.bam
- http://biocluster.ucr.edu/~rsun/workshop/APIUIND.sorted.region.bam.bai
- http://biocluster.ucr.edu/~rsun/workshop/APIIND.sorted.region.bam.bai
- Go to region
- Chr1:15166146-15242215
- Create various custom tracks from Excel (or R, Perl, Python, etc.), and import them into IGV. The Recommended File Formats page contains detailed information about the formats for handling different data types, such as mutation tracks, ChIP-Seq and RNA-Seq data.
- Mutation track sample (details on this format). The colors for displaying different mutation types can be changed in IGV under: View -> Color Legends -> Mutation
- A BED formatted file can be used to define ranges (details on this format).
- A similar result can be achieved with the GFF3 format (details on this format).