Sequencing Technology Slide Show
This manual introduces the basics of aligning next generation
sequence (NGS) data to reference genomes/transcriptomes using the tools available at Galaxy
, which is a powerful web service for sequence analysis. To explore and visualize the resulting read
pileups along with genome annotation features, this tutorial also introduces the very easy-to-use IGV
genome browser from the Broad Institute. No special computer knowledge
is required to work through this manual. Its content should be useful
for both complete beginners in the NGS analysis area and intermediate users who are mainly interested in visualizing their NGS results.
File Formats for Sequences, Alignments and Annotations
A basic knowledge of the most common sequence, alignment and annotation formats is often useful to understand the inputs/outputs and contents of many analysis tools used in the NGS field. The table below lists some of the most commonly used data formats. More detailed information can be found at the UCSC Genome Browser
and the IGV sites
| Recommended File Formats
|Sequence and base call quality data
| FASTQ and FASTA formats
|Sequence alignment data
| SAM and BAM formats
|| GTF, GFF and BED format
|Dense continuous data (e.g. %GC)
|| WIG and bigWig
Galaxy for NGS AnalysisGalaxy Slide Show
Galaxy is a framework for
integrating computational tools. It allows nearly any tool that can be
run from the command line to be wrapped in a well-defined
interface. On top of these tools, Galaxy provides an accessible environment for
interactive analysis that transparently tracks the details of analysis steps, a
workflow system for convenient reuse, data management, sharing,
publishing, and more.
The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the
user to create custom workflows and save analysis results.
For developing tools, or working with large amounts of data, or data with privacy concerns, there are two options for running your custom instances of Galaxy:
The Galaxy's web interface is straightforward. There are three sections. The left column contains links to the downloading, preparation and analysis tools. The center column is where the menus and data will appear. The right hand column shows the history of analysis steps, allows to view data and results, and more.
The researcher can obtain data from many different sources, including the UCSC Table Browser, BioMart, WormBase, and many more databases. And most importantly one can upload custom data sets.
Getting Data from Databases
The user will find links to large amounts of data from the UCSC genome database, the BioMart database, and several other server sources.
Uploading Custom Data
There are two ways to upload custom data. For small data files, one can specify a local directory for direct upload. For NGS and other large files (>~2GB), uploading via HTTP/FTP is the most reliable method.
NGS Data Quality Control
The FASTQ format comes in a number of variants. Thus, converting it into a consistent format for downstream quality checking can be particularly troublesome. Galaxy contains a set of tools that is able to handle all known FASTQ
and is intended to simplify the first steps following
data acquisition. These steps typically include (1)
parsing sequencer output, (2) calculating and (3) visualizing summary
statistics on quality scores and nucleotide distributions, (4) trimming
reads if necessary, (5) filtering reads by quality scores and other useful manipulations.
Go to "NGS QC and manipulation"
. This group of tools contains a variety of utilities for dealing with all flavors of FASTQ formats as well as outputs from SOLiD and 454 instruments.
The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The output formats created by this tool are guaranteed
to conform with the target variant specified by
the user, including the enforcement of quality score minimums and
After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools. The user should select the correct input
variant, otherwise Groomer will create outputs that can not reflect the
values intended by the sequencing technology. The summary information
provided by Groomer can be used as a sanity check. For example, if a
user incorrectly specifies the input variant "Solexa" for a Sanger
encoded FASTQ file, after grooming, he will get summary information that ‘the input
data is valid for: Sanger’, which is a direct contradiction of the
user's selection (Solexa).
FASTQ Summary Statistics
As quality scores can vary along the length of sequencing reads, a common pre-processing routine is to trim and filter reads based on their per cycle (base
position) quality. The FASTQ Summary Statistics by column accomplishes this task.
The output of this tool contains read counts,
minimums, maximums, sums, means, quartiles with ranges, outliers and
counts for each base position in a FASTQ file.
This statistical summary can be graphed by using a boxplot tool.
Creates a boxplot for the quality scores of a sequence set. The FASTQ Statistics
tool is used to generate the report file needed for this tool.
Draw nucleotides distribution chart
Creates a stacked-histogram graph for the nucleotide distribution. The FASTQ Statistics
tool is used to generate the report file needed for this tool.
Filter FASTQ reads
The option Filter FASTQ
reads by quality score and length allows filtering
by minimum and maximum read lengths and by minimum
and maximum quality score values over the entire
read while allowing a configurable number of deviant bases. Complex
can also be constructed that allow the user to
set offsets, just like with the trimmer tool, to use as bounds for
a selected aggregation action that is compared
to a user specified value. Any number of complex filters can be designed
applied to a set of sequencing reads.
To prevent otherwise high-quality reads from being rejected during
quality filtering or from influencing mapping or assembly
processes, it can be beneficial to trim bases
from poor-quality ends of reads. The FASTQ Trimmer by column tool allows
either at the end of a set of reads by using absolute
offsets or by specifying percentage values.
at 0 for each end and increase towards the
opposing end of the read.
For example, to trim the outer 3 bases from each end
of a 36 length sequencing read, a user can specify absolute 5' and 3' offsets of 3 or percentage-based offsets of 8.33 (0.0833 × 36 = 2.9988, rounded to the nearest integer = 3).
At present, the NGS Mapping group tools
includes the short read aligners Bowtie, BWA, Lastz and Megablast.
Alignment with Bowtie
is a short read aligner designed to be ultrafast and memory-efficient.
It is developed by Ben Langmead and Cole Trapnell (Genome Biology
To use Bowtie, go to "NGS Toolbox Beta", click "NGS: Mapping
", and click "Map with Bowtie for Illumina".
Alignment with BWA
BWA is a fast light-weighted tool that aligns relatively short sequences
(queries) to a sequence database, such as the human reference
genome. It is developed by Heng Li at the Sanger Insitute (Li H. and
Durbin R., 2009). It is a fast and accurate short read aligner that allows mismatches and indels (Bioinformatics, 25, 1754-60).
To used BWA, go to "NGS Toolbox Beta", click "NGS: Mapping
", and click "Map with BWA".
NGS SAMTools group of tools
includes a variety of utilities for SAM/BAM manipulation.
This tool converts SAM format to BAM format. Go to "NGS Toolbox Beta", click "NGS: SAM Tools
", then click open "SAM-to-BAM".
Histories & Workflows
Galaxy's history feature can be very helpful. It list all the analysis histories of a user, creates new ones when starting a new analysis. It also can convert a history into a workflow, supports history sharing with other users.You can show and delete data from a current history.
In the option menu, click “Extract Workflow" to create a workflow. When a series of analysis steps is defined that might be handy to re-use in the future, can turn it into workflows. A workflow stores the chosen analysis steps, so that one can simply upload a new data set at any time and run the same analysis steps. This way one can easily rerun complex analysis routines on new data sets and share
the workflows with others. During a new analysis it is recommended to regularly save a workflow. One can rename the workflows, and add analysis steps to it any time. In addition, one can select which steps to keep in a workflow. Note: the workflows are not fully supported in IE6.
- Data source: AP1 ChIP-Seq data set from Kaufmann et al, 2010: http://www.ncbi.nlm.nih.gov/sra/SRP002174.
- Download the following Fastq files and reference genome:
- AP1-GR_uninduced_control: http://biocluster.ucr.edu/~rsun/workshop/SRR038849_1M.fastq.
- AP1-GR_2h_induced_sample: http://biocluster.ucr.edu/~rsun/workshop/SRR038847_1M.fastq.
- Arabidopsis Genome TAIR10: http://biocluster.ucr.edu/~rsun/workshop/TAIR9.fas
- Convert Fastq file to sanger sequences
- Select NGS: QC and manipulation and Fastq groomer
Alignment with BWA
- Select NGS: Mapping and BWA
- Usually, one cares mainly about the reads that map uniquely (exactly
once) to a reference genome. One of the problems with NGS mappers is that they
all have different ways of reporting multiple hits. In the case of BWA
if a read maps at multiple locations, a single location is randomly
chosen and is reported. BWA reports this information through optional
tags. The tag to use here is XT:A:U (XT:A:U - user defined tag called XT. A means the tag holds a character. The value associated with this tag is 'U'.), where U stands for unique (R for repeat). So first one selects all lines from the BWA output where XT:A:U is present. This is done with the Filter and Sort -> Select tool.
Convert SAM to BAM.