NGS Analysis with Galaxy and IGV

Authors: Rebecca Sun and Thomas Girke, UC Riverside


Introduction

Sequencing Technology Slide Show

This manual introduces the basics of aligning next generation sequence (NGS) data to reference genomes/transcriptomes using the tools available at Galaxy, which is a powerful web service for sequence analysis. To explore and visualize the resulting read pileups along with genome annotation features, this tutorial also introduces the very easy-to-use IGV genome browser from the Broad Institute. No special computer knowledge is required to work through this manual. Its content should be useful for both complete beginners in the NGS analysis area and intermediate users who are mainly interested in visualizing their NGS results. 


File Formats for Sequences, Alignments and Annotations

A basic knowledge of the most common sequence, alignment and annotation formats is often useful to understand the inputs/outputs and contents of many analysis tools used in the NGS field. The table below lists some of the most commonly used data formats. More detailed information can be found at the UCSC Genome Browser and the IGV sites.

Source Data
 Recommended File Formats
Sequence and base call quality data
 FASTQ and FASTA formats
Sequence alignment data
 SAM and BAM formats
Genome annotations  GTF, GFF and BED format
Dense continuous data (e.g. %GC)  WIG and bigWig

Galaxy for NGS Analysis

Galaxy Slide Show

Galaxy is a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a well-defined interface. On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analysis steps, a workflow system for convenient reuse, data management, sharing, publishing, and more.

The web-based Galaxy site is freely available to the public. User registration at the site is optional, but can be beneficial to increase its efficiency for the user. The registration is free and allows the user to create custom workflows and save analysis results.

For developing tools, or working with large amounts of data, or data with privacy concerns, there are two options for running your custom instances of Galaxy:

Web Interface

The Galaxy's web interface is straightforward. There are three sections. The left column contains links to the downloading, preparation and analysis tools. The center column is where the menus and data will appear. The right hand column shows the history of analysis steps, allows to view data and results, and more.

Getting Data

The researcher can obtain data from many different sources, including the UCSC Table Browser, BioMart, WormBase, and many more databases. And most importantly one can upload custom data sets.

Getting Data from Databases

The user will find links to large amounts of data from the UCSC genome database, the BioMart database, and several other server sources.

Uploading Custom Data

There are two ways to upload custom data. For small data files, one can specify a local directory for direct upload. For NGS and other large files (>~2GB), uploading via HTTP/FTP is the most reliable method.

NGS Data Quality Control

The FASTQ format comes in a number of variants. Thus, converting it into a consistent format for downstream quality checking can be particularly troublesome. Galaxy contains a set of tools that is able to handle all known FASTQ variants and is intended to simplify the first steps following data acquisition. These steps typically include (1) parsing sequencer output, (2) calculating and (3) visualizing summary statistics on quality scores and nucleotide distributions, (4) trimming reads if necessary, (5) filtering reads by quality scores and other useful manipulations.

Go to "NGS QC and manipulation". This group of tools contains a variety of utilities for dealing with all flavors of FASTQ formats as well as outputs from SOLiD and 454 instruments.

FASTQ Groomer

The FASTQ Groomer tool is used to verify and convert between the known FASTQ variants. The output formats created by this tool are guaranteed to conform with the target variant specified by the user, including the enforcement of quality score minimums and maximums. After grooming, the user is presented with a valid FASTQ format that is accepted by all downstream analysis tools.

The user should select the correct input variant, otherwise Groomer will create outputs that can not reflect the values intended by the sequencing technology. The summary information provided by Groomer can be used as a sanity check. For example, if a user incorrectly specifies the input variant "Solexa" for a Sanger encoded FASTQ file, after grooming, he will get summary information that ‘the input data is valid for: Sanger’, which is a direct contradiction of the user's selection (Solexa).

FASTQ Summary Statistics

As quality scores can vary along the length of sequencing reads, a common pre-processing routine is to trim and filter reads based on their per cycle (base position) quality. The FASTQ Summary Statistics by column accomplishes this task. The output of this tool contains read counts, minimums, maximums, sums, means, quartiles with ranges, outliers and nucleotide counts for each base position in a FASTQ file. This statistical summary can be graphed by using a boxplot tool.

Quality Boxplots

Creates a boxplot for the quality scores of a sequence set. The FASTQ Statistics tool is used to generate the report file needed for this tool.

Draw nucleotides distribution chart

Creates a stacked-histogram graph for the nucleotide distribution. The FASTQ Statistics tool is used to generate the report file needed for this tool.

Filter FASTQ reads

The option Filter FASTQ reads by quality score and length allows filtering by minimum and maximum read lengths and by minimum and maximum quality score values over the entire read while allowing a configurable number of deviant bases. Complex filters can also be constructed that allow the user to set offsets, just like with the trimmer tool, to use as bounds for performing a selected aggregation action that is compared to a user specified value. Any number of complex filters can be designed and applied to a set of sequencing reads.

FASTQ Trimmer

To prevent otherwise high-quality reads from being rejected during quality filtering or from influencing mapping or assembly processes, it can be beneficial to trim bases from poor-quality ends of reads. The FASTQ Trimmer by column tool allows trimming either at the end of a set of reads by using absolute offsets or by specifying percentage values. Offsets begin at 0 for each end and increase towards the opposing end of the read.

For example, to trim the outer 3 bases from each end of a 36 length sequencing read, a user can specify absolute 5' and 3' offsets of 3 or percentage-based offsets of 8.33 (0.0833 × 36 = 2.9988, rounded to the nearest integer = 3).

Mapping Data

At present, the NGS Mapping group tools includes the short read aligners Bowtie, BWA, Lastz and Megablast.

Alignment with Bowtie

Bowtie is a short read aligner designed to be ultrafast and memory-efficient. It is developed by Ben Langmead and Cole Trapnell (Genome Biology 10:R25).

To use Bowtie, go to "NGS Toolbox Beta", click "NGS: Mapping", and click "Map with Bowtie for Illumina".

Alignment with BWA

BWA is a fast light-weighted tool that aligns relatively short sequences (queries) to a sequence database, such as the human reference genome. It is developed by Heng Li at the Sanger Insitute (Li H. and Durbin R., 2009). It is a fast and accurate short read aligner that allows mismatches and indels (Bioinformatics, 25, 1754-60).

To used BWA, go to "NGS Toolbox Beta", click "NGS: Mapping", and click "Map with BWA".

Post-processing Data

BWA tools

NGS SAMTools group of tools includes a variety of utilities for SAM/BAM manipulation.
SAM-to-BAM
This tool converts SAM format to BAM format. Go to "NGS Toolbox Beta", click "NGS: SAM Tools", then click open "SAM-to-BAM".

Histories & Workflows

Galaxy's history feature can be very helpful. It list all the analysis histories of a user, creates new ones when starting a new analysis. It also can convert a history into a workflow, supports history sharing  with other users.You can show and delete data from a current history.

In the option menu, click “Extract Workflow" to create a workflow. When a series of analysis steps is defined that might be handy to re-use in the future, can turn it into workflows. A workflow stores the chosen analysis steps, so that one can simply upload a new data set at any time and run the same analysis steps. This way one can easily rerun complex analysis routines on new data sets and share
the workflows with others. During a new analysis it is recommended to regularly save a workflow. One can rename the workflows, and add analysis steps to it any time. In addition, one can select which steps to keep in a workflow. Note: the workflows are not fully supported in IE6.


Galaxy Exercises

NGS Alignments

  1. Data source: AP1 ChIP-Seq data set from Kaufmann et al, 2010: http://www.ncbi.nlm.nih.gov/sra/SRP002174.
  2. Download the following Fastq files and reference genome: 
    1. AP1-GR_uninduced_control: http://biocluster.ucr.edu/~rsun/workshop/SRR038849_1M.fastq.
    2. AP1-GR_2h_induced_sample: http://biocluster.ucr.edu/~rsun/workshop/SRR038847_1M.fastq.
    3. Arabidopsis Genome TAIR10: http://biocluster.ucr.edu/~rsun/workshop/TAIR9.fas   

  3. Convert Fastq file to sanger sequences   
    • Select NGS: QC and manipulation and Fastq groomer

  4. Alignment with BWA 
    • Select NGS: Mapping and BWA

  5. Filter SAM 
    • Usually, one cares mainly about the reads that map uniquely (exactly once) to a reference genome. One of the problems with NGS mappers is that they all have different ways of reporting multiple hits. In the case of BWA if a read maps at multiple locations, a single location is randomly chosen and is reported. BWA reports this information through optional tags. The tag to use here is XT:A:U (XT:A:U - user defined tag called XT. A means the tag holds a character. The value associated with this tag is 'U'.), where U stands for unique (R for repeat). So first one selects all lines from the BWA output where XT:A:U is present. This is done with the Filter and Sort -> Select tool.

  6. Convert SAM to BAM. 

Quality Control

  1. Download data from Galaxy's published data site.
    • Go to Shared Data-> Data Library ->Sample NGS Datasets-> Import  human illumina dataset to your current history.

  2. Convert Fastq file to sanger sequences   
    • Select NGS: QC and manipulation and Fastq groomer
  3. FASTQ summary statistics
    • To understand the quality properties of the reads, one can run the FASTQ Summary Statistics tool from NGS: QC and manipulation.

  4. Quality boxplots
    • A graphical representation of these data facilitates its quality assessment. The plot can be readily produced with the tool Graph/Display Data -> Boxplot.

  5. Trim low quality tail off reads and mapping with Bowtie
    • Go to NGS: Mapping and choose BowtieOnce Bowtie's interface is displayed in the center, select hg19 from Select a reference genome, then select Full parameter list, and set 10 for Trim n bases from low-quality (right) end of each read before alignment (-3) to trim off the low quality tails from the reads.

  6. Convert SAM to BAM. 
    • For the import into IGV one needs to convert the SAM format into its binary BAM representation by using the NGS SAM Tool -> SAM to BAM.


NGS Data Visualization and Exploration Using IGV

Introduction

The Integrative Genomics Viewer (IGV) is an efficient visualization tool for interactive exploration of large genome datasets. It supports a wide variety of data types including NGS alignments, genomic annotations, expression data, genetic variations, etc. Alternative browsers can be found on these pages: NGS Alignment Viewers and NGS Viewers Reviewed.


Installation

IGV is a local Java application with web start capabilities. It can be downloaded from this page: http://www.broadinstitute.org/igv/download. Usually, the first option for the smallest memory footprint (750MB) is sufficient for most applications.

User Manuals

IGV Exercises

Import of Custom Genomes and Annotations

    • Save the following genome sequence and GFF annotation files to a directory called 'myigv'
    • Import the new genome into IGV
      • File -> Import Genome 
      • Follow instructions and name genome: Arab_test
    • Explore IGV's genome viewing and panel functions. For instance: 
      • Click chromosome ID and zoom in and out.
      • Click icons on top: home (whole genome view), refresh screen and define region of interest.
      • Expand Genome track to view gene regions (triangle in top left corner) then right click on a feature and copy its annotation or sequence to clipboard.
      • Paste gene ID AT1G34575 into search field and hit Go.
    • Save current session in myigv directory, close IGV and then restart this session.
      • File -> Save Session

Viewing Expression Data

    • Prebuilt genomes

Viewing NGS Data

    • Import NGS data from human
      • File -> Load from File
        • NA12878.SLX.chr1_sample.bam (requires *.bam file from this site)

    • Import custom NGS aligned read data generated in above Galaxy exercise
      • Select under prebuilt genomes A. thaliana (TAIR9)
      • Import aligned reads from Galaxy exercise
        • File -> Load from URL
          • http://biocluster.ucr.edu/~rsun/workshop/APIUIND.sorted.region.bam
          • http://biocluster.ucr.edu/~rsun/workshop/APIIND.sorted.region.bam
          • http://biocluster.ucr.edu/~rsun/workshop/APIUIND.sorted.region.bam.bai
          • http://biocluster.ucr.edu/~rsun/workshop/APIIND.sorted.region.bam.bai
        • Go to region
          • Chr1:15166146-15242215

    • Create various custom tracks from Excel (or R, Perl, Python, etc.), and import them into IGV. The Recommended File Formats page contains detailed information about the formats for handling different data types, such as mutation tracks, ChIP-Seq and RNA-Seq data.