Workshop Exercises
Introduction to Bioinformatics - NM-AIST
Mon, July 23 - Wed, July 25, 2012
Instructor: Thomas Girke

[ Back to Schedule and Slides ]

Overview: This workshop provides an introductory overview of important bioinformatics data analysis concepts related to genome sequencing, database techniques, structural biology, comparative genomics, next generation sequencing, such as RNA-Seq profiling, and small molecule/drug discovery.

Day 1: Sequence Alignments and Searching

Sequence Downloads

    Go to, select "Protein" database in the "Search" box and run the following query:

  • P450 & hydroxylase & human [organism]

  • Select under "Limits" SwissProt as database.

  • Report final query syntax from "Search details".

  • Report the number of sequences matching the final query.

  • Save sequences in "FASTA format" to local file via "Send to" drop down menu.

What are P450 Sequences?

  • If you do not know what P450 sequences are, the corresponding PROSITE entry provides an excellent summary of their function.

  • The sequence consensus signature for the P450 family is given in the corresponding field.

  • The most important residue is the invariant cysteine in this consensus pattern, because it binds the heme group that is essential for the function of P450 enzymes.

BLAST Search

  • Open protein BLAST (BLASTP) search page.

  • Copy and paste the 3rd sequence from FASTA file (O15528) you just saved into query field.

  • Select the SwissProt database under the "Database" pull-down menu.

  • Execute the BLAST search using for the remaining parameters the default settings.

  • Wait until the BLAST result page opens.

Inspecting and Interpreting BLAST Results

  • Continue with BLAST result page from previous step.

  • Expand the Search Summary field and review the Search Parameter table.

  • Record the number of sequences in the SwissProt database.

  • Inspect the BLAST result list page and record the following values:

  • Explain why the first sequence in the hit list is identical with the query sequence (O15528).

  • How many sequences in the blast result have an E value of less than 1e-100.

  • Inspect the second hit (O35132) in the alignment section of the BLAST result and record the following values:

  • Does the alignment between your query sequence and O35132 show both sequences in full-length?

  • What is the E value for this alignment?

  • How many gaps are there?

Distance Tree of BLAST Results

  • Click the button "Distance tree of results"

  • Locate your query sequence in the tree (yellow)

  • Identify its closest (most similar) neighbor in the tree.

  • Where does the closest neighbor sequence appear in the BLAST result list and what is its E value?

  • Plot the tree as a radial tree by clicking on the corresponding tab on the top of the page.

Structure Viewing

  • Repeat previous BLASTP search with query sequence O15528 as follows.

  • Open protein BLAST (BLASTP) search page.

  • On next page click on red P450 domain (CDD, this runs RPS-BLAST), and then click red CypX domain.

  • View 3D structure in Cn3D, identify the heme binding cysteine in the structure and the multiple alignment.

Local Pairwise Alignment with BLAST2

  • Open the NCBI BLAST2 page

  • Copy and paste the following P450 sequences into separate sequence fields:

>O15528 Cytochrome P450 subfamily (Calcidiol 1-monooxygenase)









>P98187.1 Cytochrome P450 4F8 (CYPIVF8)









  • Select "BLASTP" tab as alignment program on the top of the page

  • Compute the alignment.

  • Which features in the alignment indicate that it is a local and not a global alignment?

  • Click "Edit and Resubmit" button, expand "Algorithm parameters". Record the E values for different scoring matrices at the "Matrix" in "Scoring Parameters":

  • BLOSUM45

  • BLOSUM62

  • BLOSUM80

  • Why do the E values increase with BLOSOM matrices of higher values?

Global Pairwise Alignment with Needleman-Wunsch Method

  • Open the Needleman-Wunsch Alignment tool from EMBOSS

  • Copy and paste the P450 sequences from previous section into separate sequence fields.

  • Compute the alignment.

  • Which features in the alignment indicate that it is a global and not a local alignment?

Multiple Alignment

  • Open the web page for the multiple alignment program MultAlin

  • Click on the Browse button and upload the sequences P450 sequences that were saved to a file in the first step.

  • Compute the multiple alignment by clicking the Start button.

  • Identify the invariant cysteine residue at the end of the multiple alignment.

  • Locate this invariant cysteine in the Needleman-Wunsch and BLAST2 alignments.

  • Explain why multiple alignment are more appropriate for identifying consensus sequences than pairwise alignments?

Day 2: Phylogenetics, Gene Annotations & Introduction to R

Download Cytochrome P450 Sequences from NCBI

Go to, in protein database, run the following query:
  • P450 & hydroxylase & human [organism]

  • Select under Limits SwissProt as database.

  • Save the 24 sequences in FASTA format to your desktop via the "Send to" drop down menu.

Phylogenetic Analysis Pipeline

Open the  web page. The Documentation of this phylogenetic analysis page provides a very colorful overview of its rich utilities.

To start the analysis, select "A La Carte" in the Phylogeny Analysis section. Under the workflow settings select ClustalW as multiple alignment program and "ProtDist/FasstDist + Neighbor". The latter option defines the Neighbor-Joining algorithm as the tree building method. The default settings should be used for the remaining options. Start the analysis by clicking the "Create workflow" button. 

Upload your sequences on the next page by selecting the "Browse" function. Clicking the "Submit" button will run the entire phylogenetic analysis pipeline from the start to the end. This includes the following steps:

  • Computation of a multiple alignment.

  • An optional alignment curation step that allows the user to edit the multiple alignment (e.g. removal of long unaligned sections).

  • Computation of a distance matrix.

  • Calculation of a neighbor-joining tree based on the rate corrected distance matrix.

  • Upload of the tree to an interactive tree viewing page. 

It is also possible to run the Workflow in a step-by-set mode by selecting on its setup page "Run Workflow step by step".

After the pipeline run is finished, the following exercises will evaluate the intermediate data sets individually that were generated by each analysis step. 

Multiple Alignment

After the pipeline is finished click the "3. Alignment" tab on the top of the final result page. This alignment page allows to view the alignment with background shading in a separate window. In addition, one can download the alignment in different formats for imports into other programs.

  • Save the alignment in the Phylip format to your desktop and inspect it in a text editor. The phylip format is a widely used alignment format that is used by other software tools like the Phylip program.

Distance Matrix

Next, click the "5. Phylogeny" tab on the top of the site. The window in the middle of the resulting page shows the tree in pure text format. Below the tree there are links to download the distance matrix and the corresponding raw tree in the Newick format.

  • Download the distance matrix to your desktop and inspect it in a text editor. This distance matrix is useful to continue the analysis with other phylogenetic software tools like Phylip.

Phylogenetic Tree: Rooting and Editing

Click the "6. Tree Rendering" tab on the top of the site. This page allows the user to root the tree, to edit the tree labels, to color its branches and to save the tree in different formats.

  • Root the tree with the midpoint method by click the corresponding button on the page. Describe in one sentence how the rooting steps changes the tree. 

  • Change the text of some of the leaf labels of the tree by clicking on the "Change leaf name" button and then clicking on the labels. For instance, remove from some leaf labels the description part until only their ID numbers remain.

  • Download the tree in Newick format to the desktop and inspect it in a text editor. Newick is one of the most common tree formats that can be imported into many other programs including the TreeDyn program that is used for the pipeline.

  • Colorize the selected tree branches with the "Colorize" function. Finally save the tree image in PNG format (similar to JPEG format). The available SVG format allows to import the tree into vector graphics programs like InkScape for further post-processing.

Pipeline Report

A very nice feature of this page is that it provides an overview report for each analysis pipeline. To open the analysis report for this analysis click the "1. Overview" tab on top of the site.

  • Save the pipeline page in HTML format to your desktop. This way one can track all steps and program settings of a phylogenetic analysis.

Table of Contents

Introduction to R and Bioconductor


[ Slides ] [ R Code ]

  • Load R Code into RStudio. Then follow instructions. 
  • To install Bioconductor libraries, please follow these instructions.

Table of Contents

Handling Genome Data in R and Bioconductor


Slides ]   [ R Code ]

  • Load R Code into RStudio. Then follow instructions. 
  • To install the libraries required for this exercise, please run the following two commands from within R:
biocLite(c("Biostrings", "seqLogo", "GenomicRanges", "rtracklayer", "Rsamtools"))

Table of Contents

Day 3: RNA-Seq Expression Profiling & Small Molecule Informatics 

Analysis of RNA-Seq Experiments


Slides ]   [ R Code ]   Data ]

  • Load R Code into RStudio. Then follow instructions.
  • To install the libraries required for this exercise, please run the following two commands from within R:
biocLite(c("GenomicRanges", "rtracklayer", "Rsamtools", "DESeq", "edgeR", "GOstats", "GO.db", "ath1121501.db"))

Table of Contents

Viewing NGS Data in IGV Genome Browser

The Integrative Genomics Viewer (IGV) is an efficient visualization tool for interactive exploration of large genome datasets. It supports a wide variety of data types including NGS alignments, genomic annotations, expression data, genetic variations, etc. 

View RNA-Seq results in IGV

    • Download IGV
    • Select in menu in top left corner A. thaliana (TAIR10)
    • Upload the following indexed/sorted Bam files with 
      • File -> Load from URL 
            • To view area of interest, enter its coordinates Chr1:16656-16956 in position menu on top.

            • A more detailed introduction to IGV can be found here.

          Searching and Analyzing Small Molecules

          PubChem and DrugBank searches

          • Perform a keyword search with the query term 'aminobenzylpenicillin' (Ampicillin) on PubChem.
          • Inspect the PubChem annotation page for the first query hit.
          • Identify the corresponding entry in DrugBank.
          • Retrieve the bioassay data available for Ampicillin on PubChem.
          • Retrieve the SMILES string for Ampicillin and perform a structure similarity search on PubChem's Structure Search site.
          • Return to the first keyword result page and download the SD file for the 10 hits.

          Custom Analysis with Chemmine Tools and ChemmineR
          • Upload the SD file retrieved from PubChem to Chemmine Tools and perform hierarchical clustering.
          • Analyze the same SD file with ChemmineR by loading this R Code into RStudio. Then follow the instructions there.

          First four Ampicillin hits plotted in ChemmineR

          Table of Contents