EMBOSS Manual

Index

  1. INTRODUCTION
  2. EMBOSS HELP & LINKS
  3. ADMINISTRATION
  4. BASIC SYNTAX
  5. LEARNING THE BASIC SYNTAX WITH SEQRET
  6. EMBOSS PROGRAMS
    1. Sequence retrieval and feature parsing
    2. Sequence searching
    3. Pairwise alignments and comparisons
    4. Multiple alignments
    5. Repeat finding
    6. Molecular biology
    7. DNA predictions
    8. Protein predictions
    9. Proteomics
    10. Structural bioinformatics
  7. EXERCISES

  1. INTRODUCTION
  2. EMBOSS is a free and comprehensive sequence analysis package. It contains over 150 command-line tools for analyzing DNA/protein sequences that include pattern searching, phylogenetic analysis, data management, feature predictions, proteomics and more.

    Why EMBOSS?
    • Free
    • Always available
    • Command-line based
    • Wide variety of programs
    • Easy reformatting and parsing
    • Remote database access

    The basic UNIX commands for running EMBOSS applications can be found in this LINUX ESSENTIALS manual.

    Several web interfaces are available for EMBOSS: JEMBOSS, Pise, wEMBOSS, EMBOSS-Explorer, etc. Example implementations are available at: EBI, NGFN, MRC and UMDNJ.

    Kaptain GUI for EMBOSS can be started with commands:
    • $ embosslauncher.kaptn # opens window to select EMBOSS programs
    • $ emboss_progr.kaptn # opens specific EMBOSS program directly

    EMBOSS vs. GCG
    • Main advantage of GCG is its graphical sequence editor available through the SeqLab interface. There is no comparable feature available in EMBOSS.
    • Many EMBOSS programs are more up-to-date than their GCG equivalents.
    • A table to look up equivalent programs between the two packages is available at Helix Systems (NIH).

  3. EMBOSS HELP & LINKS
  4. General

    Administration

    Sequence formats

    Uniform Sequence Addresses (USA)

    Applications
    • EMBOSS Programs
    • Command line search by keyword:
    • $ wossname my_keyword
    • Finding related programs:
    • $ seealso # finds programs sharing group names
    • Help on program:
    • $ program_name -help # provides some help on the options of a program
      $ program_name -opt # starts program in interactive mode to prompt for common options
      $ tfm program_name # prints full help document on a program
      $ embossversion # writes the current EMBOSS version number

  5. ADMINISTRATION
  6. Searching remote databases with EMBOSS (database docs, admin docs)
    • The configuarions for searching remote databases need to be specified in the system wide file '/usr/local/share/EMBOSS/emboss.default' or in the user specific file '/home/user/.embossrc'.
    • Users can save this sample file to their home account under the name '.embossrc'. Most of the common databases are already specified in this file and new ones can be appended.
    • Valid databases in local implementation can be viewed with the command 'showdb'.

    Searching local databases (database docs, admin docs)
    • Sequence flat files can be searched (e.g seqret) directly without indexing. This can be slow with very large databases.
    • Searching indexed databases is much faster. The indexing is described in the admin docs.

  7. BASIC SYNTAX
  8. Syntax conventions for this manual
      "$" start of a command
      "#" end of command and start of a comment
      $ command # The text in bold font represents the actual command. The "$" and "#" signs are not part of it.
      "<...>" or "my_..." refers to file name you are using, don't type arrows!

    Basic EMBOSS syntax for Uniform Sequence Addresses (USA)
      <format::file:entry>, <file:entry>, <database:entry>, <file/database>, @list_file
      ":" separates file/database name from entry (ID#). If entry is omitted then entire file/database will be addressed.
      "::" separates format from file. Format specification is usually not required, since EMBOSS recognizes it automatically.
      "stdout" is a special USA to print output of any program to 'standard output' (screen). "stdin" is used for 'piping' the results from a previous program into the current program
      "-rformat my_format" is used by many EMBOSS programs to change list outputs into various formats. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq
      "-auto" used to turn off prompts and not to report the one-line program descriptions

    Short command-line syntax for running programs (preferred form for this manual)
      $ <application> <my_dataset:my_entry> <format::my_output>

    Long command-line syntax for running programs
      $ seqret -sequence embl:x52524 -osformat swissprot -sbegin 1 -send 100 -feature -srev stdout # To write to file, replace 'stdout' with -outseq my_output. The arguments '-sbegin' and '-send' excise specified area and '-srev' creates reverse and complement of DNA sequence.

    Visit the EMBOSS tutorial page for more general information.

    Wildcards
      Use "?" for any single character and "*" for any string of characters.
      Expressions containing wildcards need to be placed in quotes, e.g.: 'uniprot:p0172*'.

  9. LEARNING THE BASIC SYNTAX WITH SEQRET
  10. SEQRET is an extremely versatile application for sequence retrieval from databases, feature parsing and reformatting sequence as well as alignment formats.

    Reformatting sequences and alignments
      $ seqret fasta::my_input.fasta gcg::my_output.gcg # example for reformatting a fasta file into GCG format using the seqret program
      $ seqret clustal::my_align.aln phylip::my_align.phylip # reformat ClustalW alignment into PHYILIP format

    Sequence retrieval from files and databases
      $ seqret embl:x52524 embl::test.embl # retrieves sequence 'x52524' from embl database via the web and writes it to file 'test.embl' in embl format.
      $ seqret embl:x52524 embl::test.embl -feature # same as above, but includes features.
      $ seqret 'uniprot:p0172*' swiss::stdout # retrieves all sequences starting with p0172* from uniprot database and prints them to STDOUT.
      $ seqret 'uniprot:p0172*' swiss::test -ossingle # same as above, but writes all sequences into individual files using their IDs for naming.
      $ seqret @my_list gcg::my_output.gcg # reads in list file for sequence retrieval. List file format (example): have one ID reference per line using this format: /path/to/my_fasta_file:ID. List files can be called in EMBOSS with '@my_list'.
      $ seqret @csl_list swiss::stdout # reads in list file for remote sequence retrieval from UniProt. List file format: have one ID reference per line using this format: uniprot:ID.
      $ seqret embl:x52524 fasta::stdout -sbegin 1 -send 50 # prints only first 50 nucleotides of x52524 in fasta format.
      $ seqret embl:x52524 fasta::stdout -sbegin -50 -send -1 -srev # prints only last 50 nucleotides and generates 'reverse and complement'.
      $ entret embl:x52524 stdout # seqret does not retrieve the entire annotation for a sequence. To retrieve the entire entry, use entret.
      $ infoseq embl:x52524 # infoseq is a small utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description.

  11. EMBOSS PROGRAMS
    1. Sequence retrieval and feature parsing
    2. Sequence retrieval
      seqret
        Performs sequence retrieval from databases, feature parsing, sequence and alignment reformatting.
        $ seqret embl:x52524 fasta::stdout -sbegin 1 -send 50

      seqretsplit
        Reads sequences and writes them to individual files.
        $ seqretsplit 'uniprot:p0172*'

      infoseq
        Utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description.
        $ infoseq embl:x52524

      entret
        Retrieves full annotation record of a sequence.
        $ entret embl:x52524 stdout

      textsearch
        Search description line of sequence entry (very slow!).
        $ textsearch 'swissprot:TF2B*_HALSA' 'TFIIB' stdout

      Feature parsing and display
      coderet
        Extracts features like CDS, mRNA and translations specified in feature table of database entry
        $ coderet embl:X03487 stdout
        $ coderet 'embl:hsfa*' -nocds -nomrna swiss::stdout # retrieves only translated protein sequences from several entries.

      extractfeat
        Extracts each individual feature in table of database entry
        $ extractfeat embl:hsfau1 -type exon stdout

      extractseq
        Reads in a sequence with a set of map positions and writes out the specified regions of that sequence
        $ extractseq embl:hsfau1 -reg "782..856,951..1095,1557..1612,1787..1912" stdout -separate

      showfeat
        Displays features of a sequence.
        $ showfeat embl:hsfau1 stdout

    3. Sequence searching
    4. Motifs
      patmatmotifs
        Searches a protein sequence against the PROSITE motif database (setup: admin has to download the PROSITE database and configure it with prosextract)
        $ patmatmotifs -full uniprot:opsd_human stdout

      pscan
        Finds conserved motifs or finger prints in proteins using the PRINTS database (setup: admin has to download the PRINTS database, '*.dat', and configure it with printsextract)
        $ pscan uniprot:opsd_human stdout

      tfscan
        Scans DNA sequences for transcription factors binding sites using TRANSFAC database. Currently, we have only the very old public version of this database. (setup: admin has to download the TRANSFAC database and configure it with tfextract)
        $ tfscan embl:paamir stdout

      patmatdb
        Takes a motif in PROSITE format and compares it to set of search sequences
        $ patmatdb uniprot:Q6UDF0 'qxxrw' stdout
        $ patmatdb 'my_proteome:*' 'qxxrw' stdout | grep '# HitCount: 1' | wc # counts all sequences with single occurence of motif 'qxxrw'

      preg
        Regular expression search in protein sequences
        $ preg 'swissprot:*_rat' 'IA[QWF]A' stdout

      dreg
        Regular expression search in nucleotide sequences
        $ dreg embl:paamir 'ggtacc' stdout

      fuzzpro
        Protein pattern search using PROSITE style patterns. Number of mismatches can be specified.
        $ fuzzpro 'swissprot:*_rat' '[FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G'

      fuzznuc
        Nucleotide sequence pattern search using PROSITE style patterns. Number of mismatches can be specified.
        $ fuzznuc embl:hhtetra

      fuzztran
        Protein pattern search in DNA after translation using PROSITE style patterns. Number of mismatches can be specified.
        $ fuzztran embl:rnops

    5. Pairwise alignments and comparisons
    6. Dot plots
      dottup
        Creates dot plot of two sequences
        $ dottup embl:xlrhodop embl:xl23808 -graph cps

      dotpath
        Displays a non-overlapping wordmatch dotplot of two sequences
        $ dotpath ...

      Global alignment
      needle
        Creates optimum full-length alignment between two sequences using Needleman-Wunsch algorithm
        $ needle embl:xlrhodop embl:xl23808 stdout

        Possibility to run needle in all-against-all mode
        $ for i in *.fasta; do for j in *.fasta; do needle $i $j stdout -gapopen 10.0 -gapextend 0.5 >> my_needle_file; done; done;

      Local alignment
      water
        Calculates rigorous local alignment between two sequences using Smith Waterman algorithm
        $ water embl:xlrhodop embl:xl23808 stdout

      matcher
        Like water, but uses less memory.
        $ matcher embl:xlrhodop embl:xl23808 stdout

      supermatcher
        Finds a match of a large sequence against one or more sequences.
        $ supermatcher embl:ec* embl:eclac -word 50

      Common words
      wordmatch
        Finds all exact matches of a given size between 2 sequences
        $ wordmatch

    7. Multiple alignments
    8. ClustalW interface
      emma
        Multiple alignment program - interface to ClustalW program
        $ emma my_fasta my_alignment my_tree
        $ emma 'swissprot:TF2B*_HALSA' phylip::tfb.aln tfb.dnd; cp tfb.dnd intree; phylip retree # retrieves Halobacterium TFIIB proteins from UniProt, creates multiple alignment with emma, reformats alignment into PHYLIP format and displays tree file from emma with 'phylip retree' (not part of EMBOSS)

      Find identical regions (words)
      seqmatchall
        Reads a set of sequences and finds identical regions shared by at least two sequences.
        $ seqmatchall

      Align EST and genomic DNA sequences
      est2genome
        Aligns EST and genomic DNA sequences.
        $ est2genome ...

      Aligning distantly related DNA sequences
      tranalign
        Creates multiple alignment of CDSs guided by multiple protein alignment
        $ tranalign my_DNA_fasta my_pep_alignment my_result

      Alignment information
      infoalign
        Lists simple properties of sequences in alignments (percent identity compared to reference sequence).
        $ infoalign my_alignment stdout

      polydot
        Displays all-against-all dotplots of a set of sequences
        $ polydot ...

      Analyze alignments in text mode
      showalign
        Alignment mining tool to identify differences and common features between sequences
        $ showalign my_alignment stdout

      Box shading
      prettyplot
        Displays multiple alignment in colourful box-shading mode
        $ prettyplot my.aln -goutfile my.ps -graph cps

    9. Repeat finding
    10. Inverted repeats
      einverted
        Finds inverted repeats in nucleotide sequences
        $ einverted ...

      palindrome
        Looks for inverted repeats in a nucleotide sequence
        $ palindrome ...

      Tandem repeats
      equicktandem
        Finds tandem repeats
        $ equicktandem ...

      etandem
        Looks for tandem repeats in a nucleotide sequence
        $ etandem ...

    11. Molecular biology
    12. Trace files
      abiview
        Simple tool to read ABI files and display their traces.
        $ abiview

      Restriction sites and vector maps
      restrict
        Finds restriction enzyme cleavage sites (setup: admin has to download the REBASE database, 'withrefm.40x.Z' & 'proto.40x.Z' and configure it with rebaseextract)
        $ restrict embl:hsfau stdout

      remap
        Display a sequence with restriction cut sites, translation, etc. (setup: admin has to download the REBASE database 'withrefm.40x.Z' & 'proto.40x.Z' and configure it with rebaseextract)
        $ remap embl:hsfau stdout

      restover
        Finds restriction enzymes that produce a specific overhang.
        $ restover embl:hsfau stdout

      redata
        Searches REBASE for enzyme name, references, suppliers, etc.
        $ redata -enzyme BamHI stdout

      recoder
        Removes specified restriction sites, but maintains the same translation.
        $ recoder embl:hsfau -enzyme EcoRII stdout

      cirdna
        Draws circular maps of DNA constructs
        $ cirdna -graph cps

      lindna
        Draws linear maps of DNA constructs
        $ lindna -graph cps

      Mutation
      silent
        Performs scan for silent mutation in restriction enzyme sites.
        $ silent

      Translation
      transeq
        Translates nucleic acid sequences.
        $ transeq embl:paamir stdout

      sixpack
        Translates nucleic acid sequences into all six open reading frames.
        $ sixpack embl:paamir stdout

      showorf
        Displays a nucleic acid sequence with its protein translation in a style suitable for publication.
        $ showorf embl:paamir stdout

      plotorf
        Graphical representation of all 6 reading frames.
        $ plotorf embl:paamir -graph cps

      backtranseq
        Back translates a protein into DNA sequences.
        $ backtranseq swissprot:ach2_drome -cfile Edrosophila.cut stdout # '-cfile' uses drosophila sequence and codon table!

      siRNA design
      sirna
        Looks for siRNA duplexes in mRNA.
        $ sirna embl:hsfau stdout

      Primer design
      eprimer3
        Picks PCR primers and hybridization oligos.
        $ eprimer3 EMBL:HSFAU1 stdout -explain # If the '-explain' flag is used, then the statistics are reported describing the number of primers that were considered and rejected for various reasons.
        $ eprimer3 EMBL:HSFA* -numreturn 1 -productsizerange 500-700 stdout # The '-numreturn 1' restricts to one primer per sequence and '-productsizerange' defines the length of a PCR product.

      primersearch
        Reads primer pairs from input files and searches them against specified sequence(s).
        $ primersearch embl:z52466

      stssearch
        Searches a DNA database for matches with a set of STS primers (sequenced tagged sites).
        $ stssearch ...

    13. DNA predictions
    14. DNA binding regions
      marscan
        Finds MAR/SAR sites in nucleic sequences.
        $ marscan embl:u01317 stdout

      DNA structure
      banana
        Creates bending and curvature plot in B-DNA.
        $ banana

      btwisted
        Calculates the twisting in a B-DNA sequence.
        $ btwisted

    15. Protein predictions
    16. Protein targeting
      sigcleave
        Reports positions of signal cleavage sites.
        $ sigcleave my_proteinspep sigcleave.out -send 35 -rformat excel # '-send 35' delimits to first 35 AAs; '-rformat excel' output in tab-delimited Excel format

      Antibody design
      antigenic
        Finds antigenic sites in proteins.
        $ antigenic swissprot:act1_fugru stdout

      Amino acid composition and hydrophobicity plots
      pepstats
        Outputs protein statistics report including MW, IP, AA bias, etc.
        $ pepstats 'uniprot:p0172*' -sbegin1 1 -send1 30 -stdout -auto # '-auto' turns off prompt behavior; '-sbegin1 1 -send1 30' delimit analsis to first 30 positions
        $ for i in *.fasta; do pepstats -sequence $i -sbegin1 1 -send1 30 -stdout -auto >> 1_30_pepstats; done # To run pepstats on many sequences, one can use this 'shell loop' (input peptides need to be in separate files)

      pepwindow
        Creates classic Kyte & Doolittle hydropathy plot of protein.
        $ pepwindow uniprot:TF2B1_HALSA -graph cps # '-graph cps' creates ps image that can be dispayed with 'gv' or transformed/rotated with ImageMagic (or 'pstoimg') into other formats: 'convert my_image.ps -rotate -90 my_image.jpg'
        $ for i in *.pep; do pepwindow $i -graph cps; cp pepwindow.ps $i.ps; convert $i.ps -rotate -90 $i.jpg; done # shell loop for many input files

      pepwindowall
        Produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences.
        $ pepwindowall uniprot:TF2B*_HALSA -graph cps

      pepinfo
        Creates color plots of physico-chemical properties of proteins. A second image file contains the hydrophobicity plot (pepwindow images have here higher resolution)
        $ pepinfo uniprot:TF2B1_HALSA -graph cps -auto # '-graph cps' creates ps image that can be dispayed with 'gv' or transformed/rotated with ImageMagic (or 'pstoimg') into other formats: 'convert my_image.ps -rotate -90 my_image.png'
        $ for i in *.fasta; do pepinfo $i -graph cps -auto; cp pepinfo.ps $i.ps; convert $i.ps -rotate -90 $i.jpg; done # shell loop for many input files

      octanol
        Calculates and plots free energy difference between water/interface and water/octanol.
        $ octanol ...

      hmoment
        Plots or writes out the hydrophobic moment. Hydrophic moment is the hydrophobicity of a peptide measured for a specified angle of rotation per residue.
        $ hmoment swissprot:hbb_human

      Transmembrane domains
      tmap
        Reads in one or more aligned protein sequences and predicts transmembrane segments.
        $ tmap swissprot:opsd_human -out tmap.res -graph cps

      Helix analysis
      pepnet
        Displays proteins as a helical net. Useful for identifying patterns of amphipathicity for more detailed analysis with pepwheel.
        $ pepnet ...

      pepwheel
        Displays proteins as a helical wheel for highlighting amphipathicity and other properties of residues around a helix.
        $ pepwheel swissprot:hbb_human -send 30

      Protein secondary structure prediction
      garnier
        Performs secondary structure predictions of protein sequences.
        $ garnier 'uniprot:p0172*' stdout

      helixturnhelix
        Finds helix-turn-helix nucleic acid binding motifs in proteins.
        $ helixturnhelix swissprot:laci_ecoli stdout

      pepcoil
        Predicts coiled coil regions in protein sequences.
        $ pepcoil swissprot:gcn4_yeast stdout

    17. Proteomics
    18. Protein identification by mass fingerprints
      emowse
        Protein identification by mass spectrometry. Emowse is EMBOSS' implementation of the MOWSE software.
        $ emowse ...

      mwfilter
        Filter noisy molwts from mass spec output.
        $ mwfilter ...

      mwcontam
        Shows molwts that match across a set of files.
        $ mwcontam ...

      Protein digest
      digest
        Finds the positions where a specified proteolytic enzyme or reagent might cut a peptide sequence.
        $ digest ...

    19. Structural bioinformatics
    20. PDB files
      pdbparse
        Parses PDB files and writes CCF files (clean coordinate files) for proteins.
        $ pdbparse ...

      SCOP files
      scopparse
        Reads raw SCOP classification files and writes a DCF file (domain classification file).
        $ scopparse ...

    21. EXERCISES
      1. Sequence retrieval from local database
        1. Download proteome of Halobacterium spec. from ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa (use wget or web browser for download)
        2. Create list file 'my_TFB_NCBI_list' with the following TFB (transcription initiation factor IIB) protein identifiers
        3. AE004437.faa:AAG20319.1
          AE004437.faa:AAG19212.1
          AE004437.faa:AAG18850.1
          AE004437.faa:AAG19313.1
          AE004437.faa:AAG18894.1
        4. Retrieve the corresponding sequences from this local fasta 'database'
          $ seqret @my_TFB_NCBI_list TFB_NCBI.pep
        5. Reformat sequences into other formats, e.g. genbank, swissprot, etc.
          $ seqret fasta::TFB_NCBI.pep genbank::stdout

      2. Sequence retrieval and feature parsing from remote database
        1. Find TFB sequences with textsearch and retrieve them with seqret:
        2. $ textsearch 'swissprot:TF2B*_HALSA' 'TFIIB' stdout
          $ seqret 'swissprot:TF2B*_HALSA' TFB.pep # retrieves basic annotation information of records
          $ entret 'swissprot:TF2B*_HALSA' TFB.pep # retrieves full annotation information of records
        3. Parse protein domains (features) with extractfeat and view them with showfeat:
        4. $ extractfeat 'swissprot:TF2B*_HALSA' stdout
          $ showfeat 'swissprot:TF2B*_HALSA' stdout

      3. Multiple & pairwise alignments and distance trees
        1. Align the TFB sequences with emma and view the alignment with showalign
        2. $ emma 'swissprot:TF2B*_HALSA' TFB.aln TFB.dnd; showalign TFB.aln stdout -show=a
        3. Align only last 60 AA of TFB.pep with emma and view the alignment with showalign
        4. $ emma 'swissprot:TF2B*_HALSA' TFB.aln TFB.dnd -sbegin -60 -send -1; showalign TFB.aln stdout -show=a
        5. Reformat alignment into other formats, e.g PHYLIP, MSF, CLUSTAL, etc.
        6. $ seqret fasta::TFB.aln phylip::stdout
        7. View created tree file TFB.dnd with 'phylip retree' (not part of EMBOSS)
        8. $ phylip retree TFB.dnd
        9. Do everything from download, alignment and tree viewing in one command
        10. $ emma 'swissprot:TF2B*_HALSA' phylip::TFB.phylip TFB.dnd; cp TFB.dnd intree; phylip retree
        11. Create multiple alignment of TBF CDSs guided by multiple protein alignment
        12. $ extractseq embl:AE005017 -reg "7723..8700" stdout -separate > TFB.cds; extractseq embl:AE004992 -reg "851..1810" stdout -separate >> TFB.cds; extractseq embl:AE005026 -reg "12106..13092" stdout -separate >> TFB.cds; extractseq embl:AE005164 -reg "8571..9524" stdout -separate >> TFB.cds; extractseq embl:AE004988 -reg "11458..12429" stdout -separate >> TFB.cds; extractseq embl:AE005105 -reg "12389..13252" stdout -separate >> TFB.cds; extractseq embl:AE005167 -reg "1255..2178" stdout -separate >> TFB.cds; seqret TFB.cds -srev TFBrev.cds
          $ emma 'swissprot:TF2B*_HALSA' TFB.mul TFB.dnd
          $ tranalign TFBrev.cds TFB.mul clustal::TBF_CDS.aln
        13. Create DNA alignment directly from nucleotide sequences (TFBrev.cds) and compare with alignment guided by protein alignment
        14. $ emma TFBrev.cds clustal::TFB.aln TFB2.dnd
        15. Generate pairwise full-length alignments of first sequence in file TFBrev.cds against all the others.
        16. $ needle TFBrev.cds:AE005017_7723_8700 TFBrev.cds -gapopen 10.0 -gapextend 0.5 stdout

      4. Primer design
        1. Design primer for all sequences in alignment from tranalign program
        2. $ eprimer3 clustal::TBF_CDS.aln -numreturn 1 -productsizerange 800-900 stdout

      5. Motif finding
        1. Search the PROSITE motif database with one of the TFB sequences and compare the result with the corresponding SwissProt entry (TF2B2_HALSA)
        2. $ patmatmotifs -full swissprot:TF2B2_HALSA stdout
          $ entret swissprot:TF2B2_HALSA stdout
        3. Search the PRINTS motif database with one of the TFB sequences and compare the result with the corresponding SwissProt entry (TF2B2_HALSA)
        4. $ pscan swissprot:TF2B2_HALSA stdout
        5. Search the other TFB sequences for the presence of one of the identified motifs
        6. $ patmatdb 'swissprot:TF2B*_HALSA' 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' stdout
        7. Search the entire Halobacterium proteome (AE004437.faa) for the presence of this motif
        8. $ patmatdb AE004437.faa 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' stdout
        9. Search the Halobacterium proteome (AE004437.faa) with the same motif allowing 1 or 2 mismatches
        10. $ fuzzpro AE004437.faa -pattern 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' -mismatch 1 -outfile stdout
        11. Search the Halobacterium TFB sequences from swissprot with the same motif allowing 1 or 2 mismatches
        12. $ fuzzpro 'swissprot:TF2B*_HALSA' -pattern 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' -mismatch 1 -outfile stdout | grep ' Sequence:'
        13. Highlight the identified motif in above alignment (TFB.aln) using the HTML format function of showalign. Afterwards you can view the resulting TFB.html file in your local web browser.
        14. $ showalign TFB.aln TFB.html -html -high '179-194 red'
        15. Create color shaded alignment with the mview Perl program which is not not part of EMBOSS. To get help on this tool, type 'mview -help'. The argument '-width 100' turns on alignment wrapping, here every 100 positions. HTML alignments can easily imported into MS Word or other text editors.
        16. $ mview -in pearson -css on -html header -ruler on -coloring consensus -threshold 80 -consensus on -con_coloring identity TFB.mul > TFB2.html

      6. Sequence similarity searching with BLAST and domain searching with HMMER
        1. See exercises in Linux manual.