Index
- INTRODUCTION
- EMBOSS HELP & LINKS
- ADMINISTRATION
- BASIC SYNTAX
- LEARNING THE BASIC SYNTAX WITH SEQRET
- EMBOSS PROGRAMS
- Sequence retrieval and feature parsing
- Sequence searching
- Pairwise alignments and comparisons
- Multiple alignments
- Repeat finding
- Molecular biology
- DNA predictions
- Protein predictions
- Proteomics
- Structural bioinformatics
- EXERCISES
- INTRODUCTION
- EMBOSS is a free and comprehensive sequence analysis package. It contains over 150 command-line tools for analyzing DNA/protein sequences that include pattern searching, phylogenetic analysis, data management, feature predictions, proteomics and more.
- Why EMBOSS?
- Free
- Always available
- Command-line based
- Wide variety of programs
- Easy reformatting and parsing
- Remote database access
- The basic UNIX commands for running EMBOSS applications can be found in this LINUX ESSENTIALS manual.
- Several web interfaces are available for EMBOSS: JEMBOSS, Pise, wEMBOSS, EMBOSS-Explorer, etc. Example implementations are available at: EBI, NGFN, MRC and UMDNJ.
- Kaptain GUI for EMBOSS can be started with commands:
- $ embosslauncher.kaptn # opens window to select EMBOSS programs
- $ emboss_progr.kaptn # opens specific EMBOSS program directly
- EMBOSS vs. GCG
- Main advantage of GCG is its graphical sequence editor available through the SeqLab interface. There is no comparable feature available in EMBOSS.
- Many EMBOSS programs are more up-to-date than their GCG equivalents.
- A table to look up equivalent programs between the two packages is available at Helix Systems (NIH).
- EMBOSS HELP & LINKS
- General
- Administration
- Sequence formats
- Uniform Sequence Addresses (USA)
- Applications
- EMBOSS Programs
- Command line search by keyword:
$ wossname my_keyword
- Finding related programs:
$ seealso # finds programs sharing group names
- Help on program:
$ program_name -help # provides some help on the options of a program
$ program_name -opt # starts program in interactive mode to prompt for common options
$ tfm program_name # prints full help document on a program
$ embossversion # writes the current EMBOSS version number
- ADMINISTRATION
- Searching remote databases with EMBOSS (database docs, admin docs)
- The configuarions for searching remote databases need to be specified in the system wide file '/usr/local/share/EMBOSS/emboss.default' or in the user specific file '/home/user/.embossrc'.
- Users can save this sample file to their home account under the name '.embossrc'. Most of the common databases are already specified in this file and new ones can be appended.
- Valid databases in local implementation can be viewed with the command 'showdb'.
- Searching local databases (database docs, admin docs)
- Sequence flat files can be searched (e.g seqret) directly without indexing. This can be slow with very large databases.
- Searching indexed databases is much faster. The indexing is described in the admin docs.
- BASIC SYNTAX
- Syntax conventions for this manual
"$" start of a command
"#" end of command and start of a comment
$ command # The text in bold font represents the actual command. The "$" and "#" signs are not part of it.
"<...>" or "my_..." refers to file name you are using, don't type arrows!
- Basic EMBOSS syntax for Uniform Sequence Addresses (USA)
<format::file:entry>, <file:entry>, <database:entry>, <file/database>, @list_file
":" separates file/database name from entry (ID#). If entry is omitted then entire file/database will be addressed.
"::" separates format from file. Format specification is usually not required, since EMBOSS recognizes it automatically.
"stdout" is a special USA to print output of any program to 'standard output' (screen).
"stdin" is used for 'piping' the results from a previous program into the current program
"-rformat my_format" is used by many EMBOSS programs to change list outputs into various formats. The available format names are: embl, genbank, gff, pir, swiss, trace, listfile, dbmotif, diffseq, excel, feattable, motif, regions, seqtable, simple, srs, table, tagseq
"-auto" used to turn off prompts and not to report the one-line program descriptions
- Short command-line syntax for running programs (preferred form for this manual)
$ <application> <my_dataset:my_entry> <format::my_output>
- Long command-line syntax for running programs
$ seqret -sequence embl:x52524 -osformat swissprot -sbegin 1 -send 100 -feature -srev stdout # To write to file, replace 'stdout' with -outseq my_output. The arguments '-sbegin' and '-send' excise specified area and '-srev' creates reverse and complement of DNA sequence.
- Visit the EMBOSS tutorial page for more general information.
- Wildcards
Use "?" for any single character and "*" for any string of characters.
Expressions containing wildcards need to be placed in quotes, e.g.: 'uniprot:p0172*'.
- LEARNING THE BASIC SYNTAX WITH SEQRET
- SEQRET is an extremely versatile application for sequence retrieval from databases, feature parsing and reformatting sequence as well as alignment formats.
- Reformatting sequences and alignments
$ seqret fasta::my_input.fasta gcg::my_output.gcg # example for reformatting a fasta file into GCG format using the seqret program
$ seqret clustal::my_align.aln phylip::my_align.phylip # reformat ClustalW alignment into PHYILIP format
- Sequence retrieval from files and databases
$ seqret embl:x52524 embl::test.embl # retrieves sequence 'x52524' from embl database via the web and writes it to file 'test.embl' in embl format.
$ seqret embl:x52524 embl::test.embl -feature # same as above, but includes features.
$ seqret 'uniprot:p0172*' swiss::stdout # retrieves all sequences starting with p0172* from uniprot database and prints them to STDOUT.
$ seqret 'uniprot:p0172*' swiss::test -ossingle # same as above, but writes all sequences into individual files using their IDs for naming.
$ seqret @my_list gcg::my_output.gcg # reads in list file for sequence retrieval. List file format (example): have one ID reference per line using this format: /path/to/my_fasta_file:ID. List files can be called in EMBOSS with '@my_list'.
$ seqret @csl_list swiss::stdout # reads in list file for remote sequence retrieval from UniProt. List file format: have one ID reference per line using this format: uniprot:ID.
$ seqret embl:x52524 fasta::stdout -sbegin 1 -send 50 # prints only first 50 nucleotides of x52524 in fasta format.
$ seqret embl:x52524 fasta::stdout -sbegin -50 -send -1 -srev # prints only last 50 nucleotides and generates 'reverse and complement'.
$ entret embl:x52524 stdout # seqret does not retrieve the entire annotation for a sequence. To retrieve the entire entry, use entret.
$ infoseq embl:x52524 # infoseq is a small utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description.
- EMBOSS PROGRAMS
- Sequence retrieval and feature parsing
- Sequence retrieval
- seqret
Performs sequence retrieval from databases, feature parsing, sequence and alignment reformatting.
$ seqret embl:x52524 fasta::stdout -sbegin 1 -send 50
- seqretsplit
Reads sequences and writes them to individual files.
$ seqretsplit 'uniprot:p0172*'
- infoseq
Utility to list the sequences' USA, name, accession number, type (nucleic or protein), length, percentage G+C (for nucleic), and/or description.
$ infoseq embl:x52524
- entret
Retrieves full annotation record of a sequence.
$ entret embl:x52524 stdout
- textsearch
Search description line of sequence entry (very slow!).
$ textsearch 'swissprot:TF2B*_HALSA' 'TFIIB' stdout
- Feature parsing and display
- coderet
Extracts features like CDS, mRNA and translations specified in feature table of database entry
$ coderet embl:X03487 stdout
$ coderet 'embl:hsfa*' -nocds -nomrna swiss::stdout # retrieves only translated protein sequences from several entries.
- extractfeat
Extracts each individual feature in table of database entry
$ extractfeat embl:hsfau1 -type exon stdout
- extractseq
Reads in a sequence with a set of map positions and writes out the specified regions of that sequence
$ extractseq embl:hsfau1 -reg "782..856,951..1095,1557..1612,1787..1912" stdout -separate
- showfeat
Displays features of a sequence.
$ showfeat embl:hsfau1 stdout
- Sequence searching
- Motifs
- patmatmotifs
Searches a protein sequence against the PROSITE motif database (setup: admin has to download the PROSITE database and configure it with prosextract)
$ patmatmotifs -full uniprot:opsd_human stdout
- pscan
Finds conserved motifs or finger prints in proteins using the PRINTS database (setup: admin has to download the PRINTS database, '*.dat', and configure it with printsextract)
$ pscan uniprot:opsd_human stdout
- tfscan
Scans DNA sequences for transcription factors binding sites using TRANSFAC database. Currently, we have only the very old public version of this database. (setup: admin has to download the TRANSFAC database and configure it with tfextract)
$ tfscan embl:paamir stdout
- patmatdb
Takes a motif in PROSITE format and compares it to set of search sequences
$ patmatdb uniprot:Q6UDF0 'qxxrw' stdout
$ patmatdb 'my_proteome:*' 'qxxrw' stdout | grep '# HitCount: 1' | wc # counts all sequences with single occurence of motif 'qxxrw'
- preg
Regular expression search in protein sequences
$ preg 'swissprot:*_rat' 'IA[QWF]A' stdout
- dreg
Regular expression search in nucleotide sequences
$ dreg embl:paamir 'ggtacc' stdout
- fuzzpro
Protein pattern search using PROSITE style patterns. Number of mismatches can be specified.
$ fuzzpro 'swissprot:*_rat' '[FY]-[LIV]-G-[DE]-E-A-Q-x-[RKQ](2)-G'
- fuzznuc
Nucleotide sequence pattern search using PROSITE style patterns. Number of mismatches can be specified.
$ fuzznuc embl:hhtetra
- fuzztran
Protein pattern search in DNA after translation using PROSITE style patterns. Number of mismatches can be specified.
$ fuzztran embl:rnops
- Pairwise alignments and comparisons
- Dot plots
- dottup
Creates dot plot of two sequences
$ dottup embl:xlrhodop embl:xl23808 -graph cps
- dotpath
Displays a non-overlapping wordmatch dotplot of two sequences
$ dotpath ...
- Global alignment
- needle
Creates optimum full-length alignment between two sequences using Needleman-Wunsch algorithm
$ needle embl:xlrhodop embl:xl23808 stdout
Possibility to run needle in all-against-all mode
$ for i in *.fasta; do for j in *.fasta; do needle $i $j stdout -gapopen 10.0 -gapextend 0.5 >> my_needle_file; done; done;
- Local alignment
- water
Calculates rigorous local alignment between two sequences using Smith Waterman algorithm
$ water embl:xlrhodop embl:xl23808 stdout
- matcher
Like water, but uses less memory.
$ matcher embl:xlrhodop embl:xl23808 stdout
- supermatcher
Finds a match of a large sequence against one or more sequences.
$ supermatcher embl:ec* embl:eclac -word 50
- Common words
- wordmatch
Finds all exact matches of a given size between 2 sequences
$ wordmatch
- Multiple alignments
- ClustalW interface
- emma
Multiple alignment program - interface to ClustalW program
$ emma my_fasta my_alignment my_tree
$ emma 'swissprot:TF2B*_HALSA' phylip::tfb.aln tfb.dnd; cp tfb.dnd intree; phylip retree # retrieves Halobacterium TFIIB proteins from UniProt, creates multiple alignment with emma, reformats alignment into PHYLIP format and displays tree file from emma with 'phylip retree' (not part of EMBOSS)
- Find identical regions (words)
- seqmatchall
Reads a set of sequences and finds identical regions shared by at least two sequences.
$ seqmatchall
- Align EST and genomic DNA sequences
- est2genome
Aligns EST and genomic DNA sequences.
$ est2genome ...
- Aligning distantly related DNA sequences
- tranalign
Creates multiple alignment of CDSs guided by multiple protein alignment
$ tranalign my_DNA_fasta my_pep_alignment my_result
- Alignment information
- infoalign
Lists simple properties of sequences in alignments (percent identity compared to reference sequence).
$ infoalign my_alignment stdout
- polydot
Displays all-against-all dotplots of a set of sequences
$ polydot ...
- Analyze alignments in text mode
- showalign
Alignment mining tool to identify differences and common features between sequences
$ showalign my_alignment stdout
- Box shading
- prettyplot
Displays multiple alignment in colourful box-shading mode
$ prettyplot my.aln -goutfile my.ps -graph cps
- Repeat finding
- Inverted repeats
- einverted
Finds inverted repeats in nucleotide sequences
$ einverted ...
- palindrome
Looks for inverted repeats in a nucleotide sequence
$ palindrome ...
- Tandem repeats
- equicktandem
Finds tandem repeats
$ equicktandem ...
- etandem
Looks for tandem repeats in a nucleotide sequence
$ etandem ...
- Molecular biology
- Trace files
- abiview
Simple tool to read ABI files and display their traces.
$ abiview
- Restriction sites and vector maps
- restrict
Finds restriction enzyme cleavage sites (setup: admin has to download the REBASE database, 'withrefm.40x.Z' & 'proto.40x.Z' and configure it with rebaseextract)
$ restrict embl:hsfau stdout
- remap
Display a sequence with restriction cut sites, translation, etc. (setup: admin has to download the REBASE database 'withrefm.40x.Z' & 'proto.40x.Z' and configure it with rebaseextract)
$ remap embl:hsfau stdout
- restover
Finds restriction enzymes that produce a specific overhang.
$ restover embl:hsfau stdout
- redata
Searches REBASE for enzyme name, references, suppliers, etc.
$ redata -enzyme BamHI stdout
- recoder
Removes specified restriction sites, but maintains the same translation.
$ recoder embl:hsfau -enzyme EcoRII stdout
- cirdna
Draws circular maps of DNA constructs
$ cirdna -graph cps
- lindna
Draws linear maps of DNA constructs
$ lindna -graph cps
- Mutation
- silent
Performs scan for silent mutation in restriction enzyme sites.
$ silent
- Translation
- transeq
Translates nucleic acid sequences.
$ transeq embl:paamir stdout
- sixpack
Translates nucleic acid sequences into all six open reading frames.
$ sixpack embl:paamir stdout
- showorf
Displays a nucleic acid sequence with its protein translation in a style suitable for publication.
$ showorf embl:paamir stdout
- plotorf
Graphical representation of all 6 reading frames.
$ plotorf embl:paamir -graph cps
- backtranseq
Back translates a protein into DNA sequences.
$ backtranseq swissprot:ach2_drome -cfile Edrosophila.cut stdout # '-cfile' uses drosophila sequence and codon table!
- siRNA design
- sirna
Looks for siRNA duplexes in mRNA.
$ sirna embl:hsfau stdout
- Primer design
- eprimer3
Picks PCR primers and hybridization oligos.
$ eprimer3 EMBL:HSFAU1 stdout -explain # If the '-explain' flag is used, then the statistics are reported describing the number of primers that were considered and rejected for various reasons.
$ eprimer3 EMBL:HSFA* -numreturn 1 -productsizerange 500-700 stdout # The '-numreturn 1' restricts to one primer per sequence and '-productsizerange' defines the length of a PCR product.
- primersearch
Reads primer pairs from input files and searches them against specified sequence(s).
$ primersearch embl:z52466
- stssearch
Searches a DNA database for matches with a set of STS primers (sequenced tagged sites).
$ stssearch ...
- DNA predictions
- DNA binding regions
- marscan
Finds MAR/SAR sites in nucleic sequences.
$ marscan embl:u01317 stdout
- DNA structure
- banana
Creates bending and curvature plot in B-DNA.
$ banana
- btwisted
Calculates the twisting in a B-DNA sequence.
$ btwisted
- Protein predictions
- Protein targeting
- sigcleave
Reports positions of signal cleavage sites.
$ sigcleave my_proteinspep sigcleave.out -send 35 -rformat excel # '-send 35' delimits to first 35 AAs; '-rformat excel' output in tab-delimited Excel format
- Antibody design
- antigenic
Finds antigenic sites in proteins.
$ antigenic swissprot:act1_fugru stdout
- Amino acid composition and hydrophobicity plots
- pepstats
Outputs protein statistics report including MW, IP, AA bias, etc.
$ pepstats 'uniprot:p0172*' -sbegin1 1 -send1 30 -stdout -auto # '-auto' turns off prompt behavior; '-sbegin1 1 -send1 30' delimit analsis to first 30 positions
$ for i in *.fasta; do pepstats -sequence $i -sbegin1 1 -send1 30 -stdout -auto >> 1_30_pepstats; done # To run pepstats on many sequences, one can use this 'shell loop' (input peptides need to be in separate files)
- pepwindow
Creates classic Kyte & Doolittle hydropathy plot of protein.
$ pepwindow uniprot:TF2B1_HALSA -graph cps # '-graph cps' creates ps image that can be dispayed with 'gv' or transformed/rotated with ImageMagic (or 'pstoimg') into other formats: 'convert my_image.ps -rotate -90 my_image.jpg'
$ for i in *.pep; do pepwindow $i -graph cps; cp pepwindow.ps $i.ps; convert $i.ps -rotate -90 $i.jpg; done # shell loop for many input files
- pepwindowall
Produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences.
$ pepwindowall uniprot:TF2B*_HALSA -graph cps
- pepinfo
Creates color plots of physico-chemical properties of proteins. A second image file contains the hydrophobicity plot (pepwindow images have here higher resolution)
$ pepinfo uniprot:TF2B1_HALSA -graph cps -auto # '-graph cps' creates ps image that can be dispayed with 'gv' or transformed/rotated with ImageMagic (or 'pstoimg') into other formats: 'convert my_image.ps -rotate -90 my_image.png'
$ for i in *.fasta; do pepinfo $i -graph cps -auto; cp pepinfo.ps $i.ps; convert $i.ps -rotate -90 $i.jpg; done # shell loop for many input files
- octanol
Calculates and plots free energy difference between water/interface and water/octanol.
$ octanol ...
- hmoment
Plots or writes out the hydrophobic moment. Hydrophic moment is the hydrophobicity of a peptide measured for a specified angle of rotation per residue.
$ hmoment swissprot:hbb_human
- Transmembrane domains
- tmap
Reads in one or more aligned protein sequences and predicts transmembrane segments.
$ tmap swissprot:opsd_human -out tmap.res -graph cps
- Helix analysis
- pepnet
Displays proteins as a helical net. Useful for identifying patterns of amphipathicity for more detailed analysis with pepwheel.
$ pepnet ...
- pepwheel
Displays proteins as a helical wheel for highlighting amphipathicity and other properties of residues around a helix.
$ pepwheel swissprot:hbb_human -send 30
- Protein secondary structure prediction
- garnier
Performs secondary structure predictions of protein sequences.
$ garnier 'uniprot:p0172*' stdout
- helixturnhelix
Finds helix-turn-helix nucleic acid binding motifs in proteins.
$ helixturnhelix swissprot:laci_ecoli stdout
- pepcoil
Predicts coiled coil regions in protein sequences.
$ pepcoil swissprot:gcn4_yeast stdout
- Proteomics
- Protein identification by mass fingerprints
- emowse
Protein identification by mass spectrometry. Emowse is EMBOSS' implementation of the MOWSE software.
$ emowse ...
- mwfilter
Filter noisy molwts from mass spec output.
$ mwfilter ...
- mwcontam
Shows molwts that match across a set of files.
$ mwcontam ...
- Protein digest
- digest
Finds the positions where a specified proteolytic enzyme or reagent might cut a peptide sequence.
$ digest ...
- Structural bioinformatics
- PDB files
- pdbparse
Parses PDB files and writes CCF files (clean coordinate files) for proteins.
$ pdbparse ...
- SCOP files
- scopparse
Reads raw SCOP classification files and writes a DCF file (domain classification file).
$ scopparse ...
- EXERCISES
- Sequence retrieval from local database
- Download proteome of Halobacterium spec. from ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa (use wget or web browser for download)
- Create list file 'my_TFB_NCBI_list' with the following TFB (transcription initiation factor IIB) protein identifiers
- AE004437.faa:AAG20319.1
AE004437.faa:AAG19212.1
AE004437.faa:AAG18850.1
AE004437.faa:AAG19313.1
AE004437.faa:AAG18894.1
- Retrieve the corresponding sequences from this local fasta 'database'
$ seqret @my_TFB_NCBI_list TFB_NCBI.pep
- Reformat sequences into other formats, e.g. genbank, swissprot, etc.
$ seqret fasta::TFB_NCBI.pep genbank::stdout
- Sequence retrieval and feature parsing from remote database
- Find TFB sequences with textsearch and retrieve them with seqret:
$ textsearch 'swissprot:TF2B*_HALSA' 'TFIIB' stdout
$ seqret 'swissprot:TF2B*_HALSA' TFB.pep # retrieves basic annotation information of records
$ entret 'swissprot:TF2B*_HALSA' TFB.pep # retrieves full annotation information of records
- Parse protein domains (features) with extractfeat and view them with showfeat:
$ extractfeat 'swissprot:TF2B*_HALSA' stdout
$ showfeat 'swissprot:TF2B*_HALSA' stdout
- Multiple & pairwise alignments and distance trees
- Align the TFB sequences with emma and view the alignment with showalign
$ emma 'swissprot:TF2B*_HALSA' TFB.aln TFB.dnd; showalign TFB.aln stdout -show=a
- Align only last 60 AA of TFB.pep with emma and view the alignment with showalign
$ emma 'swissprot:TF2B*_HALSA' TFB.aln TFB.dnd -sbegin -60 -send -1; showalign TFB.aln stdout -show=a
- Reformat alignment into other formats, e.g PHYLIP, MSF, CLUSTAL, etc.
$ seqret fasta::TFB.aln phylip::stdout
- View created tree file TFB.dnd with 'phylip retree' (not part of EMBOSS)
$ phylip retree TFB.dnd
- Do everything from download, alignment and tree viewing in one command
$ emma 'swissprot:TF2B*_HALSA' phylip::TFB.phylip TFB.dnd; cp TFB.dnd intree; phylip retree
- Create multiple alignment of TBF CDSs guided by multiple protein alignment
$ extractseq embl:AE005017 -reg "7723..8700" stdout -separate > TFB.cds; extractseq embl:AE004992 -reg "851..1810" stdout -separate >> TFB.cds; extractseq embl:AE005026 -reg "12106..13092" stdout -separate >> TFB.cds; extractseq embl:AE005164 -reg "8571..9524" stdout -separate >> TFB.cds; extractseq embl:AE004988 -reg "11458..12429" stdout -separate >> TFB.cds; extractseq embl:AE005105 -reg "12389..13252" stdout -separate >> TFB.cds; extractseq embl:AE005167 -reg "1255..2178" stdout -separate >> TFB.cds; seqret TFB.cds -srev TFBrev.cds
$ emma 'swissprot:TF2B*_HALSA' TFB.mul TFB.dnd
$ tranalign TFBrev.cds TFB.mul clustal::TBF_CDS.aln
- Create DNA alignment directly from nucleotide sequences (TFBrev.cds) and compare with alignment guided by protein alignment
$ emma TFBrev.cds clustal::TFB.aln TFB2.dnd
- Generate pairwise full-length alignments of first sequence in file TFBrev.cds against all the others.
$ needle TFBrev.cds:AE005017_7723_8700 TFBrev.cds -gapopen 10.0 -gapextend 0.5 stdout
- Primer design
- Design primer for all sequences in alignment from tranalign program
$ eprimer3 clustal::TBF_CDS.aln -numreturn 1 -productsizerange 800-900 stdout
- Motif finding
- Search the PROSITE motif database with one of the TFB sequences and compare the result with the corresponding SwissProt entry (TF2B2_HALSA)
$ patmatmotifs -full swissprot:TF2B2_HALSA stdout
$ entret swissprot:TF2B2_HALSA stdout
- Search the PRINTS motif database with one of the TFB sequences and compare the result with the corresponding SwissProt entry (TF2B2_HALSA)
$ pscan swissprot:TF2B2_HALSA stdout
- Search the other TFB sequences for the presence of one of the identified motifs
$ patmatdb 'swissprot:TF2B*_HALSA' 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' stdout
- Search the entire Halobacterium proteome (AE004437.faa) for the presence of this motif
$ patmatdb AE004437.faa 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' stdout
- Search the Halobacterium proteome (AE004437.faa) with the same motif allowing 1 or 2 mismatches
$ fuzzpro AE004437.faa -pattern 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' -mismatch 1 -outfile stdout
- Search the Halobacterium TFB sequences from swissprot with the same motif allowing 1 or 2 mismatches
$ fuzzpro 'swissprot:TF2B*_HALSA' -pattern 'G-[KR]-x(3)-[STAGN]-x-[LIVMYA]-[GSTA](2)-[CSAV]-[LIVM]-[LIVMFY]-[LIVMA]-[GSA]-[STAC]' -mismatch 1 -outfile stdout | grep ' Sequence:'
- Highlight the identified motif in above alignment (TFB.aln) using the HTML format function of showalign. Afterwards you can view the resulting TFB.html file in your local web browser.
$ showalign TFB.aln TFB.html -html -high '179-194 red'
- Create color shaded alignment with the mview Perl program which is not not part of EMBOSS. To get help on this tool, type 'mview -help'. The argument '-width 100' turns on alignment wrapping, here every 100 positions. HTML alignments can easily imported into MS Word or other text editors.
$ mview -in pearson -css on -html header -ruler on -coloring consensus -threshold 80 -consensus on -con_coloring identity TFB.mul > TFB2.html
- Sequence similarity searching with BLAST and domain searching with HMMER
- See exercises in Linux manual.
|