Redirecting to Github

SEED: Clustering Next Generation Sequences


Authors: Ergude Bao and Thomas Girke

 

Overview

SEED is a software for clustering large sets of Next Generation Sequences (NGS) with hundreds of millions of reads in a time and memory efficient manner. Its algorithm joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues.
 

How Cite SEED?

If you use SEED, please cite the following paper:
Bao E, Jiang T, Kaloshian I, Girke T (2011) SEED: Efficient Clustering of Next Generation Sequences. Bioinformatics: epub.

Downloads and Version History


Version 1.5.1

Changes

  • Fixed a bug of miscounting total number of sequences
  • Fixed a bug about counting sequences with many N bases
Downloads

Version 1.4.1

Changes

  • Added pre-sorting to accelerate clustering resulting in a ~7-fold speed improvement compared to previous versions
  • Fixed a bug related to data sets with variable read lengths
  • Updated referenced libs to avoid compilation errors
  • Extended maximum read length to 200bp and added read length check
  • Added security check for memory allocation
Downloads

Version 1.3.1

Changes

  • Added option to co-cluster sequences in sense and antisense orientation (reverse and complement)
  • The main output format is now FASTQ instead of FASTA as in the previous version
  • Fixed bug when the nearest sequence contains > 3 mismatches to the virtual center sequence
Downloads
Version 1.2.1

Changes

  • Fast and short modes added
  • New file format
  • QV incorporated
Downloads

Short Manual

1. System requirements
 
SEED is suitable for 32-bit or 64-bit machines with Windows, OS X or Linux operating systems. At least 4GB of system memory is recommended for clustering larger data sets.
 
2. Installation
 
The downloaded .cpp file can be compiled as follows:
  • On Mac/UNIX/Linux systems, execute on the command line: g++ -o SEED SEED.cpp
  • On Windows systems, the code can be compiled under the Visual C++ environment.
3. Input
 
Only FASTQ format is supported in the current version. The sequence length should be between 21 bp and 100 bp with the max variation of 5 bp.
 
4. Using SEED


SEED --input input.fastq --output output.txt [--mismatch M] [--shift S] [--QV1 L] [--QV2 U] [--fast/short] [--reverse]


--mismatch is the maximum number of mismatches allowed from the center sequence in each cluster (0 - 3, default 3).
--shift is the maximum number of shifts allowed from the center sequence in each cluster (0 - 6, default 3).
--QV1 is the threshold for the base call quality values (QV) that are provided in the FASTQ files as Phred scores. SEED ignores those mismatches where the sum of the Phred scores of the mismatching bases is lower than the specified QV1 threshold value (0 - 2 * 93). The default value for QV1 is 0.
--QV2 is another QV threshold. It prevents co-clustering of sequences where the sum of all mismatched positions is higher than the threshold value (0 - 6 * 93). The default value for QV2 is 6 * 93.
--fast uses a bigger spaced seed weight to save running time. It is only applicable for sequences longer than 58 bp and may need more memory.
--short is to use a smaller spaced seeds weight for sequences as short as 21 bp. This setting often results in longer compute times.
-- reverse is to co-cluster sequences in sense and anti-sense orientation (reverse and complement).

5. Output
 
SEED outputs two files: a SEED file and a FASTQ file. The outputted FASTQ file has the same format as the input FASTQ file, but it contains only the center sequences and their quality scores for each cluster with one or more members. In other words, it is the filtered version of the input FASTQ file where the redundant sequences have been removed. The SEED file has a tabular format that is explained in the following table. The third column in this table is only available if the --reverse argument has been specified.

 Cluster ID  Sequence ID
 Is Reversed
Center sequence for cluster 0
   
 0  Sequence  ID from input file
 1
 0  Sequence ID from input file
 0
 Center sequence for cluster 1
   
 1  Sequence ID from input file  1
 1  Sequence ID from input file  0
 ...  ...  ...