SEED: Clustering Next Generation Sequences
Authors: Ergude Bao and Thomas Girke
SEED is a software for clustering large sets of Next Generation Sequences (NGS) with hundreds of millions of reads in a time and memory efficient manner. Its algorithm joins highly similar sequences into clusters that can differ by up to three mismatches and three overhanging residues.
Bao E, Jiang T, Kaloshian I, Girke T (2011) SEED: Efficient Clustering of Next Generation Sequences. Bioinformatics: epub.
1. System requirements
SEED is suitable for 32-bit or 64-bit machines with Windows, OS X or Linux operating systems. At least 4GB of system memory is recommended for clustering larger data sets.
The downloaded .cpp file can be compiled as follows:
Only FASTQ format is supported in the current version. The sequence length should be between 21 bp and 100 bp with the max variation of 5 bp.
4. Using SEED
--mismatch is the maximum number of mismatches allowed from the center sequence in each cluster (0 - 3, default 3).
--shift is the maximum number of shifts allowed from the center sequence in each cluster (0 - 6, default 3).
--QV1 is the threshold for the base call quality values (QV) that are provided in the FASTQ files as Phred scores. SEED ignores those mismatches where the sum of the Phred scores of the mismatching bases is lower than the specified QV1 threshold value (0 - 2 * 93). The default value for QV1 is 0.
--QV2 is another QV threshold. It prevents co-clustering of sequences where the sum of all mismatched positions is higher than the threshold value (0 - 6 * 93). The default value for QV2 is 6 * 93.
--fast uses a bigger spaced seed weight to save running time. It is only applicable for sequences longer than 58 bp and may need more memory.
--short is to use a smaller spaced seeds weight for sequences as short as 21 bp. This setting often results in longer compute times.
-- reverse is to co-cluster sequences in sense and anti-sense orientation (reverse and complement).
SEED outputs two files: a SEED file and a FASTQ file. The outputted FASTQ file has the same format as the input FASTQ file, but it contains only the center sequences and their quality scores for each cluster with one or more members. In other words, it is the filtered version of the input FASTQ file where the redundant sequences have been removed. The SEED file has a tabular format that is explained in the following table. The third column in this table is only available if the --reverse argument has been specified.