BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences
Authors: Ergude Bao and Thomas Girke
BRANCH is a software that extends de novo transfrags and identifies novel transfrags with DNA contigs or genes of close related species. BRANCH discovers novel exons first and then extends/joins fragmented de novo transfrags, so that the resulted transfrags are more complete.
Bao E, Jiang T, Girke T (2013). BRANCH: boosting RNA-Seq assemblies with partial or related genomic sequences. Bioinformatics: epub.
1. System requirements
BRANCH is suitable for 32-bit or 64-bit machines with Linux operating systems. At least 4GB of system memory is recommended for assembling larger data sets.
The LEMON graph library is required to compile and run BRANCH.
The BLAT aligner is required to run BRANCH and the modified version (distributed with BRANCH) is highly recommended.
(1) Download the .cpp file.
(2) If LEMON is already installed in your system, execute the command line: g++ -o BRANCH BRANCH.cpp -lemon -lpthread;
otherwise, down load LEMON, compile it, and execute: g++ -o BRANCH -I PATH2LEMON/include BRANCH.cpp -L PATH2LEMON/bin -lpthread.
(3) To use the modified BLAT, put it to your $PATH: export PATH=PATH2BLAT:$PATH.
(1) Single- or paired-end RNA reads in FASTA format.
(2) De novo transfrags assembled by any de novo RNA assembler (Velvet/Oases, Trinity, etc.).
(3) DNA contigs assembled by any de novo DNA assembler (Velvet, ABySS, etc.) or genome/gene sequences from a closely related species.
4. Using BRANCH
--read1 is the first pair of PE RNA reads or single-end RNA reads in fasta format
--read2 is the second pair of PE RNA reads in fasta format
--transfrag is the de novo RNA transfrags to be extended
--contig is the reference DNA contigs
--transcript is the extended de novo transfrags
--insertLow is the lower bound of insert length (highly recommended; default: 0)
--insertHigh is the upper bound of insert length (highly recommended; default: 99999)
--threshSize is the minimum size of a genome region that could be identified as an exon (default: 2 bp)
--threshCov is the minimum coverage of a genome region that could be identified as an exon (default: 2)
--threshSplit is the minimum upstream and downstream junction coverages to split a genome region into more than one exons (default: 2)
--threshConn is the minimum connectivity of two exons that could be identified as a splice junction (default: 2)
--closeGap closes sequencing gaps using PE read information (default: none)
--noAlignment skips the initial time-consuming alignment step, if all the alignment files have been provided in tmp directory (default: none)
6. Important things to note
(1) Single-end reads should have the same length and are not recommended, since the quality of single-end alignment is hard to be kept.
(2) It is better to use related gene sequences rather than related genome sequences to greatly reduce run time and memory usage.
(3) Though --insertLow and --insertHigh are options, they should always be specified to generate meaning result. Suppose the insert length is I, insertLow = I - 20 and insertHigh = I + 20 would be fine.