inchworm --reads $fasta_file --run_inchworm
Inchworm assembles RNA-Seq reads from fasta-formatted files. Assembly can proceed using strand-specific or non-strand-specific RNA-Seq data. The process is described below.
After installing, the simplest way to run Inchworm is on strand-specific and sense-oriented fasta-formatted sequences like so:
inchworm --reads $fasta_file --run_inchworm
If your data are not strand-specific, then you would run it like so:
inchworm --reads $fasta_file --run_inchworm --DS
Sample data and inchworm execution are provided in the included test_inchworm/ directory.
By default, a Kmer length of 25 is used and only assemblies at least 100 bases in length are reported. These values can be changed with options -K and -L, respectively. Many other options exist but they are mostly experimental, so you should ignore them for now.
The assembly sequences output by inchworm are formatted like so:
>a1;123 GATTACCAGATGATTGCCC......
a1 corresponds to assembly 1, and 123 corresponds to an average kmer coverage (ie. read coverage) for the assembly.
Often, a starting point for analysis of RNA-Seq begins with FASTQ files rather than FASTA files. Since Inchworm uses FASTA files as input, conversion tools are provided to convert from FASTQ to FASTA, as well as reorient sequences based on transcribed orientation (if strand-specific).
A script util/fastQ_to_fastA.pl is included to extract the sequences from fastQ files. If you have paired strand-specific reads, then be sure to reverse-complement the proper fragment end before running inchworm. For example, given two fastQ files for paired strand-specific fragment reads that look like so:
========> /2 (right of sequenced fragment) =======================================> (sense) <============ /1 (left of sequenced fragment)
which is the expected product of paired-end strand-specific RNA-Seq performed at the Broad (where, yes, the left fragment is on the right side and reverse-complemented), we would do the following to prep for inchworm:
util/fastQ_to_fastA.pl -I left.fq -a 1 --rev > left.fq.fa
util/fastQ_to_fastA.pl -I right.fq -a 2 > right.fq.fa
cat left.fq.fa right.fq.fa > both.senseOriented.fa
Then, use the both.senseOriented.fa file as the input for Inchworm de novo assembly.
Inchworm will compute many nice long high-coverage full-length transcripts, but because of sequencing error, will also report many relatively short low-coverage artifacts (found as echoes of the dominant assembled transcript, with ~95% identity to the dominant transcript). To remove many artifacts, you can use CD-HIT or UCLUST.