De novo Assembly Using Inchworm

Running Inchworm

After installing, the simplest way to run Inchworm is on strand-specific and sense-oriented fasta-formatted sequences like so:

inchworm --reads $fasta_file  --run_inchworm

If your data are not strand-specific, then you would run it like so:

inchworm --reads $fasta_file  --run_inchworm --DS

Sample data and inchworm execution are provided in the included test_inchworm/ directory.

By default, a Kmer length of 25 is used and only assemblies at least 100 bases in length are reported. These values can be changed with options -K and -L, respectively. Many other options exist but they are mostly experimental, so you should ignore them for now.

Extracting FastA sequences from FastQ files

Often, a starting point for analysis of RNA-Seq begins with FASTQ files rather than FASTA files. Since Inchworm uses FASTA files as input, conversion tools are provided to convert from FASTQ to FASTA, as well as reorient sequences based on transcribed orientation (if strand-specific).

A script util/fastQ_to_fastA.pl is included to extract the sequences from fastQ files. If you have paired strand-specific reads, then be sure to reverse-complement the proper fragment end before running inchworm. For example, given two fastQ files for paired strand-specific fragment reads that look like so:

========> /2 (right of sequenced fragment)
=======================================> (sense)
                           <============  /1  (left of sequenced fragment)

which is the expected product of paired-end strand-specific RNA-Seq performed at the Broad (where, yes, the left fragment is on the right side and reverse-complemented), we would do the following to prep for inchworm:

util/fastQ_to_fastA.pl -I left.fq -a 1 --rev > left.fq.fa

util/fastQ_to_fastA.pl -I right.fq -a 2 > right.fq.fa

cat left.fq.fa right.fq.fa > both.senseOriented.fa

Then, use the both.senseOriented.fa file as the input for Inchworm de novo assembly.

Post-processing to remove assembly artifacts

Inchworm will compute many nice long high-coverage full-length transcripts, but because of sequencing error, will also report many relatively short low-coverage artifacts (found as echoes of the dominant assembled transcript, with ~95% identity to the dominant transcript). To remove many artifacts, you can use CD-HIT or UCLUST.

De novo Assembly Using Inchworm

Running Inchworm

Inchworm Output

Extracting FastA sequences from FastQ files

Post-processing to remove assembly artifacts