Genome-guided Inchworm Assembly

Aligning RNA-Seq Reads to the Genome

Reads in FASTQ or FASTA format can be aligned to the genome using a short-read spliced aligner such as TopHat, GSNAP, or even BLAT. With longer short (>60 base) Illumina reads, we've had great success leveraging BLAT, and we include a BLAT alignment pipeline for generating such alignments. Ultimately, since transcript reconstruction will be performed using de novo assembly rather than alignment-assembly, the accuracy of the alignments such as inferred introns is not important. What is critical is that reads map to their proper location in the genome so that they can be grouped and leveraged for de novo sequence assembly. Regardless of what alignment approach you use, the end product should be a genome-coordinate sorted SAM file.

For long intron-containing genomes (intron lengths measured in kilobases or tens of kilobases), we recommend TopHat.
For short intron-containing genomes (intron lengths measured in tens or hundreds of bases), we recommend running our BLAT-based short-read alignment pipeline.
GSNAP may work fine for both types of genomes, but we haven't had as much experience in running it. We'll update our recommendations after having explored it more.

Genome-guided Inchworm RNA-Seq Assembly

Aligned reads are grouped into covered regions, and the reads in each group are targeted for de novo assembly. In the case of strand-specific RNA-Seq data, reads are grouped into strand-specific covered regions of the genome.

Sample data and pipeline execution are provided in the directory genome_guided_assembly/. These sample data correspond to Illumina 76 base strand-specific paired reads, generated using the dUTP method, with fragments sequenced in the FR library type as shown above.

Step 1: Partitioning aligned reads according to covered regions

To partition the aligned reads, run:

% $INCHWORM_HOME/bin/prep_rnaseq_alignments_for_genome_assisted_assembly.pl

########################################################################################################
#
#  Required:
#
#  --coordSorted_SAM <string>      coordinate-sorted SAM file.
#
#  -J  <int>                       value corresponding to the 75th percentile of intron lengths (coverage partitions within this length are joined)
#  -I  <int>                       maximum intron length  (reads with longer intron lengths are ignored)
#
#  *If Strand-specific, specify:
#  --SS_lib_type <string>          library type:  if single: F or R,  if paired:  FR or RF
#
#
#  Optional:
#
#  --jaccard_clip                 mitigates false fusions of genes.  ** requires paired-reads. **
#
#
########################################################################################################

The command run on the example data provided is:

%  $INCHWORM_HOME/bin/prep_rnaseq_alignments_for_genome_assisted_assembly.pl --coordSorted_SAM first100kreads.coordSorted.sam \
                                                                             --SS_lib_type RF -J 500 -I 500 --jaccard_clip

Running the above will distribute the reads into multiple FASTA-formatted files, with each file corresponding to the reads within a given partition. Files are distributed into Dir* directories. In the case strand-specific RNA-Seq is leveraged (indicated by specifying the —SS_lib_type parameter), the reads will first be separated by strand followed by partitioning.

Step 2: Inchworm assembly of partitioned reads

First, create a file listing the names of the files containing the partitioned reads:

%  find Dir* -regex ".*reads" > read_files.list

Generate a list of Inchworm assembly commands. In the case of strand-specific RNA-Seq data, run:

(strand-specific)  %  $INCHWORM_HOME/util/write_inchworm_cmds.pl  read_files.list > inchworm.cmds
 or
(non-strand-specific) % $INCHWORM_HOME/util/write_inchworm_cmds.pl  read_files.list DS > inchworm.cmds

The commands written to the inchworm.cmds file can be executed in parallel on a computing grid or executed serially.

For serial computing, you can run:

% $INCHWORM_HOME/util/cmd_process_forker.pl -c inchworm.cmds --CPU 1

If you have multiple CPUs, you may consider increasing the value of the —CPU parameter accordingly.

Once the assemblies are complete, collect the results of the assembled partitions into a single file like so:

% find Dir* -regex ".*iworm" -exec cat {} \; | $INCHWORM_HOME/util/inchworm_accession_incrementer.pl > iworm.genomeGuided.fasta