Included in the Inchworm software distribution is a pipeline that aligns RNA-Seq reads to the genome using BLAT, ultimately producing SAM (and BAM) files, which can be used below for genome-guided Inchworm assembly. This alignment pipeline excels for genomes containing genes with short introns (plants, fungi, protozoa, etc). For long-intron-containing genomes, TopHat is recommended.

In addition to the Inchworm software, to use this pipeline you must install:

  • BLAT : Jim Kent's BLAT alignment software.

  • Samtools : utilities for operating on SAM and BAM formatted files.

Both blat and samtools (particularly the psl2sam.pl script included in samtools) need to be available within your PATH setting.

Once the above tools are installed, the BLAT short-read alignment pipeline can be run as follows, starting from FASTA or FASTQ files, and single or paired-reads:

% $INCHWORM_HOME/bin/run_BLAT_shortReadPipeline.pl
################################################################################################################
#
#  --left and --right    (if paired reads)
#     or
#  --single              (if unpaired reads)
#
#  Required inputs:
#
#  --genome            multi-fasta file containing the genome sequences (should be named {refName}.fa )
#
#  --seqType          fa | fq    (fastA or fastQ format)
#
# Optional:
#
#  --SS_lib_type      strand-specific library type:  single: F or R  paired: FR or RF
#                                examples:  single RNA-Ligation method:  F
#                                           single dUTP method: R
#                                           paired dUTP method: RF
#
#  -I    maximum intron length  (default: 10000);
#
#  -o    output directory
#
#  --trim_short_terminal_segments     (trim off short terminal alignment segments that are mostly noise. Default: 10)
#
#  -P   min percent identity based on full sequence length  (default: 95)
#
#  --blat_top_hits  (default: 20 in paired mode, 1 in single mode)
#
#  -C  final top hits reported  (default: 1)  (only applies to paired mode)
#
#  If paired mode:
#
#     --max_dist_between_pairs             default (2000)
#
####################################################################################################################

Example data sets described below can be downloaded here as BLAT_short_read_alignment_pipeline-(datestamp).tgz.

Example data and pipeline execution are provided for:

  • paired reads (strand-specific, SS_lib_type: RF): example_BLAT_shortReadAlignmentPipeline/pairedSS. The strand-specific library type (SS_lib_type) of RF corresponds to the following, which results from the dUTP-based strand-specific sequencing method:

    ========> /2 (right of sequenced fragment)
    =======================================> (transcript fragment, sense orientation)
                               <============  /1  (left of sequenced fragment)
  • single (unpaired) reads (strand-specific, SS_lib_type: F): example_BLAT_shortReadAlignmentPipeline/singleSS, as generated by the RNA-ligation strand-specific sequencing method:

    ========> fragment end sequenced
    =======================================> (transcript fragment, sense orientation)

Visit those directories and execute the runAlignments.sh to demonstrate the pipeline execution.

A coordSorted.sam, and equivalent binary .bam file is the ultimate output. If strand-specific sequencing is specified by the SS_lib_type parameter, then additional partitioning of these files according to transcribed strand is performed.

Note
The coordinate-sorted SAM file is compatible with the Cufflinks software for alignment-based transcript reconstruction, in additino to being used with the Inchworm Genome-Guided De novo Transcript Assembly pipeline.