Trinity is run via the script: Trinity.pl found in the base installation directory.
Usage info is as follows:
###############################################################################
#
# ______ ____ ____ ____ ____ ______ __ __
# | || \ | || \ | || || | |
# | || D ) | | | _ | | | | || | |
# |_| |_|| / | | | | | | | |_| |_|| ~ |
# | | | \ | | | | | | | | | |___, |
# | | | . \ | | | | | | | | | | |
# |__| |__|\_||____||__|__||____| |__| |____/
#
###############################################################################
#
# Required:
#
# --seqType <string> :type of reads: ( fa, or fq )
#
# --JM <string> :(Jellyfish Memory) number of GB of system memory to use for
# k-mer counting by jellyfish (eg. 10G) *include the 'G' char
#
# If paired reads:
# --left <string> :left reads, one or more (separated by space)
# --right <string> :right reads, one or more (separated by space)
#
# Or, if unpaired reads:
# --single <string> :single reads, one or more (note, if single file contains pairs, can use flag: --run_as_paired )
#
####################################
## Misc: #########################
#
# --SS_lib_type <string> :Strand-specific RNA-Seq read orientation.
# if paired: RF or FR,
# if single: F or R. (dUTP method = RF)
# See web documentation.
#
# --output <string> :name of directory for output (will be
# created if it doesn't already exist)
# default( "/Users/bhaas/SVN/trinityrnaseq/trunk/trinity_out_dir" )
# --CPU <int> :number of CPUs to use, default: 2
# --min_contig_length <int> :minimum assembled contig length to report
# (def=200)
# --genome_guided :set to genome guided mode, only retains assembly fasta file.
# --jaccard_clip :option, set if you have paired reads and
# you expect high gene density with UTR
# overlap (use FASTQ input file format
# for reads).
# (note: jaccard_clip is an expensive
# operation, so avoid using it unless
# necessary due to finding excessive fusion
# transcripts w/o it.)
#
# --prep :Only prepare files (high I/O usage) and stop before kmer counting.
#
# --no_cleanup :retain all intermediate input files.
# --full_cleanup :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
#
# --cite :show the Trinity literature citation
#
# --version :reports Trinity version (BLEEDING_EDGE) and exits.
#
####################################################
# Inchworm and K-mer counting-related options: #####
#
# --min_kmer_cov <int> :min count for K-mers to be assembled by
# Inchworm (default: 1)
# --inchworm_cpu <int> :number of CPUs to use for Inchworm, default is min(6, --CPU option)
#
# --no_run_inchworm :stop after running jellyfish, before inchworm.
#
###################################
# Chrysalis-related options: ######
#
# --max_reads_per_graph <int> :maximum number of reads to anchor within
# a single graph (default: 200000)
# --no_run_chrysalis :stop Trinity after Inchworm and before
# running Chrysalis
# --no_run_quantifygraph :stop Trinity just before running the
# parallel QuantifyGraph computes, to
# leverage a compute farm and massively
# parallel execution..
#
# --chrysalis_output <string> :name of directory for chrysalis output (will be
# created if it doesn't already exist)
# default( "chrysalis" )
#
# --no_bowtie :dont run bowtie to use pair info in chrysalis clustering.
#
#####################################
### Butterfly-related options: ####
#
# --bfly_opts <string> :additional parameters to pass through to butterfly
# (see butterfly options: java -jar Butterfly.jar ).
# (note: only for expert or experimental use. Commonly used parameters are exposed through this Trinity menu here).
#
# //
# Alternative reconstruction modes:
# Default mode is the 'regular' Butterfly transcript reconstruction by graph node extension.
#
# --PasaFly PASA-like algorithm for maximally-supported isoforms (conservative reconstructions, fewer isoforms)
# or
# --CuffFly Cufflinks-like algorithm to report minimum transcripts (fewest isoforms)
#
#
# Butterfly read-pair grouping settings (used for all reconstruction modes to define 'pair paths'):
#
# --group_pairs_distance <int> :maximum length expected between fragment pairs (default: 500)
# (reads outside this distance are treated as single-end)
#
# ///
# Butterfly default reconstruction mode settings. (no CuffFly or PasaFly custom settings are currently available).
#
# --path_reinforcement_distance <int> :minimum overlap of reads with growing transcript
# path (default: PE: 75, SE: 25)
# Set to 1 for the most lenient path extension requirements.
#
# --triplet_lock : (increase stringency of regular butterfly reconstruction)
# lock triplet-supported nodes: node 'c' having read path 'A-B-C' disables 'Z-B-C' if no such read support exists.
#
# --extended_lock : (further increase the stringency of regular butterfy reconstruction)
# extend the triplet lock to include longer range read path information.
# ex. in extending path 'A-B-Z' to 'A-B-Z-D', we only find read support for 'A-B-C-D', that 'A-B-Z' extension to 'D' will be blocked.
# (assumes --triplet_lock)
#
# /
# Butterfly transcript reduction settings:
#
# --no_path_merging : all transcript candidates are output (including SNP variations, however, some SNPs may be unphased)
#
# By default, alternative transcript candidates are merged (in reality, discarded) if they are found to be too similar, according to the following logic:
#
# (identity=(numberOfMatches/shorterLen) > 95.0% or if we have <= 2 mismatches) and if we have internal gap lengths <= 10
#
# with parameters as:
#
# --min_per_id_same_path <int> default: 95 min percent identity for two paths to be merged into single paths
# --max_diffs_same_path <int> default: 2 max allowed differences encountered between path sequences to combine them
# --max_internal_gap_same_path <int> default: 10 maximum number of internal consecutive gap characters allowed for paths to be merged into single paths.
#
# If, in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative
# compatible read (pair-path) support is retained, and the other is discarded.
#
#
# //
# Butterfly Java and parallel execution settings.
#
# --bflyHeapSpaceMax <string> :java max heap space setting for butterfly
# (default: 10G) => yields command
# 'java -Xmx10G -jar Butterfly.jar ... $bfly_opts'
# --bflyHeapSpaceInit <string> :java initial hap space settings for
# butterfly (default: 1G) => yields command
# 'java -Xms1G -jar Butterfly.jar ... $bfly_opts'
# --bflyGCThreads <int> :threads for garbage collection
# (default, not specified, so java decides)
# --bflyCPU <int> :CPUs to use (default will be normal
# number of CPUs; e.g., 2)
# --bflyCalculateCPU :Calculate CPUs based on 80% of max_memory
# divided by maxbflyHeapSpaceMax
# --no_run_butterfly :stops after the Chrysalis stage. You'll
# need to run the Butterfly computes
# separately, such as on a computing grid.
# Then, concatenate all the Butterfly assemblies by running:
# 'find trinity_out_dir/ -name "*allProbPaths.fasta"
# -exec cat {} + > trinity_out_dir/Trinity.fasta'
#
#################################
# Grid-computing options: #######
#
# --grid_computing_module <string> : Perl module in /Users/bhaas/SVN/trinityrnaseq/trunk/PerlLibAdaptors/
# that implements 'run_on_grid()'
# for naively parallel cmds. (eg. 'BroadInstGridRunner')
#
#
###############################################################################
#
# *Note, a typical Trinity command might be:
# Trinity.pl --seqType fq --JM 100G --left reads_1.fq --right reads_2.fq --CPU 6
#
# see: /Users/bhaas/SVN/trinityrnaseq/trunk/sample_data/test_Trinity_Assembly/
# for sample data and 'runMe.sh' for example Trinity execution
# For more details, visit: http://trinityrnaseq.sf.net
#
###############################################################################
Note
| Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved. For protocols on strand-specific RNA-Seq, see: Borodina T, Adjaye J, Sultan M. A strand-specific library preparation protocol for RNA sequencing. Methods Enzymol. 2011;500:79-98. PubMed PMID: 21943893. |
If you have strand-specific data, specify the library type. There are four library types:
Paired reads:
RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method.
FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)
Unpaired (single) reads:
F: the single read is in the sense (forward) orientation
R: the single read is in the antisense (reverse) orientation
By setting the --SS_lib_type parameter to one of the above, you are indicating that the reads are strand-specific. By default, reads are treated as not strand-specific.
Other important considerations:
Whether you use Fastq or Fasta formatted input files, be sure to keep the reads oriented as they are reported by Illumina, if the data are strand-specific. This is because, Trinity will properly orient the sequences according to the specified library type. If the data are not strand-specific, now worries because the reads will be parsed in both orientations.
If you have both paired and unpaired data, and the data are NOT strand-specific, you can combine the unpaired data with the left reads of the paired fragments. Be sure that the unpaired reads have a /1 as a suffix to the accession value similarly to the left fragment reads. The right fragment reads should all have /2 as the accession suffix. Then, run Trinity using the --left and --right parameters as if all the data were paired.
If you have multiple paired-end library fragment sizes, set the --group_pairs_distance according to the larger insert library. Pairings that exceed that distance will be treated as if they were unpaired by the Butterfly process.
by setting the --CPU option, you are indicating the maximum number of threads to be used by processes within Trinity. Note that Inchworm alone will be capped at 6 threads, since performance will not improve for this step beyond that setting)
A typical Trinity command for assembling non-strand-specific RNA-seq data would be like so, running the entire process on a single high-memory server (aim for 1G RAM per 1M ~76 base Illumina paired reads, but often much less memory is required):
Run Trinity like so:
Trinity.pl --seqType fq --JM 10G --left reads_1.fq --right reads_2.fq --CPU 6
Example data and sample pipeline are provided and described here.
When Trinity completes, it will create a Trinity.fasta output file in the trinity_out_dir/ output directory (or output directory you specify).
Obtain basic stats for the number of transcripts, components, and contig N50 value by running:
% $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
Total trinity transcripts: 9351
Total trinity components: 8695
Contig N50: 1585
After obtaining Trinity transcripts, there are downstream processes available to further explore these data.