http://atgc.lirmm.fr/lordec/README.html
Program for correcting sequencing errors in PacBio reads using highly accurate short reads (e.g. Illumina).
L. Salmela, and E. Rivals. LoRDEC: accurate and efficient long read error correction. Bioinformatics 30(24):3506-3514, 2014.
Access: http://bioinformatics.oxfordjournals.org/content/30/24/3506
LoRDEC has been tested on Linux. Compiling the program requires gcc version 4.5 or newer, Boost C++ libraries (e.g. libboost1.48-dev package or newer), and GATB Core library.
lordec-correct [parameters]
Required parameters:
-2, –shortreads=<short read FASTA/Q files or prebuilt DBG file without .h5 extension>
-i, –longreads=<long read FASTA/Q file>
-k, –kmerlen=<k-mer size>
-o, –correctedreadfile=<output file for corrected long reads>
-s, –solidthreshold=<solidity abundance threshold for k-mers>
Optional parameters:
-t, –trials=<number of target k-mers> Default: 5
-b, –branch=<maximum number of branches to explore> Default: 200
-e, –errorrate=<maximum error rate> Default: 0.4
-T, –threads=<number of threads> Default: use all cores
-S, –statfile=<path statistics file> Default: not generated
The input FASTA/Q files can be compressed. Several Illumina files can be specified as a comma seprated list (e.g. reads1.fa,reads2.fq,reads3.fq.gz).
LoRDEC outputs the corrected reads to the given file in FASTA format. The regions that remain weak after the correction are outputted in lower case characters and the solid regions are outputted in upper case characters.
To trim the weak regions from the beginning and end of the corrected reads:
lordec-trim -i <corrected reads file> -o <trimmed reads file>
To trim all weak regions and split the reads on inner weak regions:
lordec-trim-split -i <corrected reads file> -o <trimmed reads file>
The read names of the trimmed and split reads consists of two parts separated by an underscore. The first part is the name of the original read and the second part is a running index of the extracted solid regions from that read.
To generate statistics on solid and weak k-mers:
lordec-stat -2 <Short read FASTA/Q file> -k <k-mer size> -s <solid k-mer threshold> -i <PacBio FASTA/Q file> -S <output stat file> [-T <number of threads>]
The format of the output statistics file is as follows. There is one line for each read with the following information:
LoRDEC can generate statistics on the explored paths while correcting reads. To turn on the path statistics run LoRDEC with an additional parameter, -s, –statfile=<path statistics file>.
Be warned that the path statistics file can be huge when running LoRDEC on large data sets. The format of the file is as follows. The lines with format solid[i]=<position> tell the position of the source solid k-mer. If running LoRDEC with only one thread the following lines will be for paths with that k-mer as source. If more threads are used, the lines are interleaved in a random fashion. For each path a line with 5 fields is outputted:
To correct long reads or to generate k-mer statistics, LoRDEC builds a de Bruijn Graph from the short reads file. This program allows to build and save the graph in a file before doing such analyses, and then to load the graph file instead of computing it from the short read file. This saves time if you reuse the graph several times. The graph is saved in Hierarchical Data Format (HDF5: version 5).
lordec-build-SR-graph [-T <number of threads>] -2 <FASTA file> -k <k-mer size> -s <solid k-mer threshold> -g <out graph file
reads the <FASTA file> of short reads, then builds and save their de Bruijn graph for k-mers of length <k-mer size> and occurring at least <solid k-mer threshold> time
Below, we provide simple examples of command lines for running the programs of this package.
lordec-correct -2 illumina.fasta -k 19 -s 3 -i pacbio.fasta -o pacbio-corrected.fasta
ill-test-5K-1.fa
ill-test-5K-2.fa
lordec-correct -2 meta-file.txt -k 19 -s 3 -i pacbio-mini.fa -o my-corrected-pacbio-reads.fa &> mylog.log
lordec-trim -i pacbio-corrected.fasta -o pacbio-corrected-trim.fasta
lordec-trim-split -i pacbio-corrected.fasta -o pacbio-corrected-trim-split.fasta
lordec-stat -2 illumina.fasta -k 19 -s 3 -i pacbio-corrected.fasta -S pacbio-corrected-stats.txt
lordec-build-SR-graph -2 illumina.fasta -k 19 -s 3 -g illumina-19-3.h5
Fixed a bug which can cause overflow of stack allocated memory.
Allowing multiple Illumina files: Multiple short read files can now be given as a comma-separated list.
By default GATB 1.0.5 is used. If you wish to link against older GATB use the compiler flag -DOLDGATB.
Options have changed and they are now parsed with getopt.
Generating path statistics no longer requires recompiling.
Maximum read length increased to 500000.
Clarfied usage for prebuilding DBG.
The code is compatible with GATB Core 1.0.4.