NCBI SRA格式转换

都建树

2023-12-01

最近NCBI的数据格式由于空间缘故都转换成了*.sra格式，不再支持*.fastq.gz，因此需要一个特别的转化工具来转换下载的*.sra数据文件。

转换工具下载地址： http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software

windows下进行格式转换：

我用的版本是 MS Windows 32 bit architecture

1. cd 你的 sratoolkit.2.0.1rc1-win32 文件夹，可以单独建一个临时文件夹，把 fastq-dump程序放里面，然后CD到这个文件夹，把data也放里面，

2.然后使用如下命令行：fastq-dump ERR022480.lite.sra 生成的FASTQ文件就在同一个文件夹，一般有三个fastq文件，这样就可以很的把sra格式转成fastq格式了。

Ubuntu Linux 32 bit architecture：

转换命令
$fastq-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
基本的命令参数

Command	Description
‘-A’ or ‘--accession’	Enables modification of the output name used for the fastq files. For example: fastq-dump -A foo SRR000001 Will produce files named ‘foo.fastq’, ‘foo_1.fastq’, and ‘foo_2.fastq’
‘-D’ or ‘--table-path’	Makes the archive path more explicitly specified, thus preventing confusion when more than option is specified. These two commands produce the same files: fastq-dump ~/SRR000001 fastq-dump -D ~/SRR000001 However, the first command below will fail while the second will succeed: fastq-dump -C ~/SRR000001 fastq-dump -C -D ~/SRR000001 (‘-C’ option is explained further below)
‘-N’ or ‘--minSpotId’	Minimum spot number at which to start the dump process
‘-X’ or ‘--maxSpotId’	Maximum spot number at which to stop the dump process For example: fastq-dump -N 5 -X 10 SRR000001 This command will dump six spots starting from spot ‘SRR000001.5’ and ending in spot ‘SRR000001.10’. Filtered spots can result in less than (maxSpotId - minSpotId + 1) total spots output.
‘-G’ or ‘--spot-group’	Boolean option that results in fastq files divided into spot groups as defined in the Experiment (or eventually Run) xml. This command: fastq-dump -G SRR051894 Produces these five fragment files: `SRR051894.fastq` `SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB028-01WG.fastq` `SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CB036-01WG.fastq` `SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD021-01WG.fastq` `SRR051894_GDSX2KN04_PSORIASISMDA-POOL-738_CD036-01WG.fastq`
‘-T’ or ‘--group-in-dirs’	Boolean option directing the utility to produce fastq files in sub-directories rather than producing files within the same directory
‘-O’ or ‘--outdir’	Indicates the directory where the fastq result should be placed For example: fastq-dump -O /tmp -T SRR000001 will create a directory, SRR000001, in /tmp with this tree structure: `>tree /tmp/SRR000001` `/tmp/SRR000001` `\|-- 1` \| `-- fastq `\|-- 2` \| `-- fastq `-- fastq
‘-K’ or ‘--keep-empty-files’	Has no effect - at one time this option would represent all three possible files even if one or two were empty
‘-M’ or ‘--minReadLen’	Allows specification of the desired minimum read length to output (default is 25). The command ‘fastq-dump -M 0 SRR000001’ prevents any filtering based on read length.
‘-W’ or ‘--noclip’	Prevents clipping of a spot sequence based on the right clip information. Toggling ‘show-clipped’ in the ‘customize’ area for reads in the SRA Run Brower enables observing the effect of this option (e.g. seeSRR000001).
‘-F’ or ‘--origfmt’	Results in fastq containing only the original identifier on the defline (i.e. no length or SRR identifier are present)
‘-C’ or ‘--dumpcs’	Forces color space sequence to be dumped instead of base space. If the optional ‘cskey’ if provided (i.e. A, C, T, or G), then all fastq files produced will use that key at the start of each color space sequence.
‘-B’ or ’--dumpbase’	Forces base space sequence to be dumped instead of color space.
‘-Q’ or ‘--offset’	Allows using a different offset value to represent a different offset character in the fastq output. For example, using an offset of 64 represents using ‘@’ as the offset character.
‘-I’ or ‘--readids’	Appends a read index to the run identifier starting with ‘1’ as the first index. Note that this differs from the spot descriptor in the Experiment xml where the read indices start with ‘0’. In the case of SRR000001, the first spot in each file would have the identifiers ‘SRR000001.5.4’, ‘SRR000001.1.2’, and ‘SRR000001.1.4’. Note that the first spot sequence in SRR000001.fastq, the fragment file, comes from the second biological/application read which has an index of ‘4’.
‘-E’ or ‘--no_qual_filter’	This option turns off quality filtering based on leading/trailing low quality values. As reads have become longer this option has become a more viable alternative.
‘-SF’ or ‘--complete’	Outputs the separated reads into a single file. For example, the command: fastq-dump -SF SRR029338 Results in the first eight lines of the file, SRR029338.fastq, containing: `@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36` `GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT` `+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36` `IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I` `@SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36` `AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT` `+SRR029338.1 080115_EAS112_0034:8:1:615:780 length=36` `IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I` In the case of 454 pair submissions, the second technical read (i.e. linker) is included in this single output file.
‘-DB’ or ‘--defline-seq’	Allows specification of the sequence defline format. For example: -DB "@$ac.$si $sn length=$rl" This specification produces the same output as the default output. See Appendix D for a more in-depth explanation. Note that submission of a ‘fastq-dump’ command to a compute farm (e.g. Sun Grid Engine) can require preceding a number of the characters with backslash characters when using this option. The above example might require this version: -DB "@\\\$ac.\\\$si \\\$sn length=\\\$rl"
‘-DQ’ or ‘--defline-qual’	Allows specification of the quality defline format. For example: -DQ "+$ac.$si $sn length=$rl"
‘-alt [n]’	Provides alternative output formats without have to indicate the individual options. Alternate ‘1’, the only option, results in this format for SRR029338_1.fastq: `@SRR029338.1 080115_EAS112_0034:8:1:615:780/1` `GGTTGAGTAAAGTGTCTAAAGGCATAGCCTGATTAT` `+` `IIIIIIIIIIIIIIIIIIIAIIA<I8I+7I9+II2I` And this format for SRR029338_2.fastq: `@SRR029338.1 080115_EAS112_0034:8:1:615:780/2` `AAAGTCAAATTTGAATTGTTGTCAGCTTGTCAAAAT` `+` `IIIIIIIIDIIIIIIIIIIIII.1F2II=8*2+//I`

2.转换*.sra 文件格式到SFF格式

$ sff-dump -A <SRR_accession> -D <Path_to_SRR_Directory> -O <Output_Path>
  
Options:

Command	Description
-O	Allows user to specify an output directory. If not used, output will default to the current directory.
-N	Minimum spot ID to output. The first spot in the output will be the number given for this option.
-X	Maximum spot ID to output. The last spot in the output will be the number given. Min and Max spot options can be combined to output subsections of an SRR.
-G	spotgroup-file Split into files by SPOT_GROUP
-T	spotgroup-dir Split into subdirectories (of -O ) by SPOT_GROUP
-L	Log level: 0-13 or fatal\|sys\|int\|err\|warn\|info\|debug[1-10]. (default: info) Set to ‘4’ to mimic the unix standard of no messages for a successful operation.
-H	Prints this help message and version information.

3 转换*.sra 文件格式到Illumina native文件格式

$illumina-dump [options] -path <directory_containing_the_accession> <acces

Command	Description
-D, --table-path	Path to accession data.
-O, --outdir	Output directory. Default: '.'
-N, --minSpotId	Minimum spot id to output.
-X, --maxSpotId	Maximum spot id to output.
-G, --spot-group	Split into files by SPOT_GROUP (member).
-T, --group-in-dirs	Split into subdirectories instead of files.
-K, --keep-empty-files	Do not delete empty files.
-L, --log-level	Logging level: 0-13 or fatal\|sys\|int\|err\|warn\|info\|debug[1-10]. Default: info
-H, --help	Prints this message

Format options:

Command	Description
-r, --read	Output READ: "seq". Default: on
-q, --qual1	Output QUALITY, into single (1) or multiple (2) files: "qcal". Default: 1
-p, --qual4	Output full QUALITY: "prb". Default: off
-i, --intensity	Output INTENSITY, if present: "int". Default: off
-n, --noise	Output NOISE, if present: "nse". Default: off
-s, --signal	Output SIGNAL, if present: "sig2". Default: off
-qseq	Output QSEQ format: "qseq". Default: off

NCBI SRA格式转换

相关阅读

相关文章

相关问答

相关文档