ERROR record:
下载了所有细菌fna后,整合成一整个fna文件,大小99G.
samtools faidx library.fna
error:[E::fai_build_core] Different line length in sequence 'kraken:taxid|436|NZ_CP062147.1'
Did you take a look at that sequence in question? It may be just a
case of a broken fasta record.
The error looks pretty clear - Your sequences may be of unequal length in different lines. Why an indexer does not auto-normalize (or at least provide an option for it),
picard NormalizeFasta --INPUT 1.fa --OUTPUT normalized.fa
得到的结果依然无法够建索引
seqkit seq -w 70 s.fa > s2.fa
只是把fa的序列行的每一行碱基数目调整,对错误序列部分无改正效果
error sequence 所在 row 250946642
总row 1497262490
查看 250946642 后50000行 找到错误行
sed -n '250946642,250996642'p normalized_library.fa > index.error.50000
kraken:taxid|436|NZ_CP062147.1’的末尾出现了新的seq
44099 CTCCGCCCCATCCGGCCCCGCCACACGGAGCTGCCCCGCCGCGTCCCAGCCCAGCCAGCGATGCC>krak
44100 en:taxid|1513|NZ_CP035785.1 Clostridium tetani strain Harvard 49205 ch
44101 romosomeNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
44102 NNNNNNNNNNNNNNNNNNNNNCAACAACGTATTTCATTTTAACACATTTAAATTTACCTATTGAGTATTA
grep '[A-Z]>' normalized_library.fa
找出有多少行fa出了错
CTCCGCCCCATCCGGCCCCGCCACACGGAGCTGCCCCGCCGCGTCCCAGCCCAGCCAGCGATGCC>krak
TTATGTGGGATTAAACTTGAAATTTCATT>kraken:taxid|290847|NC_017382.1 Helicoba
查看真正错误所在行:grep -n 'CC>krak' normalized_library.fa
250990740:CTCCGCCCCATCCGGCCCCGCCACACGGAGCTGCCCCGCCGCGTCCCAGCCCAGCCAGCGATGCC>kra
659005136:TTATGTGGGATTAAACTTGAAATTTCATT>kraken:taxid|290847|NC_017382.1 Helicoba
先删除那部分试试
sed ‘row1d;row2d’ .fa > .fa