https://github.com/OpenNMT/OpenNMT-py
GitHub - OpenNMT/CTranslate2: Fast inference engine for Transformer models
安装软件包:
pip install OpenNMT-py
pip install ctranslate2
基于CTranslate2和openNMT-py对比性能(非基于官方镜像的方式)
参考OpenNMT-py github主页链接下载Pretrained models
基于WMT训练的English-German - Transformer
这个模型里面包含两个文件
averaged-10-epoch.pt和sentencepiece.model
前者是保存的模型,后者是sentencepiece tokenizer的模型
CTranslate2模型转换和量化参考:
https://github.com/OpenNMT/CTranslate2/blob/master/docs/quantization.md
下载测试数据(写在CTranslate2/tools/benchmark/benchmark_all.py)
import sacrebleu
# Benchmark configuration
test_set = "wmt14"
langpair = "en-de"
print("Downloading the test files...")
source_file = sacrebleu.get_source_file(test_set, langpair=langpair)
target_file = sacrebleu.get_reference_files(test_set, langpair=langpair)[0]
print("source_file:", source_file)
print("target_file:", target_file)
会下载翻译前后的原始文本,分别为en-de.en, en-de.de。
bench:参考CTranslate2/tools/benchmark/opennmt_ende_wmt14/目录的ctranslate2和opennmt_py两个目录,里面分别有tokenize.sh和translate.sh。
先对于opennmt_py目录,可以根据tokenize.sh创建custom_tokenize.py,内容如下:
import pyonmttok
sp_model_path = "transformer-ende-wmt-pyOnmt/sentencepiece.model"
src_file = "sacrebleu/wmt14/en-de.en"
tgt_file = src_file + ".tok"
pyonmttok.Tokenizer("none", sp_model_path=sp_model_path).tokenize_file(src_file, tgt_file)
执行后为输入文本生成对应的tokenized文本
然后执行翻译脚本为上一步tokenized输入文本生成翻译后的tokenized文本
修改translate.sh:
#!/bin/bash
# EXTRA_ARGS=""
# if [ $DEVICE = "GPU" ]; then
# EXTRA_ARGS+=" -gpu 0"
# fi
# if [ ${INT8:-0} = "1" ]; then
# EXTRA_ARGS+=" -int8"
# fi
model_path=transformer-ende-wmt-pyOnmt/averaged-10-epoch.pt
src_file=sacrebleu//wmt14/en-de.en.tok
out_file=${src_file}.onmt.out
# onmt_translate \
python -m onmt.bin.translate \
-model ${model_path} \
-src ${src_file} \
-out ${out_file} \
-batch_size 32 \
-beam_size 4 -gpu 0
最后进行detokenize,可以参考detokenize.sh编写custom_detokenize.py:
import sys;
import pyonmttok;
sp_model_path = "transformer-ende-wmt-pyOnmt/sentencepiece.model"
src_file = "sacrebleu/wmt14/en-de.en.tok.onmt.out"
tgt_file = src_file + ".detok"
pyonmttok.Tokenizer("none", sp_model_path=sp_model_path).detokenize_file(src_file, tgt_file)
执行后会为翻译后的tokenized文本生成detokenize的文本文件,结果跟en-de.de应该相似。
ctranslate2执行跟上面流程除了translate步骤其他一样。
benchmark/ctranslate2/translate.sh里面使用了translate二进制可执行文件,但直接pip安装的ctranslate2并没有这个translate文件。
如果不使用镜像或自己构建镜像,直接使用Pip包推理,则应该自己读入文件调用python接口推理。为了公平的比较,openNMT-py的推理可以改成同样的方式并加warmup。
ctranslate2 python接口使用参考:
CTranslate2/python.md at master · OpenNMT/CTranslate2 · GitHub
这里采用自己编写ctans_custom_translate.py脚本读入文件推理:
import ctranslate2
import time
src_file = "sacrebleu/wmt14/en-de.en.tok"
tgt_file = src_file + ".ctrans.out"
model_path = "transformer-ende-wmt-pyOnmt/ende_ctranslate2/"
device = "cuda" # "cpu", "cuda"
max_batch_size = 32
beam_size = 4
with open(src_file, "r") as f:
lines = f.readlines()
lines = [line.strip('\n').split(" ") for line in lines]
translator = ctranslate2.Translator(model_path, device=device)
# warmup
translator.translate_batch(lines[:max_batch_size * 4], max_batch_size=max_batch_size, beam_size=beam_size)
time1 = time.time()
trans_results = translator.translate_batch(lines, max_batch_size=max_batch_size, beam_size=beam_size)
time2 = time.time()
print("ctranslate2 translate time:", time2 - time1)
result_lines = [" ".join(result.hypotheses[0]) + "\n" for result in trans_results]
with open(tgt_file, "w") as f:
f.writelines(result_lines)