2017 MIT &Google Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation
benifits:
训练数据的构造: introduce an artificial token at the beginning of the input sentence to indicate the target language;don’t specify the source language
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-836fdNse-1644573344174)(翻译2.0项目.assets/image-20220211165757078.png)]
2021 Facebook
many-to-English English-to-many
意义:first time that multilingual model outperform bilingual model on WMT 2021
making them attractive options for the development and maintenance of commercial translation technologies.
数据:
(1)bitext:
high-resource: use data from the shared task
mid -resource:add data from online sources:ccMatrix, ccAligned, and OPUS
low -resource:use LaBSE to embed bitext into the same embedding space,then use KNN to score and rank pairs of sentences
(2) Monolingual Data: in-domain data:Newscrawl(English and German); for other languages, choose from Commonscrawl, select the most similar to the availllable from general-domain data
词表:multilingual vocabulary
train spm model
模型:
Sparsely Gated Mixture-of-Expert(MoE) Multilingual Models:
In each Sparsely Gated MoE layer, each token is routed to the top-k expert FFN blocks based on a learned gating function.
Transformer architecture with the Feed Forward block in every alternate Transformer layer replaced with a Sparsely Gated Mixture-of-Experts layer with top-2 gating in the encoder and decoder.
Alibaba 论文
中科院 3trategies to one-to-many
三个策略:
Special Label Initialization:【universal method】:a special token (e.g. en2fr) is added at the end of the source sentence to indicate the translation direction.
【add new token】 we utilize another special languagedependent label at the beginning of the decoder
and we regard it as the first generated token of the target language (e.g. 2fr).
Language-dependent Positional Embedding: introduce trigonometric functions with different orders
or offsets on the decoder to distinguish different target languages.
Shared and Language-dependent Hidden Units per Layer: divide the hidden units of each decoder layer into shared units(to learn the commonality) and language-dependent ones(to capture the characteristic of each specific language).[tune shared units and specific language-dependent units, mask out other parts when training]
实验发现:similar languages (De/Fr) can share more hidden units