当前位置: 首页 > 工具软件 > MT > 使用案例 >

mt_hhh

巫马磊
2023-12-01

2017 MIT &Google Google’s Multilingual Neural Machine Translation System:
Enabling Zero-Shot Translation

benifits:

  1. simplicity
  2. low-resource language improvement
  3. zero-shot translation

训练数据的构造: introduce an artificial token at the beginning of the input sentence to indicate the target language;don’t specify the source language

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-836fdNse-1644573344174)(翻译2.0项目.assets/image-20220211165757078.png)]

2021 Facebook

Facebook AI’s WMT21 News Translation Task Submission

many-to-English English-to-many

意义:first time that multilingual model outperform bilingual model on WMT 2021

​ making them attractive options for the development and maintenance of commercial translation technologies.

数据:

(1)bitext:

high-resource: use data from the shared task

mid -resource:add data from online sources:ccMatrix, ccAligned, and OPUS

low -resource:use LaBSE to embed bitext into the same embedding space,then use KNN to score and rank pairs of sentences

(2) Monolingual Data: in-domain data:Newscrawl(English and German); for other languages, choose from Commonscrawl, select the most similar to the availllable from general-domain data

词表:multilingual vocabulary

train spm model

模型:

Sparsely Gated Mixture-of-Expert(MoE) Multilingual Models:

In each Sparsely Gated MoE layer, each token is routed to the top-k expert FFN blocks based on a learned gating function.

Transformer architecture with the Feed Forward block in every alternate Transformer layer replaced with a Sparsely Gated Mixture-of-Experts layer with top-2 gating in the encoder and decoder.

Alibaba 论文

中科院 3trategies to one-to-many

Three Strategies to Improve One-to-Many Multilingual Translation (2018 EMNLP 短文)

三个策略:

  1. Special Label Initialization:【universal method】:a special token (e.g. en2fr) is added at the end of the source sentence to indicate the translation direction.

    【add new token】 we utilize another special languagedependent label at the beginning of the decoder
    and we regard it as the first generated token of the target language (e.g. 2fr).

  2. Language-dependent Positional Embedding: introduce trigonometric functions with different orders
    or offsets on the decoder to distinguish different target languages.

  3. Shared and Language-dependent Hidden Units per Layer: divide the hidden units of each decoder layer into shared units(to learn the commonality) and language-dependent ones(to capture the characteristic of each specific language).[tune shared units and specific language-dependent units, mask out other parts when training]

    实验发现:similar languages (De/Fr) can share more hidden units

 类似资料:

相关阅读

相关文章

相关问答