当前位置: 首页 > 工具软件 > OpenNMT > 使用案例 >

How to use GloVe pre-trained embeddings in OpenNMT-py

卢骏俊
2023-12-01

http://forum.opennmt.net/t/how-to-use-glove-pre-trained-embeddings-in-opennmt-py/1011

Using vocabularies from OpenNMT-py preprocessing outputs, embeddings_to_torch.py generate encoder and decoder embeddings initialized with GloVe’s values.

the script is a slightly modified version of ylhsieh’s one 39.

Usage:

usage: embeddings_to_torch.py [-h] -emb_file EMB_FILE -output_file OUTPUT_FILE
                              -dict_file DICT_FILE [-verbose]
  • emb_file: GloVe like embedding file i.e. CSV [word] [dim1] ... [dim_d]
  • output_file: a filename to save the output as PyTorch serialized tensors 18
  • dict_file: dict output from OpenNMT-py preprocessing

Example

0) set some variables:

export data="../onmt_merge/sorted_tokens/"
export root="./glove_experiment"
export glove_dir="./glove"

1) get GloVe files:

mkdir "$glove_dir"
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d "$glove_dir"

2) prepare data:

  mkdir -p $root
  python preprocess.py \
      -train_src $data/train.src.txt \
      -train_tgt $data/train.tgt.txt \
      -valid_src $data/valid.src.txt \
      -valid_tgt $data/valid.tgt.txt \
      -save_data $root/data

3) prepare embeddings:

  ./tools/embeddings_to_torch.py -emb_file "$glove_dir/glove.6B.100d.txt" \
                                 -dict_file "$root/data.vocab.pt" \
                                 -output_file "$root/embeddings" 

4) train using pre-trained embeddings:

  python train.py -save_model $root/model \
        -batch_size 64 \
        -layers 2 \ 
        -rnn_size 200 \
        -word_vec_size 100 \
        -pre_word_vecs_enc "$root/embeddings.enc.pt" \
        -pre_word_vecs_dec "$root/embeddings.dec.pt" \
        -data $root/data
 类似资料: