http://forum.opennmt.net/t/how-to-use-glove-pre-trained-embeddings-in-opennmt-py/1011
Using vocabularies from OpenNMT-py preprocessing outputs, embeddings_to_torch.py
generate encoder and decoder embeddings initialized with GloVe’s values.
the script is a slightly modified version of ylhsieh’s one 39.
Usage:
usage: embeddings_to_torch.py [-h] -emb_file EMB_FILE -output_file OUTPUT_FILE
-dict_file DICT_FILE [-verbose]
emb_file
: GloVe like embedding file i.e. CSV [word] [dim1] ... [dim_d]
output_file
: a filename to save the output as PyTorch serialized tensors 18dict_file
: dict output from OpenNMT-py preprocessing0) set some variables:
export data="../onmt_merge/sorted_tokens/"
export root="./glove_experiment"
export glove_dir="./glove"
1) get GloVe files:
mkdir "$glove_dir"
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d "$glove_dir"
2) prepare data:
mkdir -p $root
python preprocess.py \
-train_src $data/train.src.txt \
-train_tgt $data/train.tgt.txt \
-valid_src $data/valid.src.txt \
-valid_tgt $data/valid.tgt.txt \
-save_data $root/data
3) prepare embeddings:
./tools/embeddings_to_torch.py -emb_file "$glove_dir/glove.6B.100d.txt" \
-dict_file "$root/data.vocab.pt" \
-output_file "$root/embeddings"
4) train using pre-trained embeddings:
python train.py -save_model $root/model \
-batch_size 64 \
-layers 2 \
-rnn_size 200 \
-word_vec_size 100 \
-pre_word_vecs_enc "$root/embeddings.enc.pt" \
-pre_word_vecs_dec "$root/embeddings.dec.pt" \
-data $root/data