NLP的项目流程比较繁琐,正好现在又AllenNLP这个基于PyTorch的工具可以用于规范数据处理,模型构建、训练和测试,感觉不错。之前看了一篇论文,作者用TensorFlow 1.13版本写的一个NLP项目,感觉实在是复杂。
先看一下AllenNLP(https://allennlp.org/)在GitHub页面中的概要,关键就是下面这个表格:
allennlp | an open-source NLP research library, built on PyTorch |
---|---|
allennlp.commands | functionality for a CLI and web service |
allennlp.data | a data processing module for loading datasets and encoding strings as integers for representation in matrices |
allennlp.models | a collection of state-of-the-art models |
allennlp.modules | a collection of PyTorch modules for use with text |
allennlp.nn | tensor utility functions, such as initializers and activation functions |
allennlp.training | functionality for training models |
AllenNLP主要分为六个模块:
先看看官方给出的样例代码,简化版的tutorial。这份代码是写在一个Python文件中的,实际项目中会分成各个文件和文件夹。
from typing import Iterator, List, Dict
import torch
import torch.optim as optim
import numpy as np
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.common.file_utils import cached_path
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor
torch.manual_seed(1)
typing
库是Python标准库中用于提供类型注释的,torch
库就是PyTorch。allennlp.data
用来组织模型处理的数据,其中每个训练样本被组织成Instance
,每个Instance
中有不同的字段Field
s,这里用的是TxetField
和SequenceLabelField
;allennlp.data.dataset_readers.DatasetReader
是用于读取数据的类,大多数AllenNLP代码必须实现的两个类之一,另一个是allennlp.models.Model
;allennlp.common.file_utils.cached_path
是用于下载文件的,也可以指定本地的文件;allennlp.data
中的几个token相关的类是用来将文本的token映射为indices(就当是一个token映射到一个整数,token->id),allennlp.data.vocabularly
用来将这些映射组合起来构成单词表;allennlp.modules
和allennlp.nn.util
中的几个Embedding相关的类,是用来给模型添加词嵌入层的;allennlp.training.metrics
中的类是用来评测模型的;allennlp.data.iterator
中的DataIterator
类是数据读取的迭代器,可以用来进行批处理之类的操作;allennlp.training.trainer.Trainer
和allennlp.predictors.SentenceTaggerPredictro
用来训练模型和进行预测;class PosDatasetReader(DatasetReader):
"""
DatasetReader for PoS tagging data, one sentence per line, like
The###DET dog###NN ate###V the###DET apple###NN
"""
def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
super().__init__(lazy=False)
self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
sentence_field = TextField(tokens, self.token_indexers)
fields = {"sentence": sentence_field}
if tags:
label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
fields["labels"] = label_field
return Instance(fields)
def _read(self, file_path: str) -> Iterator[Instance]:
with open(file_path) as f:
for line in f:
pairs = line.strip().split()
sentence, tags = zip(*(pair.split("###") for pair in pairs))
yield self.text_to_instance([Token(word) for word in sentence], tags)
PosDatasetReader
实现DatasetReader
类,用于读取词性标注(Part-of-Speech)数据,每行一句话,用###
分隔开(比如The的词性为DET)。类初始化时的参数只有一个从token
(字符串)到TokenIndexer
的字典映射(应该是单词表);text_to_instance
函数将每个句子的token和tag(分别是模型的输入和输出),转换为一个实例Instance
(其实就是组织在一个fields字典里);_read
函数(单下划线开头为默认的模块内部函数),这里读取整个文件,返回一个实例的迭代器(注意这里是生成器yield)。多数时候这部分内容写在text_to_instance中;class LstmTagger(Model):
def __init__(self,
word_embeddings: TextFieldEmbedder,
encoder: Seq2SeqEncoder,
vocab: Vocabulary) -> None:
super().__init__(vocab)
self.word_embeddings = word_embeddings
self.encoder = encoder
self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
out_features=vocab.get_vocab_size('labels'))
self.accuracy = CategoricalAccuracy()
def forward(self,
sentence: Dict[str, torch.Tensor],
labels: torch.Tensor = None) -> Dict[str, torch.Tensor]:
mask = get_text_field_mask(sentence)
embeddings = self.word_embeddings(sentence)
encoder_out = self.encoder(embeddings, mask)
tag_logits = self.hidden2tag(encoder_out)
output = {"tag_logits": tag_logits}
if labels is not None:
self.accuracy(tag_logits, labels, mask)
output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)
return output
def get_metrics(self, reset: bool = False) -> Dict[str, float]:
return {"accuracy": self.accuracy.get_metric(reset)}
torch.nn.Module
的子类,所以如何实现基本取决于使用者。关键还是其中的前向传播函数forward
。这里还有一个初始化函数,和一个获取评测指标的get_metrics
函数。模型定义包含一个embedding层和一个Seq2Seq的encoder层,最后用一个torch.nn.Linear
单层网络输出(输出的应该是一个label大小的向量),同时输入的还有单词表。reader = PosDatasetReader()
train_dataset = reader.read(cached_path(
'https://raw.githubusercontent.com/allenai/allennlp'
'/master/tutorials/tagger/training.txt'))
validation_dataset = reader.read(cached_path(
'https://raw.githubusercontent.com/allenai/allennlp'
'/master/tutorials/tagger/validation.txt'))
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)
EMBEDDING_DIM = 6
HIDDEN_DIM = 6
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
embedding_dim=EMBEDDING_DIM)
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
if torch.cuda.is_available():
cuda_device = 0
model = model.cuda(cuda_device)
else:
cuda_device = -1
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
iterator.index_with(vocab)
trainer = Trainer(model=model,
optimizer=optimizer,
iterator=iterator,
train_dataset=train_dataset,
validation_dataset=validation_dataset,
patience=10,
num_epochs=1000,
cuda_device=cuda_device)
trainer.train()
predictor = SentenceTaggerPredictor(model, dataset_reader=reader)
tag_logits = predictor.predict("The dog ate the apple")['tag_logits']
tag_ids = np.argmax(tag_logits, axis=-1)
print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])
# Here's how to save the model.
with open("/tmp/model.th", 'wb') as f:
torch.save(model.state_dict(), f)
vocab.save_to_files("/tmp/vocabulary")
# And here's how to reload the model.
vocab2 = Vocabulary.from_files("/tmp/vocabulary")
model2 = LstmTagger(word_embeddings, lstm, vocab2)
with open("/tmp/model.th", 'rb') as f:
model2.load_state_dict(torch.load(f))
if cuda_device > -1:
model2.cuda(cuda_device)
predictor2 = SentenceTaggerPredictor(model2, dataset_reader=reader)
tag_logits2 = predictor2.predict("The dog ate the apple")['tag_logits']
np.testing.assert_array_almost_equal(tag_logits2, tag_logits)
DatasetReader
,读取训练集和验证集的数据(这个网址好像要科学上网,所以直接下载下来本地调用就是了,改掉cached_path
中的路径),生成单词表;model
,包括构建词嵌入,还用一个类包装了一下;trainer
,训练,实例化预测器predictor
,预测。更为详细的教程在官方文档中:https://allenai.github.io/allennlp-docs/
目前就是AllenNLP的一个概览,可以看到,整个流程的两个关键就是数据读取和自定义模型,就是实现两个基类allennlp.data.dataset_readers.DatasetReader
和allennlp.models.Model
。
后续再根据详细教程分析。
BTW, AllenNLP还有一种基于Jsonnet文件格式定义模型,用命令行直接运行的方式,后续再看吧。