还没看完,先放上来,这个乱七八糟的草稿笔记在这就能提醒自己抓紧看....
GPT Feature
large transformer-based language model
Training objective: predict the next word, given all of the previous words within some text.
GPT-2在question answering, reading comprehension, summarization, and translation上,尽管表现不好,但是用足够的数据和计算量是可以直接做无监督学习的。
GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.
GPT-2 outperforms models trained on domain-specific datasets (e.g. Wikipedia, news, books) when evaluated on those same datasets.
GPT-2在特定领域的数据集上训练后表现更好。
数据来源:Reddit所有的超链接,大概有45 million links,用Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors获取HTML格式。(we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.)
处理之后大概800万文档,40GB的text,去掉了所有wiki的内容,因为wiki在其他的数据集太常见了,train 和 eval集的数据就会重复。
超参数
【开放下载的117M模型】HParams([('n_ctx', 1024), ('n_embd', 768), ('n_head', 12), ('n_layer', 12), ('n_vocab', 0)])
最小的模型(117M)和GPT1.0一样大,第二小的模型(345M)和Bert Largest差不多大。
以前是bution p(output|input),现在是p(output|input, task)
For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann
Google的 One Model To Learn Them All 中对multi model 的解释
The MultiModel consists of a few small modality-nets, an encoder, I/O mixer, and an autoregressive decoder, as depicted in Figure 2. As already said above, the encoder and decoder are constructed using 3 key computational blocks to get good performance across different problems:
(1) Convolutions allow the model to detect local patterns and generalize across space.
(2) Attention layers allow to focus on specific elements to improve performance of the model.
(3) Sparsely-gated mixture-of-experts gives the model capacity without excessive computation cost.
We start by describing the architecture of each of these 3 blocks and then introduce the encoder, decoder and the architecture of our modality-nets.
探讨NLP无监督和监督学习的问题
The Natural Language Decathlon: Multitask Learning as Question Answering
这篇文章提出了NLP十项全能网络MQAN(multitask question answering network)
Byte Pair Encoding (BPE) (Sennrich et al., 2015)
GPT2.0的改进
1.Layer normalization作为每个子块的输入,类似pre-activation residual network
2.