OpenGPT2.0笔记

袁泓

2023-12-01

还没看完，先放上来，这个乱七八糟的草稿笔记在这就能提醒自己抓紧看....

GPT Feature

large transformer-based language model

Training objective: predict the next word, given all of the previous words within some text.

GPT-2在question answering, reading comprehension, summarization, and translation上，尽管表现不好，但是用足够的数据和计算量是可以直接做无监督学习的。

GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

GPT-2 outperforms models trained on domain-specific datasets (e.g. Wikipedia, news, books) when evaluated on those same datasets.

GPT-2在特定领域的数据集上训练后表现更好。

数据来源：Reddit所有的超链接，大概有45 million links，用Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors获取HTML格式。（we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.）

处理之后大概800万文档，40GB的text，去掉了所有wiki的内容，因为wiki在其他的数据集太常见了，train 和 eval集的数据就会重复。

超参数

【开放下载的117M模型】HParams([('n_ctx', 1024), ('n_embd', 768), ('n_head', 12), ('n_layer', 12), ('n_vocab', 0)])

最小的模型（117M）和GPT1.0一样大，第二小的模型（345M）和Bert Largest差不多大。

以前是bution p(output|input)，现在是p(output|input, task)

For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann

单项任务的学习可以被表达成一种条件概率分布?(??????|?????)的估计。但是为了在多项task上应用一种general model，条件就不仅仅只是input，还要加上task本身的表现。所以新的任务学习条件概率分布可以写作p(output|input, task)。比如一个翻译任务，具体的example序列就可以被转化为(translate to french, english text, french text)来训练。（gpt开始在阅读理解QA上就是这么做的，所有的问题就都转化成分类问题了..）
文章中提到了Architectural level：task-specific encoders and decoders，google在One Model To Learn Them All中描述为CNN【提取特征】——attention【关注特定元素】——sparsely-gated
algorithmic level such as the inner and outer loop optimization framework of MAML

Google的 One Model To Learn Them All 中对multi model 的解释

The MultiModel consists of a few small modality-nets, an encoder, I/O mixer, and an autoregressive decoder, as depicted in Figure 2. As already said above, the encoder and decoder are constructed using 3 key computational blocks to get good performance across different problems:

(1) Convolutions allow the model to detect local patterns and generalize across space.

(2) Attention layers allow to focus on specific elements to improve performance of the model.

(3) Sparsely-gated mixture-of-experts gives the model capacity without excessive computation cost.

We start by describing the architecture of each of these 3 blocks and then introduce the encoder, decoder and the architecture of our modality-nets.

探讨NLP无监督和监督学习的问题

The Natural Language Decathlon: Multitask Learning as Question Answering

这篇文章提出了NLP十项全能网络MQAN（multitask question answering network）

Byte Pair Encoding (BPE) (Sennrich et al., 2015)

GPT2.0的改进

1.Layer normalization作为每个子块的输入，类似pre-activation residual network

OpenGPT2.0笔记

相关阅读

相关文章

相关问答

相关文档