当前位置: 首页 > 工具软件 > OpenGpt > 使用案例 >

OpenGPT2.0笔记

袁泓
2023-12-01

还没看完,先放上来,这个乱七八糟的草稿笔记在这就能提醒自己抓紧看....

GPT Feature

large transformer-based language model 

Training objective: predict the next word, given all of the previous words within some text.

GPT-2在question answering, reading comprehension, summarization, and translation上,尽管表现不好,但是用足够的数据和计算量是可以直接做无监督学习的。

GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

 

 GPT-2 outperforms models trained on domain-specific datasets (e.g. Wikipedia, news, books) when evaluated on those same datasets. 

GPT-2在特定领域的数据集上训练后表现更好。

数据来源:Reddit所有的超链接,大概有45 million links,用Dragnet (Peters & Lecocq, 2013) and Newspaper1 content extractors获取HTML格式。(we scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.)

处理之后大概800万文档,40GB的text,去掉了所有wiki的内容,因为wiki在其他的数据集太常见了,train 和 eval集的数据就会重复。

 

超参数

【开放下载的117M模型】HParams([('n_ctx', 1024), ('n_embd', 768), ('n_head', 12), ('n_layer', 12), ('n_vocab', 0)])

最小的模型(117M)和GPT1.0一样大,第二小的模型(345M)和Bert Largest差不多大。

 

以前是bution p(output|input),现在是p(output|input, task)

For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer). McCann

  • 单项任务的学习可以被表达成一种条件概率分布?(??????|?????)的估计。但是为了在多项task上应用一种general model,条件就不仅仅只是input,还要加上task本身的表现。所以新的任务学习条件概率分布可以写作p(output|input, task)。比如一个翻译任务,具体的example序列就可以被转化为(translate to french, english text, french text)来训练。(gpt开始在阅读理解QA上就是这么做的,所有的问题就都转化成分类问题了..)
  • 文章中提到了Architectural level:task-specific encoders and decoders,google在One Model To Learn Them All中描述为CNN【提取特征】——attention【关注特定元素】——sparsely-gated
  • algorithmic level such as the inner and outer loop optimization framework of MAML

 

 

 

Google One Model To Learn Them All 中对multi model 的解释

The MultiModel consists of a few small modality-nets, an encoder, I/O mixer, and an autoregressive decoder, as depicted in Figure 2. As already said above, the encoder and decoder are constructed using 3 key computational blocks to get good performance across different problems:

(1) Convolutions allow the model to detect local patterns and generalize across space.

(2) Attention layers allow to focus on specific elements to improve performance of the model.

(3) Sparsely-gated mixture-of-experts gives the model capacity without excessive computation cost.

We start by describing the architecture of each of these 3 blocks and then introduce the encoder, decoder and the architecture of our modality-nets.

 

探讨NLP无监督和监督学习的问题

The Natural Language Decathlon: Multitask Learning as Question Answering

这篇文章提出了NLP十项全能网络MQAN(multitask question answering network)

Byte Pair Encoding (BPE) (Sennrich et al., 2015)

GPT2.0的改进

1.Layer normalization作为每个子块的输入,类似pre-activation residual network

2.

 类似资料: