Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Similar presentations


Presentation on theme: "Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova"— Presentation transcript:

1 Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language

2 Outline Motivation Contribution Pre-training Procedure
Model Architecture Input Representation Pre-training Tasks Fine-tuning Procedure Experiments GLUE Results SQuAD v1.1 Results NER Results SWAG Results Ablations Studies

3 Motivation 现有表征模型严重制约预训练表征的表达能力, e.g., I made a bank deposit.
OpenAI GPT, unidirectional pre-trained model Each word is contextualized using its left (or right) “bank” representation is only based on: I made a, but not deposit ELMo, a shallow pre-trained model Bidirectional Encoder Representations from Transformers (BERT) Deeply bidirectional pre-trained model. Each word is contextualized using its left and right context “bank” representation is based on I made a ____ deposit

4 Contribution The importance of bidirectional pre-training for language representations Eliminate the needs of heavily engineered task-specific architectures The state-of-the-art for eleven NLP tasks

5 Pre-training Procedure
(1) 首先从数据集抽取两个句子, 要求模型预测第二句是第一句的下一句的概率(Next Sentence)。同时,随机去除两个句子的一些词,并要求模型预测这些词是什么(Masked Words)。 (2) 把经过处理的句子对传入深度双向编码器,并通过两个损失函数同时学习这两个任务目标,完成预训练过程。 Pre-training Procedure Input Representation Model Architecture Task #1: Masked LM Task #2: Next Sentence Prediction

6 Model Architecture A multi-layer bidirectional Transformer encoder
BERTBASE: L=12, H=768, A=12, Total Parameters=110M BERTLARGE: L=24, H=1024, A=16, Total Parameters = 340M Attention is all you need, NIPS 2017

7 Input Representation

8 Task 1: Masked LM In order to train a deep bidirectional representation, First, mask 15% of the input tokens at random Second, predict those masked tokens e.g., I made a bank deposit  I made a [MASK] deposit Downsides: A mismatch between pre-training and fine-tuning, [MASK] More pre-training steps to converge

9 Task 1: Masked LM To mitigate the mismatch between pre-training and fine-tuning 80% of the time: [MASK], e.g., I made a bank deposit  I made a [MASK] deposit 10% of the time: random, e.g., I made a bank deposit  I made a apple deposit 10% of the time: unchanged, e.g., I made a bank deposit  I made a bank deposit More pre-training steps to converge Empirical improvements outweigh the increased training cost

10 Task 2: Next Sentence Prediction
In order to understand sentence relationships, pre-train next sentence prediction task. Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext

11 Fine-tuning Procedure
Parameters are fine-tuned jointly to maximize the log-probability of the correct label.

12 General Language Understanding Evaluation (GLUE) Results
BERTLARGE and BERTBASE outperforms baselines across all tasks. BERTLARGE outperforms BERTBASE across all tasks, even those with little training data.

13 NER Results SQuAD v1.1 Results SWAG Results Span prediction task
Token tagging task Classification task

14 Ablations Studies Line_1 -- Line_2: NSP impact.
Line_2 -- Line_3: Bidirectional impact Line_2 -- Line_4: Bidirectional impact Larger models lead to a accuracy improvement BERT is effective for both the fine-tuning and feature-based approaches

15 https://github.com/google-research/bert


Download ppt "Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova"

Similar presentations


Ads by Google