Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language

Outline Motivation Contribution Pre-training Procedure
Model Architecture Input Representation Pre-training Tasks Fine-tuning Procedure Experiments GLUE Results SQuAD v1.1 Results NER Results SWAG Results Ablations Studies

Motivation 现有表征模型严重制约预训练表征的表达能力, e.g., I made a bank deposit.
OpenAI GPT, unidirectional pre-trained model Each word is contextualized using its left (or right) “bank” representation is only based on: I made a, but not deposit ELMo, a shallow pre-trained model Bidirectional Encoder Representations from Transformers (BERT) Deeply bidirectional pre-trained model. Each word is contextualized using its left and right context “bank” representation is based on I made a ____ deposit

Contribution The importance of bidirectional pre-training for language representations Eliminate the needs of heavily engineered task-specific architectures The state-of-the-art for eleven NLP tasks

Pre-training Procedure
(1) 首先从数据集抽取两个句子, 要求模型预测第二句是第一句的下一句的概率（Next Sentence）。同时，随机去除两个句子的一些词，并要求模型预测这些词是什么（Masked Words）。 (2) 把经过处理的句子对传入深度双向编码器，并通过两个损失函数同时学习这两个任务目标，完成预训练过程。 Pre-training Procedure Input Representation Model Architecture Task #1: Masked LM Task #2: Next Sentence Prediction

Model Architecture A multi-layer bidirectional Transformer encoder
BERTBASE: L=12, H=768, A=12, Total Parameters=110M BERTLARGE: L=24, H=1024, A=16, Total Parameters = 340M Attention is all you need, NIPS 2017

Input Representation

Task 1: Masked LM In order to train a deep bidirectional representation, First, mask 15% of the input tokens at random Second, predict those masked tokens e.g., I made a bank deposit  I made a [MASK] deposit Downsides: A mismatch between pre-training and fine-tuning, [MASK] More pre-training steps to converge

Task 1: Masked LM To mitigate the mismatch between pre-training and fine-tuning 80% of the time: [MASK], e.g., I made a bank deposit  I made a [MASK] deposit 10% of the time: random, e.g., I made a bank deposit  I made a apple deposit 10% of the time: unchanged, e.g., I made a bank deposit  I made a bank deposit More pre-training steps to converge Empirical improvements outweigh the increased training cost

Task 2: Next Sentence Prediction
In order to understand sentence relationships, pre-train next sentence prediction task. Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext

Fine-tuning Procedure
Parameters are fine-tuned jointly to maximize the log-probability of the correct label.

General Language Understanding Evaluation (GLUE) Results
BERTLARGE and BERTBASE outperforms baselines across all tasks. BERTLARGE outperforms BERTBASE across all tasks, even those with little training data.

NER Results SQuAD v1.1 Results SWAG Results Span prediction task
Token tagging task Classification task

Ablations Studies Line_1 -- Line_2: NSP impact.
Line_2 -- Line_3: Bidirectional impact Line_2 -- Line_4: Bidirectional impact Larger models lead to a accuracy improvement BERT is effective for both the fine-tuning and feature-based approaches

https://github.com/google-research/bert

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Similar presentations

Presentation on theme: "Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Similar presentations

Presentation on theme: "Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova"— Presentation transcript:

Similar presentations

About project

Feedback