Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.
Distributed Representations of Sentences and Documents
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning Ronan Collobert Jason Weston Presented by Jie Peng.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Fill-in-The-Blank Using Sum Product Network
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Distributed Representations for Natural Language Processing
Rationalizing Neural Predictions
Neural Machine Translation
End-To-End Memory Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Chilimbi, et al. (2014) Microsoft Research
Recursive Neural Networks
Pick samples from task t
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Conditional Random Fields for ASR
Ajita Rattani and Reza Derakhshani,
Are End-to-end Systems the Ultimate Solutions for NLP?
Deep learning and applications to Natural language processing
Training Techniques for Deep Neural Networks
Image Question Answering
Neural Language Model CS246 Junghoo “John” Cho.
Machine Learning Week 1.
إستراتيجيات ونماذج التقويم
Two-Stream Convolutional Networks for Action Recognition in Videos
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Neural Speech Synthesis with Transformer Network
Unsupervised Pretraining for Semantic Parsing
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
RNN Encoder-decoder Architecture
CSSE463: Image Recognition Day 18
Life Long Learning Hung-yi Lee 李宏毅
Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2
Word2Vec.
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Word Embedding 모든 단어를 vector로 표시 Word vector Word embedding Word
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Introduction to Neural Networks
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
An Evaluation Of Transfer Learning for Classifying Sales Engagement s at Large Scale Yong Liu, Pavel Dmitriev, Yifei Huang, Andrew Brooks, Li Dong.
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Deep Neural Network Language Models
Bidirectional LSTM-CRF Models for Sequence Tagging
Week 7 Presentation Ngoc Ta Aidean Sharghi
Learning to Detect Human-Object Interactions with Knowledge
CRCV REU 2019 Week 4.
Text Processing Like Humans Do:
The experiment based on hier-attention
Leveraging Free Text Data for Decision Making in Drug Development
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova Google AI Language

Outline Motivation Contribution Pre-training Procedure Model Architecture Input Representation Pre-training Tasks Fine-tuning Procedure Experiments GLUE Results SQuAD v1.1 Results NER Results SWAG Results Ablations Studies

Motivation 现有表征模型严重制约预训练表征的表达能力, e.g., I made a bank deposit. OpenAI GPT, unidirectional pre-trained model Each word is contextualized using its left (or right) “bank” representation is only based on: I made a, but not deposit ELMo, a shallow pre-trained model Bidirectional Encoder Representations from Transformers (BERT) Deeply bidirectional pre-trained model. Each word is contextualized using its left and right context “bank” representation is based on I made a ____ deposit

Contribution The importance of bidirectional pre-training for language representations Eliminate the needs of heavily engineered task-specific architectures The state-of-the-art for eleven NLP tasks

Pre-training Procedure (1) 首先从数据集抽取两个句子, 要求模型预测第二句是第一句的下一句的概率(Next Sentence)。同时,随机去除两个句子的一些词,并要求模型预测这些词是什么(Masked Words)。 (2) 把经过处理的句子对传入深度双向编码器,并通过两个损失函数同时学习这两个任务目标,完成预训练过程。 Pre-training Procedure Input Representation Model Architecture Task #1: Masked LM Task #2: Next Sentence Prediction

Model Architecture A multi-layer bidirectional Transformer encoder BERTBASE: L=12, H=768, A=12, Total Parameters=110M BERTLARGE: L=24, H=1024, A=16, Total Parameters = 340M Attention is all you need, NIPS 2017

Input Representation

Task 1: Masked LM In order to train a deep bidirectional representation, First, mask 15% of the input tokens at random Second, predict those masked tokens e.g., I made a bank deposit  I made a [MASK] deposit Downsides: A mismatch between pre-training and fine-tuning, [MASK] More pre-training steps to converge

Task 1: Masked LM To mitigate the mismatch between pre-training and fine-tuning 80% of the time: [MASK], e.g., I made a bank deposit  I made a [MASK] deposit 10% of the time: random, e.g., I made a bank deposit  I made a apple deposit 10% of the time: unchanged, e.g., I made a bank deposit  I made a bank deposit More pre-training steps to converge Empirical improvements outweigh the increased training cost

Task 2: Next Sentence Prediction In order to understand sentence relationships, pre-train next sentence prediction task. Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext

Fine-tuning Procedure Parameters are fine-tuned jointly to maximize the log-probability of the correct label.

General Language Understanding Evaluation (GLUE) Results BERTLARGE and BERTBASE outperforms baselines across all tasks. BERTLARGE outperforms BERTBASE across all tasks, even those with little training data.

NER Results SQuAD v1.1 Results SWAG Results Span prediction task Token tagging task Classification task

Ablations Studies Line_1 -- Line_2: NSP impact. Line_2 -- Line_3: Bidirectional impact Line_2 -- Line_4: Bidirectional impact Larger models lead to a accuracy improvement BERT is effective for both the fine-tuning and feature-based approaches

https://github.com/google-research/bert