Introduction to Natural Language Processing

Introduction to Natural Language Processing
Mamoru Komachi Faculty of System Design Tokyo Metropolitan University 16 Feb 2018, 1:1 English presentation rehearsal

Short Bio March 2005 The University of Tokyo Majored in History and Philosophy of Science (B.A.) March 2010 Nara Institute of Science and Technology Majored in Natural Language Processing (PhD.) March 2013 Nara Institute of Science and Technology Started an academic career (Assistant Professor) September 2011 Apple Japan Developed Japanese input method (Software Engineer) April 2013-present Tokyo Metropolitan University Opened a lab (Associate Professor)

Deep learning in a nutshell
Paradigm for learning complex mathematical models by using multi- layered neural networks Achieves dramatic improvements in various kinds of pattern recognition tasks Now becomes one of the standard approaches in vision and speech (continuous and dense feature representation) Not widely used in natural language processing until recently due to the nature of the language (discrete and sparse feature representation)

DL learns implicit features without supervision
Lee et al., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. ICML 2009.

Deep learning for natural language processing
Representation learning Traditional feature engineering requires huge annotation cost and/or crafting features (feature templates) Need to learn implicit feature representation (without supervision) Architecture for deep neural networks Conventional statistical methods cannot produce fluent sentences Need to generate fluent sentences (possibly with heavy supervision)

Example of implicit features in NLP
What is the meaning of king? King Arthur is a legendary British leader of the late 5th and early 6th centuries Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … The meaning of a word can be characterized by its contextual words (distributional hypothesis) “you shall know a word by the company it keeps.” (Firth, 1957)

Vector space model: Representing the meaning as a vector
king = (0.1, 0.7, -0.3, 0, -1)T How often it co-occurs with “leader”? How often it co-occurs with “empire”? How often it co-occurs with “dog”? …… Similarity of word (vector) king・farmar = ||king|| ||farmar|| cos θ similarity = cos θ = king・farmar / ||king|| ||farmar||

Semantic composition (additive operation) by a word vector (word2vec)
king = (0.1, 0.7, -0.3, 0, -1)T

Quantitative evaluation of semantic similarity of words and phrases
Visualization of word embeddings Red: word embeddings learned by word2vec Blue: word embeddings optimized by the usage of grammatical errors of English learners Can reflect the use of words to learn word embeddings

Other representation of semantic meanings
Matrix Baroni and Zamperelli Nouns are vectors, adjectives are matrices: Representing adjective- noun constructions in semantic space. Tensor Socher et al Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

Deep learning for natural language processing
Representation learning Traditional feature engineering requires huge annotation cost and/or crafting features (feature templates) Need to learn implicit feature representation (without supervision) Architecture for deep neural networks Conventional statistical methods cannot produce fluent sentences Need to generate fluent sentences (possibly with heavy supervision)

Natural language generation: Machine translation
Translate a sentence in a language to another Research question: How to address multi-class classification problem? How to model alignment between words?

Traditional approaches in machine translation
Bernerd Vauquois’ Pyramid Generates an expression in the target language by understanding the source language Mainstream approache until 90s CC BY-SA 3.0,

Recent advances in statistical and empirical methods in machine translation
1.Translation models 2. Open-source software 3. Automatic evaluation 4. Optimization 5.Massive Data Minimum error rate training (2003) GIZA++, SRILM Pharaoh, Moses (2003-) Europarl Patent corpus (2008) BLEU (2002) Starting from IBM models (1993) to phrase-based methods (2003)

Statistical machine translation: Learning mathematical models from parallel corpus
Noisy channel model ③Decode ④Optimization P(f | e) P(e) ②Rule extraction argmax BLEU 翻訳モデル言語モデル目的言語 source target target target source target ①Alignment Parallel corpus Raw corpus Development corpus (reference)

The larger the data, the better the translation
Brants et al. “Large Language Models in Machine Translation”. EMNLP-2007. ←Better Translation Worse→ ←small Data Data　large→

From statistical models to a neural model
Why factorize? →If the parallel corpus is small, it is not able to estimate translation model P(e|f) robustly. What if we have a large-scale parallel corpus? →No need to factorize translation models! P 𝑒 𝑓 = P( 𝑒 𝑧 | 𝑒 <𝑧 ,𝑓) Generating word depends on source words and all the target words generated so far

Recurrent neural network (RNN) encodes sequence of words
深い Recurrent neural network (RNN) encodes sequence of words マジやばいやばい </s> output hidden マジ input 深層学習

Generating a sentence by combining two RNN models (sequence to sequence)
Encoder-decoder approach DL is is really cool </s> decoder </s> DL encoder Encode a sentence vector from word vectors in the source language 深層学習マジやばい

Seq2seq models can learn complex models only from simple features
Zhang and LeCun, Text Understanding from Scratch, arXiv →Learns text classification model only from characters Zaremba and Sutskever, Learning to Execute, arXiv →Learns Python interpreter only using RNN

How to make alignments？ →Attends source side during decoding
Attention = weighted sum of hidden states of encoder DL is is really cool </s> DL <s> Do not form a single sentence vector. Rather, it uses all the word vectors by an attention mechanism. 深層学習マジやばい </s>

If there is a large-scale parallel corpus, neural machine translation outperforms statistical methods Luong et al. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015.

Attention can be done not only for sequence but also for tree structure
Eriguchi et al. Tree-to-Sequence Attentional Neural Machine Translation. ACL 2016. Attends phrase structure in the encoder to consider syntactic structures on the source side

Deep learning enables language generation from multi-modal input
Generates a fluent caption only from a single image coherent.html

Summary: Deep learning for natural language processing
Representation learning Can find implicit features Can compute the meaning of a sentence by semantic composition through mathematical modeling Architecture for deep neural network Can generate fluent sentences Opens up broad possibility for natural language generation

Introduction to Natural Language Processing

Similar presentations

Presentation on theme: "Introduction to Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Natural Language Processing

Similar presentations

Presentation on theme: "Introduction to Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback