Introduction to Natural Language Processing Mamoru Komachi <komachi@tmu.ac.jp> Faculty of System Design Tokyo Metropolitan University 16 Feb 2018, 1:1 English presentation rehearsal
Short Bio March 2005 The University of Tokyo Majored in History and Philosophy of Science (B.A.) March 2010 Nara Institute of Science and Technology Majored in Natural Language Processing (PhD.) March 2013 Nara Institute of Science and Technology Started an academic career (Assistant Professor) September 2011 Apple Japan Developed Japanese input method (Software Engineer) April 2013-present Tokyo Metropolitan University Opened a lab (Associate Professor)
Deep learning in a nutshell Paradigm for learning complex mathematical models by using multi- layered neural networks Achieves dramatic improvements in various kinds of pattern recognition tasks Now becomes one of the standard approaches in vision and speech (continuous and dense feature representation) Not widely used in natural language processing until recently due to the nature of the language (discrete and sparse feature representation)
DL learns implicit features without supervision Lee et al., Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. ICML 2009.
Deep learning for natural language processing Representation learning Traditional feature engineering requires huge annotation cost and/or crafting features (feature templates) Need to learn implicit feature representation (without supervision) Architecture for deep neural networks Conventional statistical methods cannot produce fluent sentences Need to generate fluent sentences (possibly with heavy supervision)
Example of implicit features in NLP What is the meaning of king? King Arthur is a legendary British leader of the late 5th and early 6th centuries Edward VIII was King of the United Kingdom and the Dominions of the British Empire, … The meaning of a word can be characterized by its contextual words (distributional hypothesis) “you shall know a word by the company it keeps.” (Firth, 1957)
Vector space model: Representing the meaning as a vector king = (0.1, 0.7, -0.3, 0, -1)T How often it co-occurs with “leader”? How often it co-occurs with “empire”? How often it co-occurs with “dog”? …… Similarity of word (vector) king・farmar = ||king|| ||farmar|| cos θ similarity = cos θ = king・farmar / ||king|| ||farmar||
Semantic composition (additive operation) by a word vector (word2vec) king = (0.1, 0.7, -0.3, 0, -1)T
Quantitative evaluation of semantic similarity of words and phrases Visualization of word embeddings Red: word embeddings learned by word2vec Blue: word embeddings optimized by the usage of grammatical errors of English learners Can reflect the use of words to learn word embeddings
Other representation of semantic meanings Matrix Baroni and Zamperelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective- noun constructions in semantic space. Tensor Socher et al. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.
Deep learning for natural language processing Representation learning Traditional feature engineering requires huge annotation cost and/or crafting features (feature templates) Need to learn implicit feature representation (without supervision) Architecture for deep neural networks Conventional statistical methods cannot produce fluent sentences Need to generate fluent sentences (possibly with heavy supervision)
Natural language generation: Machine translation Translate a sentence in a language to another Research question: How to address multi-class classification problem? How to model alignment between words? https://twitter.com/haldaume3/status/732333418038562816
Traditional approaches in machine translation Bernerd Vauquois’ Pyramid Generates an expression in the target language by understanding the source language Mainstream approache until 90s CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=683855
Recent advances in statistical and empirical methods in machine translation 1.Translation models 2. Open-source software 3. Automatic evaluation 4. Optimization 5.Massive Data Minimum error rate training (2003) GIZA++, SRILM Pharaoh, Moses (2003-) Europarl Patent corpus (2008) BLEU (2002) Starting from IBM models (1993) to phrase-based methods (2003)
Statistical machine translation: Learning mathematical models from parallel corpus Noisy channel model ③Decode ④Optimization P(f | e) P(e) ②Rule extraction argmax BLEU 翻訳モデル 言語モデル 目的 言語 source target target target source target ①Alignment Parallel corpus Raw corpus Development corpus (reference)
The larger the data, the better the translation Brants et al. “Large Language Models in Machine Translation”. EMNLP-2007. ←Better Translation Worse→ ←small Data Data large→
From statistical models to a neural model Why factorize? →If the parallel corpus is small, it is not able to estimate translation model P(e|f) robustly. What if we have a large-scale parallel corpus? →No need to factorize translation models! P 𝑒 𝑓 = P( 𝑒 𝑧 | 𝑒 <𝑧 ,𝑓) Generating word depends on source words and all the target words generated so far
Recurrent neural network (RNN) encodes sequence of words 深い Recurrent neural network (RNN) encodes sequence of words マジ やばい やばい </s> output hidden マジ input 深層学習
Generating a sentence by combining two RNN models (sequence to sequence) Encoder-decoder approach DL is is really cool </s> decoder </s> DL encoder Encode a sentence vector from word vectors in the source language 深層学習 マジ やばい
Seq2seq models can learn complex models only from simple features Zhang and LeCun, Text Understanding from Scratch, arXiv 2015. →Learns text classification model only from characters Zaremba and Sutskever, Learning to Execute, arXiv 2015. →Learns Python interpreter only using RNN
How to make alignments? →Attends source side during decoding Attention = weighted sum of hidden states of encoder DL is is really cool </s> DL <s> Do not form a single sentence vector. Rather, it uses all the word vectors by an attention mechanism. 深層学習 マジ やばい </s>
If there is a large-scale parallel corpus, neural machine translation outperforms statistical methods Luong et al. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015.
Attention can be done not only for sequence but also for tree structure Eriguchi et al. Tree-to-Sequence Attentional Neural Machine Translation. ACL 2016. Attends phrase structure in the encoder to consider syntactic structures on the source side
Deep learning enables language generation from multi-modal input Generates a fluent caption only from a single image http://deeplearning.cs.toronto.edu/i2t http://googleresearch.blogspot.jp/2014/11/a-picture-is-worth-thousand- coherent.html
Summary: Deep learning for natural language processing Representation learning Can find implicit features Can compute the meaning of a sentence by semantic composition through mathematical modeling Architecture for deep neural network Can generate fluent sentences Opens up broad possibility for natural language generation