How much do word embeddings encode about syntax? Jacob Andreas and Dan Klein UC Berkeley.

Slides:



Advertisements
Similar presentations
Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Advertisements

Deep Learning in NLP Word representation and how to use it for Parsing
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.
Syntactic And Sub-lexical Features For Turkish Discriminative Language Models ICASSP 2010 Ebru Arısoy, Murat Sarac¸lar, Brian Roark, Izhak Shafran Bang-Xuan.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Authors : Ramon F. Astudillo, Silvio Amir, Wang Lin, Mario Silva, Isabel Trancoso Learning Word Representations from Scarce Data By: Aadil Hayat (13002)
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
Model Oriented Programming: An Empirical Study of Comprehension Omar Badreddin Andrew Forward Timothy C. Lethbridge try.umple.org.
Linguistic Essentials
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:
CHR as grammar formalism A first report Henning Christiansen Roskilde University, D ENMARK Idea: Propagation rules of CHR:
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
NLP. Introduction to NLP Background –Developed by Jay Earley in 1970 –No need to convert the grammar to CNF –Left to right Complexity –Faster than O(n.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Experimental Psychology PSY 433 Chapter 5 Research Reports.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Distributed Representations for Natural Language Processing
Approaches to Machine Translation
Language, Mind, and Brain by Ewa Dabrowska
Experimental Psychology
Authorship Attribution Using Probabilistic Context-Free Grammars
Deep Learning.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Are End-to-end Systems the Ultimate Solutions for NLP?
Deep learning and applications to Natural language processing
Statistical NLP Spring 2011
CS 388: Natural Language Processing: Statistical Parsing
Word2Vec CS246 Junghoo “John” Cho.
Trends and Future of NLP
Distributed Representation of Words, Sentences and Paragraphs
Word Embeddings with Limited Memory
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 26
Approaches to Machine Translation
Word embeddings based mapping
Translingual Knowledge Projection and Statistical Machine Translation
Word embeddings based mapping
Unsupervised Pretraining for Semantic Parsing
Vector Representation of Text
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Word embeddings (continued)
Type Topic in here! Created by Educational Technology Network
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Artificial Intelligence 2004 Speech & Natural Language Processing
Word representations David Kauchak CS158 – Fall 2016.
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Presentation transcript:

How much do word embeddings encode about syntax? Jacob Andreas and Dan Klein UC Berkeley

Everybody loves word embeddings few most that the a each this every [Collobert 2011] [Collobert 2011, Mikolov 2013, Freitag 2004, Schuetze 1995, Turian 2010]

What might embeddings bring? Cathleen complained about the magazine’s shoddy editorial quality. Mary executive average

Today’s question Can word embeddings trained with surface context improve a state-of-the-art constituency parser? (no)

Embeddings and parsing Pre-trained word embeddings are useful for a variety of NLP tasks Can they improve a constituency parser? – (not very much) [Cite XX, Cite XX, Cite XX]

Three hypotheses Vocabulary expansion (good for OOV words) Statistic pooling (good for medium-frequency words) Embedding structure (good for features) Cathleen Mary average editorial executive transitivity tense

Vocabulary expansion: Embeddings help handling of out-of-vocabulary words Cathleen Mary

Vocabulary expansion John Mary Pierre yellow enormous hungry Cathleen

Vocabulary expansion John Mary Pierre yellow enormous hungry Cathleen complained about the magazine’s shoddy editorial quality. Cathleen Mary

Vocab. expansion results Baseline +OOV

Vocab. expansion results Baseline +OOV

Vocab. expansion results Baseline +OOV (300 sentences)

Statistic pooling hypothesis: Embeddings help handling of medium-frequency words average editorial executive

Statistic pooling executive kind giant editorial average {NN, JJ} {NN} {NN, JJ} {JJ} {NN}

Statistic pooling executive kind giant editorial average {NN, JJ} {JJ, NN} {NN, JJ}

Statistic pooling executive kind giant editorial average {NN, JJ} {NN} {NN, JJ} {JJ} {NN} editorial NN editorial NN

Statistic pooling results Baseline +Pooling

Vocab. expansion results Baseline +Pooling (300 sentences)

Embedding structure hypothesis: The organization of the embedding space directly encodes useful features transitivity tense

Embedding structure vanished dined vanishing dining devoured assassinated devouring assassinating “transitivity” “tense” dined VBD [Huang 2011]

Embedding structure vanished dined vanishing dining devoured assassinated devouring assassinating “transitivity” “tense” dined VBD [Huang XX]

Embedding structure results Baseline +Features

Embedding structure results Baseline +Features (300 sentences)

To summarize (300 sentences)

Combined results Baseline +OOV +Pooling

Vocab. expansion results Baseline (300 sentences) +OOV +Pooling

What about… Domain adaptation? (no significant gain) French? (no significant gain) Other kinds of embeddings? (no significant gain)

Why didn’t it work? Context clues often provide enough information to reason around words with incomplete / incorrect statistics Parser already has a robust OOV, small count models Sometimes “help” from embeddings is worse than nothing: bifurcate Soap homered Paschi tuning unrecognized

What about other parsers? Dependency parsers (continuous repr. as syntactic abstraction) Neural networks (continuous repr. as structural requirement) [Henderson 2004, Socher 2013] [Henderson 2004, Socher 2013, Koo 2008, Bansal 2014]

What didn’t we try? Hard clustering (some evidence that this is useful for morphologically rich languages) A nonlinear feature-based model Embeddings in higher constituents (e.g. in a CRF parser) [Candito 09]

Conclusion Embeddings provide no apparent benefit to state-of-the-art parser for: – OOV handling – Parameter pooling – Lexicon features Code online at