Hidden-Variable Models for Discriminative Reranking Jiawen, Liu Spoken Language Processing Lab, CSIE National Taiwan Normal University Reference: Hidden-Variable.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
10. Lexicalized and Probabilistic Parsing -Speech and Language Processing- 발표자 : 정영임 발표일 :
1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Hidden Markov Models Theory By Johan Walters (SR 2003)
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Simulation.
Visual Recognition Tutorial
Distributed Representations of Sentences and Documents
Scalable Text Mining with Sparse Generative Models
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Online Learning Algorithms
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Mining and Summarizing Customer Reviews
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Graphical models for part of speech tagging
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Scalable Inference and Training of Context- Rich Syntactic Translation Models Michel Galley, Jonathan Graehl, Keven Knight, Daniel Marcu, Steve DeNeefe.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
You Are What You Tag Yi-Ching Huang and Chia-Chuan Hung and Jane Yung-jen Hsu Department of Computer Science and Information Engineering Graduate Institute.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Relevance Language Modeling For Speech Recognition Kuan-Yu Chen and Berlin Chen National Taiwan Normal University, Taipei, Taiwan ICASSP /1/17.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
NTU & MSRA Ming-Feng Tsai
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Learning Deep Generative Models by Ruslan Salakhutdinov
PRESENTED BY: PEAR A BHUIYAN
Machine Learning Basics
Natural Language to SQL(nl2sql)
Random Neural Network Texture Model
Presentation transcript:

Hidden-Variable Models for Discriminative Reranking Jiawen, Liu Spoken Language Processing Lab, CSIE National Taiwan Normal University Reference: Hidden-Variable Models for Discriminative Reranking. Terry Koo, Michael Collins, EMNLP, Parse Reranking with WordNet Using a Hidden- ​ Variable Model. Terry Koo and Michael Collins, M. ​ Eng Thesis, Massachusetts Institute of Technology, Cambrige, MA, USA

National Taiwan Normal University Outline Introduction The hidden-variable model Local feature vectors Training the model Hidden-value domains and local features Experimental Results Applying the Model to Other NLP Tasks Empirical Analysis of the Hidden Values Conclusions and Future Research 2

National Taiwan Normal University Introduction-1 A number of recent approaches in statistical NLP have focused on reranking algorithms. The success of reranking approaches depends critically on the choice of representation used by the reranking model. Typically, each candidate structure is mapped to a feature-vector representation. This paper describes a new method for the representation of NLP structures within reranking approaches. Their method involves a hidden-variable model, where the hidden variables correspond to an assignment of words to either clusters or word-senses. Lexical items are automatically assigned their hidden values using unsupervised learning within a discriminative reranking approach. 3

National Taiwan Normal University Introduction-2 They make use of a conditional log-linear model for their task. Formally, hidden variables within the log–linear model consist of global assignments, where a global assignment entails an assignment of every word in the sentence to some hidden cluster or sense value. The number of such global assignments grows exponentially fast with the length of the sentence being processed. Training and decoding with the model requires summing over the exponential number of possible global assignments They show that the required summations can be computed efficiently and exactly using dynamic–programming methods under certain restrictions on features in the model. 4

National Taiwan Normal University Introduction-3 Their model has the ability to alleviate data sparsity issues by learning to assign words to word clusters, and can mitigate problems with word–sense polysemy by learning to assign lexical items to underlying word senses based upon contextual information. 5

National Taiwan Normal University The hidden-variable model-1 Each sentence s i for i = 1…n in their training data has a set of n i candidate parse trees t i,1,..., t i,n i, which are the output of an N –best baseline parser. They define t i,1 to be the parse with the highest F–measure for sentence s i. Given a candidate parse tree t i,j, the hidden– variable model assigns a domain of hidden values to each word in the tree. Formally, if t i,j spans m words then the hidden–value domains for each word are the sets H 1 (t i,j ),…, H m (t i,j ). A global hidden–value assignment, which attaches a hidden value to every word in t i,j, is written h= (h 1,…,h m ) H(t i,j ), where H(t i,j ) = H 1 (t i,j ) ×…×H m (t i,j ) is the set of all possible global assignments for t i,j. 6

National Taiwan Normal University The hidden-variable model-2 7

National Taiwan Normal University The hidden-variable model-3 They define a feature–based representation Φ(t i,j, h). Each component of the feature vector is the count of some substructure within (t i,j, h). For example, They use a parameter vector Θ R d to define a log–linear distribution over candidate trees together with global hidden–value assignments: 8

National Taiwan Normal University The hidden-variable model-4 By marginalizing out the global assignments, they obtain a distribution over the candidate parses alone: The loss function is the negative log-likelihood of the training data-with respect to Θ : 9

National Taiwan Normal University Local feature vectors-1 Note that the number of possible global assignments grows exponentially fast with respect to the number of words spanned by t i,j. This poses a problem when training the model, or when calculating the probability of a parse tree through Eq. 2. This section describes how to address this difficulty by restricting features to sufficiently local scope. The restriction to local feature–vectors makes use of the dependency structure underlying the parse tree t i,j. Formally, for tree t i,j, they define the corresponding dependency tree D(t i,j ) to be a set of edges between words in t i,j, where (u, v) D(t i,j ) if and only if there is a head–modifier dependency between words u and v. 10

National Taiwan Normal University Local feature vectors-2 If w, u and v are word indices, they introduce single–variable local feature vectors and pairwise local feature vectors. The global feature vector Φ(t i,j, h) is then decomposed into a sum over the local feature vectors: 11

National Taiwan Normal University Local feature vectors-3 In their implementation, each dimension of the local feature vectors is an indicator function signaling the presence of a feature, so that a sum over local feature vectors in a tree gives the occurrence count of features in that tree. For instance, 12

National Taiwan Normal University Training the model-1 The gradient of the loss function is given by: where Using the feature–vector decomposition in Eq. 4, they can rewrite the key functions of as follows: 13

National Taiwan Normal University Training the model-2 where and are marginalized probabilities and Z i,j is the associated normalization constant: The three quantities above can be computed with belief propagation (Yedidia et al., 2003), a dynamic–programming technique that is efficient and exact when the graph D(t i,j ) is a tree. 14

National Taiwan Normal University Hidden-value domains and local features-1 Each word in a parse tree is given a domain of possible hidden values by the hidden–variable model. In particular, they will see how different definitions of the domains give rise to the three main model types: –Clustering –Refinement –Mapping into a pre–built ontology such as WordNet. To splits each word into a domain of three word–sense hidden values. –Each word receives a domain of hidden values that is not shared with any other word. –The model is then able to distinguish several different usages for each word, emulating a refinement operation. 15

National Taiwan Normal University Hidden-value domains and local features-2 To split each word’s part–of–speech tag into several sub–tags. –This approach assigns the same domain to many words. –The behavior of the model then emulates a clustering operation. In their experiments, they made use of features such as those in Figure 2 in combination with the following four definitions of the hidden–value domains. –Lexical (Refinement) Each word is split into three sub–values. –Part–of–Speech (Clustering) The part–of–speech tag of each word is split into five sub–values. The word shares would be assigned the domain {NNS 1,..., NNS 5 }. –Highest Nonterminal (Clustering) The highest nonterminal to which each word propagates as a headword is split into five sub–values. The word bought yields domain {S 1,..., S 5 } 16

National Taiwan Normal University Hidden-value domains and local features-3 –Supersense (Pre–Built Ontology) They borrow the idea of using WordNet lexicographer filenames as broad “supersenses”. For each word, they split each of its supersenses into three sub– supersenses. If no supersenses are available, we fall back to splitting the part–of– speech into five sub–values. shares has the supersenses {noun.possession 1, noun.act 1, noun.artifact 1,... noun.possession 3, noun.act 3, noun.artifact 3 }. in does not have any WordNet supersenses, so it is assigned the domain {IN 1,..., IN 5 }. 17

National Taiwan Normal University Hidden-value domains and local features-4 18

National Taiwan Normal University The final feature sets They created eight feature sets by combining the four hidden–value domains above with two alternative definitions of dependency structures –Standard head–modifier dependencies –Sibling dependencies For instance –The head–modifier dependencies produced by the tree fragment in Figure 2 are (bought, shares), (bought, in), and (bought, yesterday) –The sibling dependencies are (bought, shares), (shares, in), and (in, yesterday). 19

National Taiwan Normal University Mixed Models The different hidden–variable models display varying strengths and weaknesses. They created mixtures of different models using a weighted average: where Z(s i ) is a normalization constant that can be ignored. 20

National Taiwan Normal University Experimental Results-1 They trained and tested the model on data from the Penn Treebank. For each of the eight feature sets, they used the stochastic gradient descent method to optimize the parameters of the model. They created various mixtures of the eight models, testing the accuracy of each mixture on the secondary development set. Their final model was a mixture of three of the eight possible models: –supersense hidden values with sibling trees –lexical hidden values with sibling trees –highest nonterminal hidden values with normal head–modifier trees. 21

National Taiwan Normal University Experimental Results-2 Their final tests evaluated four models –the Collins (1999) base parser, C99. –the Collins (2000) reranker, C2K. –a combination of the C99 base model with the three models, MIX. –augmenting MIX with features from the method in C2K, MIX+. 22

National Taiwan Normal University Applying the Model to Other NLP Tasks-1 To summarize the model, the major components of the approach are as follows: –They assume some set of candidate structures t i,j, which are to be reranked by the model. Each structure t i,j has n i,j words w 1,..., w n i,j, and each word w k has a set H k ( t i,j ) of possible hidden values. –They assume a graph D(t i,j ) for each t i,j that defines possible interactions between hidden variables in the model. –They assume some definition of local feature vectors, which consider either single hidden variables, or pairs of hidden variables that are connected by an edge in D(t i,j ). There is no requirement that the hidden variables only be associated with words in the structure –In speech recognition hidden variables could be associated with phonemes rather than words 23

National Taiwan Normal University Applying the Model to Other NLP Tasks-2 NLP tasks other than parsing involve structures t i,j that are not necessarily parse trees. –In speech recognition candidates are simply strings (utterances). –In tagging tasks candidates are labeled sequences. As a final note, there is some flexibility in the choice of D(t i,j ). –In the more general case where D(t i,j ) contains cycles 24

National Taiwan Normal University Empirical Analysis of the Hidden Values-1 Their model makes no assumptions about the interpretation of the hidden values assigned to words. During training, the model simply learns a distribution over global hidden– value assignments that is useful in improving the log–likelihood of the training data. However, they expect that the model will learn to make hidden–value assignments that are reasonable from a linguistic standpoint. They established a corpus of parse trees with hidden–value annotations, as follows. –They find the optimal parameters Θ* on the training set. –For every sentence s i in the training set, they then use to find, the most probable candidate parse under the model. –They use to decode, the most probable global assignment of hidden values, for each parse tree. 25

National Taiwan Normal University Empirical Analysis of the Hidden Values-2 They created a corpus of pairs for the feature set defined by part– of– speech hidden–value domains and standard dependency structures. 26

National Taiwan Normal University Conclusions and Future Research The hidden–variable model is a novel method for representing NLP structures in the reranking framework. They can obtain versatile behavior from the model simply by manipulating the definition of the hidden–value domains, and they have experimented with models that emulate word clustering, word refinement, and mappings from words into an existing ontology. Future work may consider the use of hidden–value domains with mixed contents. Another area for future research is to investigate the use of unlabeled data within the approach. Finally, future work may apply the models to NLP tasks other than parsing. 27