1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department.

Slides:

Advertisements

Similar presentations

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Supervised Learning Recap

1 Statistical NLP: Lecture 12 Probabilistic Context Free Grammars.

Grammatical structures for word-level sentiment detection.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Statistical NLP: Lecture 11

Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.

Hidden Markov Models Fundamentals and applications to bioinformatics.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Midterm Review CS4705 Natural Language Processing.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Natural Language Understanding

Subway Network Algorithm Matt Freeburg ICS 311 Fall 2006 University of Hawai’i at Manoa.

Isolated-Word Speech Recognition Using Hidden Markov Models

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.

Graphical models for part of speech tagging

Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypotheses in real time Robust to speech recognition noise Semantic.

Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Arindam K. Das CIA Lab University of Washington Seattle, WA MINIMUM POWER BROADCAST IN WIRELESS NETWORKS.

Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.

Dimitrios Skoutas Alkis Simitsis

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.

CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

National Taiwan University, Taiwan

Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.

Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.

Supertagging CMSC Natural Language Processing January 31, 2006.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Probabilistic Context Free Grammars Grant Schindler 8803-MDM April 27, 2006.

Conditional Markov Models: MaxEnt Tagging and MEMMs

Develop a fast semantic decoder for dialogue systems Capability to parse 10 – 100 ASR hypothesis in real time Robust to speech recognition noise Trainable.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.

Stochastic Context Free Grammars for RNA Structure Modeling

CS4705 Natural Language Processing

LECTURE 15: REESTIMATION, EM AND MIXTURES

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Presentation transcript:

1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department

2 Reference Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.

3 Outline Introduction HVS Model Experiments Conclusion

4 Introduction Language model: Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information Class-based language model –POS tag information Structured language model –Syntactic information

5 Hierarchical Hidden Markov Model HHMM is structured multi-level stochastic process. –Each state is an HHMM –Internal state: hidden state that do not emit observable symbols directly –Production state: leaf state States of HMM are production states of HHMM.

6 HHMM (cont.) Parameters of HHMM:

7 HHMM (cont.) Transition probability: horizontal Initial probability: vertical Observation probability:

8 HHMM (cont.) Current node is root: –Choose child according to initial probability Child is production state: –Produce an observation –Transit within the same level –When it reaches end-state, back to parent of end-state Child is internal state: –Choose child –Wait until control is back from children –Transit within the same level –When it reaches end-state, back to parent of end-state

9 HHMM (cont.)

10 HHMM (cont.) Other application: trend of stocks (IDEAL 2004)

11 Hidden Vector State Model

12 Hidden Vector State Model (cont.) The semantic information relating to any single word can be stored as a vector of semantic tag names

13 Hidden Vector State Model (cont.) If state transitions were unconstrained –Fully HHMM Transitions between states can be factored into a stack shift: two stage, pop, push Stack size is limited, # of new concept to be pushed is limited to one –More efficient

14 Hidden Vector State Model (cont.) The joint probability is defined:

15 Hidden Vector State Model (cont.) Approximation (assumption): So,

16 Hidden Vector State Model (cont.) Generative process associated with this constrained version of HVS models consists of three step for each position t: 1. choose a value for n t 2. Select preterminal concept tag c t [1] 3. Select a word w t

17 Hidden Vector State Model (cont.) It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema. It is not reasonable to require utterances with manually transcribed parse trees. Assume abstract semantic annotations and availability of a set of domain specific lexical classes.

18 Hidden Vector State Model (cont.) Abstract semantic annotations: show me flights arriving in X at T. List flights arriving around T in X. Which flight reaches X before T. = FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T)))) Class set: CITY: Boston, New York, Denver…

19 Experiments Experimental Setup Training set: ATIS-2, ATIS-3 Test set: ATIS-3 NOV93, DEC94 Baseline: FST (Finite Semantic Tagger) GT for FST, Witten-Bell for HVS Show me flights from Boston to New York Goal: FLIGHT Slots: FROMLOC.CITY = Boston TOLOC.CITY = New York

20 Experiments

21 Experiments Dash line: goal detection accuracy, Solid line: F-measure

22 Conclusion The key features of HVS model –Its ability for representing hierarchical information in a constrained way –Its capability for training directly from target semantics without explicit word-level annotation.

23 HVS Language Model The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure. Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c 1,…,c M,c # } A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D

24 HVS Language Model (cont.)

25 HVS Language Model (cont.) Each HVS model state transition is restricted: (i) exactly n t class label are popped off the stack (ii) exactly one new class label c t is pushed into the stack The number of elements to pop n t and the choice of new class label to push c t are determined:

26 HVS Language Model (cont.)

27 HVS Language Model (cont.) n t is conditioned on all the class labels that are in the stack at t-1 but c t is conditioned only on the class labels that remain on the stack after the pop operation Former distribution can encode embedding, whereas the latter focuses on modeling long- range dependencies.

28 HVS Language Model (cont.) Joint probability: Assumption:

29 HVS Language Model (cont.) Training: EM algorithm –C,N: latent data, W: observed data E-step:

30 HVS Language Model (cont.) M-Step: –Q function (auxiliary): –Substituting P(W,C,N|λ)

31 HVS Language Model (cont.) Calculate probability distributions separately.

32 HVS Language Model (cont.) State space S, if fully populated: –|S|=M D states, for M=100+, D=3 to 4 Due to data sparseness, backoff is needed.

33 HVS Language Model (cont.) Backoff weight: Modified version of absolute discounting

34 Experiments Training set: –ATIS-3,276K words, 23K sentences. Development set: – ATIS -3 Nov93 Test set : –ATIS-3 Dec94, 10K words, 1K sentences. OOV were removed k=850

35 Experiments (cont.)

36 Experiments (cont.)

37 Conclusion The HVS language model is able to make better use of context than standard class n-gram models. HVS model is trainable using EM.

38 Class tree for implementation

39 Iteration number vs. perplexity