1 The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department
2 Reference Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.
3 Outline Introduction HVS Model Experiments Conclusion
4 Introduction Language model: Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information Class-based language model –POS tag information Structured language model –Syntactic information
5 Hierarchical Hidden Markov Model HHMM is structured multi-level stochastic process. –Each state is an HHMM –Internal state: hidden state that do not emit observable symbols directly –Production state: leaf state States of HMM are production states of HHMM.
6 HHMM (cont.) Parameters of HHMM:
7 HHMM (cont.) Transition probability: horizontal Initial probability: vertical Observation probability:
8 HHMM (cont.) Current node is root: –Choose child according to initial probability Child is production state: –Produce an observation –Transit within the same level –When it reaches end-state, back to parent of end-state Child is internal state: –Choose child –Wait until control is back from children –Transit within the same level –When it reaches end-state, back to parent of end-state
9 HHMM (cont.)
10 HHMM (cont.) Other application: trend of stocks (IDEAL 2004)
11 Hidden Vector State Model
12 Hidden Vector State Model (cont.) The semantic information relating to any single word can be stored as a vector of semantic tag names
13 Hidden Vector State Model (cont.) If state transitions were unconstrained –Fully HHMM Transitions between states can be factored into a stack shift: two stage, pop, push Stack size is limited, # of new concept to be pushed is limited to one –More efficient
14 Hidden Vector State Model (cont.) The joint probability is defined:
15 Hidden Vector State Model (cont.) Approximation (assumption): So,
16 Hidden Vector State Model (cont.) Generative process associated with this constrained version of HVS models consists of three step for each position t: 1. choose a value for n t 2. Select preterminal concept tag c t [1] 3. Select a word w t
17 Hidden Vector State Model (cont.) It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema. It is not reasonable to require utterances with manually transcribed parse trees. Assume abstract semantic annotations and availability of a set of domain specific lexical classes.
18 Hidden Vector State Model (cont.) Abstract semantic annotations: show me flights arriving in X at T. List flights arriving around T in X. Which flight reaches X before T. = FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T)))) Class set: CITY: Boston, New York, Denver…
19 Experiments Experimental Setup Training set: ATIS-2, ATIS-3 Test set: ATIS-3 NOV93, DEC94 Baseline: FST (Finite Semantic Tagger) GT for FST, Witten-Bell for HVS Show me flights from Boston to New York Goal: FLIGHT Slots: FROMLOC.CITY = Boston TOLOC.CITY = New York
20 Experiments
21 Experiments Dash line: goal detection accuracy, Solid line: F-measure
22 Conclusion The key features of HVS model –Its ability for representing hierarchical information in a constrained way –Its capability for training directly from target semantics without explicit word-level annotation.
23 HVS Language Model The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure. Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c 1,…,c M,c # } A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D
24 HVS Language Model (cont.)
25 HVS Language Model (cont.) Each HVS model state transition is restricted: (i) exactly n t class label are popped off the stack (ii) exactly one new class label c t is pushed into the stack The number of elements to pop n t and the choice of new class label to push c t are determined:
26 HVS Language Model (cont.)
27 HVS Language Model (cont.) n t is conditioned on all the class labels that are in the stack at t-1 but c t is conditioned only on the class labels that remain on the stack after the pop operation Former distribution can encode embedding, whereas the latter focuses on modeling long- range dependencies.
28 HVS Language Model (cont.) Joint probability: Assumption:
29 HVS Language Model (cont.) Training: EM algorithm –C,N: latent data, W: observed data E-step:
30 HVS Language Model (cont.) M-Step: –Q function (auxiliary): –Substituting P(W,C,N|λ)
31 HVS Language Model (cont.) Calculate probability distributions separately.
32 HVS Language Model (cont.) State space S, if fully populated: –|S|=M D states, for M=100+, D=3 to 4 Due to data sparseness, backoff is needed.
33 HVS Language Model (cont.) Backoff weight: Modified version of absolute discounting
34 Experiments Training set: –ATIS-3,276K words, 23K sentences. Development set: – ATIS -3 Nov93 Test set : –ATIS-3 Dec94, 10K words, 1K sentences. OOV were removed k=850
35 Experiments (cont.)
36 Experiments (cont.)
37 Conclusion The HVS language model is able to make better use of context than standard class n-gram models. HVS model is trainable using EM.
38 Class tree for implementation
39 Iteration number vs. perplexity