Download presentation
Presentation is loading. Please wait.
Published byOsborn Webb Modified over 8 years ago
1
Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc.edu http://L2R.cs.uiuc.edu/~danr 7 Sequential Models Tutorial on Machine Learning in Natural Language Processing and Information Extraction 1.Sequential Inference Problems 2.HMM 3.Advanced Models
2
2 Outline Shallow Parsing What it is Why we need it Shallow Parsing (Learning) Models Learning Sequences Hidden Markov Model (HMM) Discriminative Approaches HMM with Classifiers PMM Learning and Inference CSCL Related Approaches
3
3 Parsing Parsing is a task to analyze for syntactical structure of inputs An output is a “parse tree”
4
4 Parsing S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for
5
5 Parsing Parsing is an important task toward understanding natural languages It is not an easy task Many NLP tasks do not require all information from parse trees
6
6 Shallow Parsing Shallow Parsing is a generic term for a task identifies only a subset of parse trees
7
7 Shallow Parsing S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for
8
8 Not all the parse tree information is needed It should do better than full parsing in those specific parts It can be done more robuslty More adaptive to data from new domains (Li & Roth’ CoNLL’01) Shallow parsing as an intermediate step toward full parsing It is a generic chunking task – applicable to many other segmentation tasks such as named entity recognition. [LBJ generic segmentation implementation] Reasons for Studying Shallow Parsing
9
9 By shallow parsing we mean: identifying non-overlapping, non-embedding phrases. Shallow Parsing = Text Chunking Shallow Parsing [Today] BrownblamedMr.BobmakingthingsdirtyMr.for NP Chunk
10
10 Usually text chunking without any further specification follows the definition from CoNLL-2000 shared task [Demo]Demo Text Chunking
11
11 Text Chunking S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for NP VP ADJPPP
12
12 Modeling Shallow Parsing Model shallow parsing as a sequence prediction problem For each word predict one of these : B – Beginning of a chunk I – Inside a chunk but not a beginning O – Outside a chunk
13
13 Modeling Shallow Parsing BrownblamedMr.BobmakingthingsdirtyMr.for NP IOBIOBOBO VP ADJPPP I-NPB-VPB-NPI-NPB-VPB-NPB-ADJPB-NPB-PP
14
14 Similar Problems This model applies to many other problems NLP Named entity recognition Verb-argument identification Other domains Identifying genes (splice sites, Chuang&Roth’01)
15
15 Outline Shallow Parsing What it is Why we need it Shallow Parsing (Learning) Models Learning Sequences Hidden Markov Model (HMM) Discriminative Approaches HMM with Classifiers PMM Learning and Inference CSCL Related Approaches
16
16 Hidden Markov Model (HMM) HMM is a probabilistic generative model It models how an observed sequence is generated Let’s call each position in a sequence a time step At each time step, there are two variables Current state (hidden) Observation
17
17 HMM Elements Initial state probability P(s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6
18
18 HMM for Shallow Parsing States: {B, I, O} Observations: Actual words and/or part-of-speech tags s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for
19
19 HMM for Shallow Parsing Given a sentence, we can ask what the most likely state sequence is Initial state probability: P(s 1 =B),P(s 1 =I),P(s 1 =O) Transition probabilty: P(s t =B|s t-1 =B),P(s t =I|s t-1 =B),P(s t =O|s t-1 =B), P(s t =B|s t-1 =I),P(s t =I|s t-1 =I),P(s t =O|s t-1 =I), … Observation Probability: P(o t =‘Mr.’|s t =B),P(o t =‘Brown’|s t =B),…, P(o t =‘Mr.’|s t =I),P(o t =‘Brown’|s t =I),…, … s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for
20
20 Finding most likely state sequence in HMM (1) P ( s k ;s k¡ 1 ;:::;s 1 ;o k ;o k¡ 1 ;:::;o 1 )
21
21 Finding most likely state sequence in HMM (2) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j o k ;o k¡ 1 ;:::;o 1 )
22
22 Finding most likely state sequence in HMM (3) max s k ;s k¡ 1 ;:::;s 1 P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 ;:::;s 1 [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 [ P ( s k j s k¡ 1 ) ¢ P ( o k¡ 1 j s k¡ 1 )] ¢ max s k¡ 2 ;:::;s 1 [ k¡ 2 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 [ P ( s k j s k¡ 1 ) ¢ P ( o k¡ 1 j s k¡ 1 )] ¢ max s k¡ 2 [ P ( s k¡ 1 j s k¡ 2 ) ¢ P ( o k¡ 2 j s k¡ 2 )] ¢ ::: ¢ max s 1 [ P ( s 2 j s 1 ) ¢ P ( o 1 j s 1 )] ¢ P ( s 1 )
23
23 Finding most likely state sequence in HMM (4) Viterbi’s Algorithm Dynamic Programming
24
24 Learning the Model Estimate Initial state probability P (s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) Unsupervised Learning EM Algorithm Supervised Learning Estimate each element directly from data
25
25 Experimental Results NP Prediction (POS Tags Only) 89.08 Recall 86.62 Precision 87.83 F1 R (Recall) = Numberofchunkspredictedcorrectly Numberofcorrectchunks P (Precision) = Numberofchunkspredictedcorrectly Numberofpredictedchunks F 1 = 2 RP R + P
26
26 Note Experiments use a different representation BrownblamedMr.BobmakingthingsdirtyMr.for NP IOBIOBOBO BrownblamedMr.BobmakingthingsdirtyMr.for][][][
27
27 Outline Shallow Parsing What it is Why we need it Shallow Parsing (Learning) Models Learning Sequences Hidden Markov Model (HMM) Discriminative Approaches HMM with Classifiers PMM Learning and Inference CSCL Related Approaches
28
28 Problems with HMM Long-term dependencies are hard to incorporate HMM is trained to maximize the likelihood of the data, not to maximize the predictions Maximize P(S,O), not P(S|O) What metric do we care about?
29
29 Discriminative Approaches Model the predictions directly Shallow Parsing Goal: predict BIO sequences given a sentence S* = argmax P(S|O) Generative approaches model P(S,O) Discriminative approaches model P(S|O) directly
30
30 Discriminative Approaches Use classifiers to predict the labels based on the context of the inputs Good: larger context can be incorporated BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier
31
31 Problems with using Classifiers Outputs may be inconsistent BrownblamedMr.BobmakingthingsdirtyMr.for A classifier IOIIOBOBO
32
32 Constraints An ‘I’ cannot follow an ‘O’ The sequence has to begin with a ‘B’ or an ‘O’ How to maintain constraints? B I O
33
33 HMM revisited HMM does not have this problem. These constraints are taken care for by the transition probabilities. Specifically, zero transition probabilities ensure that outputs always satisfy constraints
34
34 HMM with Classifiers HMM requires observation probability P(o t |s t ) Classifiers give us P(s t |o t ) We can compute P(o t |s t ) by P(o t |s t ) = P(s t |o t )P(o t )/P(s t ) See details herehere
35
35 Experimental Results NP Prediction RecallPredictionF1 HMM89.0886.6287.83 HMM with Classifiers 91.4991.5391.51 HMM with Classifiers (with Lexical features) 93.8593.4693.65
36
36 Projection based Markov Model (PMM) Classifiers use previous prediction as part of inputs BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier
37
37 PMM Each classifier outputs P(s t |s t-1,O) How do you Train/Test in this case? argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )
38
38 Experimental Results NP Prediction RecallPredictionF1 HMM (POS)89.0886.6287.83 HMM with Classifiers (POS) 91.4991.5391.51 HMM with Classifiers 93.8593.4693.65 PMM (POS)92.0491.7791.90 PMM94.2894.0294.15
39
39 Outline Shallow Parsing What it is Why we need it Shallow Parsing (Learning) Models Learning Sequences Hidden Markov Model (HMM) Discriminative Approaches HMM with Classifiers PMM Learning and Inference CSCL Related Approaches
40
40 Learning and Inference Learning: estimate parameters from data, and makes predictions for new data Inference: ensure that outputs satisfy constraints
41
41 Inference Assign values to the variables of interest (output; states) in a way that maximizes your objective function and satisfies constraints
42
42 Inference Boolean Constraint Satisfaction Problem V – set of variables Cost – c: V [0,1] Clauses – model constraints Satisfying assignment – : V {0,1} Find the solution that minimize the cost C() = i=1..n (v i )c(v i )
43
43 Inference for Shallow Parsing Define a variable v i for each possible chunk Cost of each chunk = -P(v i is a chunk) For any two overlapping chunks: (vi vj)(vi vj) Find the solution that minimizes the cost C() = i=1..n (v i )c(v i ) This is a good objective function – it maximizes the expected number of correct chucks. CSP in general is hard Structure of the constraints yields a problem that can be solved by a shortest path algorithm
44
44 Inference for Shallow Parsing O O O C C C -0.48 -0.19 -0.66 [ [ ] ] -0.56 -0.77 -0.36 -0.44 -0.22 o 1 o 2 o 3 o 4 o 5 o 6 o 7 [ [ [ ] ] ]
45
45 Experimental Results NP Prediction RecallPredictionF1 HMM (POS)89.0886.6287.83 HMM with Classifiers (POS) 91.4991.5391.51 HMM with Classifiers 93.8593.4693.65 PMM94.2894.0294.15 CSCL94.1293.4593.78
46
46 Other Approaches to Shallow Parsing MEMM CRF Global Perceptron (Collins’) Others (e.g., global SVM)
47
47 Maximum Entropy Markov Model (MEMM) Similar to PMM Each term is learned with maximum entropy model BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier =argmax s k ;s k¡ 1 ;:::;s 1 [ k Y t =2 P ( s t j s t¡ 1 ;O )] ¢ P ( s 1 j O ) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )
48
48 Conditional Random Field (CRF) Similar to PMM, but based on Markov random field theory (undirected graph) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O ) =argmax s k ;s k¡ 1 ;:::;s 1 1 Z ( O ) exp ( X i w i f i ( s k ;s k¡ 1 ;O ))
49
49 Collins’ Perceptron Training for HMM Similar to HMM, but use Perceptron algorithm to learn argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O ) =argmax s k ;s k¡ 1 ;:::;s 1 P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =argmax s k ;s k¡ 1 ;:::;s 1 log ( P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 )) =argmax s k ;s k¡ 1 ;:::;s 1 log ( P ( o k j s k ))+[ k¡ 1 X t =1 log ( P ( s t +1 j s t ))+ log ( P ( o t j s t ))]+ log ( P ( s 1 )) =argmax s k ;s k¡ 1 ;:::;s 1 w i f i ( o k ;s k )+[ k¡ 1 X t =1 u t g t ( s t +1 ;s t )+ w t f t ( o t ;s t )]+ u 1 g 1 ( s 1 )
50
50 Maximum Entropy Markov Model (MEMM) P ( s k j s k¡ 1 ;O ) = 1 Z ( s k¡ 1 ;O ) exp ( X i w i f i ( s k ;s k¡ 1 ;O )) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )
51
51 Maximum Entropy Markov Model (MEMM) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )
52
52 Conditional Random Field (CRF) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )
53
53 Conclusions All approaches use linear representation The differences are Features How to learn weights Training Paradigms: Global Training (HMM, CRF, Global Perceptron) Modular Training (PMM, MEMM, CSCL) These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.