Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc.edu http://L2R.cs.uiuc.edu/~danr 7 Sequential Models Tutorial on Machine Learning in Natural Language Processing and Information Extraction 1.Sequential Inference Problems 2.HMM 3.Advanced Models

2 Outline  Shallow Parsing What it is Why we need it  Shallow Parsing (Learning) Models Learning Sequences  Hidden Markov Model (HMM)  Discriminative Approaches HMM with Classifiers PMM  Learning and Inference CSCL  Related Approaches

3 Parsing  Parsing is a task to analyze for syntactical structure of inputs  An output is a “parse tree”

4 Parsing S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for

5 Parsing  Parsing is an important task toward understanding natural languages  It is not an easy task  Many NLP tasks do not require all information from parse trees

6 Shallow Parsing  Shallow Parsing is a generic term for a task identifies only a subset of parse trees

7 Shallow Parsing S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for

8  Not all the parse tree information is needed It should do better than full parsing in those specific parts  It can be done more robuslty More adaptive to data from new domains (Li & Roth’ CoNLL’01)  Shallow parsing as an intermediate step toward full parsing  It is a generic chunking task – applicable to many other segmentation tasks such as named entity recognition. [LBJ generic segmentation implementation] Reasons for Studying Shallow Parsing

9  By shallow parsing we mean: identifying non-overlapping, non-embedding phrases.  Shallow Parsing = Text Chunking Shallow Parsing [Today] BrownblamedMr.BobmakingthingsdirtyMr.for NP Chunk

10  Usually text chunking without any further specification follows the definition from CoNLL-2000 shared task [Demo]Demo Text Chunking

11 Text Chunking S NPVP NPPP VP ADJPNP BrownblamedMr.BobmakingthingsdirtyMr.for NP VP ADJPPP

12 Modeling Shallow Parsing  Model shallow parsing as a sequence prediction problem  For each word predict one of these : B – Beginning of a chunk I – Inside a chunk but not a beginning O – Outside a chunk

13 Modeling Shallow Parsing BrownblamedMr.BobmakingthingsdirtyMr.for NP IOBIOBOBO VP ADJPPP I-NPB-VPB-NPI-NPB-VPB-NPB-ADJPB-NPB-PP

14 Similar Problems  This model applies to many other problems NLP  Named entity recognition  Verb-argument identification Other domains  Identifying genes (splice sites, Chuang&Roth’01)

16 Hidden Markov Model (HMM)  HMM is a probabilistic generative model It models how an observed sequence is generated  Let’s call each position in a sequence a time step  At each time step, there are two variables Current state (hidden) Observation

17 HMM  Elements Initial state probability P(s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6

18 HMM for Shallow Parsing  States: {B, I, O}  Observations: Actual words and/or part-of-speech tags s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

19 HMM for Shallow Parsing  Given a sentence, we can ask what the most likely state sequence is Initial state probability: P(s 1 =B),P(s 1 =I),P(s 1 =O) Transition probabilty: P(s t =B|s t-1 =B),P(s t =I|s t-1 =B),P(s t =O|s t-1 =B), P(s t =B|s t-1 =I),P(s t =I|s t-1 =I),P(s t =O|s t-1 =I), … Observation Probability: P(o t =‘Mr.’|s t =B),P(o t =‘Brown’|s t =B),…, P(o t =‘Mr.’|s t =I),P(o t =‘Brown’|s t =I),…, … s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

20 Finding most likely state sequence in HMM (1) P ( s k ;s k¡ 1 ;:::;s 1 ;o k ;o k¡ 1 ;:::;o 1 )

21 Finding most likely state sequence in HMM (2) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j o k ;o k¡ 1 ;:::;o 1 )

22 Finding most likely state sequence in HMM (3) max s k ;s k¡ 1 ;:::;s 1 P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 ;:::;s 1 [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 [ P ( s k j s k¡ 1 ) ¢ P ( o k¡ 1 j s k¡ 1 )] ¢ max s k¡ 2 ;:::;s 1 [ k¡ 2 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k¡ 1 [ P ( s k j s k¡ 1 ) ¢ P ( o k¡ 1 j s k¡ 1 )] ¢ max s k¡ 2 [ P ( s k¡ 1 j s k¡ 2 ) ¢ P ( o k¡ 2 j s k¡ 2 )] ¢ ::: ¢ max s 1 [ P ( s 2 j s 1 ) ¢ P ( o 1 j s 1 )] ¢ P ( s 1 )

23 Finding most likely state sequence in HMM (4)  Viterbi’s Algorithm Dynamic Programming

24 Learning the Model  Estimate Initial state probability P (s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t )  Unsupervised Learning EM Algorithm  Supervised Learning Estimate each element directly from data

25 Experimental Results  NP Prediction (POS Tags Only) 89.08 Recall 86.62 Precision 87.83 F1 R (Recall) = Numberofchunkspredictedcorrectly Numberofcorrectchunks P (Precision) = Numberofchunkspredictedcorrectly Numberofpredictedchunks F 1 = 2 RP R + P

26 Note  Experiments use a different representation BrownblamedMr.BobmakingthingsdirtyMr.for NP IOBIOBOBO BrownblamedMr.BobmakingthingsdirtyMr.for][][][

28 Problems with HMM  Long-term dependencies are hard to incorporate  HMM is trained to maximize the likelihood of the data, not to maximize the predictions Maximize P(S,O), not P(S|O) What metric do we care about?

29 Discriminative Approaches  Model the predictions directly  Shallow Parsing Goal: predict BIO sequences given a sentence  S* = argmax P(S|O) Generative approaches model P(S,O) Discriminative approaches model P(S|O) directly

30 Discriminative Approaches  Use classifiers to predict the labels based on the context of the inputs  Good: larger context can be incorporated BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier

31 Problems with using Classifiers  Outputs may be inconsistent BrownblamedMr.BobmakingthingsdirtyMr.for A classifier IOIIOBOBO

32 Constraints  An ‘I’ cannot follow an ‘O’  The sequence has to begin with a ‘B’ or an ‘O’  How to maintain constraints? B I O

33 HMM revisited  HMM does not have this problem.  These constraints are taken care for by the transition probabilities.  Specifically, zero transition probabilities ensure that outputs always satisfy constraints

35 Experimental Results  NP Prediction RecallPredictionF1 HMM89.0886.6287.83 HMM with Classifiers 91.4991.5391.51 HMM with Classifiers (with Lexical features) 93.8593.4693.65

36 Projection based Markov Model (PMM)  Classifiers use previous prediction as part of inputs BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier

37 PMM  Each classifier outputs P(s t |s t-1,O)  How do you Train/Test in this case? argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )

38 Experimental Results  NP Prediction RecallPredictionF1 HMM (POS)89.0886.6287.83 HMM with Classifiers (POS) 91.4991.5391.51 HMM with Classifiers 93.8593.4693.65 PMM (POS)92.0491.7791.90 PMM94.2894.0294.15

40 Learning and Inference  Learning: estimate parameters from data, and makes predictions for new data  Inference: ensure that outputs satisfy constraints

41 Inference  Assign values to the variables of interest (output; states) in a way that maximizes your objective function and satisfies constraints

42 Inference  Boolean Constraint Satisfaction Problem V – set of variables Cost – c: V  [0,1] Clauses – model constraints Satisfying assignment – : V  {0,1} Find the solution  that minimize the cost C() =  i=1..n (v i )c(v i )

43 Inference for Shallow Parsing  Define a variable v i for each possible chunk  Cost of each chunk = -P(v i is a chunk)  For any two overlapping chunks: (vi  vj)(vi  vj)  Find the solution  that minimizes the cost C() =  i=1..n (v i )c(v i )  This is a good objective function – it maximizes the expected number of correct chucks.  CSP in general is hard  Structure of the constraints yields a problem that can be solved by a shortest path algorithm

44 Inference for Shallow Parsing O O O C C C -0.48 -0.19 -0.66 [ [ ] ] -0.56 -0.77 -0.36 -0.44 -0.22 o 1 o 2 o 3 o 4 o 5 o 6 o 7 [ [ [ ] ] ]

45 Experimental Results  NP Prediction RecallPredictionF1 HMM (POS)89.0886.6287.83 HMM with Classifiers (POS) 91.4991.5391.51 HMM with Classifiers 93.8593.4693.65 PMM94.2894.0294.15 CSCL94.1293.4593.78

46 Other Approaches to Shallow Parsing  MEMM  CRF  Global Perceptron (Collins’)  Others (e.g., global SVM)

47 Maximum Entropy Markov Model (MEMM)  Similar to PMM  Each term is learned with maximum entropy model BrownblamedMr.BobmakingthingsdirtyMr.for IOBIOBOBO A classifier =argmax s k ;s k¡ 1 ;:::;s 1 [ k Y t =2 P ( s t j s t¡ 1 ;O )] ¢ P ( s 1 j O ) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )

48 Conditional Random Field (CRF)  Similar to PMM, but based on Markov random field theory (undirected graph) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O ) =argmax s k ;s k¡ 1 ;:::;s 1 1 Z ( O ) exp ( X i w i f i ( s k ;s k¡ 1 ;O ))

49 Collins’ Perceptron Training for HMM  Similar to HMM, but use Perceptron algorithm to learn argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O ) =argmax s k ;s k¡ 1 ;:::;s 1 P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =argmax s k ;s k¡ 1 ;:::;s 1 log ( P ( o k j s k ) ¢ [ k¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 )) =argmax s k ;s k¡ 1 ;:::;s 1 log ( P ( o k j s k ))+[ k¡ 1 X t =1 log ( P ( s t +1 j s t ))+ log ( P ( o t j s t ))]+ log ( P ( s 1 )) =argmax s k ;s k¡ 1 ;:::;s 1 w i f i ( o k ;s k )+[ k¡ 1 X t =1 u t g t ( s t +1 ;s t )+ w t f t ( o t ;s t )]+ u 1 g 1 ( s 1 )

50 Maximum Entropy Markov Model (MEMM) P ( s k j s k¡ 1 ;O ) = 1 Z ( s k¡ 1 ;O ) exp ( X i w i f i ( s k ;s k¡ 1 ;O )) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )

51 Maximum Entropy Markov Model (MEMM) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )

52 Conditional Random Field (CRF) argmax s k ;s k¡ 1 ;:::;s 1 P ( s k ;s k¡ 1 ;:::;s 1 j O )

53 Conclusions  All approaches use linear representation  The differences are Features How to learn weights Training Paradigms:  Global Training (HMM, CRF, Global Perceptron)  Modular Training (PMM, MEMM, CSCL) These approaches are easier to train, but may requires additional mechanisms to enforce global constraints.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Similar presentations

Presentation on theme: "Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Similar presentations

Presentation on theme: "Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural."— Presentation transcript:

Similar presentations

About project

Feedback