Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Sequence Learning Sudeshna Sarkar 14 Aug 2008. 2 Alternative graphical models for part of speech tagging.

Similar presentations


Presentation on theme: "1 Sequence Learning Sudeshna Sarkar 14 Aug 2008. 2 Alternative graphical models for part of speech tagging."— Presentation transcript:

1 1 Sequence Learning Sudeshna Sarkar 14 Aug 2008

2 2 Alternative graphical models for part of speech tagging

3 3 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

4 4 Hidden Markov Model (HMM) : Generative Modeling Source Model P  Y  Noisy Channel P  X  Y  y x

5 5 Dependency (1st order)

6 6 Disadvantage of HMMs (1) No Rich Feature Information Rich information are required –When x k is complex –When data of x k is sparse Example: POS Tagging How to evaluate P  w k |t k  for unknown words w k ? Useful features –Suffix, e.g., -ed, -tion, -ing, etc. –Capitalization Generative Model Parameter estimation: maximize the joint likelihood of training examples

7 7 Generative Models Hidden Markov models (HMMs) and stochastic grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood of train examples

8 8 Generative Models (cont’d) Difficulties and disadvantages Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations

9 9 Making use of rich domain features A learning algorithm is as good as its features. There are many useful features to include in a model Most of them aren’t independent of each other Identity of word Ends in “-shire” Is capitalized Is head of noun phrase Is in a list of city names Is under node X in WordNet Word to left is verb Word to left is lowercase Is in bold font Is in hyperlink anchor Other occurrences in doc …

10 10 Problems with Richer Representation and a Generative Model These arbitrary features are not independent: Overlapping and long-distance dependences Multiple levels of granularity (words, characters) Multiple modalities (words, formatting, layout) Observations from past and future HMMs are generative models of the text: Generative models do not easily handle these non-independent features. Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data! Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

11 11 Discriminative Models We would prefer a conditional model: P(y|x) instead of P(y,x): Can examine features, but not responsible for generating them. Don’t have to explicitly model their dependencies. Don’t “waste modeling effort” trying to generate what we are given at test time anyway. Provide the ability to handle many arbitrary features.

12 12 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t - 1... Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1 O t +1 O t - 1... Conditional... transitions observations Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000]

13 13 Locally Normalized Conditional Sequence Model S t-1 S t O t S t+1 O t +1 O t - 1... Generative (traditional HMM)... transitions observations S t-1 S t O t S t+1... Conditional... transitions entire observation sequence Standard belief propagation: forward-backward procedure. Viterbi and Baum-Welch follow naturally. Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000] MaxEnt POS Tagger [Ratnaparkhi, 1996] SNoW-based Markov Model [Punyakanok & Roth, 2000] Or, more generally:...

14 14 Exponential Form for “Next State” Function Overall Recipe: - Labeled data is assigned to transitions. - Train each state’s exponential model by maximum likelihood (iterative scaling or conjugate gradient). weightfeature Black-box classifier s t-1

15 15 Principle of Maximum Entropy The correct distribution P(s,o) is that which maximizes entropy, or “uncertainty” subject to constraints Constraints represent evidence Given k features, constraints have the form, i.e. the model’s expectation for each feature should match the observed expectation Philosophy: Making inferences on the basis of partial information without biasing the assignment would amount to arbitrary assumptions of information that we do not have

16 16 Maximum Entropy Classifier Conditional model p(y|x) Does not try to model p(x) Can work with complicated input features since we do not need to model dependencies between them. Principle of maximum entropy We want a classifier –Matching feature constraints from training data –Predictions maximize entropy There is a unique exponential family distribution that meets these criteria. Maximum Entropy Classifier p(y|x;  ) inference and learning

17 17 Indicator Features Feature functions f(x,y) f1(w,y) = {word is Sarani & y = Location} f2(w,y) = {previous tag = Per-begin, current word suffix = “an:, & y = Per-end}

18 18 Problems with MaxEnt classifier It makes decisions at each point independently

19 19 MEMM Use a series of maximum entropy classifiers that know the previous label Define a Viterbi model of inference P(y|x) =  t Pyt-1 (yt|x) Finding the most likely label sequence given an input sequence and learning Combines the advantages of HMM and maximum entropy. But there is a problem.

20 20 Maximum Entropy Markov Model Label bias problem: the probability transitions leaving any given state must sum to one

21 21 In some state space configurations, MEMMs essentially completely ignore the inputs Example of label bias problem This is not a problem for HMMs, because the input is generated by the model.

22 22 P=0.75 P=0.25 P=1 Label Bias Example Given: “ rib ” 3 times, “ rob ” 1 times Training: p(1|0, “ r ” )=0.75, p(4|0, “ r ” )=0.25 Inference:

23 23 Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs HMMS S t-1 StSt OtOt S t+1 O t+1 O t-1... S t-1 StSt OtOt S t+1 O t+1 O t-1...

24 24 Random Field

25 25 CRF CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations X y1y1 y2y2 y3y3

26 26 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

27 27 Machine Learning a Panacea? A machine learning method is as good as the feature set it uses Shift focus from linguistic processing to feature set design

28 28 Features to use in IE Features are task dependent Good feature identification needs a good knowledge of the domain combined with automatic methods of feature selection.

29 29 Feature Examples Extraction of protein and their interactions from biomedical literature (Mooney) For each token, they take the following as features: –Current token –Last 2 tokens and next 2 tokens –Output of dictionary-based tagger for these 5 tokens –Suffix for each of the 5 tokens (last 1, 2, and 3 characters) –Class labels for last 2 tokens Two potentially oncogenic cyclins, cyclin A and cyclin D1, share common properties of subunit configuration, tyrosine phosphorylation and physical association with the Rb protein

30 30 More Feature Examples line, sentence, or paragraph features: length is centered in page percent of non-alphabetics white-space aligns with next line containing sentence has two verbs grammatically contains a question contains links to “authoritative” pages emissions that are uncountable features at multiple levels of granularity Example word features: identity of word is in all caps ends in “-ski” is part of a noun phrase is in a list of city names is under node X in WordNet or Cyc is in bold font is in hyperlink anchor features of past & future last person name was female next two words are “and Associates”

31 31 Indicator Features They’re a little different from the typical supervised ML approach Limited to binary values –Think of a feature as being on or off rather than as a feature with a value Feature values are relative to an object/class pair rather than being a function of the object alone. Typically have lots and lots of features (100s of 1000s of features is quite common.)

32 32 Feature Templates Next word A feature template – gives rise to |V|x|T| binary features Curse of Dimensionality Overfitting

33 33 Feature Selection vs Extraction Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original x i, i =1,...,d dimensions to new k<d dimensions, z j, j =1,...,k Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA)

34 34 Feature Reduction Example domain: NER in Hindi (Sujan Saha) Feature Value Selection Feature Value Clustering ACL 2008: Kumar Saha; Pabitra Mitra; Sudeshna Sarkar Word Clustering and Word Selection Based Feature Reduction for MaxEnt Based Hindi NER

35 35

36 36 Better Approach Discriminative model which models P(y|x) directly Maximize the conditional likelihood of training examples

37 37 Maximum Entropy modeling N-gram model : probabilities depend on the previous few tokens. We may identify a more heterogeneous set of features which contribute in some way to the choice of the current word. (whether it is the first word in a story, whether the next word is to, whether one of the last 5 words is a preposition, etc) Maxent combines these features in a probabilistic model. The given features provide a constraint on the model. We would like to have a probability distribution which, outside of these constraints, is as uniform as possible – has the maximum entropy among all models that satisfy these constraints.

38 38 Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model –Two parameters in generative model, –parameter in source model and parameter in noisy channel –Unified conditional model Employ maximum entropy principle Maximum Entropy Markov Model

39 39 General Maximum Entropy Principle Model Model distribution P  Y  |X  with a set of features  f   f     f l  defined on X and Y Idea Collect information of features from training data Principle –Model what is known –Assume nothing else  Flattest distribution  Distribution with the maximum Entropy

40 40 Example ( Berger et al., 1996) example Model translation of word “in” from English to French –Need to model P(word French ) –Constraints  1: Possible translations: dans, en, à, au course de, pendant  2: “dans” or “en” used in 30% of the time  3: “dans” or “à” in 50% of the time

41 41 Features 0-1 indicator functions –1 if  x  y  satisfies a predefined condition –0 if not Example: POS Tagging

42 42 Constraints Empirical Information Statistics from training data T Constraints Expected Value From the distribution P  Y  |X  we want to model

43 43 Maximum Entropy: Objective Entropy Maximization Problem

44 44 Dual Problem Conditional model Maximum likelihood of conditional data Solution Improved iterative scaling (IIS) (Berger et al. 1996) Generalized iterative scaling (GIS) (McCallum et al. 2000)

45 45 Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order Features Basic features (like parameters in HMM) Bigram (1st order) or trigram (2nd order) in source model State-output pair feature  X k  x k  Y k  y k  Advantage: incorporate other advanced features on  x k  y k 

46 46 HMM vs MEMM (1st order) HMM Maximum Entropy Markov Model (MEMM)

47 47 Performance in POS Tagging POS Tagging Data set: WSJ Features: –HMM features, spelling features (like – ed, -tion, -s, -ing, etc.) Results (Lafferty et al. 2001) 1st order HMM –94.31% accuracy, 54.01% OOV accuracy 1st order MEMM –95.19% accuracy, 73.01% OOV accuracy

48 48 ME applications Part of Speech (POS) Tagging (Ratnaparkhi, 1996) P(POS tag | context) Information sources –Word window (4) –Word features (prefix, suffix, capitalization) –Previous POS tags

49 49 ME applications Abbreviation expansion (Pakhomov, 2002) Information sources –Word window (4) –Document title Word Sense Disambiguation (WSD) (Chao & Dyer, 2002) Information sources –Word window (4) –Structurally related words (4) Sentence Boundary Detection (Reynar & Ratnaparkhi, 1997) Information sources –Token features (prefix, suffix, capitalization, abbreviation) –Word window (2)

50 50 Solution Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

51 51 Why ME? Advantages Combine multiple knowledge sources –Local  Word prefix, suffix, capitalization (POS - (Ratnaparkhi, 1996) )  Word POS, POS class, suffix (WSD - (Chao & Dyer, 2002) )  Token prefix, suffix, capitalization, abbreviation (Sentence Boundary - (Reynar & Ratnaparkhi, 1997) ) –Global  N-grams (Rosenfeld, 1997)  Word window  Document title (Pakhomov, 2002)  Structurally related words (Chao & Dyer, 2002)  Sentence length, conventional lexicon (Och & Ney, 2002) Combine dependent knowledge sources

52 52 Why ME? Advantages Add additional knowledge sources Implicit smoothing Disadvantages Computational –Expected value at each iteration –Normalizing constant Overfitting –Feature selection  Cutoffs  Basic Feature Selection (Berger et al., 1996)

53 53 Conditional Models Conditional probability P(label sequence y | observation sequence x) rather than joint probability P(y, x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

54 54 Discriminative Models Maximum Entropy Markov Models (MEMMs) Exponential model Given training set X with label sequence Y: Train a model θ that maximizes P(Y|X, θ) For a new data sequence x, the predicted label y maximizes P(y|x, θ) Notice the per-state normalization

55 55 MEMMs (cont’d) MEMMs have all the advantages of Conditional Models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states (“conservation of score mass”) Subject to Label Bias Problem Bias toward states with fewer outgoing transitions

56 56 Label Bias Problem P(1 and 2 | ro) = P(2 | 1 and ro)P(1 | ro) = P(2 | 1 and o)P(1 | r) P(1 and 2 | ri) = P(2 | 1 and ri)P(1 | ri) = P(2 | 1 and i)P(1 | r) Since P(2 | 1 and x) = 1 for all x, P(1 and 2 | ro) = P(1 and 2 | ri) In the training data, label value 2 is the only label value observed after label value 1 Therefore P(2 | 1) = 1, so P(2 | 1 and x) = 1 for all x However, we expect P(1 and 2 | ri) to be greater than P(1 and 2 | ro). Per-state normalization does not allow the required expectation Consider this MEMM:

57 57 Solve the Label Bias Problem Change the state-transition structure of the model Not always practical to change the set of states Start with a fully-connected model and let the training procedure figure out a good structure Prelude the use of prior, which is very valuable (e.g. in information extraction)

58 58 Random Field

59 59 Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions “vote” more strongly than others depending on the corresponding observations

60 60 Definition of CRFs X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

61 61 Example of CRFs

62 62 Graphical comparison among HMMs, MEMMs and CRFs HMM MEMM CRF

63 63 Conditional Distribution x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features are parameters to be estimated y| e is the set of components of y defined by edge e y| v is the set of components of y defined by vertex v If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y = y, given X = x, by fundamental theorem of random fields is:

64 64 Conditional Distribution (cont’d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: Z(x) is a normalization over the data sequence x

65 65 Parameter Estimation for CRFs The paper provided iterative scaling algorithms It turns out to be very inefficient Prof. Dietterich’s group applied Gradient Descendent Algorithm, which is quite efficient

66 66 Training of CRFs (From Prof. Dietterich) First, we take the log of the equation Then, take the derivative of the above equation For training, the first 2 items are easy to get. For example, for each k, f k is a sequence of Boolean numbers, such as 00101110100111. is just the total number of 1’s in the sequence. The hardest thing is how to calculate Z(x)

67 67 Training of CRFs (From Prof. Dietterich) (cont’d) Maximal cliques y1y1 y2y2 y3y3 y4y4 c1c1 c2c2 c3c3 c1c1 c2c2 c3c3

68 68 POS tagging Experiments

69 69 POS tagging Experiments (cont’d) Compared HMMs, MEMMs, and CRFs on Penn treebank POS tagging Each word in a given input sentence must be labeled with one of 45 syntactic tags Add a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and if it contains one of the following suffixes: -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies oov = out-of-vocabulary (not observed in the training set)

70 70 Summary Discriminative models are prone to the label bias problem CRFs provide the benefits of discriminative models CRFs solve the label bias problem well, and demonstrate good performance


Download ppt "1 Sequence Learning Sudeshna Sarkar 14 Aug 2008. 2 Alternative graphical models for part of speech tagging."

Similar presentations


Ads by Google