Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו דגן המחלקה למדעי המחשב אוניברסיטת בר אילן
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Supervised Learning Scheme Classification Model “Labeled” Examples New Examples Classifications Training Algorithm Classification Algorithm
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Transformational Based Learning (TBL) for Tagging Introduced by Brill (1995) Can exploit a wider range of lexical and syntactic regularities via transformation rules – triggering environment and rewrite rule Tagger: –Construct initial tag sequence for input – most frequent tag for each word –Iteratively refine tag sequence by applying “transformation rules” in rank order Learner: –Construct initial tag sequence for the training corpus –Loop until done: Try all possible rules and compare to known tags, apply the best rule r* to the sequence and add it to the rule ranking
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Some examples 1. Change NN to VB if previous is TO –to/TO conflict/NN with VB 2. Change VBP to VB if MD in previous three –might/MD vanish/VBP VB 3. Change NN to VB if MD in previous two –might/MD reply/NN VB 4. Change VB to NN if DT in previous two –the/DT reply/VB NN
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Transformation Templates Specify which transformations are possible For example: change tag A to tag B when: 1.The preceding (following) tag is Z 2.The tag two before (after) is Z 3.One of the two previous (following) tags is Z 4.One of the three previous (following) tags is Z 5.The preceding tag is Z and the following is W 6.The preceding (following) tag is Z and the tag two before (after) is W
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Lexicalization New templates to include dependency on surrounding words (not just tags): Change tag A to tag B when: 1.The preceding (following) word is w 2.The word two before (after) is w 3.One of the two preceding (following) words is w 4.The current word is w 5.The current word is w and the preceding (following) word is v 6.The current word is w and the preceding (following) tag is X (Notice: word-tag combination) 7.etc…
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Initializing Unseen Words How to choose most likely tag for unseen words? Transformation based approach: –Start with NP for capitalized words, NN for others –Learn “morphological” transformations from: Change tag from X to Y if: 1.Deleting prefix (suffix) x results in a known word 2.The first (last) characters of the word are x 3.Adding x as a prefix (suffix) results in a known word 4.Word W ever appears immediately before (after) the word 5.Character Z appears in the word
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Unannotated Input Text Annotated Text Ground Truth for Input Text Rules Learning Algorithm TBL Learning Scheme Setting Initial State
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Greedy Learning Algorithm Initial tagging of training corpus – most frequent tag per word At each iteration: –Identify rules that fix errors and compute “error reduction” for each transformation rule: #errors fixed - #errors introduced –Find best rule; If error reduction greater than a threshold (to avoid overfitting): Apply best rule to training corpus Append best rule to ordered list of transformations
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Stochastic POS Tagging POS tagging: For a given sentence W = w 1 …w n Find the matching POS tags T = t 1 …t n In a statistical framework: T' = arg max P(T|W) T
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Bayes’ Rule Words are independent of each other A word’s identity depends only on its own tag Markovian assumptions Denominator doesn’t depend on tags Chaining rule Notation: P(t 1 ) = P(t 1 | t 0 )
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books The Markovian assumptions Limited Horizon –P(X i+1 = t k |X1,…,X i ) = P(X i+1 = t k | X i ) Time invariant –P(X i+1 = t k | X i ) = P(X j+1 = t k | X j )
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Maximum Likelihood Estimations In order to estimate P(w i |t i ), P(t i |t i-1 ) we can use the maximum likelihood estimation –P(w i |t i ) = c(w i,t i ) / c(t i ) –P(t i |t i-1 ) = c(t i-1 t i ) / c(t i-1 ) Notice estimation for i=1
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Unknown Words Many words will not appear in the training corpus. Unknown words are a major problem for taggers (!) Solutions – –Incorporate Morphological Analysis – Consider words appearing once in training data as UNKOWNs
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books “Add-1/Add-Constant” Smoothing
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Smoothing for Tagging For P(t i |t i-1 ) Optionally – for P(t i |t i-1 )
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Viterbi Finding the most probable tag sequence can be done with the viterbi algorithm. No need to calculate every single possible tag sequence (!)
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Hmms Assume a state machine with –Nodes that correspond to tags –A start and end state –Arcs corresponding to transition probabilities - P(t i |t i-1 ) –A set of observations likelihoods for each state - P(w i |t i )
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books NN VBZ NNS AT VB RB P(like)=0.2 P(fly)=0.3 … P(eat)= P(likes)=0.3 P(flies)=0.1 … P(eats)=0.5 P(the)=0.4 P(a)=0.3 P(an)=0.2 …
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books HMMs An HMM is similar to an Automata augmented with probabilities Note that the states in an HMM do not correspond to the input symbols. The input symbols don’t uniquely determine the next state.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books HMM definition HMM=(S,K,A,B) –Set of states S={s 1,…s n } –Output alphabet K={k 1,…k n } –State transition probabilities A={a ij } i,j S –Symbol emission probabilities B=b(i,k) i S,k K –start and end states (Non emitting) Alternatively: initial state probabilities Note: for a given i- a ij =1 & b(i,k)=1
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Why Hidden? Because we only observe the input - the underlying states are hidden Decoding: The problem of part-of-speech tagging can be viewed as a decoding problem: Given an observation sequence W=w 1,…,w n find a state sequence T=t 1,…,t n that best explains the observation.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books Homework