Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007
Overview Unlabeled data can be used to reduce the need for supervision Basic idea: making use of the redundancy in the unlabeled data –CoTraining [ Blum and Mitchell 98] Two unsupervised models –DL-CoTrain, based on Decision List Learning –CoBoost, based on AdaBoost algorithm
Problem setting Given iid labeled examples –7 seed rules iid unlabeled examples – 90,000 unlabeled examples x i is a feature vector drawn from a set of possible values X Wish to learn a classification function f :X ->Y {Location, Person, Organization}
Redundantly Sufficient Features spelling featurecontext feature person Redundantly Sufficient Features: features can be separated into two types either X 1 or X 2 is sufficient for classification there exists functions f 1 and f 2 for any example, X 1 and X 2 are conditionally independent given Y
CoTraining spellingcontext Mr. Cooper … a president of Each unlabeled pair is represented as an edge An edge indicates that the two features must have the same label
CoTraining spellingcontext Mr. Cooper … president Each unlabeled pair is represented as an edge An edge indicates that the two features must have the same label
CoTraining Given Labeled examples Unlabeled examples x i = Induce functions f 1 and f 2 such that Loose the constraint
Supervised Algorithm based on Decision Lists Input –x i is a set of features Output –A function –h(x,y) is an estimate of the probability p(y|x) of seeing label y given that feature x is present –h can be thought of as defining a decision list of rules x->y ranked by their strength h(x,y) true x1x1 x2x2 x3x3 false h(x 1,0) h(x 2,1) Decision List
Supervised Algorithm based on Decision Lists (2) The label for a test example x h(x,y) is defined as follows –Count(x,y) is the number of times feature x is seen with label y in training data true x1x1 x2x2 x3x3 false h(x 1,0) h(x 2,1) Decision List
DL-CoTrain (unsupervised decision list) iteration Initialize spelling rules x 1,s ->y x 2,s ->y x 3,s ->y x 4,s ->y context rules x 1,c ->y x 2,c ->y x 3,c ->y spelling rules x 1,s ->y x 2,s ->y x 3,s ->y x 4,s ->y x 5,s ->y x 4,s ->y label datainduce rules label data induce rules labeled data … Spelling rules … context rules … Induce rules: choose the rules with the features that appeared more times with some known label
Boosting-based algorithm - AdaBoost D is a distribution over instances, specifies the relative weight of each example weight for the learner Choose h t and the weight to minimize Z t The training error is bounded above by
AdaBoost for named entity recognition Weak hypothesis choose choose h t, so that it minimize
CoBoost (unsupervised AdaBoost) Recall the criteria for CoTraining Given Labeled examples Unlabeled examples x i = (spelling, context) = Induce functions f 1 and f 2 such that
CoBoost (2) Optimization function : a extension of Z t, learn f 1 and f 2 choose h t and to minimize this function is the unthresholded hypothesis for f j error for labeled data the number of disagreements on unlabeled data
CoBoost (3) At each iteration step 1: fix the second one, choose and to minimize the first one step 2: fix the first one, choose and to minimize the second one
CoBoost (4) t is iteration, j is the classifier take the current output of the other classifier for unlabeled data the instance weight is based on this classifier the same form as the function Z t in AdaBoost use the same algorithm as AdaBoost to choose and
Evaluation 88,962 examples (spelling,context) pairs 7 seed rules are used 1000 examples are chosen as test data. (85 noise) We label the examples to ( location, person, organization, noise)