Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth
Semi -super vised Learning ? Scarcity of Training Data What are constraints ? How/why do they help ?
Supervised learning ( X1 Y1) Labelled Data (X2- Y2) (X3 Y3).. ……(Xn Yn). What if n is less ?.. Obtaining training data is Costly and it could be inefficient. Example : (Fraud detection / Anamoly detection) Domain expertise helps……
Definitions X = (X1,X2,X3,X4…………Xn) Y = (Y1,Y2,Y3,Y4…………Yn) H : X Y is a classifier. f : (Cross product of X and Y ) - R set of real numbers The out-put of the classifier will be such y which maximizes the value of function f
Classification function.. It’s a linear sum of feature functions
Motivational Interviewing Labels : Support,Reflection,Cofrontation,Facilitate, Question
Can we exploit knowledge of constraints in Inference Phase? Lets assume n items (observations) in sequence and p labels.. i.e., n tokens and p parts of speech or n tokens and p tags in an NER task Brute Force : O(n power p ) Viterbi : O( N power P) Can we go down further ? Can we further reduce our search space Further down ?
Introducing constraints into Model Let C1, C2 ……….CK be the constraints C: (Cross product of X and Y) {0,1} Constraints are of two types. Hard (MUST be satisfied) Soft (Can be relaxed) 1Cx is the set of sequence labels that DON’T violate the constraints
Constraints come to rescue Lets say x out of X possible tag sequences violate the constraints. Search space comes from X to X-x. How do we infer ? Does Viterbi help us ?
Example A B C D E F G S1 X1 X1 X1 X1 X1 X1 X1 S2 X10 X10 X10 X10 X10 X10 X10 S3 X11 X11 X11 X11 X11 X1I X11 Motivational Interviewing : At least ONE reflection
Soft constraints How do we calculate distance here ? How do we learn the parameters ?
Lars Ole Andersen. Program Analysis and Specialization for the C programming Language. PhD Thesis, DIKU, University of Copenhagen, May This is Ground Truth. But HMM gives this. Lars Ole Andersen. Program Analysis and Specialization for the C Programming Language. PhD Thesis, DIKU, University of Copenhagen, May 1994.
Top-k inference We only chose the few top possible sequences and add ALL of of them to training data. The author used beam search decoding, but this can be done with any inference procedure. From the Unlabeled sample, we label them and include them in the training data. Choice : We may include only the high confident samples. PitFall : Then we don’t really learn properly and miss-out some characteristics
Algorithm: