Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth.

Slides:



Advertisements
Similar presentations
Latent Variables Naman Agarwal Michael Nute May 1, 2013.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng.
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.
A Probabilistic Framework for Semi-Supervised Clustering
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Isolated-Word Speech Recognition Using Hidden Markov Models
Competence Centre on Information Extraction and Image Understanding for Earth Observation Matteo Soccorsi (1) and Mihai Datcu (1,2) A Complex GMRF for.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Transductive Regression Piloted by Inter-Manifold Relations.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Semi-supervised Dialogue Act Recognition Maryam Tavafi.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Machine Learning Concept Learning General-to Specific Ordering
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Chapter 12 search and speaker adaptation 12.1 General Search Algorithm 12.2 Search Algorithms for Speech Recognition 12.3 Language Model States 12.4 Speaker.
Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Page 1 July 2008 ICML Workshop on Prior Knowledge for Text and Language Constraints as Prior Knowledge Ming-Wei Chang, Lev Ratinov, Dan Roth Department.
Learning From Measurements in Exponential Families Percy Liang, Michael I. Jordan and Dan Klein ICML 2009 Presented by Haojun Chen Images in these slides.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Introduction to Classification & Clustering Villanova University Machine Learning Lab Module 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Learning, Uncertainty, and Information: Learning Parameters
Lecture 7: Constrained Conditional Models
Introduction to Classification & Clustering
Semi-Supervised Clustering
Structured prediction
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
Adversarial Learning for Neural Dialogue Generation
Transfer Learning in Astronomy: A New Machine Learning Paradigm
By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS
Lecture 8:Eigenfaces and Shared Features
CSC 594 Topics in AI – Natural Language Processing
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
CIS 700 Advanced Machine Learning for NLP Inference Applications
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
A finite sequence of operations that solves a given task
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
CONTEXT DEPENDENT CLASSIFICATION
Predicting Prevalence of Influenza-Like Illness From Geo-Tagged Tweets
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Support Vector Machines
Concave Minimization for Support Vector Machine Classifiers
Multivariate Methods Berlin Chen, 2005 References:
Dan Roth Department of Computer Science
Presentation transcript:

Guiding Semi- Supervision with Constraint-Driven Learning Ming-Wei Chang,Lev Ratinow, Dan Roth

Semi -super vised Learning ? Scarcity of Training Data What are constraints ? How/why do they help ?

Supervised learning ( X1  Y1) Labelled Data (X2-  Y2) (X3  Y3).. ……(Xn  Yn). What if n is less ?.. Obtaining training data is Costly and it could be inefficient. Example : (Fraud detection / Anamoly detection) Domain expertise helps……

Definitions X = (X1,X2,X3,X4…………Xn) Y = (Y1,Y2,Y3,Y4…………Yn) H : X  Y is a classifier. f : (Cross product of X and Y ) -  R set of real numbers The out-put of the classifier will be such y which maximizes the value of function f

Classification function.. It’s a linear sum of feature functions

Motivational Interviewing Labels : Support,Reflection,Cofrontation,Facilitate, Question

Can we exploit knowledge of constraints in Inference Phase? Lets assume n items (observations) in sequence and p labels.. i.e., n tokens and p parts of speech or n tokens and p tags in an NER task Brute Force : O(n power p ) Viterbi : O( N power P) Can we go down further ? Can we further reduce our search space Further down ?

Introducing constraints into Model Let C1, C2 ……….CK be the constraints C: (Cross product of X and Y)  {0,1} Constraints are of two types. Hard (MUST be satisfied) Soft (Can be relaxed) 1Cx is the set of sequence labels that DON’T violate the constraints

Constraints come to rescue Lets say x out of X possible tag sequences violate the constraints. Search space comes from X to X-x. How do we infer ? Does Viterbi help us ?

Example A B C D E F G S1 X1 X1 X1 X1 X1 X1 X1 S2 X10 X10 X10 X10 X10 X10 X10 S3 X11 X11 X11 X11 X11 X1I X11 Motivational Interviewing : At least ONE reflection

Soft constraints How do we calculate distance here ? How do we learn the parameters ?

Lars Ole Andersen. Program Analysis and Specialization for the C programming Language. PhD Thesis, DIKU, University of Copenhagen, May This is Ground Truth. But HMM gives this. Lars Ole Andersen. Program Analysis and Specialization for the C Programming Language. PhD Thesis, DIKU, University of Copenhagen, May 1994.

Top-k inference We only chose the few top possible sequences and add ALL of of them to training data. The author used beam search decoding, but this can be done with any inference procedure. From the Unlabeled sample, we label them and include them in the training data. Choice : We may include only the high confident samples. PitFall : Then we don’t really learn properly and miss-out some characteristics

Algorithm: