Inference and Learning via Integer Linear Programming

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.

Authors Sebastian Riedel and James Clarke Paper review by Anusha Buchireddygari Incremental Integer Linear Programming for Non-projective Dependency Parsing.

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Max Cut Problem Daniel Natapov.

Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

Support Vector Machines

A Linear Programming Formulation for Global Inference in Natural Language Tasks Dan RothWen-tau Yih Department of Computer Science University of Illinois.

Visual Recognition Tutorial

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

Page 1 SRL via Generalized Inference Vasin Punyakanok, Dan Roth, Wen-tau Yih, Dav Zimak, Yuancheng Tu Department of Computer Science University of Illinois.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

CES 514 – Data Mining Lecture 8 classification (contd…)

Page 1 Generalized Inference with Multiple Semantic Role Labeling Systems Peter Koomen, Vasin Punyakanok, Dan Roth, (Scott) Wen-tau Yih Department of Computer.

Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.

Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Learning to Identify Overlapping and Hidden Cognitive Processes from fMRI Data Rebecca Hutchinson, Tom Mitchell, Indra Rustandi Carnegie Mellon University.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Online Learning Algorithms

Semi-Supervised Learning

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Outline Separating Hyperplanes – Separable Case

Latent (S)SVM and Cognitive Multiple People Tracker.

Dual Coordinate Descent Algorithms for Efficient Large Margin Structured Prediction Ming-Wei Chang and Scott Wen-tau Yih Microsoft Research 1.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

Aspect Guided Text Categorization with Unobserved Labels Dan Roth, Yuancheng Tu University of Illinois at Urbana-Champaign.

Naive Bayes Classifier

Hardness of Learning Halfspaces with Noise Prasad Raghavendra Advisor Venkatesan Guruswami.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Machine Learning Concept Learning General-to Specific Ordering

Global Inference via Linear Programming Formulation Presenter: Natalia Prytkova Tutor: Maximilian Dylla

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

Semantic Role Labelling Using Chunk Sequences Ulrike Baldewein Katrin Erk Sebastian Padó Saarland University Saarbrücken Detlef Prescher Amsterdam University.

Introduction of SNoW (Sparse Network of Winnows )

Lecture 7: Constrained Conditional Models

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

Deep Feedforward Networks

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

CS 9633 Machine Learning Concept Learning

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

CIS 700 Advanced Machine Learning for NLP Inference Applications

Machine Learning Basics

Data Mining Lecture 11.

Margin-based Decomposed Amortized Inference

CS 4/527: Artificial Intelligence

An Introduction to Support Vector Machines

Probabilistic Models with Latent Variables

Combinatorial Auctions (Bidding and Allocation)

KAIST CS LAB Oh Jong-Hoon

Generally Discriminant Analysis

Text Categorization Berlin Chen 2003 Reference:

A task of induction to find patterns

Instructor: Aaron Roth

Dan Roth Computer and Information Science University of Pennsylvania

A task of induction to find patterns

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Inference and Learning via Integer Linear Programming Vasin,Dan,Scott,Dav

Outline Problem Definition Integer Linear Programming (ILP) Its generality Learning and Inference via ILP Experiments Extension to hierarchical learning Future Direction Hidden Variables Place Inference task (finding assignment to variable set) as ILP -Cost Fcn – defined by set of learned classifiers -Constraints – maintain structure of solution Doing this 1. allow MANY constraints in inference 2. set up a framework where different learning methods can be used -- compare two natural algorithms - independent vs. global training Experiments -- independent training is better sometimes easy problems -- global better for difficult problems Extension when -- dependent tasks (one on the next) -- classification is done in levels Future Direction learn with hidden variables

Problem Definition X = (X1,...,Xk)  X k = X Y = (Y1,...,Yl)  Y l = Y Given X = x, find Y = y Notation agreements Capital letters mean variables Non-capital letters mean values Bold indicates vectors or matrixes X,Y is a set

Example (Text Chunking) y = NP ADJP VP ADVP VP x = The guy presenting now is so tired

Classifiers A classifier Example h: X Y (l-1)Y  {1,..,l} R score(x,y-3,NP,3) = 0.3 score(x,y-3,VP,3) = 0.5 score(x,y-3,ADVP,3) = 0.2 score(x,y-3,ADJP,3) = 1.2 score(x,y-3,NULL,3) = 0.1

Inference Goal: x  y Given Find y x input score(x,y-t,y,t) for all (y-t,y)  Y l, t  {1,..,l} C A set of constraints over Y Find y maximizes global function score(x,y) = t score(x,y-t,yt,t) satisfies constraints C

Integer Linear Programming Boolean variables: U = (U1,...,Ud)  {0,1}d Cost vector: p = (p1,…,pd)  Rd Cost Function: pU Constraint Matrix: c  ReRd Maximize pU Subject to cU  0 (cU=0, cU3, possible)

ILP (Example) U = (U1,U2,U3) p = (0.3, 0.5, 0.8) c = 1 2 3 -1 -2 2 -1 -2 2 0 -3 2 Maximize pU Subject to cU  0

Boolean Functions as Linear Constraints Conjunction U1U2U3  U1=1, U2=1, U3=1 Disjunction U1U2U3  U1 + U2 + U3  1 CNF (U1U2)(U3U4)  U1+U2  1, U3+U4  1

Text Chunking Indicator Variables Cost Vector U1,NP,U1,NULL,U2,VP... y1=NP, y1=NULL, y2=VP,.. U1,NP indicates that phrase 1 is labeled NP Cost Vector p1,NP = score(x,NP,1) p1,NULL = score(x,NULL,1) p2,VP = score(x,VP,2) ... pU = score(x,y) = t score(x,yt,t), subject to constraints

Structural Constraints Coherency yt can take only one value y{NP,..NULL} Ut,y = 1 Non-Overlapping y1 and y2 overlap U1,NULL + U2,NULL = 1

Linguistic Constraints Every sentence must have at least one VP t Ut,VP  1 Every sentence must have at least one NP t Ut,NP  1 ...

Interacting Classifiers Classifier for an output yt uses other outputs y-t as inputs score(x,y-t,y,t) Need to ensure that the final output from ILP computed from a consistent y Introduce additional variables Introduce additional coherency constraints

Interacting Classifiers Additional variables Y=y  UY,y for all possible y-t,y Additional coherency constraints UY,y = 1 iff Ut,yt = 1 for all yt in y yt in y Ut,yt - UY,y  l – 1 yt in y Ut,yt - lUY,y  0

Learning Classifiers score(x,y-t,y,t) = yy(x,y-t,t) Learn y, for all y  Y Multi-class learning Example (x,y)  {y(x,y-t,t),yt}t=1..l Learn each classifier independently

Learn with Inference Feedback Learn by observing global behavior For each example (x,y) Make prediction with the current classifiers and ILP y’ = argmaxy t score(x,y-t,y,t) For each t, update If y’t  yt Promote score(x,y-t,yt,t) Demote score(x,y’-t,y’t,t)

Experiments Semantics Role Labeling Assume correct boundaries are given Only sentences with more than 5 arguments are included

Experimental Results For difficult task: For easy task: Winnow Perceptron For difficult task: Inference feedback during training improves performance For easy task: Learning without inference feedback is better

Conservative Updating Update only if necessary Example U1 + U2 = 1 Predict (U1, U2) = (1,0) Correct (U1, U2) = (0,1) Feedback Demote class 1, promote class 2 So, U1=0  U2=1, so only demote class 1

Conservative Updating S = minset(Constraints) Set of functions that, if changed, would make global prediction correct. Promote (Demote) only those functions in the minset S

Hierarchical Learning Given x Compute hierarchically z1 = h1(x) z2 = h2(x,z1) … y = hs+1(x,z1,…,zs) Assume all z are known in training

Hierarchical Learning Assume each hj can be computed via ILP pj, Uj, cj y = argmaxymaxz1,…zs jjpjUj Subject to c1U1  0, c2U2  0, …, cs+1Us+1  0 where j is a large enough constant to preserve hierarchy

Hidden Variables Given x y = h(x,z) z is not known in training y = argmaxymaxz score(x,z,y-t,y,t) Subject to some constraints

Learning with Hidden Variables Truncated EM styled learning For each example (x,y) Compute z with the current classifiers and ILP z = argmaxz score(x,z,y-t,y,t) Make prediction with the current classifiers and ILP (y’,z’) = argmaxy,z t score(x,z,y-t,y,t) For each t, update If y’t  yt Promote score(x,z,y-t,yt,t) Demote score(x,z’,y’-t,y’t,t)

Conclusion ILP is powerful general learnable useful fast (or at least not too slow) extendable

Boolean Functions as Linear Constraints Conjunction abc  Ua + Ub + Uc  3 Disjunction abc  Ua + Ub + Uc  1 DNF ab + cd  Iab + Icd  1 Introduce new variables Iab, Icd

Helper Variables We must link Ia, Ib, and Iab Iab ab IaIb  Iab Iab  IaIb 2Iab <= Ia + Ib

Semantic Role Labeling a,b,c... ph1=A0, ph1=A1,ph2=A0,.. Cost Vector pa = score(ph1=A0) pb = score(ph1=A1) ... Indicator Variables Ia indicates that phrase 1 is labeled A0 paIa = 0.3 if Ia and 0 ow

Learning X = (X1,...,Xk)  X1,…,Xk = X Y-t = (Y1,...,Yt-1,Yt+1,Yl)  Y1,…,Yt-1,Yt+1,…,Yl = Y -t Yt  Yt Given X = x, and Y-t = y-t, find Yt = yt or score of each possible yt X Y –t  Yt or X Y –tYt R

SRL via Generalized Inference

Outline Find potential argument candidates Classify arguments to types Inference for Argument Structure Integer linear programming (ILP) Cost Function Constraints Features We follow a now seemingly standard approach to SRL. Given a sentence, first we find a set of potential argument candidates by identifying which words are at the border of an argument. Then, once we have a set of potential arguments, we use a suite of classifiers to tell us how likely an argument is to be of each type. Finally, we use all of the information we have so far to find the assignment of types to argument that gives us the “optimal” global assignment. Similar approaches (with similar results) use inference procedures tied to their represntation. Instead, we use a general inference procedure by setting up the problem as a linear programming problem. This is really where our technique allows us to apply powerful information that similar approaches can not.

Find Potential Arguments I left my nice pearls to her Every chunk can be an argument Restrict potential arguments BEGIN(word) BEGIN(word) = 1  “word begins argument” END(word) END(word) = 1  “word ends argument” Argument (wi,...,wj) is a potential argument iff BEGIN(wi) = 1 and END(wj) = 1 Reduce set of potential argments I left my nice pearls to her [ [ [ [ [ ] ] ] ] ]

Details... Learn a function BEGIN(word) Learn a function B(word,context,structure)  {0,1} END(word) E(word,context,structure)  {0,1} POTARG = {arg | BEGIN(first(arg)) and END(last(arg))}

Arguments Type Likelihood Assign type-likelihood How likely is it that arg a is type t? For all a  POTARG , t  T P (argument a = type t ) I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] 0.3 0.2 0.2 0.3 0.6 0.0 0.0 0.4 A0 CA1 A1 Ø

Details... Learn a classifier Estimate Probabilites ARGTYPE(arg) P(arg)  {A0,A1,...,CA0,...,LOC,...} argmaxt{A0,A1,...,CA0,...,LOC,...} wt P(arg) Estimate Probabilites P(a = t) = wt P(a) / Z

What is a Good Assignment? Likelihood of being correct P(Arg a = Type t) if t is the correct type for argument a For a set of arguments a1, a2, ..., an Expected number of arguments correct  i P( ai = ti ) We search for the assignment with maximum expected correct

I left my nice pearls to her I left my nice pearls to her Inference Maximize expected number correct T* = argmaxT  i P( ai = ti ) Subject to some constraints Structural and Linguistic I left my nice pearls to her 0.3 0.2 0.2 0.3 0.6 0.0 0.0 0.4 0.1 0.3 0.5 0.1 0.1 0.2 0.3 0.4 I left my nice pearls to her Cost = 0.3 + 0.4 + 0.5 + 0.4 = 1.6 Non-Overlapping Cost = 0.3 + 0.6 + 0.5 + 0.4 = 1.8 Independent Max Cost = 0.3 + 0.4 + 0.3 + 0.4 = 1.4 BlueRed & N-O

Everything is Linear Cost function a  POTARG P(a=t) = a  POTARG , t  T P(a=t)Iat Constraints Non-Overlapping a and a’ overlap  IaØ + Ia’Ø = 0 Linguistic  CA0   A0  a IaCA0 – a IaA0  1 Integer Linear Programming

Features are Important Here, a discussion of the features should go. Which are most important? Comparison to other people.

I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I left my nice pearls to her I left my nice pearls to her