Lecture 7: Constrained Conditional Models

Slides:

Advertisements

Similar presentations

Automated Theorem Proving Lecture 1. Program verification is undecidable! Given program P and specification S, does P satisfy S?

Advertisements

Dynamic Bayesian Networks (DBNs)

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

The Theory of NP-Completeness

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Computer Science 1000 Digital Circuits. Digital Information computers store and process information using binary as we’ve seen, binary affords us similar.

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Constraint Satisfaction Problems (CSPs) CPSC 322 – CSP 1 Poole & Mackworth textbook: Sections § Lecturer: Alan Mackworth September 28, 2012.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

EMIS 8373: Integer Programming NP-Complete Problems updated 21 April 2009.

Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.

Simultaneously Learning and Filtering Juan F. Mancilla-Caceres CS498EA - Fall 2011 Some slides from Connecting Learning and Logic, Eyal Amir 2006.

NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.

CS Statistical Machine learning Lecture 24

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Discrete Optimization MA2827 Fondements de l’optimisation discrète Material from P. Van Hentenryck’s course.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Automatic Test Generation

Constraint Satisfaction Problems (CSPs) Introduction

Chapter 10 NP-Complete Problems.

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Part 2 Applications of ILP Formulations in Natural Language Processing

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

By Dan Roth and Wen-tau Yih PowerPoint by: Reno Kriz CIS

EMGT 6412/MATH 6665 Mathematical Programming Spring 2016

Approximate Inference

Kai-Wei Chang University of Virginia

Structural testing, Path Testing

CIS 700 Advanced Machine Learning for NLP A First Look at Structures

CIS 700 Advanced Machine Learning for NLP Inference Applications

CS 188: Artificial Intelligence

EMIS 8373: Integer Programming

CSC 594 Topics in AI – Natural Language Processing

NP-Completeness Yin Tat Lee

Intro to Theory of Computation

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Computer Science cpsc322, Lecture 13

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Exploiting Graphical Structure in Decision-Making

Chapter 6. Large Scale Optimization

Hidden Markov Models Part 2: Algorithms

Objective of This Course

Bayesian Models in Machine Learning

Integer Programming (정수계획법)

How Hard Can It Be?.

3D Transformation CS380: Computer Graphics Sung-Eui Yoon (윤성의)

Artificial Intelligence

Kai-Wei Chang University of Virginia

Chapter 8 NP and Computational Intractability

Linear Programming Duality, Reductions, and Bipartite Matching

CS 188: Artificial Intelligence Fall 2007

NP-Complete Problems.

Integer Programming (정수계획법)

NP-Completeness Yin Tat Lee

Constraint satisfaction problems

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

Artificial Intelligence 12. Two Layer ANNs

Ideas for testing Transformations of cds 4/27/2019 AOO/Demeter.

Domain Splitting CPSC 322 – CSP 4 Textbook §4.6 February 4, 2011.

Dan Roth Computer and Information Science University of Pennsylvania

Implementation of Learning Systems

Dan Roth Department of Computer Science

“Traditional” image segmentation

Constraint satisfaction problems

CS249: Neural Language Model

Presentation transcript:

Lecture 7: Constrained Conditional Models Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Lec 6: Graphical Model

Where are we? Graphical models Formulations of structured output Bayesian Networks Markov Random Fields Formulations of structured output Joint models Markov Logic Network Conditional models Conditional Random Fields (again) Constrained Conditional Models

Outline Consistency of outputs and the value of inference Constrained conditional models via an example Hard constraints and Integer Programs Soft constraints

Outline Consistency of outputs and the value of inference Constrained conditional models via an example Hard constraints and Integer Programs Soft constraints

Consistency of outputs Or: How to introduce knowledge into prediction Suppose we have a sequence labeling problem where the outputs can be one of A or B 1 2 3

Consistency of outputs Or: How to introduce knowledge into prediction Suppose we have a sequence labeling problem where the outputs can be one of A or B We want to add a condition: There should be no more than one B in the output 1 2 3

Consistency of outputs Or: How to introduce knowledge into prediction Suppose we have a sequence labeling problem where the outputs can be one of A or B We want to add a condition: There should be no more than one B in the output Potential function that ensures this condition y1 y2 y3 f A B -1 1 2 3

Consistency of outputs Or: How to introduce knowledge into prediction Suppose we have a sequence labeling problem where the outputs can be one of A or B We want to add a condition: There should be no more than one B in the output Potential function that ensures this condition y1 y2 y3 f A B -1 1 2 3 Should we learn what we can write down easily? Especially for such large, computationally cumbersome factors

Another look at learning and inference Inference: A global decision comprising of Multiple local decisions and their inter-dependencies Note: If there are no inter-dependencies between decisions, we could as well treat these as independent prediction problems Does learning need to be global too? Recall: Local vs. global learning Global learning: Learn with inference Local learning: learn the local decisions independently and piece them together Maybe we can learn sub-structures independently and piece them together

Another look at learning and inference Inference: A global decision comprising of Multiple local decisions and their inter-dependencies Note: If there are no inter-dependencies between decisions, we could as well treat these as independent prediction problems Does learning need to be global too? Recall: Local vs. global learning Global learning: Learn with inference Local learning: learn the local decisions independently and piece them together Maybe we can learn sub-structures independently and piece them together

Typical updates in global learning Stochastic gradient descent update for CRF For a training example (xi, yi) Structured perceptron

Typical updates in global learning Stochastic gradient descent update for CRF For a training example (xi, yi) Structured perceptron This can be computed by a dynamic programming algorithm similar to Viterbi

Why local learning? Global learning may not always be feasible Practical concerns Global learning is computationally prohibitive Inference within the training loop Data issues We don’t have a single dataset that is fully annotated with a structure Multiple datasets, each with a subset of the structures But inference at deployment can still be global Recall the discussion from one-vs-all Questions?

Combining local classifiers What does inference do? Combining local classifiers Inference can “glue” together local decisions And enforce global coherence 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Combining local classifiers What does inference do? Combining local classifiers Inference can “glue” together local decisions And enforce global coherence We need to ensure that the colored edges form a valid output (i.e. a tree) 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Combining local classifiers What does inference do? Combining local classifiers Inference can “glue” together local decisions And enforce global coherence We need to ensure that the colored edges form a valid output (i.e. a tree) 3 possible node labels 3 possible edge labels Setting Output: Nodes and edges are labeled and the blue and orange edges form a tree Goal: Find the highest scoring tree

Constraints at prediction time Introduce additional information Might not have been available at training time Add domain knowledge Eg: All part-of-speech tag sequences should contain a verb Enforce coherence into the set of local decisions Eg: The collection of decisions should form a tree

Outline Consistency of outputs and the value of inference Constrained conditional models via an example Hard constraints and Integer Programs Soft constraints

Constrained Conditional Model Inference consists of two components Local classifiers May be a collection of structures themselves These are trained models A set of constraints that restrict the space of joint assignments of the local classifiers Background knowledge

Prediction in a CCM An example of how we can formalize prediction 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓E (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3)

Prediction in a CCM An example of how we can formalize prediction 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently,

Prediction in a CCM An example of how we can formalize prediction 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Emissions Transitions

Iz = 1 if z is true, 0 otherwise An example of how we can formalize prediction Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Emissions Transitions Indicator functions Iz = 1 if z is true, 0 otherwise

Score for decisions, from trained classifiers An example of how we can formalize prediction Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Score for decisions, from trained classifiers Emissions Transitions

Prediction in a CCM One group of indicators per factor An example of how we can formalize prediction Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Emissions Transitions One group of indicators per factor One score per indicator

Prediction in a CCM One group of indicators per factor An example of how we can formalize prediction Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Emissions Transitions This expression explicitly enumerates every decision that we need to make to build the final output One group of indicators per factor One score per indicator

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Let us enumerate every decision that we need to make to build the final output

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Only one of these two decisions can exist Iy1=A score(y1=A) + Iy1=B score(y1=B) Let us enumerate every decision that we need to make to build the final output

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Only one of these two decisions can exist Iy1=A score(y1=A) + Iy1=B score(y1=B) + Iy2=A score(y2=A) + Iy2=B score(y2=B) + Iy3=A score(y2=A) + Iy3=B score(y2=B) Let us enumerate every decision that we need to make to build the final output

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Only one of these two decisions can exist Only one of these four decisions can exist Iy1=A score(y1=A) + Iy1=B score(y1=B) Iy1=A and y2=A score(y1=A, y2=A) + Iy1=A and y2=B score(y1=A, y2=B) + Iy1=B and y2=A score(y1=B, y2=A) + Iy1=B and y2=B score(y1=B, y2=B) + + Iy2=A score(y2=A) + Iy2=B score(y2=B) + Iy3=A score(y2=A) + Iy3=B score(y2=B) Let us enumerate every decision that we need to make to build the final output

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Only one of these two decisions can exist Only one of these four decisions can exist Iy1=A score(y1=A) + Iy1=B score(y1=B) Iy1=A and y2=A score(y1=A, y2=A) + Iy1=A and y2=B score(y1=A, y2=B) + Iy1=B and y2=A score(y1=B, y2=A) + Iy1=B and y2=B score(y1=B, y2=B) + + Iy2=A score(y2=A) + Iy2=B score(y2=B) + Iy3=A score(y2=A) + Iy3=B score(y2=B) Iy2=A and y3=A score(y2=A, y3=A) + Iy2=A and y3=B score(y2=A, y3=B) + Iy2=B and y3=A score(y2=B, y3=A) + Iy2=B and y3=B score(y2=B, y3=B) + Let us enumerate every decision that we need to make to build the final output

Some decisions can not exist together Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Iy1=A score(y1=A) + Iy1=B score(y1=B) Iy1=A and y2=A score(y1=A, y2=A) + Iy1=A and y2=B score(y1=A, y2=B) + Iy1=B and y2=A score(y1=B, y2=A) + Iy1=B and y2=B score(y1=B, y2=B + + Iy2=A score(y2=A) + Iy2=B score(y2=B) + Iy3=A score(y2=A) + Iy3=B score(y2=B) Iy2=A and y3=A score(y2=A, y3=A) + Iy2=A and y3=B score(y2=A, y3=B) + Iy2=B and y3=A score(y2=B, y3=A) + Iy2=B and y3=B score(y2=B, y3=B + Some decisions can not exist together

Some decisions can not exist together Prediction in a CCM 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Iy1=A score(y1=A) + Iy1=B score(y1=B) Iy1=A and y2=A score(y1=A, y2=A) + Iy1=A and y2=B score(y1=A, y2=B) + Iy1=B and y2=A score(y1=B, y2=A) + Iy1=B and y2=B score(y1=B, y2=B + + Iy2=A score(y2=A) + Iy2=B score(y2=B) + Iy3=A score(y2=A) + Iy3=B score(y2=B) Iy2=A and y3=A score(y2=A, y3=A) + Iy2=A and y3=B score(y2=A, y3=B) + Iy2=B and y3=A score(y2=B, y3=A) + Iy2=B and y3=B score(y2=B, y3=B + Some decisions can not exist together

Prediction in a CCM Suppose the outputs can be one of A or B 1 2 3 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, But we should only consider valid label sequences

Prediction in a CCM Predict with constraints 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Predict with constraints Each yi can either be a A or a B The emission decisions and the transition decisions should agree There should be no more than one B in the output

Prediction in a CCM Predict with constraints 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Predict with constraints Each yi can either be a A or a B The emission decisions and the transition decisions should agree There should be no more than one B in the output

Prediction in a CCM Predict with constraints 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Predict with constraints Each yi can either be a A or a B The emission decisions and the transition decisions should agree There should be no more than one B in the output We can write this using linear constraints over the indicator variables

Prediction in a CCM Predict with constraints 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Predict with constraints Each yi can either be a A or a B The emission decisions and the transition decisions should agree There should be no more than one B in the output We could add extra knowledge that was not present at training time

Prediction in a CCM Predict with constraints 𝒘 𝑻 𝝓T (y1, y2) 𝒘 𝑻 𝝓T (y2, y3) Suppose the outputs can be one of A or B 1 2 3 Typically, we have 𝒘 𝑻 𝝓 𝑬 (x, y1) 𝒘 𝑻 𝝓E (x, y2) 𝒘 𝑻 𝝓E (x, y3) Or equivalently, Predict with constraints Each yi can either be a A or a B The emission decisions and the transition decisions should agree There should be no more than one B in the output Questions?

Constrained Conditional Model with hard constraints Global joint inference Treat the output as a collection of decisions (one per factor/part) Each decision associated with a score In addition, allow arbitrary constraints involving the decisions Constraints Inject domain knowledge into the prediction Can be stated as logical statements Can be transformed into linear inequalities in terms of the decision variables Learn the scoring functions locally or globally with/without constraints

Outline Consistency of outputs and the value of inference Constrained conditional models via an example Hard constraints and Integer Programs Soft constraints

Inference with hard constraints Such that y is feasible

Inference with hard constraints Same as Such that y is feasible

Inference with hard constraints Same as Such that y is feasible Can be written as linear (in)equalities in the I’s I’s can only be 0 or 1 This is an Integer Linear Program MAP inference can be written as integer linear programs Solving an integer linear program is NP-complete in the worst case Use an off-the-shelf solver (Gurobi) and hope for the best Or not: Write a specialized search algorithm if we know more about the problem (exact or approximate) We will see examples of these Questions?

Inference with hard constraints Same as Such that y is feasible Can be written as linear (in)equalities in the I’s I’s can only be 0 or 1 This is an Integer Linear Program MAP inference can be written as integer linear programs (ILPs) Solving an integer linear program is NP-complete in the worst case Use an off-the-shelf solver (Gurobi) and hope for the best Or not: Write a specialized search algorithm if we know more about the problem (exact or approximate) We will see examples of these Questions?

Inference with hard constraints Same as Such that y is feasible Can be written as linear (in)equalities in the I’s I’s can only be 0 or 1 This is an Integer Linear Program MAP inference can be written as integer linear programs (ILPs) But solving ILPs is NP-complete in the worst case Use an off-the-shelf solver (Gurobi) and hope for the best Or not: Write a specialized search algorithm if we know more about the problem (exact or approximate) We will see examples of these Questions?

ILP inference Can introduce domain knowledge in the form of constraints Any Boolean expression over the inference variables can be written as linear inequalities A uniform language for formulating and reasoning about inference Have not made the problem any easier to solve By allowing easy addition of constraints, it may be simple to write down provably intractable inference formulations (Off-the-shelf solvers seem to work admirably!) Questions?

Outline Consistency of outputs and the value of inference Constrained conditional models via an example Hard constraints and Integer Programs Soft constraints

Constraints may not always hold “Every car should have a wheel” Yes No Sometimes constraints don’t always hold Allow the model to consider structures that violate constraints

Constrained Conditional Model General case: with soft constraints The model score for the structure Sometimes constraints don’t always hold Allow the model to consider structures that violate constraints

Constrained Conditional Model General case: with soft constraints The model score for the structure Suppose we have a collection of “soft constraints” 𝐶 1 , 𝐶 2 , ⋯ each associated with penalties for violation 𝜌 1 , 𝜌 2 ,⋯ That is: A constraint 𝐶 𝑘 is a Boolean expression over the output. An assignment that violates this constraint has to pay a penalty of 𝜌 𝑘 Sometimes constraints don’t always hold Allow the model to consider structures that violate constraints

Constrained Conditional Model General case: with soft constraints The model score for the structure Penalty for violating the constraint If the constraint is hard, penalty = 1. Sometimes constraints don’t always hold Allow the model to consider structures that violate constraints

Constrained Conditional Model General case: with soft constraints The model score for the structure Penalty for violating the constraint If the constraint is hard, penalty = 1. How far is the structure y from satisfying the constraint Ck The most simple case: 0: If satisfied 1: If not Sometimes constraints don’t always hold Allow the model to consider structures that violate constraints

Constrained conditional models: review Write down conditions that the output need to satisfy Constraints are effectively factors in a factor graph whose potential functions are fixed Different learning regimes Train with the constraints or without Remember: constraint penalties are fixed in either case Prediction Can write the inference formulation as an integer linear program Can solve it with an off-the-shelf solver (or not!) Extension Soft constraints: Constraints that don’t always need to hold Questions?

Structured prediction: General formulations All the discussion so far has been about representation. Forthcoming lectures: Algorithms for training and inference Graphical models Bayesian Networks Markov Random Fields Formulations of structured output Joint models Markov Logic Network Conditional models Conditional Random Fields (again) Constrained Conditional Models