Kai-Wei Chang University of Virginia

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS 6961: Structured Prediction Fall 2014 Introduction Lecture 1 What is structured prediction?
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Lecture 14 – Neural Networks
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks Marco Loog.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Online Learning Algorithms
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning CSE 681 CH2 - Supervised Learning.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
CS 6961: Structured Prediction Fall 2014 Course Information.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Lecture 7: Constrained Conditional Models
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning for Computer Security
Chapter 7. Classification and Prediction
Deep Feedforward Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Introduction to Machine Learning
Boosting and Additive Trees (2)
Introduction to Data Science Lecture 7 Machine Learning Overview
CIS 700 Advanced Machine Learning Structured Machine Learning:   Theory and Applications in Natural Language Processing Shyam Upadhyay Department of.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CIS 700 Advanced Machine Learning for NLP A First Look at Structures
CIS 700 Advanced Machine Learning for NLP Inference Applications
Basic machine learning background with Python scikit-learn
Data Mining Lecture 11.
CS 188: Artificial Intelligence
CS 4/527: Artificial Intelligence
Data Mining Practical Machine Learning Tools and Techniques
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Hidden Markov Models Part 2: Algorithms
Objective of This Course
CS 2750: Machine Learning Support Vector Machines
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Overview of Machine Learning
Kai-Wei Chang University of Virginia
CS 188: Artificial Intelligence Fall 2007
ML – Lecture 3B Deep NN.
Ensemble learning.
Support Vector Machines
Machine learning overview
Introduction to Neural Networks
Dan Roth Department of Computer Science
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CSC 578 Neural Networks and Deep Learning
CS249: Neural Language Model
Presentation transcript:

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Lecture 11: Summary Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar’s course on Structured Prediction Advanced ML: Inference

This lecture What is a structure? A survey of the terrain we have covered ML for inter-dependent variables

Recall: What is structure? A structure is a concept that can be applied to any complex thing, whether it be a bicycle, a commercial company, or a carbon molecule. By complex, we mean: It is divisible into parts, There are different kinds of parts, The parts are arranged in a specifiable way, and, Each part has a specifiable function in the structure of the thing as a whole From the book Analysing Sentences: An Introduction to English Syntax by Noel Burton-Roberts, 1986.

An example task: Semantic Parsing Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES size population capital US_CITIES state MAX (numeric list) ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition US_STATES SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES Expression 1 = Expression 2 SELECT expression FROM table SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US SELECT expression FROM table WHERE condition name US_STATES Expression 1 = Expression 2 size SELECT expression FROM table Or perhaps population? MAX numeric list US_STATES size SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

A plausible strategy to build the query Find the largest state in the US At each step many, many decisions to make Some decisions are simply not allowed A query has to be well formed! Even so, many possible options Why does “Find” map to SELECT? Largest by size/population/population of capital? SELECT expression FROM table WHERE condition name US_STATES Expression 1 = Expression 2 size SELECT expression FROM table Or perhaps population? MAX numeric list US_STATES size SELECT expression FROM table WHERE condition MAX numeric list ORDERBY predicate DELETE FROM table WHERE condition SELECT expression FROM table Expression 1 = Expression 2 name US_STATES size population capital US_CITIES state

Standard classification tools can’t predict structures X: “Find the largest state in the US.” Y: Classification is about making one decision Spam or not spam, or predict one label, etc SELECT name FROM us_states WHERE size = (SELECT MAX(size) FROM us_states)

Standard classification tools can’t predict structures X: “Find the largest state in the US.” Y: We need to make multiple decisions Each part needs a label: e.g., should “US” be mapped to us_states or us_cities? The decisions interact with each other If the outer FROM clause talks about the table us_states, then the inner FROM clause should not talk about utah_counties SELECT name FROM us_states WHERE size = (SELECT MAX(size) FROM us_states)

How did we get here? Multiclass classification Different strategies One-vs-all, all-vs-all Global learning algorithms One feature vector per outcome Each outcome scored Prediction = highest scoring outcome Binary classification Learning algorithms Prediction is easy – Threshold Features (???) Structured classification Global models or local models Each outcome scored Prediction = highest scoring outcome Inference is no longer easy! Makes all the difference

Structured output is… A graph, possibly labeled and/or directed Possibly from a restricted family, such as chains, trees, etc. A discrete representation of input Eg. A table, the SRL frame output, a sequence of labels etc A collection of inter-dependent decisions Eg: The sequence of decisions used to construct the output The result of a combinatorial optimization problem 𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦∈𝑌 score(x, y) Representation Procedural Formally

Challenges with structured output Two challenges We cannot train a separate weight vector for each possible inference outcome For multiclass, we could train one weight vector for each label We cannot enumerate all possible structures for inference Inference for binary/multiclass is easy

Challenges with structured output Solution Decompose the output into parts that are labeled Define how the parts interact with each other how labels are scored for each part an inference algorithm to assign labels to all the parts

Multiclass as a structured output A structure is… A graph (in general, hypergraph), possibly labeled and/or directed A collection of inter- dependent decisions The output of a combinatorial optimization problem 𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 score(x, y) Multiclass A graph with one node and no edges Node label is the output Can be composed via multiple decisions Winner-take-all argmaxi 𝑤 𝑇 𝜙(𝑥, 𝑖)

Multiclass is a structure: Implications A lot of the ideas from multiclass may be generalized to structures Not always trivial, but useful to keep in mind Broad statements about structured learning must apply to multiclass classification Useful for sanity check, also for understanding Binary classification is the most “trivial” form of structured classification Multiclass with two classes

The machine learning of interdependent variables

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

What does it mean to define the model? Say we want to predict four output variables from some input x y1 y2 y3 y4

What does it mean to define the model? Say we want to predict four output variables from some input x Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it y1 y2 y3 y4 Option 1: Score each decision separately Pro: Prediction is easy, each y independent Con: No consideration of interactions

What does it mean to define the model? Say we want to predict four output variables from some input x Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it y1 y2 y3 y4 Option 2: Add pairwise factors Pro: Accounts for pairwise dependencies Cons: Makes prediction harder, ignores third and higher order dependencies

What does it mean to define the model? Say we want to predict four output variables from some input x Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it y1 y2 y3 y4 Option 3: Use only order 3 factors Pro: Accounts for order 3 dependencies Cons: Prediction even harder. Inference should consider all triples of labels now

What does it mean to define the model? Say we want to predict four output variables from some input x Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it y1 y2 y3 y4 Option 4: Use order 4 factors Pro: Accounts for order 4 dependencies Cons: Basically no decomposition over the labels!

What does it mean to define the model? Say we want to predict four output variables from some input x Recall: Each factor is a local expert about all the random variables connected to it i.e. A factor can assign a score to assignments of variables connected to it y1 y2 y3 y4 How do we decide what to do?

Some aspects to consider Availability of supervision Supervised algorithms are well studied; supervision is hard (or expensive) to obtain Complexity of model More complex models encode complex dependencies between parts; complex models make learning and inference harder Features Most of the time we will assume that we have a good feature set to model our problem. But do we? Domain knowledge Incorporating background knowledge into learning and inference in a mathematically sound way

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

Training structured models Empirical risk minimization principle Minimize loss over the training data Regularize the parameters to prevent overfitting We have seen different training strategies falling under this umbrella Conditional Random Fields Structural Support Vector Machines Structured Perceptron (doesn’t have regularization) Different algorithms exist We saw stochastic gradient descent in some detail

Training considerations Train globally vs train locally Global: Train according to your final model y1 y2 y3 y4 x Pro: Learning uses all the available information Con: Computationally expensive

Training considerations Train globally vs train locally Local: Decompose your model into smaller ones and train each one separately Full model still used at prediction time y1 y2 y1 y2 y3 y4 x y3 y4 y2 y3 y1 y3 y1 y4 y2 y4 Pro: Easier to train Con: May not capture global dependencies

Training considerations Local vs global Local learning Learn parameters for individual components independently Learning algorithm not aware of the full structure Global learning Learn parameters for the full structure Learning algorithm “knows” about the full structure Depends on inference complexity Depends on size of available data too How do we choose?

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

Marginals: find Maximizer: find Inference What is inference? The prediction step More broadly, an aggregation operation on the space of outputs for an example: max, expectation, sample, sum Different flavors: MAP, marginal, loss augmented. Marginals: find Maximizer: find

Inference Many algorithms, solution strategies Combinatorial optimization, one size doesn’t fit all Graph algorithms, belief propagation, Integer linear programming, (beam) search, Monte Carlo methods…. Some tradeoffs Programming effort Exact vs inexact Is the problem solvable with a known algorithm? Do we care about the exact answer? How do we choose?

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

How does background knowledge affect your choices? Background knowledge biases your predictor in several ways What is the model? Maybe third order factors are not needed… etc Your choices for learning and inference algorithms Feature functions Constraints that prohibit certain inference outcomes

Computational issues Model definition What are the parts of the output? What are the inter-dependencies? How to train the model? How to do inference? Data annotation difficulty Background knowledge about domain Semi-supervised/indirectly supervised?

Data and how it influences your model Annotated data is a precious resource Takes specialized expertise to generate Or: very clever tricks (like online games that make data as a side effect) Important directions Learning with latent representations, indirect supervision, partial supervision In all these cases Learning is rarely a convex problem Modeling choices become very important! Bad model will hurt

Looking ahead Big questions (a very limited and biased set) Representations Can we learn the factorization? Can we learn feature functions? Dealing with the data problem for new applications Clever tricks to get data Taming latent variable learning Applications How does structured prediction help you? Gathering importance as computer programs have to deal with uncertain, noisy inputs and make complex decisions