Click to edit Master title style

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Regularized risk minimization

Linear Regression.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Lecture 4: Embedded methods

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Visual Recognition Tutorial

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Linear Discriminant Functions Chapter 5 (Duda et al.)

Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

UNCONSTRAINED MULTIVARIABLE

Collaborative Filtering Matrix Factorization Approach

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1 Chapter 6 General Strategy for Gradient methods (1) Calculate a search direction (2) Select a step length in that direction to reduce f(x) Steepest Descent.

Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Ch. Eick: Num. Optimization with GAs Numerical Optimization General Framework: objective function f(x 1,...,x n ) to be minimized or maximized constraints:

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Optimization via Search

Support vector machines

PREDICT 422: Practical Machine Learning

Learning Deep Generative Models by Ruslan Salakhutdinov

Gradient descent David Kauchak CS 158 – Fall 2016.

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Large Margin classifiers

CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.

Non-linear Minimization

Dan Roth Department of Computer and Information Science

Empirical risk minimization

Conditional Random Fields for ASR

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Boosting and Additive Trees

10701 / Machine Learning.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Perceptrons Support-Vector Machines

CS 188: Artificial Intelligence

Probabilistic Models for Linear Regression

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

Modelling data and curve fitting

Collaborative Filtering Matrix Factorization Approach

Ying shen Sse, tongji university Sep. 2016

Learning Markov Networks

Linear Classifier by Dr

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Large Scale Support Vector Machines

Support vector machines

Support vector machines

Support vector machines

Empirical risk minimization

Image Classification & Training of Neural Networks

Recap: Naïve Bayes classifier

Section 3: Second Order Methods

SVMs for Document Ranking

Presentation transcript:

Click to edit Master title style Grafting-Light: Fast, Incremental Feature Selection and Structure Learning of Markov Random Fields (Zhu, Lao and Xing, 2010) Click to edit Master title style Click to edit Master subtitle style Pete Stein COMP 790-90 | 10/28/10 3/22/2011 11/14/2018 1

Grafting: Sneak Peek What is “grafting?” What are we doing? Gradient Feature Testing What are we doing? Using “gradient descent” and other methods to approximate optimal features from a dataset. (…not this kind of grafting.) 3/22/2011

Motivation: Feature Selection Past approaches to features/feature-selection Mysterious Ad-hoc Detached from the learning process Choosing the best features versus learning: two sides of the same coin? 3/22/2011

Motivation: Feature Selection Challenge: a robust framework for intelligent, efficient feature selection Support large number of features Computational efficiency 3/22/2011

Grafting-Light Application Seeks features for modeling with Markov random fields Entails large number of high-dimensional inputs and features Selecting optimal features in NP-hard Options? Previous works: “Batch” methods that optimize all candidate features Incremental methods (e.g., “Grafting”) Andrey Markov, 1856-1922 3/22/2011

Regularized Risk Minimization Generally, how do we select features? Want features that minimize risk: L() = penalty for deviation of f() from y P() = JPDF (joint probability density function) Example L(): L would be a penalty for a binary function that returns either inclusion or exclusion L = ½ | y – sign(f(x)) | 3/22/2011

Empirical Risk In practice, we don’t know p. Instead: Minimize a combination of empirical risk (for each data point) Plus a regularization term 3/22/2011

Loss Functions Lbnll = ln(1 + e-ρ) Desirable to choose classifiers that separate data as much as possible. I.e., maximize margin ρ = y · f(x) So, better loss functions L() are possible, e.g., Binomial Negative Log Likelihood: Lbnll = ln(1 + e-ρ) 3/22/2011

Loss Functions 3/22/2011

Regularization Why not just minimize empirical risk and leave it at that? Optimization problem may be unbounded w.r.t. parameters Overfitting: solution may fit one set of data very well, and then do poorly on the next set. 3/22/2011

Regularization What does this regularization term look like? q = regularization norm w = weighting vector λ = regularization parameters α = inclusion/exclusion vector – e.g., {0, 1} 3/22/2011

Which Regularization? Depends on “pragmatic” motivations (performance, scalability) Of particular interest: q ∈ {0, 1, 2} 3/22/2011

Unified Regularization Combining the three criterions: Now, how do we minimize C(w)? Grafting. 3/22/2011

Grafting General strategy: But, obstacles exist: Use gradient descent with model parameters w But, obstacles exist: Slow (quadratic w.r.t. data dimensionality) Numerical issues for some regularizers Unfortunately there are a number of problems with this approach. Firstly, the gradient descent can be quite slow. Algorithms such as conjugate gradient descent typically scale quadratically with the number of dimensions,4 and we are particularly interested in the domain where we have many features and so p is large. This quadratic dependence on number of model weights seems particularly wasteful if we are using the W0 and/or W1 regularizer, and we know that only some subset of those weights will be non-zero. The second problem is that the W0 and W1 regularizers do not have continuous first derivatives with respect to the model weights. This can cause numerical problems for general purpose gradient descent algorithms that expect these derivatives to be continuous. 3/22/2011

Gradient Descent First, what is gradient descent? First-order optimization algorithm Finds local minima (local maxima = gradient ascent) Take steps proportional to the gradient Relatively slow in some situations For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the gradients point nearly orthogonally to the shortest direction to a minimum point. For non-differentiable functions, gradient methods are ill-defined. 3/22/2011

Grafting: Stagewise Optimization Basic steps: Split parameter weights (w) into free (F) and fixed (Z) sets During each stage, transfer one weight (with largest contribution to C(w)) from Z to F. Use gradient descent to minimize C(w) w.r.t. F. Gradient descent is faster with the stage-wise approach, but still slow. At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

Grafting-Light Main difference: at each grafting step, use fast gradient-based heuristic to choose features that will reduce C(w) the most. Isn’t this the same as regular grafting? At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

Grafting-Light Grafting: optimize over all free parameters at each stage. Grafting-Light: only one gradient descent optimization At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

Grafting/Grafting-Light Comparison Step-by-step comparison of Grafting and Grafting-Light: Grafting-Light: One step of gradient descent Select new features (e.g., v) 2D gradient descent Grafting: Optimize from μ to local optimum μ* Select new features (e.g., v) Optimize over μ and v 3/22/2011

Results / Comparison Full-Opt.-L1 Grafting Grafting-Light 3/22/2011

The End Thank you; questions? 3/22/2011