Presentation is loading. Please wait.

Presentation is loading. Please wait.

Click to edit Master title style

Similar presentations


Presentation on theme: "Click to edit Master title style"— Presentation transcript:

1 Click to edit Master title style
Grafting-Light: Fast, Incremental Feature Selection and Structure Learning of Markov Random Fields (Zhu, Lao and Xing, 2010) Click to edit Master title style Click to edit Master subtitle style Pete Stein COMP | 10/28/10 3/22/2011 11/14/2018 1

2 Grafting: Sneak Peek What is “grafting?” What are we doing?
Gradient Feature Testing What are we doing? Using “gradient descent” and other methods to approximate optimal features from a dataset. (…not this kind of grafting.) 3/22/2011

3 Motivation: Feature Selection
Past approaches to features/feature-selection Mysterious Ad-hoc Detached from the learning process Choosing the best features versus learning: two sides of the same coin? 3/22/2011

4 Motivation: Feature Selection
Challenge: a robust framework for intelligent, efficient feature selection Support large number of features Computational efficiency 3/22/2011

5 Grafting-Light Application
Seeks features for modeling with Markov random fields Entails large number of high-dimensional inputs and features Selecting optimal features in NP-hard Options? Previous works: “Batch” methods that optimize all candidate features Incremental methods (e.g., “Grafting”) Andrey Markov, 3/22/2011

6 Regularized Risk Minimization
Generally, how do we select features? Want features that minimize risk: L() = penalty for deviation of f() from y P() = JPDF (joint probability density function) Example L(): L would be a penalty for a binary function that returns either inclusion or exclusion L = ½ | y – sign(f(x)) | 3/22/2011

7 Empirical Risk In practice, we don’t know p. Instead:
Minimize a combination of empirical risk (for each data point) Plus a regularization term 3/22/2011

8 Loss Functions Lbnll = ln(1 + e-ρ)
Desirable to choose classifiers that separate data as much as possible. I.e., maximize margin ρ = y · f(x) So, better loss functions L() are possible, e.g., Binomial Negative Log Likelihood: Lbnll = ln(1 + e-ρ) 3/22/2011

9 Loss Functions 3/22/2011

10 Regularization Why not just minimize empirical risk and leave it at that? Optimization problem may be unbounded w.r.t. parameters Overfitting: solution may fit one set of data very well, and then do poorly on the next set. 3/22/2011

11 Regularization What does this regularization term look like?
q = regularization norm w = weighting vector λ = regularization parameters α = inclusion/exclusion vector – e.g., {0, 1} 3/22/2011

12 Which Regularization? Depends on “pragmatic” motivations (performance, scalability) Of particular interest: q ∈ {0, 1, 2} 3/22/2011

13 Unified Regularization
Combining the three criterions: Now, how do we minimize C(w)? Grafting. 3/22/2011

14 Grafting General strategy: But, obstacles exist:
Use gradient descent with model parameters w But, obstacles exist: Slow (quadratic w.r.t. data dimensionality) Numerical issues for some regularizers Unfortunately there are a number of problems with this approach. Firstly, the gradient descent can be quite slow. Algorithms such as conjugate gradient descent typically scale quadratically with the number of dimensions,4 and we are particularly interested in the domain where we have many features and so p is large. This quadratic dependence on number of model weights seems particularly wasteful if we are using the W0 and/or W1 regularizer, and we know that only some subset of those weights will be non-zero. The second problem is that the W0 and W1 regularizers do not have continuous first derivatives with respect to the model weights. This can cause numerical problems for general purpose gradient descent algorithms that expect these derivatives to be continuous. 3/22/2011

15 Gradient Descent First, what is gradient descent?
First-order optimization algorithm Finds local minima (local maxima = gradient ascent) Take steps proportional to the gradient Relatively slow in some situations For poorly conditioned convex problems, gradient descent increasingly 'zigzags' as the gradients point nearly orthogonally to the shortest direction to a minimum point. For non-differentiable functions, gradient methods are ill-defined. 3/22/2011

16 Grafting: Stagewise Optimization
Basic steps: Split parameter weights (w) into free (F) and fixed (Z) sets During each stage, transfer one weight (with largest contribution to C(w)) from Z to F. Use gradient descent to minimize C(w) w.r.t. F. Gradient descent is faster with the stage-wise approach, but still slow. At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

17 Grafting-Light Main difference: at each grafting step, use fast gradient-based heuristic to choose features that will reduce C(w) the most. Isn’t this the same as regular grafting? At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

18 Grafting-Light Grafting: optimize over all free parameters at each stage. Grafting-Light: only one gradient descent optimization At each grafting step, we calculate the magnitude of ¶C=¶wi for each wi 2 Z, and determine the maximum magnitude. We then add the weight to the set of free weights F , and call a general purpose gradient descent routine to optimize (7) with respect to all the free weights. Since we know how to calculate the gradient of ¶C=¶wi, we can use an efficient quasi-Newton method. We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) variant (see Press et al. 1992, chap. 10 for description). We start the optimization at the weight values found in the k−1’th step, so in general only the most recently added weight changes significantly. Note that choosing the next weight based on the magnitude of ¶C=¶wi does not guarantee that it is the best weight to add at that stage. However, it is much faster than the alternative of trying an optimization with each member of Z in turn and picking the best. We shall see below that this procedure will take us to a solution that is at least locally optimal. 3/22/2011

19 Grafting/Grafting-Light Comparison
Step-by-step comparison of Grafting and Grafting-Light: Grafting-Light: One step of gradient descent Select new features (e.g., v) 2D gradient descent Grafting: Optimize from μ to local optimum μ* Select new features (e.g., v) Optimize over μ and v 3/22/2011

20 Results / Comparison Full-Opt.-L1 Grafting Grafting-Light 3/22/2011

21 The End Thank you; questions? 3/22/2011


Download ppt "Click to edit Master title style"

Similar presentations


Ads by Google