First-Order Methods.

Slides:



Advertisements
Similar presentations
Optimization with Constraints
Advertisements

Introduction to Neural Networks Computing
Optimization.
Data Modeling and Parameter Estimation Nov 9, 2005 PSCI 702.
Optimization of thermal processes
Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Function Optimization Newton’s Method. Conjugate Gradients
Tutorial 12 Unconstrained optimization Conjugate gradients.
Motion Analysis (contd.) Slides are from RPI Registration Class.
CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.
Design Optimization School of Engineering University of Bradford 1 Numerical optimization techniques Unconstrained multi-parameter optimization techniques.
Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.
12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.
Advanced Topics in Optimization
Why Function Optimization ?

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.
UNCONSTRAINED MULTIVARIABLE
Algorithm Taxonomy Thus far we have focused on:
Introduction to Adaptive Digital Filters Algorithms
1 Optimization Multi-Dimensional Unconstrained Optimization Part II: Gradient Methods.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
ADALINE (ADAptive LInear NEuron) Network and
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Question about Gradient descent Hung-yi Lee. Larger gradient, larger steps? Best step:
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
1 Chapter 6 General Strategy for Gradient methods (1) Calculate a search direction (2) Select a step length in that direction to reduce f(x) Steepest Descent.
Gradient Methods In Optimization
Variations on Backpropagation.
Survey of unconstrained optimization gradient based algorithms
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
Data assimilation for weather forecasting G.W. Inverarity 06/05/15.
ELG5377 Adaptive Signal Processing Lecture 15: Recursive Least Squares (RLS) Algorithm.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.
CS : Designing, Visualizing and Understanding Deep Neural Networks
Non-linear Minimization
Computational Optimization
Machine Learning – Regression David Fenyő
Presented by Xinxin Zuo 10/20/2017
CS5321 Numerical Optimization
Lecture 11. MLP (III): Back-Propagation
Steepest Descent Algorithm: Step 1.
Variations on Backpropagation.
Outline Single neuron case: Nonlinear error correcting learning
Chapter 10. Numerical Solutions of Nonlinear Systems of Equations
CS5321 Numerical Optimization
Instructor :Dr. Aamer Iqbal Bhatti
METHOD OF STEEPEST DESCENT
Introduction to Scientific Computing II
Introduction to Scientific Computing II
Introduction to Scientific Computing II
LECTURE 35: Introduction to EEG Processing
LECTURE 33: Alternative OPTIMIZERS
Backpropagation.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Introduction to Scientific Computing II
Variations on Backpropagation.
Backpropagation.
Performance Optimization
L23 Numerical Methods part 3
Section 3: Second Order Methods
CSC 578 Neural Networks and Deep Learning
CS5321 Numerical Optimization
Logistic Regression Geoff Hulten.
Presentation transcript:

First-Order Methods

Gradient Descent Search in direction of steepest descent

Gradient Descent Applying gradient descent and line search on Rosenbrock function

Conjugate Gradient Inspired by methods for optimizing quadratic functions Assumes problem takes the quadratic form where A is positive-definite matrix All search directions are mutually conjugate with respect to A For n-dimensional problem, converges in n steps Can be used when function is locally approximated as quadratic

Conjugate Gradient Using Fletcher-Reeves or Polak-Ribière equations for β(k), a function can be locally approximated as quadratic and minimized using Conjugate Gradient

Momentum Addresses common convergence issues Some functions have regions with very small gradients

Momentum Some functions cause gradient descent to get stuck Momentum overcomes these issues by replicating the effect of physical momentum

Momentum Momentum update equations

Nesterov Momentum

Adagrad Instead of using the same learning rate for all components of x, Adaptive Subgradient method (Adagrad) adapts the learning rate for each component of x. For each component of x, the update equation is where ϵ≈1× 10 −8

RMSProp Extends Adagrad to avoid monotonically decreasing learning rate by maintaining a decaying average of squared gradients Update Equation

Adadelta Modifies RMSProp to eliminate learning rate parameter entirely

Adam Incorporates ideas from momentum and RMSProp The adaptive moment estimation method (Adam), adapts the learning rate to each parameter. At each iteration, a sequence of values are computed Biased decaying momentum Biased decaying squared gradient Corrected decaying momentum Corrected decaying squared gradient Next iterate x(k+1)

Hypergradient Descent Many accelerated descent methods are highly sensitive to hyperparameters such as learning rate. Applying gradient descent to a hyperparameter of an underlying descent method is called hypergradient descent Requires computing the partial derivative of the objective function with respect to the hyperparameter

Hypergradient Descent

Summary Gradient descent follows the direction of steepest descent. The conjugate gradient method can automatically adjust to local valleys. Descent methods with momentum build up progress in favorable directions. A wide variety of accelerated descent methods use special techniques to speed up descent. Hypergradient descent applies gradient descent to the learning rate of an underlying descent method.