First-Order Methods.

Slides:

Advertisements

Similar presentations

Optimization with Constraints

Advertisements

Introduction to Neural Networks Computing

Data Modeling and Parameter Estimation Nov 9, 2005 PSCI 702.

Optimization of thermal processes

Steepest Decent and Conjugate Gradients (CG). Solving of the linear equation system.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

Function Optimization Newton’s Method. Conjugate Gradients

Tutorial 12 Unconstrained optimization Conjugate gradients.

Motion Analysis (contd.) Slides are from RPI Registration Class.

CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.

Design Optimization School of Engineering University of Bradford 1 Numerical optimization techniques Unconstrained multi-parameter optimization techniques.

Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.

12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.

Advanced Topics in Optimization

Why Function Optimization ?

9 1 Performance Optimization. 9 2 Basic Optimization Algorithm p k - Search Direction  k - Learning Rate or.

UNCONSTRAINED MULTIVARIABLE

Algorithm Taxonomy Thus far we have focused on:

Introduction to Adaptive Digital Filters Algorithms

1 Optimization Multi-Dimensional Unconstrained Optimization Part II: Gradient Methods.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

ADALINE (ADAptive LInear NEuron) Network and

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

Question about Gradient descent Hung-yi Lee. Larger gradient, larger steps? Best step:

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

1 Chapter 6 General Strategy for Gradient methods (1) Calculate a search direction (2) Select a step length in that direction to reduce f(x) Steepest Descent.

Gradient Methods In Optimization

Variations on Backpropagation.

Survey of unconstrained optimization gradient based algorithms

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.

Data assimilation for weather forecasting G.W. Inverarity 06/05/15.

ELG5377 Adaptive Signal Processing Lecture 15: Recursive Least Squares (RLS) Algorithm.

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.

CS : Designing, Visualizing and Understanding Deep Neural Networks

Non-linear Minimization

Computational Optimization

Machine Learning – Regression David Fenyő

Presented by Xinxin Zuo 10/20/2017

CS5321 Numerical Optimization

Lecture 11. MLP (III): Back-Propagation

Steepest Descent Algorithm: Step 1.

Variations on Backpropagation.

Outline Single neuron case: Nonlinear error correcting learning

Chapter 10. Numerical Solutions of Nonlinear Systems of Equations

CS5321 Numerical Optimization

Instructor :Dr. Aamer Iqbal Bhatti

METHOD OF STEEPEST DESCENT

Introduction to Scientific Computing II

Introduction to Scientific Computing II

Introduction to Scientific Computing II

LECTURE 35: Introduction to EEG Processing

LECTURE 33: Alternative OPTIMIZERS

Backpropagation.

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Introduction to Scientific Computing II

Variations on Backpropagation.

Backpropagation.

Performance Optimization

L23 Numerical Methods part 3

Section 3: Second Order Methods

CSC 578 Neural Networks and Deep Learning

CS5321 Numerical Optimization

Logistic Regression Geoff Hulten.

Presentation transcript:

First-Order Methods

Gradient Descent Search in direction of steepest descent

Gradient Descent Applying gradient descent and line search on Rosenbrock function

Conjugate Gradient Inspired by methods for optimizing quadratic functions Assumes problem takes the quadratic form where A is positive-definite matrix All search directions are mutually conjugate with respect to A For n-dimensional problem, converges in n steps Can be used when function is locally approximated as quadratic

Conjugate Gradient Using Fletcher-Reeves or Polak-Ribière equations for β(k), a function can be locally approximated as quadratic and minimized using Conjugate Gradient

Momentum Addresses common convergence issues Some functions have regions with very small gradients

Momentum Some functions cause gradient descent to get stuck Momentum overcomes these issues by replicating the effect of physical momentum

Momentum Momentum update equations

Nesterov Momentum

Adagrad Instead of using the same learning rate for all components of x, Adaptive Subgradient method (Adagrad) adapts the learning rate for each component of x. For each component of x, the update equation is where ϵ≈1× 10 −8

RMSProp Extends Adagrad to avoid monotonically decreasing learning rate by maintaining a decaying average of squared gradients Update Equation

Adadelta Modifies RMSProp to eliminate learning rate parameter entirely

Adam Incorporates ideas from momentum and RMSProp The adaptive moment estimation method (Adam), adapts the learning rate to each parameter. At each iteration, a sequence of values are computed Biased decaying momentum Biased decaying squared gradient Corrected decaying momentum Corrected decaying squared gradient Next iterate x(k+1)

Hypergradient Descent Many accelerated descent methods are highly sensitive to hyperparameters such as learning rate. Applying gradient descent to a hyperparameter of an underlying descent method is called hypergradient descent Requires computing the partial derivative of the objective function with respect to the hyperparameter

Hypergradient Descent

Summary Gradient descent follows the direction of steepest descent. The conjugate gradient method can automatically adjust to local valleys. Descent methods with momentum build up progress in favorable directions. A wide variety of accelerated descent methods use special techniques to speed up descent. Hypergradient descent applies gradient descent to the learning rate of an underlying descent method.