Lecture 2 CMS 165 Optimization.

Slides:

Advertisements

Similar presentations

Regularization David Kauchak CS 451 – Fall 2013.

Advertisements

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Optimization Tutorial

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Separating Hyperplanes

Artificial Neural Networks

Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.

Mathematical formulation XIAO LIYING. Mathematical formulation.

Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,

Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Chapter 6 Neural Network.

CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.

Lecture 2b: Convolutional NN: Optimization Algorithms

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

The role of optimization in machine learning

Deep Learning Methods For Automated Discourse CIS 700-7

Fall 2004 Backpropagation CS478 - Machine Learning.

RNNs: An example applied to the prediction task

Data Mining, Neural Network and Genetic Programming

CSC321 Lecture 18: Hopfield nets and simulated annealing

National Taiwan University

Large-scale Machine Learning

Chilimbi, et al. (2014) Microsoft Research

ECE 5424: Introduction to Machine Learning

Lecture 07: Soft-margin SVM

Real Neurons Cell structures Cell body Dendrites Axon

Unrolling: A principled method to develop deep neural networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Machine Learning Basics

Artificial neural networks (ANNs)

An Introduction to Support Vector Machines

RNNs: Going Beyond the SRN in Language Prediction

CSC 578 Neural Networks and Deep Learning

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

CS 4501: Introduction to Computer Vision Training Neural Networks II

Logistic Regression & Parallel SGD

Large Scale Support Vector Machines

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Perceptron as one Type of Linear Discriminants

Lecture 08: Soft-margin SVM

Deep Neural Networks (DNN)

Lecture 07: Soft-margin SVM

Neural Networks Geoff Hulten.

Deep Learning for Non-Linear Control

Lecture Notes for Chapter 4 Artificial Neural Networks

Overfitting and Underfitting

Neural networks (1) Traditional multi-layer perceptrons

COSC 4368 Machine Learning Organization

CS639: Data Management for Data Science

Image Classification & Training of Neural Networks

TensorFlow: A System for Large-Scale Machine Learning

CMS 165 Lecture 18 Summing up...

The Updated experiment based on LSTM

Introduction to Neural Networks

Section 3: Second Order Methods

Role of Stein’s Lemma in Guaranteed Training of Neural Networks

Batch Normalization.

CSC 578 Neural Networks and Deep Learning

Rong Ge, Duke University

Goodfellow: Chapter 14 Autoencoders

Introduction to Machine Learning

Presentation transcript:

Lecture 2 CMS 165 Optimization

Non-convex optimization All loss-functions that are not convex: not very informative. Global optimality: too strong. Weaker notions of optimality? What is a saddle point?

Different kinds of critical/stationary points

Higher order saddle points

When can saddles arise: Symmetries

Higher order saddles: Over-parameterization Overspecification: More capacity (neurons) than required. Each critical point in smaller network (over which function is realized) generates a set of critical points in larger network (degeneracy) Even global optima in smaller network can generate local optima and saddle points in larger network. A long time spent in the manifold of singularities.

How to escape saddle points? What happens to Gradient descent near saddles? What happens to Newton’s method near saddles? Solution: Cubic regularization: Second order optimization. What about first order methods?

Stochastic Gradient Descent Why is it so popular? Why is it so special? What guarantees?

Stochastic Gradient Descent

Scaling up optimization What are challenges in scaling up? How to take communication costs into account? Impact of gradient compression?

Distributed training involves computation & communication Compress? Parameter server Compress? GPU 1 GPU 2 With 1/2 data

Distributed training by majority vote sign(g) Parameter server GPU 1 GPU 2 GPU 3 sign [sum(sign(g))] Parameter server GPU 1 GPU 2 GPU 3

Variants of SGD distort the gradient in different ways Signum Adam

Large-Batch Analysis

Vector density & its relevance in deep learning A sparse vector A dense vector Natural measure of density =1 for fully dense v ≈0 for fully sparse v Fully dense vector……………….a sign vector

Distributed SignSGD: Majority vote theory If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Having symmetry implies sign not screwed up badly Same variance reduction as SGD

Optimization in ML Optimizer used as a black-box in ML. We expect it to work perfectly. In many settings it works, e.g. over-parameterized neural networks. But extensive hyper-parameter magic (which no one talks about). SGD is the workhorse. Noise helps (for regularization). Second order methods are not yet scalable and don’t seem to help. In many cases, we don’t want it to work well. We want it to avoid certain solutions, even globally optimal solutions. E.g. in learn multi-lingual embeddings. Initialize in the hope of nudging to desirable solutions. But if a perfect optimizer, this should not matter. Early stopping: We don’t want it to fit loss function too perfectly, when there is not enough regularization. (Generalization, wait for a few more lectures).

References ICML 2016 tutorial on non-convex optimization: http://tensorlab.cms.caltech.edu/users/anima/slides/icml2016-tutorial.pdf “Cubic regularization of Newton method and its global performance” Nestorov, Polyak. Rong Ge blog https://www.offconvex.org/2017/07/19/saddle-efficiency/ “How to Escape Saddle Points Efficiently” Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael Jordan https://arxiv.org/pdf/1703.00887.pdf SignSGD papers: https://arxiv.org/pdf/1802.04434.pdf https://arxiv.org/pdf/1810.05291.pdf Hal Daume’s blog https://nlpers.blogspot.com/2016/05/a-bad-optimizer-is-not-good-thing.html Optional follow-up reading: Early stopping in kernel methods https://papers.nips.cc/paper/7187-early-stopping-for-kernel-boosting-algorithms-a-general- analysis-with-localized-complexities.pdf Escaping saddles in constrained optimization: https://arxiv.org/pdf/1809.02162.pdf Escaping higher order saddles: http://www.offconvex.org/2016/03/22/saddlepoints/ K Fukumizu, S Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. 2000.