Lecture 2 CMS 165 Optimization.

Slides:



Advertisements
Similar presentations
Regularization David Kauchak CS 451 – Fall 2013.
Advertisements

Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Separating Hyperplanes
Artificial Neural Networks
Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Efficient and Numerically Stable Sparse Learning Sihong Xie 1, Wei Fan 2, Olivier Verscheure 2, and Jiangtao Ren 3 1 University of Illinois at Chicago,
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー
Chapter 6 Neural Network.
CSC2535: Computation in Neural Networks Lecture 8: Hopfield nets Geoffrey Hinton.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Machine Learning Artificial Neural Networks MPλ ∀ Stergiou Theodoros 1.
Lecture 2b: Convolutional NN: Optimization Algorithms
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
The role of optimization in machine learning
Deep Learning Methods For Automated Discourse CIS 700-7
Fall 2004 Backpropagation CS478 - Machine Learning.
RNNs: An example applied to the prediction task
Data Mining, Neural Network and Genetic Programming
CSC321 Lecture 18: Hopfield nets and simulated annealing
National Taiwan University
Large-scale Machine Learning
Chilimbi, et al. (2014) Microsoft Research
ECE 5424: Introduction to Machine Learning
Lecture 07: Soft-margin SVM
Real Neurons Cell structures Cell body Dendrites Axon
Unrolling: A principled method to develop deep neural networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Machine Learning Basics
Artificial neural networks (ANNs)
An Introduction to Support Vector Machines
RNNs: Going Beyond the SRN in Language Prediction
CSC 578 Neural Networks and Deep Learning
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
CS 4501: Introduction to Computer Vision Training Neural Networks II
Logistic Regression & Parallel SGD
Large Scale Support Vector Machines
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Perceptron as one Type of Linear Discriminants
Lecture 08: Soft-margin SVM
Deep Neural Networks (DNN)
Lecture 07: Soft-margin SVM
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Lecture Notes for Chapter 4 Artificial Neural Networks
Overfitting and Underfitting
Neural networks (1) Traditional multi-layer perceptrons
COSC 4368 Machine Learning Organization
CS639: Data Management for Data Science
Image Classification & Training of Neural Networks
TensorFlow: A System for Large-Scale Machine Learning
CMS 165 Lecture 18 Summing up...
The Updated experiment based on LSTM
Introduction to Neural Networks
Section 3: Second Order Methods
Role of Stein’s Lemma in Guaranteed Training of Neural Networks
Batch Normalization.
CSC 578 Neural Networks and Deep Learning
Rong Ge, Duke University
Goodfellow: Chapter 14 Autoencoders
Introduction to Machine Learning
Presentation transcript:

Lecture 2 CMS 165 Optimization

Non-convex optimization All loss-functions that are not convex: not very informative. Global optimality: too strong. Weaker notions of optimality? What is a saddle point?

Different kinds of critical/stationary points

Higher order saddle points

When can saddles arise: Symmetries

Higher order saddles: Over-parameterization Overspecification: More capacity (neurons) than required. Each critical point in smaller network (over which function is realized) generates a set of critical points in larger network (degeneracy) Even global optima in smaller network can generate local optima and saddle points in larger network. A long time spent in the manifold of singularities.

How to escape saddle points? What happens to Gradient descent near saddles? What happens to Newton’s method near saddles? Solution: Cubic regularization: Second order optimization. What about first order methods?

Stochastic Gradient Descent Why is it so popular? Why is it so special? What guarantees?

Stochastic Gradient Descent

Scaling up optimization What are challenges in scaling up? How to take communication costs into account? Impact of gradient compression?

Distributed training involves computation & communication Compress? Parameter server Compress? GPU 1 GPU 2 With 1/2 data

Distributed training by majority vote sign(g) Parameter server GPU 1 GPU 2 GPU 3 sign [sum(sign(g))] Parameter server GPU 1 GPU 2 GPU 3

Variants of SGD distort the gradient in different ways Signum Adam

Large-Batch Analysis

Vector density & its relevance in deep learning A sparse vector A dense vector Natural measure of density =1 for fully dense v ≈0 for fully sparse v Fully dense vector……………….a sign vector

Distributed SignSGD: Majority vote theory If gradients are unimodal and symmetric… …reasonable by central limit theorem… …majority vote with M workers converges at rate: Having symmetry implies sign not screwed up badly Same variance reduction as SGD

Optimization in ML Optimizer used as a black-box in ML. We expect it to work perfectly. In many settings it works, e.g. over-parameterized neural networks. But extensive hyper-parameter magic (which no one talks about). SGD is the workhorse. Noise helps (for regularization). Second order methods are not yet scalable and don’t seem to help. In many cases, we don’t want it to work well. We want it to avoid certain solutions, even globally optimal solutions. E.g. in learn multi-lingual embeddings. Initialize in the hope of nudging to desirable solutions. But if a perfect optimizer, this should not matter. Early stopping: We don’t want it to fit loss function too perfectly, when there is not enough regularization. (Generalization, wait for a few more lectures).

References ICML 2016 tutorial on non-convex optimization: http://tensorlab.cms.caltech.edu/users/anima/slides/icml2016-tutorial.pdf “Cubic regularization of Newton method and its global performance” Nestorov, Polyak. Rong Ge blog https://www.offconvex.org/2017/07/19/saddle-efficiency/ “How to Escape Saddle Points Efficiently” Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael Jordan https://arxiv.org/pdf/1703.00887.pdf SignSGD papers: https://arxiv.org/pdf/1802.04434.pdf https://arxiv.org/pdf/1810.05291.pdf Hal Daume’s blog https://nlpers.blogspot.com/2016/05/a-bad-optimizer-is-not-good-thing.html Optional follow-up reading: Early stopping in kernel methods https://papers.nips.cc/paper/7187-early-stopping-for-kernel-boosting-algorithms-a-general- analysis-with-localized-complexities.pdf Escaping saddles in constrained optimization: https://arxiv.org/pdf/1809.02162.pdf Escaping higher order saddles: http://www.offconvex.org/2016/03/22/saddlepoints/ K Fukumizu, S Amari. Local minima and plateaus in hierarchical structures of multilayer perceptrons. 2000.