LARS Background Reference Paper: Reference Patch in Intel Caffe

Slides:

Advertisements

Similar presentations

{Customer} Divisions Plan {Date} – {Version} Salesforce.com.

Advertisements

A brief review of non-neural-network approaches to deep learning

Denise Sakai Troxell (2000) Handling Some of the Problems Encountered When Using Excel Solver for Microsoft Excel 2000.

ImageNet Classification with Deep Convolutional Neural Networks

Lecture 4: CNN: Optimization Algorithms

Hands-On Microsoft Windows Server 2008

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

© 2004, The Technology Firm Ethertype 886 from the Intel website Probe Packets and Settings AFT and ALB teams use probe packets. Probes.

Systems of Linear Inequalities (4.10) Layering inequalities.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.

VMware Certified Professional 6-Data Center Virtualization Beta 2V0-621Exam.

Assignment 4: Deep Convolutional Neural Networks

Lecture 2c: Caffe: Training of CIFAR-10

Lecture 2b: Convolutional NN: Optimization Algorithms

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Machine Learning Supervised Learning Classification and Regression

Wenchi MA CV Group EECS,KU 03/20/2017

Outline Introduction. Changes made to the Tycho design from last time (June 2005). Example Tycho setup. Tycho benchmark motivations and methodology. Some.

Deep Residual Learning for Image Recognition

TensorFlow– A system for large-scale machine learning

LSUN Semantic Segmentation Extended PSPNet

Fall 2004 Backpropagation CS478 - Machine Learning.

Analysis of Sparse Convolutional Neural Networks

Deep Residual Networks

Early Results of Deep Learning on the Stampede2 Supercomputer

CSE 190 Caffe Tutorial.

Quantum Simulation Neural Networks

Chilimbi, et al. (2014) Microsoft Research

Extreme Big Data Examples

Computer Science and Engineering, Seoul National University

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Class 6 Mini Presentations - part

Presenter: Chu-Song Chen

dawn.cs.stanford.edu/benchmark

Inception and Residual Architecture in Deep Convolutional Networks

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

ECE 6504 Deep Learning for Perception

CS 188: Artificial Intelligence

Machine Learning: The Connectionist

Deep Residual Learning for Image Recognition

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Aoxiao Zhong Quanzheng Li Team HMS-MGH-CCDS

Chapter 5: CPU Scheduling

Early Results of Deep Learning on the Stampede2 Supercomputer

Incremental Training of Deep Convolutional Neural Networks

Logistic Regression & Parallel SGD

Construct a Convolutional Neural Network with Python

Declarative Transfer Learning from Deep CNNs at Scale

Neural Networks Geoff Hulten.

TGS Salt Identification Challenge

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Designing Neural Network Architectures Using Reinforcement Learning

ImageNet Classification with Deep Convolutional Neural Networks

Inception-v4, Inception-ResNet and the Impact of

Heterogeneous convolutional neural networks for visual recognition

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Department of Computer Science Ben-Gurion University of the Negev

The Updated experiment based on LSTM

Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Natalie Lang Tomer Malach

Batch Normalization.

Deep screen image crop and enhance

CRCV REU 2019 Kara Schatz.

Single Parameter Tuning

LHC beam mode classification

Presentation transcript:

LARS Background Reference Paper: Reference Patch in Intel Caffe Layer-wise Adaptive Rate Scaling（LARS) is aimed to resolve convergence issue under big batch size with SGD/Momentum optimizer by adjusting layer local learning rate. Reference Paper: UC Berkeley: Large Batch Training of Convolutional Networks ImageNet Training in Minutes Reference Patch in Intel Caffe https://github.com/intel/caffe/commit/9a565d68d1c274cf82dad794f2febaa4b195e71f

Add LARS in Fluid Add LARS in SGD/Momentum Optimizer to Adjust Local LR: Add 3 attributes for SGD/Momentum Optimizer: use_local_lr: bool (default false) local_gw_ratio: float (default 0.001) weight_decay: float (default 0.0005) if (use_local_lr) local_lr= learning_rate * local_gw_ratio * sqrt(sumsq(param)) / (sqrt(sumsq(grad))+ weight_decay * sqrt(sumsq(param))) Status: - function code ready: add lars in SGD and Momentum(without Nesterov) Optimizers for dense parameters update. - test ResNet50 convergence using 8K batch size at single E5 2699 V4 with cifar10 dataset, testing result referring to page 3, 4 - unit test code to be added when solution review is passed. Dependency and To Do - global learning rate scheduler such as step, poly , warmup and so on is not implemented in Fluid, which will affect the convergence with big batch size and big initial learning rate - https://github.com/PaddlePaddle/Paddle/issues/6413 - to be tested in distributed environment. - to check the performance impact introduced by LARS computation after Fluid CPU optimization is done (perf impact is minor in non- optimized Fluid for CPU version)

ResNet50 Convergence Testing Benchmark: ResNet 50 - https://github.com/dzhwinter/benchmark/blob/master/fluid/resnet50.py (add test accuracy) Dataset: cifar10 num passes: 50 CPU: 2 socket E5 2699 V4 Memory: DDR4 128G(16Gx8) Test Method: Testing is performed at single Broadwell with big batch size; No global LR schedule available Optimizer Batch Size Learning Rate Momentum LARS local_gw_ratio weight_decay Train Accuracy Pass No. (Max Train Accuracy) Test Accuracy Pass No. (Max Test Accuracy) 32 0.01 0.9 Off NA 99.41% 48 81.86% 42 1024 0.32 99.28% 78.09% 45 8192 2.56 15.21% 49 15.76% 1 25.51% 24.43% On 0.001 90.63% 65.17% 36 0.0005 90.54% 47 64.09% 26 0.00025 86.04% 60.28% 28 87.33% 59.84% 21 0.002 52.54% 2 10.00% 0-49 Test Summary: - We use default batch size 32, LR 0.01 in benchmark as baseline, we got 99.41% train accuracy and 81.86% test accuracy. - Then we scale batch size to 1024 and 8192, and scale the LR linearly from 0.01 to 0.32 and 2.56. - For 8192 batch size, with LR 2.56, we can only get 15.76% test accuracy within 50 passes, then we reduce the LR from 2.56 to 1, the test accuracy is 24.43% We turned on LARS, for 8192 batch size, we get 90.63% train accuracy and 65.17% test accuracy within 50 passes. Because NO global LR scheduler is available in Fluid currently, the initial big LR 2.56 cannot drop after passes, which will block ResNet50 to reach theoretical convergence under big batch size.

ResNet50 Convergence Testing – Accuracy Curve

ResNet50 Theoretical Convergence (ImageNet) model top-1 validation accuracy top-5 validation accuracy ResNet-50 75.3% 92.2% https://github.com/KaimingHe/deep-residual-networks