LARS Background Reference Paper: Reference Patch in Intel Caffe

Slides:



Advertisements
Similar presentations
{Customer} Divisions Plan {Date} – {Version} Salesforce.com.
Advertisements

A brief review of non-neural-network approaches to deep learning
Denise Sakai Troxell (2000) Handling Some of the Problems Encountered When Using Excel Solver for Microsoft Excel 2000.
ImageNet Classification with Deep Convolutional Neural Networks
Lecture 4: CNN: Optimization Algorithms
Hands-On Microsoft Windows Server 2008
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
© 2004, The Technology Firm Ethertype 886 from the Intel website Probe Packets and Settings AFT and ALB teams use probe packets. Probes.
Systems of Linear Inequalities (4.10) Layering inequalities.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
ImageNet Classification with Deep Convolutional Neural Networks Presenter: Weicong Chen.
VMware Certified Professional 6-Data Center Virtualization Beta 2V0-621Exam.
Assignment 4: Deep Convolutional Neural Networks
Lecture 2c: Caffe: Training of CIFAR-10
Lecture 2b: Convolutional NN: Optimization Algorithms
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Machine Learning Supervised Learning Classification and Regression
Wenchi MA CV Group EECS,KU 03/20/2017
Outline Introduction. Changes made to the Tycho design from last time (June 2005). Example Tycho setup. Tycho benchmark motivations and methodology. Some.
Deep Residual Learning for Image Recognition
TensorFlow– A system for large-scale machine learning
LSUN Semantic Segmentation Extended PSPNet
Fall 2004 Backpropagation CS478 - Machine Learning.
Analysis of Sparse Convolutional Neural Networks
Deep Residual Networks
Early Results of Deep Learning on the Stampede2 Supercomputer
CSE 190 Caffe Tutorial.
Quantum Simulation Neural Networks
Chilimbi, et al. (2014) Microsoft Research
Extreme Big Data Examples
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Class 6 Mini Presentations - part
Presenter: Chu-Song Chen
dawn.cs.stanford.edu/benchmark
Inception and Residual Architecture in Deep Convolutional Networks
Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.
ECE 6504 Deep Learning for Perception
CS 188: Artificial Intelligence
Machine Learning: The Connectionist
Deep Residual Learning for Image Recognition
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Aoxiao Zhong Quanzheng Li Team HMS-MGH-CCDS
Chapter 5: CPU Scheduling
Early Results of Deep Learning on the Stampede2 Supercomputer
Incremental Training of Deep Convolutional Neural Networks
Logistic Regression & Parallel SGD
Construct a Convolutional Neural Network with Python
Declarative Transfer Learning from Deep CNNs at Scale
Neural Networks Geoff Hulten.
TGS Salt Identification Challenge
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Designing Neural Network Architectures Using Reinforcement Learning
ImageNet Classification with Deep Convolutional Neural Networks
Inception-v4, Inception-ResNet and the Impact of
Heterogeneous convolutional neural networks for visual recognition
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Department of Computer Science Ben-Gurion University of the Negev
The Updated experiment based on LSTM
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M
Natalie Lang Tomer Malach
Batch Normalization.
Deep screen image crop and enhance
CRCV REU 2019 Kara Schatz.
Single Parameter Tuning
LHC beam mode classification
Presentation transcript:

LARS Background Reference Paper: Reference Patch in Intel Caffe Layer-wise Adaptive Rate Scaling(LARS) is aimed to resolve convergence issue under big batch size with SGD/Momentum optimizer by adjusting layer local learning rate. Reference Paper: UC Berkeley: Large Batch Training of Convolutional Networks ImageNet Training in Minutes   Reference Patch in Intel Caffe https://github.com/intel/caffe/commit/9a565d68d1c274cf82dad794f2febaa4b195e71f

Add LARS in Fluid Add LARS in SGD/Momentum Optimizer to Adjust Local LR: Add 3 attributes for SGD/Momentum Optimizer: use_local_lr: bool (default false) local_gw_ratio: float (default 0.001) weight_decay: float (default 0.0005) if (use_local_lr) local_lr= learning_rate * local_gw_ratio * sqrt(sumsq(param)) / (sqrt(sumsq(grad))+ weight_decay * sqrt(sumsq(param))) Status: - function code ready: add lars in SGD and Momentum(without Nesterov) Optimizers for dense parameters update. - test ResNet50 convergence using 8K batch size at single E5 2699 V4 with cifar10 dataset, testing result referring to page 3, 4 - unit test code to be added when solution review is passed. Dependency and To Do - global learning rate scheduler such as step, poly , warmup and so on is not implemented in Fluid, which will affect the convergence with big batch size and big initial learning rate - https://github.com/PaddlePaddle/Paddle/issues/6413 - to be tested in distributed environment. - to check the performance impact introduced by LARS computation after Fluid CPU optimization is done (perf impact is minor in non- optimized Fluid for CPU version)

ResNet50 Convergence Testing Benchmark: ResNet 50 - https://github.com/dzhwinter/benchmark/blob/master/fluid/resnet50.py (add test accuracy) Dataset: cifar10 num passes: 50 CPU: 2 socket E5 2699 V4 Memory: DDR4 128G(16Gx8) Test Method: Testing is performed at single Broadwell with big batch size; No global LR schedule available Optimizer Batch Size Learning Rate Momentum LARS local_gw_ratio weight_decay Train Accuracy Pass No. (Max Train Accuracy) Test Accuracy Pass No. (Max Test Accuracy) 32 0.01 0.9 Off NA 99.41% 48 81.86% 42 1024 0.32 99.28% 78.09% 45 8192 2.56 15.21% 49 15.76% 1 25.51% 24.43% On 0.001 90.63% 65.17% 36 0.0005 90.54% 47 64.09% 26 0.00025 86.04% 60.28% 28 87.33% 59.84% 21 0.002 52.54% 2 10.00% 0-49 Test Summary: - We use default batch size 32, LR 0.01 in benchmark as baseline, we got 99.41% train accuracy and 81.86% test accuracy. - Then we scale batch size to 1024 and 8192, and scale the LR linearly from 0.01 to 0.32 and 2.56. - For 8192 batch size, with LR 2.56, we can only get 15.76% test accuracy within 50 passes, then we reduce the LR from 2.56 to 1, the test accuracy is 24.43% We turned on LARS, for 8192 batch size, we get 90.63% train accuracy and 65.17% test accuracy within 50 passes. Because NO global LR scheduler is available in Fluid currently, the initial big LR 2.56 cannot drop after passes, which will block ResNet50 to reach theoretical convergence under big batch size.

ResNet50 Convergence Testing – Accuracy Curve

ResNet50 Theoretical Convergence (ImageNet) model top-1 validation accuracy top-5 validation accuracy ResNet-50 75.3% 92.2% https://github.com/KaimingHe/deep-residual-networks