LARS Background Reference Paper: Reference Patch in Intel Caffe Layer-wise Adaptive Rate Scaling(LARS) is aimed to resolve convergence issue under big batch size with SGD/Momentum optimizer by adjusting layer local learning rate. Reference Paper: UC Berkeley: Large Batch Training of Convolutional Networks ImageNet Training in Minutes Reference Patch in Intel Caffe https://github.com/intel/caffe/commit/9a565d68d1c274cf82dad794f2febaa4b195e71f
Add LARS in Fluid Add LARS in SGD/Momentum Optimizer to Adjust Local LR: Add 3 attributes for SGD/Momentum Optimizer: use_local_lr: bool (default false) local_gw_ratio: float (default 0.001) weight_decay: float (default 0.0005) if (use_local_lr) local_lr= learning_rate * local_gw_ratio * sqrt(sumsq(param)) / (sqrt(sumsq(grad))+ weight_decay * sqrt(sumsq(param))) Status: - function code ready: add lars in SGD and Momentum(without Nesterov) Optimizers for dense parameters update. - test ResNet50 convergence using 8K batch size at single E5 2699 V4 with cifar10 dataset, testing result referring to page 3, 4 - unit test code to be added when solution review is passed. Dependency and To Do - global learning rate scheduler such as step, poly , warmup and so on is not implemented in Fluid, which will affect the convergence with big batch size and big initial learning rate - https://github.com/PaddlePaddle/Paddle/issues/6413 - to be tested in distributed environment. - to check the performance impact introduced by LARS computation after Fluid CPU optimization is done (perf impact is minor in non- optimized Fluid for CPU version)
ResNet50 Convergence Testing Benchmark: ResNet 50 - https://github.com/dzhwinter/benchmark/blob/master/fluid/resnet50.py (add test accuracy) Dataset: cifar10 num passes: 50 CPU: 2 socket E5 2699 V4 Memory: DDR4 128G(16Gx8) Test Method: Testing is performed at single Broadwell with big batch size; No global LR schedule available Optimizer Batch Size Learning Rate Momentum LARS local_gw_ratio weight_decay Train Accuracy Pass No. (Max Train Accuracy) Test Accuracy Pass No. (Max Test Accuracy) 32 0.01 0.9 Off NA 99.41% 48 81.86% 42 1024 0.32 99.28% 78.09% 45 8192 2.56 15.21% 49 15.76% 1 25.51% 24.43% On 0.001 90.63% 65.17% 36 0.0005 90.54% 47 64.09% 26 0.00025 86.04% 60.28% 28 87.33% 59.84% 21 0.002 52.54% 2 10.00% 0-49 Test Summary: - We use default batch size 32, LR 0.01 in benchmark as baseline, we got 99.41% train accuracy and 81.86% test accuracy. - Then we scale batch size to 1024 and 8192, and scale the LR linearly from 0.01 to 0.32 and 2.56. - For 8192 batch size, with LR 2.56, we can only get 15.76% test accuracy within 50 passes, then we reduce the LR from 2.56 to 1, the test accuracy is 24.43% We turned on LARS, for 8192 batch size, we get 90.63% train accuracy and 65.17% test accuracy within 50 passes. Because NO global LR scheduler is available in Fluid currently, the initial big LR 2.56 cannot drop after passes, which will block ResNet50 to reach theoretical convergence under big batch size.
ResNet50 Convergence Testing – Accuracy Curve
ResNet50 Theoretical Convergence (ImageNet) model top-1 validation accuracy top-5 validation accuracy ResNet-50 75.3% 92.2% https://github.com/KaimingHe/deep-residual-networks