Presentation is loading. Please wait.

Presentation is loading. Please wait.

LARS Background Reference Paper: Reference Patch in Intel Caffe

Similar presentations


Presentation on theme: "LARS Background Reference Paper: Reference Patch in Intel Caffe"— Presentation transcript:

1 LARS Background Reference Paper: Reference Patch in Intel Caffe
Layer-wise Adaptive Rate Scaling(LARS) is aimed to resolve convergence issue under big batch size with SGD/Momentum optimizer by adjusting layer local learning rate. Reference Paper: UC Berkeley: Large Batch Training of Convolutional Networks ImageNet Training in Minutes Reference Patch in Intel Caffe

2 Add LARS in Fluid Add LARS in SGD/Momentum Optimizer to Adjust Local LR: Add 3 attributes for SGD/Momentum Optimizer: use_local_lr: bool (default false) local_gw_ratio: float (default 0.001) weight_decay: float (default ) if (use_local_lr) local_lr= learning_rate * local_gw_ratio * sqrt(sumsq(param)) / (sqrt(sumsq(grad))+ weight_decay * sqrt(sumsq(param))) Status: - function code ready: add lars in SGD and Momentum(without Nesterov) Optimizers for dense parameters update. - test ResNet50 convergence using 8K batch size at single E V4 with cifar10 dataset, testing result referring to page 3, 4 - unit test code to be added when solution review is passed. Dependency and To Do - global learning rate scheduler such as step, poly , warmup and so on is not implemented in Fluid, which will affect the convergence with big batch size and big initial learning rate - - to be tested in distributed environment. - to check the performance impact introduced by LARS computation after Fluid CPU optimization is done (perf impact is minor in non- optimized Fluid for CPU version)

3 ResNet50 Convergence Testing
Benchmark: ResNet (add test accuracy) Dataset: cifar10 num passes: 50 CPU: 2 socket E V4 Memory: DDR4 128G(16Gx8) Test Method: Testing is performed at single Broadwell with big batch size; No global LR schedule available Optimizer Batch Size Learning Rate Momentum LARS local_gw_ratio weight_decay Train Accuracy Pass No. (Max Train Accuracy) Test Accuracy Pass No. (Max Test Accuracy) 32 0.01 0.9 Off NA 99.41% 48 81.86% 42 1024 0.32 99.28% 78.09% 45 8192 2.56 15.21% 49 15.76% 1 25.51% 24.43% On 0.001 90.63% 65.17% 36 0.0005 90.54% 47 64.09% 26 86.04% 60.28% 28 87.33% 59.84% 21 0.002 52.54% 2 10.00% 0-49 Test Summary: - We use default batch size 32, LR 0.01 in benchmark as baseline, we got 99.41% train accuracy and 81.86% test accuracy. - Then we scale batch size to 1024 and 8192, and scale the LR linearly from 0.01 to 0.32 and 2.56. - For batch size, with LR 2.56, we can only get 15.76% test accuracy within 50 passes, then we reduce the LR from 2.56 to 1, the test accuracy is 24.43% We turned on LARS, for 8192 batch size, we get 90.63% train accuracy and 65.17% test accuracy within 50 passes. Because NO global LR scheduler is available in Fluid currently, the initial big LR 2.56 cannot drop after passes, which will block ResNet50 to reach theoretical convergence under big batch size.

4 ResNet50 Convergence Testing – Accuracy Curve

5 ResNet50 Theoretical Convergence (ImageNet)
model top-1 validation accuracy top-5 validation accuracy ResNet-50 75.3% 92.2%


Download ppt "LARS Background Reference Paper: Reference Patch in Intel Caffe"

Similar presentations


Ads by Google