Computer Systems and Networking Group (CSNG) Holistic Approach to DNN Training Efficiency: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group
1. Machine Learning Benchmarking and Analysis
+ A ML researcher Waiting for hours or days Try a new framework? (TF, MXNet, PyTorch, …) A ML researcher Change hyper-parameters? Try a new library? Buy a new GPU? (V100, P100, 1080 Ti, Titan Xp …) Add/Remove a layer? OR … let’s start by looking at the typical situation that ML researchers will face. Never mind, you have to pay this much time anyway? Waiting for hours or days +
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training To understand the performance bottlenecks, our methodology is to first define a proper set of key performance metrics, and then build tools if necessary to pin-point bottlenecks. These tools are hardware & model-independent. We also collected a set of DNN training models for us to apply our tools on. Pin-pointing tools Key performance metrics
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training So let’s start by talking about our benchmark suite. Pin-pointing tools Key performance metrics
Why Do We Need ML Benchmark Suite? Lack of a standard diverse benchmark set with the state-of-the-art models for DNN training How training is different from inference: Why do we need TBD? A short answer is this: currently the system/hardware community lacks a diverse benchmark set with state-of-the-art models for DNN training. DNN has two major tasks, training and inference. To do inference, the network takes data samples as input, and outputs certain property of the data samples. Before the network can produce reasonable output, it has to be trained properly, and that is the purpose of the training task. Here we can see the differences between training and inference. The algorithm is different, the performance metric is different, the memory consumption is in different magnitude because the major consumer of the memory is different, and the time to finish is also very different. These differences raise different requirements for the underlying hardware. Training and inference are equally important, here in this project we focus on the training problem. Training Inference Algorithm Iterations * (forward + backward + weight update) Forward only Performance Metric Time-to-accuracy Latency Memory consumption GBs MBs Time to finish Full train in days or weeks In milliseconds Our focus is on training
Need for Benchmark Diversity (early 2017) Lack of a standard diverse benchmark set with state-of-the-art models for DNN training Need for benchmark diversity: DNNs have been widely successful Most research used only image classification and CNN models Performance characteristics are different for different DNNs Need for diversity. Now Image classification is the most classical application for DNN, but DNN has been successful for a lot more application domains, like self-driving cars, recommendation systems, object detection, machine translation, speech recognition, and reinforcement learning and so on. Due to the great success of the deep learning, a lot of special-purpose systems and architectures are proposed in recent years to accelerate DNN computation, but most of those research works in systems and architecture communities focus only image classification models. This indicates that there are great opportunities for system and hardware targeting all the other application domains, but first we need to propose proper benchmark models for these domains.
State-of-the-art Models AlexNet (2012) VGG (2013) GoogleNet (2014) ResNet (2015) ? State-of-the-art models. Here is a history of image classification models. In 2012 the AlexNet is the winner of ImageNet competition, and the winner goes to VGG next year, GoogleNet next year, resnet next year, and there is a ResNet version 2 which beat human in image classification task. Similar thing with object detection. In 2014 we have this RCNN model proposed, Fast R-CNN next year, Faster RCNN at the end of the same year, YOLO in 2016, and YOLO version 2 last year. The same story happens in each application domain, which is that State-of-the-art models are constantly evolving, old models can be quickly out-dated. This brings a challenge for benchmarks, we need to keep track of all the state-of-the-art models in each domain. RCNN (2014) Fast RCNN (2015) Faster RCNN (2015) YOLO (2016) YOLO v2 (2017) ? State-of-the-art models are constantly evolving Old models can be quickly out-dated
Training Benchmarks for DNNs (TBD) Applications Models Dataset # of layers Dominant layer Maintainer Image Classification ResNet-50 T,M,C Inception-v3 T,M,C ImageNet 50 (152 max) 42 CONV Hongyu Zhu Machine Translation Seq2Seq T,M Transformer T,M IWSLT15 5 12 LSTM Attention Bojian Zheng Andrew Pelegris Object Detection Faster RCNN T,M Mask RCNN P Pascal VOC 101 Zilun Zhang Speech Recognition Deep Speech 2 P, M LibriSpeech 7 (9 max) RNN Kuei-Fang Hsueh Jiahuang Lin Recommendation System NCF P MovieLens 4 GMF, MLP Izaak Niksan Adversarial Network WGAN T Downsampled ImageNet 14+14 Reinforcement Learning A3C T,M Atari 2600 Mohamed Akrout This is a list of models in our benchmark set. We have 8 models that cover 6 major application domains. The footnotes here indicate that the model is available on which frameworks. We do not implement any of them, all models we use come from open-sourced github repos. Most of them are from the official examples of the frameworks. Different frameworks are maintained by different people in this group. (Footnotes indicate available implementation: T for , M for , C for , P for ) tbd-suite.ai https://github.com/tbd-ai/tbd-suite
Our Focus: Benchmarking and Analysis http://tbd-suite.ai https://mlperf.org/ Building tools to analyze ML performance/efficiency Industry/Academia de-facto standard Our group is responsible for the reference model implementation for speech recognition (inference): DeepSpeech2 from UofT
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Performance Metrics Throughput Compute Utilization FP32 Utilization # of data samples processed per second Compute Utilization GPU busy time over Elapsed time FP32 Utilization Total FP32 instructions over Maximum FP32 instructions Memory Breakdown Which data structures occupy how much memory
Throughput Time-to-accuracy Throughput # of data samples processed per second We assume that there exists such hyper-parameter configuration that guarantees training quality This is the metric that people truly care about 1, It could take you forever to collect the wall time 2, Hyper-parameter tuning may play a big role. 3, We make the assumption that … Recent works showed that such assumption holds for training ResNet models in extremely large scale. 4, so that we can focus on performance only, which is denoted by throughput. Easy to measure Time-to-accuracy Throughput Too expensive! Hyper-parameter tuning plays a big role Need to handle samples with variant sizes
Compute Utilization GPU busy time over Elapsed time t1 t2 t3 Compute Utilization = (t1 + t2) / t3 Indicate how well the non-GPU workloads overlap with GPU computation: Data loading Communication (PCIe, networking) ……
FP32/FP16/TensorCore Utilization Average # ofinstructions executed per cycle over Maximum instructions per cycle When GPU is busy, how well are the GPU cores utilized? Most models are trained with single-precision floats Provided by nvprof Indicate speed-up potential in kernel-level Helps identify the “straggler” kernels (usually not MatMul or CNN kernels)
Memory Breakdown Goal: understand which data structures contribute how much to the total memory consumption Data Structures: Weights Gradients Activations Workspace Dynamic The purpose of the memory profiler is to see which data structure contributes how much to the total memory consumption. We categorize the data structure according to their functionality. Here are the 5 types of data structure. Weights are just model weights. Gradients are the output of the backward process, they are used for updating the weights. Activations are the intermediate results generated in the forward process, and it needs to be reused in the backward process. Workspace is generally the temporal space used for cuda kernels. Generally, the first three types are allocated before the training iterations, while the workspace is allocated during the training iterations. For anything other than workspace that is allocated during training, the training iterations that do not belong to workspace, we assign them with a new type called Dynamic. Allocated before training starts Allocated and released during training
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics
Toolchain: How to get the required metrics? Challenge Solution Throughput Straightforward Compute Utilization Nvidia Visual Profiler Profiling file is huge for visual profiler Sampling FP32 Utilization Nvprof & post-processing Profiling takes extremely long time Memory Breakdown Dump and parse training log Currently not supported by frameworks Augment frameworks with memory profiling Network Utilization Network Profiler Augment framework with network profiling For the toolchain, we want to use nvprof for GPU, vTune for CPU to provide us useful information. But the challenge is that they have no domain knowledge about the applications. So in this case, we need to properly define what metrics are important. Also we need to extract the metrics based on the definitions and the information that nvprof or vTune provide us. We also need to build memory profiler, which will be covered later.
Toolchain: sampling, setup, warmup Fully Training a DNN takes days or weeks Training algorithm is iterative, each iteration follows the same logic Setup Need to verify training accuracy Different frameworks may use different hyper-parameters for the same models Skipping warmup Before training stably, a framework needs to: Initialize dataflow, allocate memory, auto-tuning
Toolchain: Overview Metrics Setup: make implementations comparable Memory consumption Training logs Memory profiler CPU utilization DNN model implementation vTune Short training period FP32 utilization .nvvp file Warm-up & auto-tuning (excluded from data collection) Sampling nvprof Compute utilization .nvvp file Training throughput
Memory Profiler Flow Data Structures Weights Modified framework Parser program Gradients DNN model implementation Training logs with functionality tagged Activations Workspace Dynamic
A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training To understand the performance bottlenecks, our methodology is to first define a proper set of key performance metrics, and then build tools if necessary to pin-point bottlenecks. These tools are hardware & model-independent. We also collected a set of DNN training models for us to apply our tools on. Pin-pointing tools Key performance metrics
Experimental setup All results are carried out on the single-machine single-GPU environment OS: Ubuntu 16.04 Libraries: CUDA 9, cuDNN 7 Frameworks: TensorFlow v1.8, MXNet v1.1.0, PyTorch v0.4.0, CNTK v2.0 GPUs: Quadro P4000, 1080 Ti, Titan Xp, P100, 2080 Ti, Titan V, V100 CPU: 28-core Intel Xeon E5-2680 Networking: 1Gb/s ethernet, 100Gb/s infiniband, 16GB/s PCIe
Results: Training Quality Here are two examples of the training time v.s. training accuracy curves. The left one is ResNet-50, and the right one is seq2seq. Expected training accuracy reached
Results: Throughput Mini-batch size matters for training throughput Here are the throughput results for inception and transformer. As you can see, the training throughput increases with mini-batch size for all models. But the benefit of further increase mini-batch size decreases. We call this throughput saturation. This means that once the mini-batch size reaches a certain level, further increase will provide very limited benefit. Mini-batch size matters for training throughput Performance improves with larger mini-batches
Results: Throughput Diversity Here are the throughputs for the two RNN-based models in our benchmark set. We cannot increase the mini-batch size further because it is limited by the memory. Performance of RNN-based models does not saturate within GPU memory budget
Results Analysis: GPU Compute Utilization Mini-batch size should be large enough to keep GPU busy GPU compute utilization is low for LSTM-based models
Results Analysis: GPU FP32 Utilization Mini-batch size should be large enough to have high FP utilization
Hardware Sensitivity Better GPU does NOT always mean better performance and utilization We need better system designs and libraries
GPU Memory Profiling Feature maps are the dominant GPU memory consumers
Results: Distributed Training In the case of distributed training, network bandwidth becomes a potential candidate for performance bottleneck. We tested the training of ResNet-50 under both multi-GPU and multi-machine environment. Here the throughput is the total throughput, not per GPU/machine. We can see that for multi-GPU cases, the scalability is very good, but for multi-machine case, you need fast network bandwidth with infiniband. Ethernet will even decrease the overall training throughput. Training ResNet-50 on MXNet (left: multi-machine; right: multi-GPU on single machine) Ethernet (eth) bw = 1Gb/s; InfiniBand (ib) bw = 100Gb/s; PCIe bw = 16GB/s Networking BW should be large enough for weight/gradient updates
Project Status TBD project website is live: tbd-suite.ai Github repo: github.com/tbd-ai
TBD Summary A new benchmark suite for DNN training Currently, 7 application domains, 9 state-of-the-art models Comes with tools to analyze: performance, efficiency, memory, and network consumption Part of the community effort (MLPerf) to standardize benchmarking for machine learning
2. Gist: Efficient Data Encoding for Deep Neural Network Training
DNN Training vs. Inference Step 1 - Forward Pass (makes a prediction) Step 2 - Backward Pass (calculates error gradients) L1 L2 L3 L4 Ln Intermediate layer outputs Feature maps Generated in the forward pass Used in the backward pass DNN training requires stashing feature maps for the backward pass (not required in Inference)
Training Deeper Networks Train larger networks on a single GPU by reducing memory footprint GPU DRAM has limited size, restricting the depth of DNN Alternatively, smaller batch size leads to underutilized GPU Feature Maps are a major consumer of GPU memory Larger minibatch size potential crash/out-of-memory
Limitations of Prior Work Focus on DNN inference, i.e., weights Apply pruning, quantization and Huffman encoding However, weights are a small fraction of memory footprint Additionally, techniques are not well suited for training Training requires frequent weight updates Map poorly on the GPU HW
Our Insight Forward pass Lx Ly Backward pass Lz Timeline Feature map Large temporal gap between 2 uses Generated 1st use 2nd use Baseline Feature map stored in FP32 format Our approach Smaller format between 2 uses Encode() Decode()
Layer-Specific Encodings Key Idea: Use layer-specific compression Can be both fast and efficient Can be even lossless Usually difficult for FP32
Significant footprint is due to Relu layer Relu Importance Significant footprint is due to Relu layer CNTK Profiling
Relu Backward Propagation Binarize – 1 bit representation Relu -> Pool Relu Backward Propagation Binarize – 1 bit representation (Lossless)
Sparse Storage Dense Compute Relu/Pool -> Conv Sparse Storage Dense Compute (Lossless)
Opportunity for Lossy Encoding Precision reduction in forward pass quickly degrades accuracy Precision Reduction Error AlexNet : 16-bit doesn’t train L1 L2 L3 L4 Forward pass 2nd uses Backward pass L1 L2 L3 L4 Restricting precision reduction to the 2nd use results in aggressive bit savings with no effect on accuracy
Delayed Precision Reduction Training with Reduced Precision Delayed Precision Reduction (Lossy)
Proposed System Architecture - Gist DNN Identifies encoding opportunity Modified execution graph Gist Efficient memory sharing Execution graph Memory allocation for new data structures
Up to 2X compression ratio With minimal performance overhead
Gist Summary Systematic memory breakdown analysis for image classification Layer-specific lossless encodings Binarization and sparse storage/dense compute Aggressive lossy encodings With delayed precision reduction Footprint reduction measured on real systems: Up to 2X reduction with only 4% performance overhead Further optimizations – more than 4X reduction
3. Priority-based Parameter Propagation (P3) for Distributed DNN Training
Networking Profiler for DNN Training Significant communication, especially, from workers to parameter server
Network Utilization: ResNet-50 on MXNet 4 Gbps Ethernet Huge network underutilization, with significant spikes
Network Utilization: ResNet-50 on TensforFlow 4 Gbps Ethernet Similar behavior on the same model different framework
Network Utilization: Sockeye on MXNet 4 Gbps Ethernet And even beyond image classification
Communication Dependency
Current Compute-Communication Pattern
Priority-based Parameter Propagation
Network Utilization Baseline P3 VGG-19 Sockeye
Training Accuracy vs. Deep Gradient Compression 0.4% difference in validation accuracy on ResNet-50 Worse for other models 0.4% difference in validation accuracy It also took considerable effort to reproduce this result P3 is a safer way to optimize networking bandwidth
P3 Summary Current distribution mechanisms fail to fully utilize available network bandwidth Propose Priority-based Parameter Propagation (P3) based on two key ideas: Parameter slicing Priority-based update 1.26X, 1.38X and 1.66X speedup in training throughput (and time-to-accuracy) over ResNet-50, Sockeye, and VGG-19
Other Architecture+ML Related Projects EcoRNN: efficient RNN/LSTM implementations on GPUs/FPGAs Scaling backward path with Jacobians instead of gradients (log(N) steps instead of N using Blelloch scan algorithm) GPU virtualization project: efficient resource reuse in the close (e.g., mixing large training with many inferences) ML compilers beyond Halide and TVM: global optimizations and auto-scheduling
My Students: EcoSystem Research Group Hongyu Zhu (PhD) Bojian Zheng (PhD) Pavel Klishin (PhD) Alexandra Tsvetkova (PhD) James Gleeson (PhD, co-advised) Andrew Pelegris (MSc) Shang (Sam) Wang (MSc) Geoffrey Yu (MSc) Xiaodan (Serina) Tan (MASc) Qiongsi Wu (BSc) Ming (Michael) Yang (BASc) Izaak Niksan (BASc) Yifan Bai (BSc) Jiahuang (Jacob) Lin (BSc) Kuei-Fang (Albert) Hsueh (BASc) Zilun Zhang (BSc)
Holistic Approach to DNN Training: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group