Jason furmanek Kris murphy IBM

Jason furmanek furmanek@us.ibm.com Kris murphy kriskend@us.ibm.com IBM
Training Models in TensorFlow with Oversubscribed GPU Memory and NVLink 2.0 Jason furmanek Kris murphy IBM Imagine you have curated a great data set, and built a deep TensorFlow model to learn from it. When you go to train it on your GPUs it blows up on an out of memory error. What can you do? Hi my name is Sam Matzek and I work on deep learning with TensorFlow at IBM, in the next 45 minutes I'm going to talk about to overcome GPU memory limits to train and inference with large models and data.

Overview GPU memory limits
TensorFlow graph modifications for tensor swapping What’s possible with NVIDIA NVLink 2.0 connections between CPU and GPU

Deep learning is memory constrained
GPUs have limited memory Neural networks are growing deeper and wider Amount and size of data to process is always growing

Gpu memory usage Tensors (Layer outputs) Input data Kernels GPU memory
GPU memory limit is a problem – discuss what all needs to fit in GPU memory for model training. What you need to do to make things fit is trade off batch size, data size (to change tensor / layer output) size, or make model smaller. Kernels loss

Model training in gpu memory
Tensor 1 Tensor 2 Tensor 3 Discuss diagram, model in memory, forward and backward phases, feature maps get stored in GPU memory After animation, discuss solution of storing tensors in system memory between usage in forward and backward phases loss

Model training with tensor swapping
GPU memory Tensor 1 Tensor 2 Tensor 3 Tensor 4 loss Tensors swapped out to system memory in forward phase, swapped back in during backward phase Point out how this can let you have very deep, long models by continuing to swap out feature maps System memory

Tensorflow large model support graph modifications
GPU CPU a Swap out Swap in We produced a Python module that can analyze your model's graph and automatically insert swapping nodes. Here's how a quick look at how the graph modifications work... Talk about tuning with the control dependency b

Enabling tensorflow large model support
Keras API from tensorflow.contrib.lms import LMSKerasCallback lms_callback = LMSKerasCallback() … model.fit_generator(generator=training_gen, callbacks=[lms_callback]) Estimator API with tf.name_scope(‘gradientscope'): optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step()) from tensorflow.contrib.lms import LMSHook lms_hook = LMSHook({‘gradientscope '}) mnist_classifier.train(input_fn=train_input_fn, steps=20000, hooks=[logging_hook, lms_hook])

What’s possible with Large model support?
GoogleNet: 2.5x increase in the max image resolution Resnet50: 5x increase in batch size Resnet152: 4.6x increase in batch size 3D-Unet brain MRIs: 2.4x increase in the max image resolution Let me give you some examples of what’s possible with tensor swapping….

3D-UNet image segmentation
3D-Unet generally has high memory usage requirements International Multimodal Brain Tumor Segmentation Challenge (BraTS) Existing Keras model with TensorFlow backend What is image segmentation? Pixel class vs whole image Find and select all water in an image – detail is important for water and other natural resource management 3DUnet image segmentation is a good candidate to benefit from tensor swapping because of (first bullet) BraTS – image segmentation of 3D MRIs to produce segmentations of whole tumor, tumor core, and enhancing tumor structures Used existing model to investigate real world use case and benefit for tensor swapping

effect of 2x resolution on dice coefficients (higher is better)
As I previously mentioned, we were able to increase the maximum resolution that could be processed by 2.4x. Dice coefficients are a statistic commonly used in the medical field to compare predicted image segmentations against the reference Graphs show taking a prediction map from a lower resolution, scaling it up 2x and and scoring it against a truth file at a 2x greater native resolution Box plots read like sideways bell curves, higher is better, quartiles So we’ve shown that tensor swapping allows you to train models for learning and labeling finer details and structure Allows training of deeper models. Allows higher batch size which can lead to faster convergence summary of what you just talked about before moving on: Problem, solution, benefit

“Swapping makes everything slow”
Let’s look at why swapping can hurt performance and what solutions we can apply there.

Typical GPU connectivity
System Memory CPU System memory bus: 76.8 GB/s PCIe: 32 GB/s It all comes down to bus speeds – speed between the CPU and GPU and speeds between system memory and CPU Discuss traditional connectivity and NVLink 2.0 connectivity (POWER9, AC922) Transition to next slide: So what does tensor swapping look like when you have PCI vs NVLink connections between the CPU and GPU? To answer that question, we ran training of the 3D-Unet model on two systems. Each system had NVIDIA Volta V100 GPUs. One system had GPUs connected with PCIe, and the other system, the IBM AC922 had NVLink 2.0 connections between the CPU and GPU. This is what we saw: CPU and GPU? GPU GPU NVLink

Nvlink cpu to GPU connectivity
System Memory CPU System memory bus: 170 GB/s NVLink 2.0: 150 GB/s NVLink 2.0: 150 GB/s It all comes down to bus speeds – speed between the CPU and GPU and speeds between system memory and CPU Discuss traditional connectivity and NVLink 2.0 connectivity (POWER9, AC922) Transition to next slide: So what does tensor swapping look like when you have PCI vs NVLink connections between the CPU and GPU? To answer that question, we ran training of the 3D-Unet model on two systems. Each system had NVIDIA Volta V100 GPUs. One system had GPUs connected with PCIe, and the other system, the IBM AC922 had NVLink 2.0 connections between the CPU and GPU. This is what we saw: CPU and GPU? GPU GPU NVLink 2.0

Effects of nvlink 2.0 on large model support
PCIe connected GPU training one high res 3D MRI with large model support NVLink 2.0 connected GPU training one high res 3D MRI with large model support What we see here is the NVIDIA Profiler view of training one MRI through the model. The top is PCI, bottom GPU. It took the PCI connected GPU 2.5 times longer to process the image. Again, the GPUs are both the same model, and a few of the corresponding kernels are linked in red. The execution times are the same. So why the difference? AC922 with NVLink 2.0 is 2.5x faster White gaps on GPU utilization line are due to memory copy time differences, due to bus speeds Memory copies in graph, note now much longer they are transition to next slide: "The amount of time the GPU goes idle due to memory copy times is increasing the training time by 2.5x. Let's dig into that a bit more".

Effects of nvlink 2.0 on epoch times
Discuss the server with 8 GPUs, each GPU shares its PCI bus with another GPU PCI bus contention when multiple GPUs are used, further degrades performance Epoch times 2.5x faster than PCI connection and 3.5x when using multiple GPUs on the same bus Epoch time is shorter due to the GPU utilization being higher, which directly corresponds to the higher memory copy throughput. Transition to next slide: So we've seen that tensor swapping allows you to use higher resolution data and batch sizes and that NVLink 2.0 connections between the CPU and GPU give you drastically faster epoch times. You may still be thinking: what’s the catch? What’s the real overhead of tensor swapping with NVLInk 2.0 connections to the CPU? To separate the overhead of swapping from the added time of using larger resolutions, we utilized V100s with both 16GB and 32GB of RAM. We measured the epoch times of MRI resolutions that could only be trained on the 16GB GPU using swapping, and re-measured the times on the 32GB without swapping. Here’s what we saw: <next slide>

Effects of nvlink 2.0 on gpu utilization
Discuss the server with 8 GPUs, each GPU shares its PCI bus with another GPU PCI bus contention when multiple GPUs are used, further degrades performance Epoch times 2.5x faster than PCI connection and 3.5x when using multiple GPUs on the same bus Epoch time is shorter due to the GPU utilization being higher, which directly corresponds to the higher memory copy throughput. Transition to next slide: So we've seen that tensor swapping allows you to use higher resolution data and batch sizes and that NVLink 2.0 connections between the CPU and GPU give you drastically faster epoch times. You may still be thinking: what’s the catch? What’s the real overhead of tensor swapping with NVLInk 2.0 connections to the CPU? To separate the overhead of swapping from the added time of using larger resolutions, we utilized V100s with both 16GB and 32GB of RAM. We measured the epoch times of MRI resolutions that could only be trained on the 16GB GPU using swapping, and re-measured the times on the 32GB without swapping. Here’s what we saw: <next slide>

Overhead of large model support with nvlink 2.0
Discuss ‘x’ factor of resolution about 16GB limit. Even at 1.8x, swapping only adds 13% overhead with NVLink 2.0, and 25% at 2.4x the max res without swapping Take away is that for minimal overhead you can gain a lot of resolution, details, accuracy

Future direction of large model support
TensorFlow 2.0 Eager mode on by default tensorflow.contrib sunset potential to integrate with TensorFlow’s Grappler open to contributions Continuing research in TensorFlow graph modifications early experiments show 6.9x the 3D MRI resolution (up from 2.4x) Not just for TensorFlow! Caffe Large Model Support PyTorch Large Model Support

Large model support with nvlink 2.0
Tensor swapping can be used to overcome GPU memory limits Allows training of: Deeper models higher resolution data larger batch sizes NVLink 2.0 between CPU and GPU allow tensor swapping with minimal overhead So to conclude, As you leave here today, I'd like you to ask yourself what you could accomplish with deeper models, higher resolution data, and larger batch sizes and please contact me if you would like to talk about tensor swapping or about NVLink connections between your CPUs and GPUs.

More information TensorFlow Large Model Support Code / Pull Request:
TensorFlow Large Model Support Research Paper: TensorFlow Large Model Support Case Study: IBM AC922 with NVLink 2.0 connections between CPU and GPU: More information slide to leave up during Q&A

Jason furmanek Kris murphy IBM

Similar presentations

Presentation on theme: "Jason furmanek Kris murphy IBM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jason furmanek Kris murphy IBM

Similar presentations

Presentation on theme: "Jason furmanek Kris murphy IBM"— Presentation transcript:

Similar presentations

About project

Feedback