dawn.cs.stanford.edu/benchmark

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

A General Distributed Deep Learning Platform

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.

Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models S. Branson, P. Perona, S. Belongie.

Measuring Performance II and Logic Design

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

Effects of Limiting Numerical Precision on Neural Networks

Deep Residual Learning for Image Recognition

Stanford University.

World’s fastest Machine Learning Engine

Graphics Processor Graphics Processing Unit

Deep Learning Methods For Automated Discourse CIS 700-7

FPGAs for next gen DAQ and Computing systems at CERN

Early Results of Deep Learning on the Stampede2 Supercomputer

Benchmarking Deep Learning Inference

Fast and Robust Hashing for Database Operators

Sathish Vadhiyar Parallel Programming

SuperB and its computing requirements

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Large-scale Machine Learning

Chilimbi, et al. (2014) Microsoft Research

Extreme Big Data Examples

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Learning in HEP Large number of applications:

R SE to the challenges of ntelligent systems

Morgan Kaufmann Publishers

Ten Words … that promise adequate capacity to digest massive datasets and offer powerful predictive analytics thereupon. These principles and strategies.

Inception and Residual Architecture in Deep Convolutional Networks

LARS Background Reference Paper: Reference Patch in Intel Caffe

Parallel Processing and GPUs

Gesture recognition using deep learning

Neural Networks and Backpropagation

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Early Results of Deep Learning on the Stampede2 Supercomputer

Computer Architecture

Advanced Computing Facility Introduction

Incremental Training of Deep Convolutional Neural Networks

Logistic Regression & Parallel SGD

Discussion HPC Priority project for COSMO consortium

Declarative Transfer Learning from Deep CNNs at Scale

Gandiva: Introspective Cluster Scheduling for Deep Learning

Computer Hardware Optimization

TensorFlow: A System for Large-Scale Machine Learning

Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Model Compression Joseph E. Gonzalez

Computer Systems and Networking Group (CSNG)

Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M

Natalie Lang Tomer Malach

Search-Based Approaches to Accelerate Deep Learning

on Road Signs & Face Detection

Presentation transcript:

dawn.cs.stanford.edu/benchmark DAWNBench An End-to-End Deep Learning Benchmark and Competition DAWNBench An End-to-End Deep Learning Benchmark and Competition that focuses on time and cost to achieve state-of-the-art accuracy Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia Stanford University dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark To address the growing computational demands In recent years, we have seen an explosion in interest deep learning And as a result, we have also seen massive growth in computational demands Fortunately, there have been a number of novel innvoations to address these growing computational demands Including dawn.cs.stanford.edu/benchmark

To address the growing computational demands New software systems Training algorithms Communication methods Hardware TensorFlow PyTorch CNTK MXNet That aim to make designing and training deep learning models faster and easier dawn.cs.stanford.edu/benchmark

To address the growing computational demands New software systems Training decisions Communication methods Hardware Adam RMSprop Stochastic Depth Batch normalization - Such as choice of optimizer to make more efficient use of data and architecture choices to provide regularization dawn.cs.stanford.edu/benchmark

To address the growing computational demands New software systems Training decisions Communication methods Hardware HogWild Synthetic Gradients DimmWitted For asynchronous and synchronous training As well as reduced communication and shared state dawn.cs.stanford.edu/benchmark

To address the growing computational demands New software systems Training decisions Communication methods Hardware Google TPU Nvidia GPUs Microsoft Brainwave Intel Xeon Phi - Many advances in hardware from existing technologies like CPUs and GPUs to new architectures like Google’s TPU - This represents a tremendous effort from the community to reduced the cost both in terms of time and money to create state-of-the-art deep learning systems dawn.cs.stanford.edu/benchmark

To address the growing computational demands New software systems Training decisions Communication methods Hardware Google TPU Nvidia GPUs Microsoft Brainwave Intel Xeon Phi This represents a tremendous effort from the community to reduced the cost in terms of both time and money to create state-of-the-art deep learning systems No standard evaluation criteria for end-to-end training and inference dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Many existing deep learning benchmarks Number of existing benchmarks dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Many existing deep learning benchmarks Accuracy ImageNet CIFAR10 MS COCO SQuAD WMT Machine Translation On one side there are benchmarks that focus on accuracy dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Many existing deep learning benchmarks Accuracy ImageNet CIFAR10 MS COCO SQuAD WMT Machine Translation Throughput (examples/second) Baidu DeepBench TensorFlow Benchmarks “Benchmarking state-of-the-art Deep Learning Software Tools” jcjohnson/cnn-benchmarks soumith/convnet-benchmarks On the other side, there are benchmarks that focus on throughput, where throughput is normally defined as examples / seconds when processing a single mini-batch of data These benchmarks have had a huge impact on deep learning so far dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Many existing deep learning benchmarks Accuracy ImageNet CIFAR10 MS COCO SQuAD WMT Machine Translation Throughput (examples/second) Baidu DeepBench TensorFlow Benchmarks “Benchmarking state-of-the-art Deep Learning Software Tools” jcjohnson/cnn-benchmarks soumith/convnet-benchmarks On the other side, there are benchmarks that focus on throughput, where throughput is normally defined as examples / seconds when processing a single mini-batch of data These benchmarks have had a huge impact on deep learning so far Not time to accuracy dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Example: batch size affects accuracy Mention configuration in soundtrack End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Example: batch size affects accuracy End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

Example: batch size affects accuracy A batch size of 32 achieves the highest accuracy End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Example: batch size affects accuracy and throughput End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

Example: batch size affects accuracy and throughput A batch size of 2048 achieves the highest throughput End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Example: batch size affects accuracy and throughput End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark A batch size of 256 represents a reasonable trade-off between convergence rate and throughput End-to-end training of a ResNet56 CIFAR10 model on a Nvidia P100 machine with 512 GB of memory and 28 CPU cores, using TensorFlow 1.2 compiled from source with CUDA 8.0 and CuDNN 5.1. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? 1.25x Stochastic depth 3.1x Minimal effort backpropagation 3x Reduced precision 29x Accurate, large minibatch SGD 3x Nvidia V100 vs Nvidia P100 dawn.cs.stanford.edu/benchmark

What if we combine optimizations? 1.25x Stochastic depth 3.1x Minimal effort backpropagation 3x Reduced precision 29x Accurate, large minibatch SGD 3x Nvidia V100 vs Nvidia P100 Does that give us a combined speed-up of 1011x? dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? End-to-end training of ResNet110 on CIFAR10 in PyTorch, where the baseline is on machine with a single K80 and a batch size of 128. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? End-to-end training of ResNet110 on CIFAR10 in PyTorch, where the baseline is on machine with a single K80 and a batch size of 128. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? End-to-end training of ResNet110 on CIFAR10 in PyTorch, where the baseline is on machine with a single K80 and a batch size of 128. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark What if we combine optimizations? End-to-end training of ResNet110 on CIFAR10 in PyTorch, where the baseline is on machine with a single K80 and a batch size of 128. dawn.cs.stanford.edu/benchmark

What if we combine optimizations? Optimizations interact in non-trivial ways End-to-end training of ResNet110 on CIFAR10 in PyTorch, where the baseline is on machine with a single K80 and a batch size of 128. dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark First benchmark to measure time and cost to get a state-of-the-art accuracy dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark First benchmark to measure time and cost to get a state-of-the-art accuracy Our goal: measure end-to-end throughput subject to accuracy dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark As an initial release dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark As an initial release Tasks Image classification ImageNet CIFAR10 dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark As an initial release Tasks Image classification ImageNet CIFAR10 Question answering SQuAD dawn.cs.stanford.edu/benchmark

close to the state-of-the-art For each task Accuracy threshold close to the state-of-the-art dawn.cs.stanford.edu/benchmark

close to the state-of-the-art For each task Metrics Training time Training cost (USD) Inference latency Inference cost (USD) Accuracy threshold close to the state-of-the-art - Make line nice dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark The Competition Deadline: April 20th, 2018 at 11:59 PM PST dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark The Competition Deadline: April 20th, 2018 at 11:59 PM PST Decide the winners for each metric on each task dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark The Competition Deadline: April 20th, 2018 at 11:59 PM PST Decide the winners for each metric on each task Define the next set of tasks, thresholds, and metrics dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark A first step, with more to follow More tasks (e.g. machine translation, video classification) More metrics (e.g. sample complexity, energy) - Add website link Join the discussion: bit.ly/dawnbench-community dawn.cs.stanford.edu/benchmark

dawn.cs.stanford.edu/benchmark Conclusion Deep learning methods are effective but computationally expensive, leading to a great deal of work to optimize their computational performance. Yet there is no standard evaluation criteria for end-to-end training and inference. DAWNBench End-to-End training and inference Open to community submissions Evolving tasks, thresholds, and metrics Join the competition: dawn.cs.stanford.edu/benchmark dawn.cs.stanford.edu/benchmark