Early Results of Deep Learning on the Stampede2 Supercomputer

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

TACC’s mission is to enable discoveries that advance science and society through the application of advanced computing technologies. Texas Advanced Computing.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
A Framework for Visualizing Science at the Petascale and Beyond Kelly Gaither Research Scientist Associate Director, Data and Information Analysis Texas.
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Running Mantevo Benchmark on a Bare-metal Server Mohammad H. Mofrad January 28, 2016
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.
Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,
Scaling up R computation with high performance computing resources.
Societal applications of large scalable parallel computing systems ARTEMIS & ITEA Co-summit, Madrid, October 30th 2009.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.
Advanced Computing Facility Introduction
Azure Stack Foundation
Parallel Programming Models
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Big Data Analytics and HPC Platforms
Panel: Beyond Exascale Computing
VisIt Project Overview
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
TensorFlow– A system for large-scale machine learning
A Brief Introduction to NERSC Resources and Allocations
Organizations Are Embracing New Opportunities
Analysis of Sparse Convolutional Neural Networks
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Modern supercomputers, Georgian supercomputer project and usage areas
Data Analytics and CERN IT Hadoop Service
Benchmarking Deep Learning Inference
R. Rastogi, A. Srivastava , K. Sirasala , H. Chavhan , K. Khonde
Computing models, facilities, distributed computing
Working With Azure Batch AI
Distributed Network Traffic Feature Extraction for a Real-time IDS
Scott Michael Indiana University July 6, 2017
Heterogeneous Computation Team HybriLIT
Enabling machine learning in embedded systems
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Part I Status of GATE-related projects (Clermont-Ferrand; May 2017)
Deep Learning Platform as a Service
dawn.cs.stanford.edu/benchmark
LARS Background Reference Paper: Reference Patch in Intel Caffe
Interactive Website (
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Open Source Toolkit for Turn-Key AI Cluster (Introduction)
Development of the Nanoconfinement Science Gateway
Deep Learning Packages
Early Results of Deep Learning on the Stampede2 Supercomputer
Scalable Parallel Interoperable Data Analytics Library
Tor Skeie Feroz Zahid Simula Research Laboratory 27th June 2018
Hybrid Programming with OpenMP and MPI
Declarative Transfer Learning from Deep CNNs at Scale
Gandiva: Introspective Cluster Scheduling for Deep Learning
Action Recognition.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Can (HPC)Clouds supersede traditional High Performance Computing?
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Search-Based Approaches to Accelerate Deep Learning
Presentation transcript:

Early Results of Deep Learning on the Stampede2 Supercomputer Zhao Zhang, Weijia Xu, Niall Gaffney, Daniel Stanzione Texas Advanced Computing Center zzhang@tacc.utexas.edu Sep 26th, 2017

Outline Overview Technical Approach Performance Summary and Future Work

Deep Learning as a Methodology Many scientists are exploring and adopting deep learning as the data science methodology to tackle their domain research challenge Astronomy Drug discovery Disease diagnosis Molecular dynamics Neurology Particle physics Social science

Deep Learning on HPC Neural network is a good fit for HPC architecture HPC can reduce the training turnaround time Minutes, Hours — Interactive research! Instant gratification! 1-4 days Tolerable Interactivity replaced by running many experiments in parallel 1-4 weeks High value experiments only Progress stalls >1 month Don’t even try — Jonathan Hseu, Google Brain Team

TACC’s Mission To enable discoveries that advance science and society through the application of advanced computing technologies TACC is embracing this new type of application and the programming paradigm changes it may bring

Technical Approach Support a number of popular deep learning frameworks Caffe, MXNet, and TensorFlow at the moment Support multiple modern architectures Intel KNL, Intel Xeon CPU, Nvidia GPU Profile and present the performance of a suite of well- known deep learning applications ConvNet on the Cifar10 dataset, AlexNet, CaffeNet, GoogeNet on the ImageNet dataset

Status We measured Caffe, MXNet, and TensorFlow performance on the Stampede2 supercomputer and our GPU-based Maverick cluster Intel Caffe is available as a module on the Stampede2 supercomputer with the user guide in preparation

Outline Overview Approach Performance Summary and Future Work

Stampede2 Phase 1 of Stampede2 has 4,200 KNLs 96 GB DDR4 and 16 GB MCDRAM 100 Gb/s Intel Omni-Path interconnect A Lustre deployment with ~40 PB storage

Software Stack Intel Caffe (self_contained_MKLGOLD_u2) Machine Learning Scaling Library -- MLSL (v2017- Preview) MKLML (2017.0.2.20170110) Intel MPI 17.0.4 MXNet (v0.10.0-79-g790328f) TensorFlow (v1.3.0-rc2)

Single-node Performance — Caffe Goal: Find out the OMP_NUM_THREADS configuration for best performance on a KNL Compare the MKL2017 kernel and the default kernel Methodology: Fix the workload of ConvNet with Cifar10 dataset and CaffeNet with ImageNet-100 dataset Run the workloads with {32, 64, 128} OpenMP threads

Single-Node Performance - Caffe MKL2017 has a 1.4-3.7x speedup compared to the default kernel 64 OpenMP threads deliver the best performance

Single-node Performance — Architecture Comparison Goal: Understand the performance difference between KNL and Nvidia’s K40 and P100 Methodology: Fix the workloads Run ConvNet, CaffeNet, AlexNet, and GoogleNet on one KNL, one K40, and one P100

Single-node Performance — Architecture Comparison One KNL is 2x faster than one K40 One KNL is 40-80% slower than one P100

Single-node Performance — MXNet and TensorFLow Slowdown summary of MXNet and TensorFlow on KNL MXNet TensorFlow K40 1.2-3.7x 5.0-22.3x P100 5.3-14.1x 18.9-59.0x

Multi-node Performance - Strong Scaling Caffe Goal: Understand the strong scaling pattern of Caffe Methodology: Fix the mini-batch size of each model Fix the number of epochs Run ConvNet, CaffeNet, AlexNet, and GoogleNet on {1, 2, 4, 8} KNLs mini-batch size { Number of Epochs Guaranteed Generalization! mini-batch size { Number of Epochs

Multi-Node Performance — Strong Scaling Caffe ~50% strong scaling efficiency

Multi-Node Performance — Strong Scaling Caffe 2 KNLs deliver similar performance to one P100 for CaffeNet, AlexNet, and GoogleNet

Multi-node Performance — Weak Scaling Caffe Goal: to understand the weak scaling pattern of Caffe Methodology: Increase the mini-batch size proportionally to node count Fix the number of epochs Run ConvNet, CaffeNet, AlexNet, and GoogleNet on {128, 256, 512, 1024} KNLs with mini-batch size { Number of Epochs mini-batch size { Number of Epochs Guaranteed Generalization?

Multi-Node Performance — Weak Scaling Caffe MLSL stops working Pure MPI mode Known largest working scale: 768

Summary TACC has three popular deep learning frameworks, Caffe, MXNet, and TensorFlow running on the Stampede2 supercomputer A single KNL performance with Caffe is ~2x faster than a K40 and ~40%-80% slower than a P100 Strong scaling Caffe with the same mini-batch size has ~50% efficiency on 4 KNLs Weak scaling Caffe with large mini-batch size has ~80% efficiency on 512 KNLs

Ongoing and Future Work Improving the Caffe scalability with Intel Deploying the Layer-wise Adaptive Rate Scaling (LARS) algorithm to enable large mini-batch size training with less time Possibly optimize MXNet and TensorFlow performance with Intel

THANKS Q&A