Adam Coates Deep Learning and HPC Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS.

Slides:

Advertisements

Similar presentations

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Advertisements

Neural networks Introduction Fitting neural networks

CS590M 2008 Fall: Paper Presentation

Advanced topics.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Distributed Computations

1 CS 177 Week 12 Recitation Slides Running Time and Performance.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Artificial Neural Networks

Spatial Pyramid Pooling in Deep Convolutional

MACHINE LEARNING AND ARTIFICIAL NEURAL NETWORKS FOR FACE VERIFICATION

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

High level & Low level language High level programming languages are more structured, are closer to spoken language and are more intuitive than low level.

Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.

An Example of Course Project Face Identification.

IE 585 Introduction to Neural Networks. 2 Modeling Continuum Unarticulated Wisdom Articulated Qualitative Models Theoretic (First Principles) Models Empirical.

CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.

A shallow introduction to Deep Learning

Andrew Ng Feature learning for image classification Kai Yu and Andrew Ng.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

CS 351/ IT 351 Modeling and Simulation Technologies HPC Architectures Dr. Jim Holten.

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

CS 732: Advance Machine Learning

Introduction to HPC Debugging with Allinea DDT Nick Forrington

ConvNets for Image Classification

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Big data classification using neural network

TensorFlow– A system for large-scale machine learning

Big Data is a Big Deal!.

Modeling Big Data Execution speed limited by: Model complexity

Cleveland SQL Saturday Catch-All or Sometimes Queries

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Deep learning David Kauchak CS158 – Fall 2016.

Large-scale Machine Learning

Chilimbi, et al. (2014) Microsoft Research

Resource Elasticity for Large-Scale Machine Learning

Spatial Analysis With Big Data

Hierarchical Architecture

Classification with Perceptrons Reading:

Supervised Training of Deep Networks

Deep learning and applications to Natural language processing

Classification / Regression Neural Networks 2

State-of-the-art face recognition systems

Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.

Object Recognition & Detection

Machine Learning Platform Life-Cycle Management

COS 518: Advanced Computer Systems Lecture 12 Mike Freedman

CMPT 733, SPRING 2016 Jiannan Wang

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,

Logistic Regression & Parallel SGD

Overview of Machine Learning

CSSE463: Image Recognition Day 18

TensorFlow: A System for Large-Scale Machine Learning

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

CMPT 733, SPRING 2017 Jiannan Wang

Presentation transcript:

Adam Coates Deep Learning and HPC Adam Coates Visiting Scholar at IU Informatics Post-doc at Stanford CS

Adam Coates What do we want computers to do with our data? Images/video Audio Text Label: “Motorcycle” Suggest tags Image search … Speech recognition Music classification Speaker identification … Web search Anti-spam Machine translation …

Adam Coates Computer vision is hard! Motorcycle

Adam Coates What do we want computers to do with our data? Images/video Audio Text Label: “Motorcycle” Suggest tags Image search … Speech recognition Music classification Speaker identification … Web search Anti-spam Machine translation … Machine learning performs well on many of these problems, but is a lot of work. What is it about machine learning that makes it so hard to use?

Adam Coates Machine learning for image classification “Motorcycle”

Adam Coates Why is this hard? You see this: But the camera sees this:

Adam Coates Machine learning and feature representations Input Raw image Motorbikes “Non”-Motorbikes Learning algorithm pixel 1 pixel 2 pixel 1 pixel 2

Adam Coates Machine learning and feature representations Input Motorbikes “Non”-Motorbikes Learning algorithm pixel 1 pixel 2 pixel 1 pixel 2 Raw image

Adam Coates Machine learning and feature representations Input Motorbikes “Non”-Motorbikes Learning algorithm pixel 1 pixel 2 pixel 1 pixel 2 Raw image

Adam Coates What we want Input Motorbikes “Non”-Motorbikes Learning algorithm pixel 1 pixel 2 Feature representation handlebars wheel E.g., Does it have Handlebars? Wheels? Handlebars Wheels Raw image Features

Adam Coates How is computer perception done? Image Vision features Detection Images/video Audio Audio features Speaker ID Audio Text Text features Text classification, Machine translation, Information retrieval,.... Coming up with features is difficult, time- consuming, requires expert knowledge. When working on applications of learning, we spend a lot of time tuning the features.

Adam Coates Deep Learning Find algorithms that can learn representations/features from data. – Deep neural networks. – “Unsupervised feature learning” Learn representations without knowing task.

Adam Coates Deep Learning Build multi-stage pipelines from simple pieces. – Classic system: deep neural net. – Generally: compositions of differentiable functions. “Motorcycle” Optimize weights inside network to give correct answers on training data.

Adam Coates Deep Learning Build multi-stage pipelines from simple pieces. – Learns internal representation as needed. “Motorcycle”

Adam Coates Basic algorithmic components In a loop over entire training set: 1.Evaluate deep network. Usually process a batch of training examples (e.g., 100) at once 2.Compute gradient of loss function w.r.t parameters. Sum up gradients over batch of examples. 3.Update trainable parameters using gradient.

Adam Coates Scaling Up Deep Learning at Stanford Most DL networks built on a few primitives. – Mostly large dense matrix/vector operations. – A few “block” matrices for widely-used cases. – Communication hidden in distributed arrays. Most operations are hardware-friendly. – Not far from sgemm throughput. – Relatively low communication / IO needs. But hard to avoid doing many iterations. – Have to focus on making each loop very fast.

Adam Coates Scaling Up Deep Learning at Stanford In-house MPI+CUDA infrastructure. – Up to 11.2B parameter networks. – Typical experiment: ~14M images (Image-Net). [Coates et al., ICML 2013]

Adam Coates Scaling Up Deep Learning at Stanford Duplicated “Google Brain” with 3 machines. – Compared to machines. – Unsupervised learning from 10M YouTube frames. Largest artificial neural nets ever trained. – 6.5x larger than previous system. … but what should we do with it!? Surprisingly hard to find a problem big enough that such models matter! [Coates et al., ICML 2013]

Adam Coates Applications Building universal representations – “One neural net to rule them all.” … Object RecognitionLocalizationTaggingDepth Estimation … …… Shared representation for many tasks. [E.g., Collobert et al., 2011]

Adam Coates Applications Autonomous Driving 1 year * 1 Hz = ~30M frames [Actually have to drive for 1 year!] Can we train from a few hundred 1080p frames per second?

Adam Coates Applications: why these? High impact. – Universal representations: many applications with diffused value. – Driving: single application with high value. Train once, deploy everywhere. – Training is hard, expensive. – Deploying is easy, cheap. – A supercomputer can generate an artifact that gets re- used by others.

Adam Coates Things that work Find common cases; tightly optimize – Surprisingly few core pieces. E.g., 10. Distributed arrays – Massive time-saver; easy to think about. – Easy to save and restore from Lustre. – Load shards and sanity-check them in Matlab. High-level language bindings – Low-level code in C++/CUDA (JIT)

Adam Coates Challenges Experiment turn-around time is still long. – Maybe 3-5 experiments running at once. – Weeks for big models / big datasets. Productivity is still much lower than, e.g., Matlab. – Lack of strong tools at every level except lowest. Many DL hackers are not systems hackers. Lots of hard-won lessons that are trapped in our group.

Adam Coates Laundry list from Stanford infrastructure Job control and scripting is painful – Zombies – PBS/Torque mostly works JIT compilation – JIT compile C/C++ code Flexible enough to do many things. Easier to use CUDA runtime, templatizing, etc. – Avoids Driver API, which is much less convenient. Easier to link with high-level languages. – Needs to be thread-savvy Caching of compiled modules Avoiding deadlocks or locking problems in cache(s) – Ideally invisible to users But first use of kernels is really slow. Debugging – Unclear what to do here. Support for common tools? NVTX, VampirTrace…? Distributed arrays – Stanford implementation is rough. Should have pursued more standard approach. – MATLAB’s Co-distributed arrays; ScaLapack-style arrays. Multi-dimensional array with a “distributor” that maps indices to ranks. Support to re-distribute array. Support to save/load arrays even when process grid changes. Distribution-aware implementations of most functionality. Execution structure – Imperative programming is just easier (esp. with students + scientists). DAGs, etc. are static and difficult to alter. Works OK for us; but many headaches. CUDA streams+events semantics is really nice. – Solves the same problem: hide massive parallelism from the caller. – But allows arbitrary scheduling on the fly. Easy to understand behavior as viewed by the host. If you want custom functionality, you just have to write the parallel code. – In CUDA, you have to write the kernel. – For ScaLapack, you had to write code on top of BLACS. – Single-rank case should look like 100-rank case. Students can prototype single-rank. Easier to think about. IO tools – We spend a lot of time writing file loaders. Application-specific, but lots of boiler-plate. – Many common cases in ML. E.g., a list of samples, where each sample = video, image, string, vector. Currently difficult to handle distributed saving/loading of large arrays of data.