EE 193: Parallel Computing

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

DSPs Vs General Purpose Microprocessors

Computer architecture

CSCI 4717/5717 Computer Architecture

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Lecture 16: Basic CPU Design

Artificial Neural Networks

Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

IT253: Computer Organization

CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.

March 31, 2016Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms I 1 … let us move on to… Artificial Neural Networks.

1 Lecture 5a: CPU architecture 101 boris.

Mihaela Malița Gheorghe M. Ștefan

Real-World Pipelines Idea Divide process into independent stages

Scalpel: Customizing DNN Pruning to the

Computers’ Basic Organization

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Stanford University.

TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:

Analysis of Sparse Convolutional Neural Networks

Distributed Processors

Data Mining, Neural Network and Genetic Programming

Chilimbi, et al. (2014) Microsoft Research

How will execution time grow with SIZE?

The Need for Speed: Benchmarking DL Workloads

Deep Learning Hung-yi Lee 李宏毅.

5.2 Eleven Advanced Optimizations of Cache Performance

Neural Network Hardware

Intelligent Information System Lab

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5 Smaller Network: CNN

Parallel Processing and GPUs

Convolutional Networks

The University of Adelaide, School of Computer Science

Introduction to Neural Networks

Lecture 14: Reducing Cache Misses

NVIDIA Fermi Architecture

EVA2: Exploiting Temporal Redundancy In Live Computer Vision

Speech Generation Using a Neural Network

Azure Accelerated Networking: SmartNICs in the Public Cloud

Face Recognition with Neural Networks

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

EE 193: Parallel Computing

Neural Networks Geoff Hulten.

Lecture: Deep Convolutional Neural Networks

ECE 498AL Lecture 15: Reductions and Their Implementation

CS 3410, Spring 2014 Computer Science Cornell University

EE 193: Parallel Computing

Instructor: Joel Grodstein

ECE 352 Digital System Fundamentals

EE 193: Parallel Computing

Artificial Neural Networks

Memory System Performance Chapter 3

EE 155 / Comp 122 Parallel Computing

EE 155 / COMP 122: Parallel Computing

EE 193: Parallel Computing

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

EE 193/Comp 150 Computing with Biological Parts

Parallel Systems to Compute

Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing

Presentation transcript:

EE 193: Parallel Computing Fall 2017 Tufts University Instructor: Joel Grodstein joel.grodstein@tufts.edu Lecture 9: Google TPU

Resources In-datacenter performance analysis of a tensor-processing unit Norm Jouppi et. al, Google, ISCA 2017 https://eecs.berkeley.edu/research/colloquium/170315 a talk by Dave Patterson he mostly just presented the above paper Efficient Processing of Deep Neural Networks: A Tutorial and Survey Vivienne Sze et al., MIT, 2017 Available at Cornell Libraries Design and Analysis of a Hardware CNN Accelerator Kevin Kiningham et al, 2017 Stanford Tech Report Describes a systolic array Is it similar to the TPU? Don’t know… EE 193 Joel Grodstein

Summary of this course Building a one-core computer that’s really fast is impossible Instead, we build lots of cores, and hope to fill them with lots of threads We’ve succeeded But we had to do lots of work (especially for a GPU) Is there a better way? Domain-specific architectures Build custom hardware to match your problem Only cost effective if your problem is really important EE 193 Joel Grodstein

The problem Three problems that Google faces: Speech recognition and translation on phones Search ranking Google fixes your searches for you Common solution: deep neural networks EE 193 Joel Grodstein

Neural net Each node: a weighted sum y=w1in1+w2in2+… Followed by an excitation function e.g., out = (y<0? 0 : 1) Each vertical slice is a layer Deep neural net has lots of layers EE 193 Joel Grodstein

Types of neural nets Multi-layer perceptrons are fully connected (60%) Convolutional neural net has connections to nearby inputs only (30%) Recurrent NN has internal state, like an IIR filter (5%) EE 193 Joel Grodstein

What does this remind you of? Matrix multiplication – yet again! y=w1in1+w2in2+… This is a vector dot product One layer: output vector = weight matrix * input vector So, yes, Deep Neural Networks is mostly just lots of matrix multiplies (plus the excitation function) EE 193 Joel Grodstein

How do we make it run fast? We’ve spent an entire course learning that. For Google, it wasn’t good enough One network can have 5-100M weights Lots and lots of customers Goal: improve GPU performance by 10x Strategy: what parts of a CPU/GPU do we not need? Remove them, add more computes EE 193 Joel Grodstein

What did they add? Matrix-multiply unit 65K multiply-accumulate units! Pascal has 3800 cores. Lots of on-chip memory to hold the weights, and temporary storage 3x more on-chip memory than a GPU Still can’t hold them all, but a lot What did they remove to make room for this stuff? EE 193 Joel Grodstein

What did they remove? No OOO No speculative execution or branch prediction Actually, no branches at all! Far less registers Pascal has 64K registers/SM; TPU has very few How can they get away with that? Like a GPU Like a GPU Just the arithmetic EE 193 Joel Grodstein

Very few registers Outputs of one stage directly feed the next stage GPU or CPU approach stuff outputs into registers read them again as next-stage inputs costs silicon area, and also power just get rid of them, direct feed part of what’s called a systolic array also remove intermediate registers for the sums EE 193 Joel Grodstein

What did they remove? Far fewer threads No SMT Haswell = 2 threads/core Pascal  1000 threads/SM TPU = ?? No SMT But isn’t lots of threads and SMT what makes things go fast? SMT: when one thread is stalled, swap in another Stalls are hard to predict (depend on memory patterns, branch patterns) Keep lots of threads waiting 1 thread! TPU has no branches TPU has only one application EE 193 Joel Grodstein

Downsides of SMT Lots of threads required lots of registers TPU doesn’t need nearly as many registers Moving data back and forth between registers and ALUs costs power SMT is a way to keep the machine busy during an unpredictable stalls It doesn’t change the fact that a thread stalled unpredictably It doesn’t make the stalled thread faster It improves average latency, not worst-case latency Google’s requirement: Customers want 99% reliability on worst-case time They don’t want unpredictable waits more than 1% of the time SMT does not fit this customer requirement EE 193 Joel Grodstein

What did they remove? No OOO No branches (or branch prediction, speculative execution) Far less registers 1 thread, no SMT 8-bit integer arithmetic training vs inference training is in floating point; done infrequently and can be slow inference can be done with 8-bit integers TPU uses 8-bit MACs. Saves substantial area – but TPU cannot do training EE 193 Joel Grodstein

Results 15-30X faster at inference than GPU or CPU 30X-80X better performance/watt “Performance” is 99th percentile EE 193 Joel Grodstein

Block diagram EE 193 Joel Grodstein

Chip floorplan EE 193 Joel Grodstein