Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Overview Introduction The Level of Abstraction Organization & Architecture Structure & Function Why study computer organization?
Introduction Computer Organization and Architecture: Lesson 1.
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Lecture 3: Computer Architectures
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
Philipp Gysel ECE Department University of California, Davis
Computer Architecture Furkan Rabee
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Introduction to Computers - Hardware
William Stallings Computer Organization and Architecture 6th Edition
Computer Organization and Architecture Lecture 1 : Introduction
Itanium® 2 Processor Architecture
Basic Computer Hardware and Software.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Stanford University.
Lynn Choi School of Electrical Engineering
ESE532: System-on-a-Chip Architecture
The Relationship between Deep Learning and Brain Function
Processor/Memory Chapter 3
Introduction to Computing
Microarchitecture.
Chilimbi, et al. (2014) Microsoft Research
Central Processing Unit Architecture
NVIDIA’s Extreme-Scale Computing Project
Parallel computer architecture classification
ESE532: System-on-a-Chip Architecture
Hyperdimensional Computing with 3D VRRAM In-Memory Kernels: Device-Architecture Co-Design for Energy-Efficient, Error-Resilient Language Recognition H.
ECE 154A Introduction to Computer Architecture
Architecture & Organization 1
CS203 – Advanced Computer Architecture
Computer Architecture and Organization
Parallel Processing and GPUs
Stripes: Bit-Serial Deep Neural Network Computing
Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1
Power-Efficient Machine Learning using FPGAs on POWER Systems
Introduction to Computer Systems
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Architecture & Organization 1
Komponen Dasar Sistem Komputer
ESE534: Computer Organization
Chapter 17 Parallel Processing
Introduction to Multiprocessors
Chapter 2: Data Manipulation
Overview Parallel Processing Pipelining
Chapter 1 Introduction.
Optimization for Fully Connected Neural Network for FPGA application
Final Project presentation
Chapter 2: Data Manipulation
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
Chapter 4 Multiprocessors
ESE534: Computer Organization
EE 193: Parallel Computing
Model Compression Joseph E. Gonzalez
A microprocessor into a memory chip Dave Patterson, Berkeley, 1997
Samira Khan University of Virginia Feb 6, 2019
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Chapter 2: Data Manipulation
Learning and Memorization
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Pushing TinyML Systems toward Minimum Energy Perspectives from a Mixed-Signal Designer Boris Murmann September 26, 2019.
Presentation transcript:

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019

Revolution in Artificial Intelligence

Intelligence Infused from the Edge to the Cloud

Fueled by advances in AI/ML Increasing Accuracy Error Rate in Image Classification 2010 2011 2012 2013 2014 2015 2017 2000 2005 2010 2015 Growth in papers (relative to 1996)

AI Revolution Enabled by Hardware AI driven by advances in hardware AI Revolution Enabled by Hardware = Chiwawa

Deep Learning Advances Gated by Hardware [Bill Dally, SysML 2018]

Deep Learning Advances Gated by Hardware Results improve with Larger Models Larger Datasets => More Computation

Data Set and Model Size Scaling 256X data 64X model size Compute 32,000 X Hestess et al. Arxiv 17.12.00409

How Much Compute? Inferencing** 25 Million Weights 300 Gops for HD Image 9.4 Tops for 30fps 12 Cameras, 3 nets = 338 Tops Training 30Tops x 108 (train set) x 10^2 (epochs) = 1023 Ops **ResNet-50 12 HD Cameras

Challenge: End of Line for CMOS Scaling Device scaling down slowing Power Density stopped scaling in 2005

Dennard Scaling to Dark Silicon Can we specialize designs for DNNs? http://cseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdf

Energy Cost for Operations Instruction Fetch/D 70 45nm [Horowitz ISSCC 2014]

Y = AB + C Energy Cost for DNN Ops Instruction Data Type Energy Energy/Op 1 Op/Instr (*,+) Memory fp32 89 nJ 693 pJ 128 Ops/Instr (AX+B) 72 nJ 562 pJ 128 Ops/Instr (AX+B) Cache 0.87 nJ 6.82 pJ fp16 0.47 nJ 3.68 pJ

Build a processor tuned to application - specialize for DNNs More effective parallelism for AI Operations (not ILP) SIMD vs. MIMD VLIW vs. Speculative, out-of-order Eliminate unneeded accuracy IEEE replaced by lower precision FP 32-64 bit bit integers to 8-16 bit integers More effective use of memory bandwidth User controlled versus caches How many Teraops per Watt?

2-D Grid of Processing Elements Systolic Arrays Balance compute and I/O Local Communication Concurrency M1*M2 nxn mult in 2n cycles

TPUv2 8GB + 8GB HBM 600 GB/s Mem BW 32b float MXU fp32 accumulation Reduced precision mult. [Dean NIPS’17]

Datatypes – Reduced Precision Stochastic Rounding: 2.2 rounded to 2 with probability 0.8 rounded to 3 with probability 0.2 16-bits accuracy matches 32-bits!

Integer Weight Representations

Extreme Datatypes Q: What can we do with single bit or ternary (+a, -b, 0) representations A: Surprisingly, a lot. Just retrain/increase the number of activation layers Network Chenzhuo et al,, arxiv 1612.01064

Pruning – How many values? 90% values can be thrown away

Quantization – How many distinct values? 16 Values => 4 bits representation Instead of 16 fp32 numbers, store16 4-bit indices [Song et al, 2015]

Memory Locality Bring Compute Closer to Memory Compute Interconnect Weights Interconnect DRAM 32bits 640pJ 256-bit access 8KB Cache 50pJ 256-bit Bus 256pJ 256-bit Bus 25pJ 32-bit op 4pJ 20mm Bring Compute Closer to Memory

Memory and inter-chip communication advances pJ 64pJ/B 16GB 900GB/s @ 60W 10pJ/B 256MB 6TB/s @ 60W 1pJ/B 1000 x 256KB 60TB/s @ 60W Memory power density Is ~25% of logic power density [Graphcore 2018]

In-Memory Compute In-memory compute with low-bits/value, low cost ops [Ando et al., JSSC 04/18]

AI Application-System Co-design

AI Application-System Co-design Co-design across the algorithm, architecture, and circuit levels Optimize DNN hardware accelerators across multiple datasets [Reagen et al, Minerva, ISCA 2016]

Where Next? Rise of non-Von Neumann Architectures to efficiently solve Domain-Specific Problems A new Golden Age of Computer Architecture Advances in semi technology, computer architecture will continue advances in AI 10 Tops/W is the current State of the Art Several orders of magnitude gap with the human brain