Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.

Overview Introduction The Level of Abstraction Organization & Architecture Structure & Function Why study computer organization?

Introduction Computer Organization and Architecture: Lesson 1.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

February 12, 1999 Architecture and Circuits: 1 Interconnect-Oriented Architecture and Circuits William J. Dally Computer Systems Laboratory Stanford University.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Lecture 3: Computer Architectures

Parallel Processing Presented by: Wanki Ho CS147, Section 1.

Philipp Gysel ECE Department University of California, Davis

Computer Architecture Furkan Rabee

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

Introduction to Computers - Hardware

William Stallings Computer Organization and Architecture 6th Edition

Computer Organization and Architecture Lecture 1 : Introduction

Itanium® 2 Processor Architecture

Basic Computer Hardware and Software.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Stanford University.

Lynn Choi School of Electrical Engineering

ESE532: System-on-a-Chip Architecture

The Relationship between Deep Learning and Brain Function

Processor/Memory Chapter 3

Introduction to Computing

Microarchitecture.

Chilimbi, et al. (2014) Microsoft Research

Central Processing Unit Architecture

NVIDIA’s Extreme-Scale Computing Project

Parallel computer architecture classification

ESE532: System-on-a-Chip Architecture

Hyperdimensional Computing with 3D VRRAM In-Memory Kernels: Device-Architecture Co-Design for Energy-Efficient, Error-Resilient Language Recognition H.

ECE 154A Introduction to Computer Architecture

Architecture & Organization 1

CS203 – Advanced Computer Architecture

Computer Architecture and Organization

Parallel Processing and GPUs

Stripes: Bit-Serial Deep Neural Network Computing

Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1

Power-Efficient Machine Learning using FPGAs on POWER Systems

Introduction to Computer Systems

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Architecture & Organization 1

Komponen Dasar Sistem Komputer

ESE534: Computer Organization

Chapter 17 Parallel Processing

Introduction to Multiprocessors

Chapter 2: Data Manipulation

Overview Parallel Processing Pipelining

Chapter 1 Introduction.

Optimization for Fully Connected Neural Network for FPGA application

Final Project presentation

Chapter 2: Data Manipulation

Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing

Chapter 4 Multiprocessors

ESE534: Computer Organization

EE 193: Parallel Computing

Model Compression Joseph E. Gonzalez

A microprocessor into a memory chip Dave Patterson, Berkeley, 1997

Samira Khan University of Virginia Feb 6, 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

Chapter 2: Data Manipulation

Learning and Memorization

Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar

Pushing TinyML Systems toward Minimum Energy Perspectives from a Mixed-Signal Designer Boris Murmann September 26, 2019.

Presentation transcript:

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019

Revolution in Artificial Intelligence

Intelligence Infused from the Edge to the Cloud

Fueled by advances in AI/ML Increasing Accuracy Error Rate in Image Classification 2010 2011 2012 2013 2014 2015 2017 2000 2005 2010 2015 Growth in papers (relative to 1996)

AI Revolution Enabled by Hardware AI driven by advances in hardware AI Revolution Enabled by Hardware = Chiwawa

Deep Learning Advances Gated by Hardware [Bill Dally, SysML 2018]

Deep Learning Advances Gated by Hardware Results improve with Larger Models Larger Datasets => More Computation

Data Set and Model Size Scaling 256X data 64X model size Compute 32,000 X Hestess et al. Arxiv 17.12.00409

How Much Compute? Inferencing** 25 Million Weights 300 Gops for HD Image 9.4 Tops for 30fps 12 Cameras, 3 nets = 338 Tops Training 30Tops x 108 (train set) x 10^2 (epochs) = 1023 Ops **ResNet-50 12 HD Cameras

Challenge: End of Line for CMOS Scaling Device scaling down slowing Power Density stopped scaling in 2005

Dennard Scaling to Dark Silicon Can we specialize designs for DNNs? http://cseweb.ucsd.edu/~mbtaylor/papers/taylor_landscape_ds_ieee_micro_2013.pdf

Energy Cost for Operations Instruction Fetch/D 70 45nm [Horowitz ISSCC 2014]

Y = AB + C Energy Cost for DNN Ops Instruction Data Type Energy Energy/Op 1 Op/Instr (*,+) Memory fp32 89 nJ 693 pJ 128 Ops/Instr (AX+B) 72 nJ 562 pJ 128 Ops/Instr (AX+B) Cache 0.87 nJ 6.82 pJ fp16 0.47 nJ 3.68 pJ

Build a processor tuned to application - specialize for DNNs More effective parallelism for AI Operations (not ILP) SIMD vs. MIMD VLIW vs. Speculative, out-of-order Eliminate unneeded accuracy IEEE replaced by lower precision FP 32-64 bit bit integers to 8-16 bit integers More effective use of memory bandwidth User controlled versus caches How many Teraops per Watt?

2-D Grid of Processing Elements Systolic Arrays Balance compute and I/O Local Communication Concurrency M1*M2 nxn mult in 2n cycles

TPUv2 8GB + 8GB HBM 600 GB/s Mem BW 32b float MXU fp32 accumulation Reduced precision mult. [Dean NIPS’17]

Datatypes – Reduced Precision Stochastic Rounding: 2.2 rounded to 2 with probability 0.8 rounded to 3 with probability 0.2 16-bits accuracy matches 32-bits!

Integer Weight Representations

Extreme Datatypes Q: What can we do with single bit or ternary (+a, -b, 0) representations A: Surprisingly, a lot. Just retrain/increase the number of activation layers Network Chenzhuo et al,, arxiv 1612.01064

Pruning – How many values? 90% values can be thrown away

Quantization – How many distinct values? 16 Values => 4 bits representation Instead of 16 fp32 numbers, store16 4-bit indices [Song et al, 2015]

Memory Locality Bring Compute Closer to Memory Compute Interconnect Weights Interconnect DRAM 32bits 640pJ 256-bit access 8KB Cache 50pJ 256-bit Bus 256pJ 256-bit Bus 25pJ 32-bit op 4pJ 20mm Bring Compute Closer to Memory

Memory and inter-chip communication advances pJ 64pJ/B 16GB 900GB/s @ 60W 10pJ/B 256MB 6TB/s @ 60W 1pJ/B 1000 x 256KB 60TB/s @ 60W Memory power density Is ~25% of logic power density [Graphcore 2018]

In-Memory Compute In-memory compute with low-bits/value, low cost ops [Ando et al., JSSC 04/18]

AI Application-System Co-design

AI Application-System Co-design Co-design across the algorithm, architecture, and circuit levels Optimize DNN hardware accelerators across multiple datasets [Reagen et al, Minerva, ISCA 2016]

Where Next? Rise of non-Von Neumann Architectures to efficiently solve Domain-Specific Problems A new Golden Age of Computer Architecture Advances in semi technology, computer architecture will continue advances in AI 10 Tops/W is the current State of the Art Several orders of magnitude gap with the human brain