Download presentation
Presentation is loading. Please wait.
Published bySucianty Kartawijaya Modified over 5 years ago
1
Artificial Intelligence: Driving the Next Generation of Chips and Systems
Manish Pandey May 13, 2019
2
Revolution in Artificial Intelligence
3
Intelligence Infused from the Edge to the Cloud
4
Fueled by advances in AI/ML
Increasing Accuracy Error Rate in Image Classification Growth in papers (relative to 1996)
5
AI Revolution Enabled by Hardware
AI driven by advances in hardware AI Revolution Enabled by Hardware = Chiwawa
6
Deep Learning Advances Gated by Hardware
[Bill Dally, SysML 2018]
7
Deep Learning Advances Gated by Hardware
Results improve with Larger Models Larger Datasets => More Computation
8
Data Set and Model Size Scaling
256X data 64X model size Compute 32,000 X Hestess et al. Arxiv
9
How Much Compute? Inferencing** 25 Million Weights
300 Gops for HD Image 9.4 Tops for 30fps 12 Cameras, 3 nets = 338 Tops Training 30Tops x 108 (train set) x 10^2 (epochs) = 1023 Ops **ResNet-50 12 HD Cameras
10
Challenge: End of Line for CMOS Scaling
Device scaling down slowing Power Density stopped scaling in 2005
11
Dennard Scaling to Dark Silicon
Can we specialize designs for DNNs?
12
Energy Cost for Operations
Instruction Fetch/D 70 45nm [Horowitz ISSCC 2014]
13
Y = AB + C Energy Cost for DNN Ops Instruction Data Type Energy
Energy/Op 1 Op/Instr (*,+) Memory fp32 89 nJ 693 pJ 128 Ops/Instr (AX+B) 72 nJ 562 pJ 128 Ops/Instr (AX+B) Cache 0.87 nJ 6.82 pJ fp16 0.47 nJ 3.68 pJ
14
Build a processor tuned to application - specialize for DNNs
More effective parallelism for AI Operations (not ILP) SIMD vs. MIMD VLIW vs. Speculative, out-of-order Eliminate unneeded accuracy IEEE replaced by lower precision FP 32-64 bit bit integers to 8-16 bit integers More effective use of memory bandwidth User controlled versus caches How many Teraops per Watt?
15
2-D Grid of Processing Elements
Systolic Arrays Balance compute and I/O Local Communication Concurrency M1*M2 nxn mult in 2n cycles
16
TPUv2 8GB + 8GB HBM 600 GB/s Mem BW 32b float MXU fp32 accumulation
Reduced precision mult. [Dean NIPS’17]
17
Datatypes – Reduced Precision
Stochastic Rounding: 2.2 rounded to 2 with probability 0.8 rounded to 3 with probability 0.2 16-bits accuracy matches 32-bits!
18
Integer Weight Representations
19
Extreme Datatypes Q: What can we do with single bit or ternary (+a, -b, 0) representations A: Surprisingly, a lot. Just retrain/increase the number of activation layers Network Chenzhuo et al,, arxiv
20
Pruning – How many values?
90% values can be thrown away
21
Quantization – How many distinct values?
16 Values => 4 bits representation Instead of 16 fp32 numbers, store16 4-bit indices [Song et al, 2015]
22
Memory Locality Bring Compute Closer to Memory Compute Interconnect
Weights Interconnect DRAM 32bits 640pJ 256-bit access 8KB Cache 50pJ 256-bit Bus 256pJ 256-bit Bus 25pJ 32-bit op 4pJ 20mm Bring Compute Closer to Memory
23
Memory and inter-chip communication advances
pJ 64pJ/B 16GB 60W 10pJ/B 256MB 60W 1pJ/B 1000 x 256KB 60W Memory power density Is ~25% of logic power density [Graphcore 2018]
24
In-Memory Compute In-memory compute with low-bits/value, low cost ops
[Ando et al., JSSC 04/18]
25
AI Application-System Co-design
26
AI Application-System Co-design
Co-design across the algorithm, architecture, and circuit levels Optimize DNN hardware accelerators across multiple datasets [Reagen et al, Minerva, ISCA 2016]
27
Where Next? Rise of non-Von Neumann Architectures to efficiently solve Domain-Specific Problems A new Golden Age of Computer Architecture Advances in semi technology, computer architecture will continue advances in AI 10 Tops/W is the current State of the Art Several orders of magnitude gap with the human brain
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.