Download presentation
Presentation is loading. Please wait.
Published byAuvo Uotila Modified over 5 years ago
1
Pushing TinyML Systems toward Minimum Energy Perspectives from a Mixed-Signal Designer
Boris Murmann September 26, 2019
2
The TinyML Vision < milliwatt
P. Warden, “AI and Unreliable Electronics (*batteries not included),” Pete Warden’s Blog, Dec. 2016
3
Workhorse of ML: Deep Convolutional Neural Network
Memory (~100 kB…100 MB) and compute (mostly multiply and add) Typically more than 1 Billion arithmetic operations per inference, even for moderate-size 109 operations/10 ms/0.1 mW = 1000 TOps/W Sze, Proc. IEEE, 2017
4
CNN Processor Landscape
TinyML Bankman ISSCC 2018 Better accuracy, programmability, but too inefficient for TinyML Source: IMEC, based on:
5
Questions for This Evening
What stands in the way of building ~1000 TOps/W TinyML processors without sacrificing classification accuracy and programmability? What sets the relevant efficiency asymptotes? What can we do to overcome these limits? Analog/mixed-signal circuits? In memory computing?
6
Note on Performance Metrics
𝑃𝑜𝑤𝑒𝑟=𝑅𝑎𝑡𝑒× 𝐸𝑛𝑒𝑟𝑔𝑦 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =𝑅𝑎𝑡𝑒× 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 × 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 What we ultimately care about given accuracy) 1 𝑝𝐽 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 1 1 𝑇𝑂𝑝𝑠/𝑊 Must minimize Operations/Inference and Energy/Operation (maximize TOps/W) Reporting only TOps/W as a standalone metric can be very misleading Fabric with high TOps/W may be underutilized (requiring more operations) Operation can have vastly different definitions (1b vs. 16b, multiply vs. add, etc.)
7
NAND Gate (~28 nm LP CMOS) 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ≈0.2 𝑓𝐽 ⇒ 5,000 𝑇𝑂𝑝𝑠/𝑊
8
Building a Hardwired CNN Using Digital Logic?
Example: One billion connections One multiplier and one adder per connection (8b) 8b multiplier requires 56 full adders 8b adder requires 8 full adders Full adder occupies 15x11 wire tracks with pitch 8λ In 28 nm CMOS one connection occupies >133 μm2 One billion connections occupy >133,000 mm2! Source: Danny Bankman’s PhD thesis
9
Typical CNN Accelerator
x + x + x + RF RF RF DRAM SRAM SRAM x + x + x + RF RF RF x + x + x + RF RF RF Array of processing elements plus hierarchical memory architecture Large models require off-chip DRAM, small models (< 1 MB) may not Optimizing memory access and data movement is key
10
Illustration of Scale – On-Chip SRAM Cell
11
Illustration of Scale – Pulling SRAM Bit Across Chip
4 mm Eyeriss
12
Illustration of Scale – Pulling SRAM Bit from a Register File
~100 mm Eyeriss
13
Typical Numbers (~28 nm CMOS)
8b Multiply ~200 fJ 8b Add ~30 fJ 1 mm wire ~200 fJ ~100 MB ~10 pJ/b ~100 KB ~1 pJ/b ~1 KB ~0.1 pJ/b ~100 B ~10 fJ/b
14
Data Re-Use in CNNs Sze, Proc. IEEE, 2017
15
DRAM Energy 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 ×𝐴𝑐𝑐𝑒𝑠𝑠 𝐸𝑛𝑒𝑟𝑔𝑦 𝑝𝑒𝑟 𝑊𝑒𝑖𝑔ℎ𝑡 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 10 𝑀𝑖𝑙𝑙𝑖𝑜𝑛 ×80𝑝𝐽 1 𝐵𝑖𝑙𝑙𝑖𝑜𝑛 =800𝑓𝐽 ⇒ 𝑇𝑂𝑝𝑠/𝑊 Reality check: Eyeriss fetches 15.4 MB per 2.6 Billion operations (batch size = 4) Difficult to do better than single-digit TOps/W with external DRAM TinyML processors for small models may not need DRAM
16
Per-Layer Energy Distribution Example
Yang et al.,
17
Processing Element Energy (~100 B Register File, 8b)
𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = (𝐸 𝑅𝐹 + 𝐸 𝑚𝑢𝑙𝑡 + 𝐸 𝑎𝑑𝑑 + 𝐸 𝑐𝑜𝑚𝑚 )/2 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 =(3×80𝑓𝐹+200𝑓𝐽+30𝑓𝐽+80𝑓𝐽)/2≈275𝑓𝐽⇒3.6 𝑇𝑂𝑝𝑠/𝑊 Difficult to do better than single-digit TOps/W with standard RF PE topology
18
Can We Do Better Using Analog/Mixed-Signal Tricks?
Analog Goodness Digital Base Fabric
19
Analog Neural Network Processor (1990)
Efficiency limited by current steering operation (Class-A) B. E. Boser et al., “An Analog Neural Network Processor With Programmable Topology,” JSSC, Dec. 1991
20
Re-Evaluating the Potential for Mixed-Signal (2015)
Energy for 16 x 8b MAC Opportunity Embrace what CMOS is good at Switches, capacitors Avoid active circuits Murmann et al., Asilomar 2015
21
Charge Domain Dot Product Circuit
Externally digital, internally analog compute block ADC energy amortized over several multipliers Multiply via charge redistribution Add via passive charge sharing Small unit caps < 1fF Bankman, A-SSCC 2016
22
Performance Summary MNIST Classification Accuracy Floating Point 98.4%
SC Dot Product 98.0% Power Multiplier Array 2.38 μW SAR ADC 5.36 μW Energy Per 16-element dot product 3.2 pJ Per arithmetic operation 104 fJ (9.61 TOps/W) Per classification 41.3 nJ Output Rate 2.4 MHz Core Area 0.113 mm x mm
23
Energy Breakdown For a 16 x 8b dot product
3.2 pJ drawn from digital supplies in measurement Control signals High activity factor Data-independent 0.2 pJ drawn from VREF,DAC in post-layout simulation Dominated by clocks & digital!
24
Processing Energy Limits
Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR Digital 𝐸~ log 2 SNR
25
Processing Energy of Actual Implementation
Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR +N 11 𝐵− 𝐸 𝑠𝑤 Digital 𝐸~ log 2 SNR
26
Processing Energy of Actual Implementation
Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR +N 11 𝐵− 𝐸 𝑠𝑤 + 𝐸 𝐴𝐷𝐶 Digital 𝐸~ log 2 SNR
27
Mixed-Signal BinaryNet
Digital Multiplication Based on results from Courbariaux et al., NIPS 2016 Weights and activations constrained to +1 and -1, multiplication becomes XNOR Minimizes D/A and A/D overhead At the time, a nice option for small/medium-size problems and further mixed-signal exploration Analog summation
28
Mixed-Signal Binary CNN Processor
Binary CNN with “CMOS-inspired” topology, engineered for minimal circuit-level path loading Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB) Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree CIFAR-10 Bankman et al., ISSCC 2018, JSSC 2019
29
CIFAR-10 Sample Images Human accuracy ~94%
30
Original BinaryNet Topology
88.54% accuracy on CIFAR-10 1.67 MB weight memory (68% FC layers) 27.9 mJ/classification with FPGA Zhao et al., FPGA 2017
31
Mixed-Signal BinaryNet Topology
Sacrificed accuracy for regularity and energy efficiency 86.05% accuracy on CIFAR-10 328 KB weight memory 3.8 mJ per classification
32
Neuron
33
Naïve Sequential Computation
34
Weight-Stationary 256 x2 (north/south) x4 (neuron mux)
35
Weight-Stationary and Data-Parallel
Parallel broadcast
36
Complete Architecture
“some” programmability
37
Switched-Capacitor Neuron Implementation
Weights x inputs Bias & offset cal. 𝑣 diff 𝑉 𝐷𝐷 = 𝐶 𝑢 𝐶 𝑡𝑜𝑡 𝑖= 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑏= −1 𝑠 𝑖= 𝑖 𝑚 𝑖 Batch normalization folded into weight signs and bias
38
“Memory-Cell-Like” Processing Element
1 fF metal-oxide-metal fringe capacitor Standard-cell-based 42 transistors
39
Performance Summary Technology 28nm Algorithm CNN Dataset CIFAR-10
(Weight, Activation) Precision [bits] (1, 1) Supply [V] VNEU = 0.6 VCOMP = 0.8 VDD = 0.8 VMEM = 0.8 VDD = 0.6 VMEM = 0.53 Classification Accuracy [%] 86.05 85.69 Energy per Classification [μJ] 3.79 2.61 Power [mW] 0.899 0.094 Frame Rate [FPS] 237 36 Arithmetic Energy Efficiency 532 1b-TOps/W 772
40
Comparison to Synthesized Digital
Mixed-Signal BinarEye (Moons et al., CICC 2018)
41
Digital vs. Mixed-Signal Binary CNN Processor
% CIFAR-10 Synthesized Digital Moons et al. CICC 2018 Hand-Designed Digital (Projected) Mixed-Signal Bankman et al. ISSCC 2018
42
Limitations of Mixed-Signal BinaryNet
Limited programmability (CIFAR-10, keyword spotting) Relatively limited accuracy (86% on CIFAR-10) due to 1b arithmetic Large chip area Energy advantage over customized digital is not revolutionary Same SRAM, same baseline data movement energy Need a denser array and even less data movements to unleash larger gains In-memory computing
43
The Case for In-Memory Computing
Loaded by many Access one Loaded by many Access many
44
Logical Progression: SRAM
“Memory-like” [Valavi, VLSI 2018] 0.93 fJ per 1b-MAC in 28 nm 24107 F2 Single-bit 2.3 fJ per 1b-MAC in 65 nm 290 F2 Single-bit
45
One column per weight bit Binary weighted column summation after ADC
Multi-Bit Extension Serialized data One column per weight bit Binary weighted column summation after ADC Jia et al., arXiv 2018,
46
Complete Processor ~150…300 TOps/W (excluding DRAM)
Jia et al., arXiv 2018,
47
Embracing Emerging Memory Technology: RRAM
2.3 fJ per 1b-MAC in 65 nm 290 F2 Volatile, leaky TBD Approaching 12 F2 Non-volatile Multiple bits per cell (?)
48
RRAM Density – BinaryNet Example
256 columns … 1024 rows 25 F2 F = 90 nm Side length s = 0.45 mm 2 x 1024 x 0.45 um x 256 x 0.45 um = mm2 per layer 0.84 mm2 for the complete 8-layer network Can hold all weights in compute array for TinyML (2x)
49
Matrix-Vector Multiplication with Resistive Cells
Bitlines sum currents that are proportional to cell conductance Typically use two cells to achieve pos/neg weights (other schemes possible) Important: Efficient D/A and A/D Interfaces Analog transimpedance amplifiers in column readout will likely be too inefficient Tsai, 2018
50
Dynamic Voltage-Mode Readout
Precharge Operation is weakly nonlinear, but we can train for that… Time-multiplexed ADC ~1 pJ PWM input Clipped ReLU
51
Processing Pipeline Small ~KB Large ~MB Small ~KB
52
Example: Mapping of ResNet-32 on a 4x10 Array
Dazzi et al., arXiv 2019,
53
VGG-7 Experiment (4.8 Million Parameters)
2-bit weights, 2-bit activations Accuracy on CIFAR-10 2b quantization only: 93% 2b quantization + RRAM/ADC model: 91% Work in progress!
54
Energy Model for Column in Conv3 Layer
Efficiency set by ADC!
55
ADC Energy Chart ~mJ for 16b ADC ~pJ for 8b ADC
B. Murmann, "ADC Performance Survey ," [Online]. Available:
56
State-of-the-Art Example
S. Yin, X. Sun, S. Yu, Jae-sun Seo, 2019,
57
Summary CNN processors are limited by memory access and data movement
Performing dense computations using “memory-like” or in-memory compute fabrics reduces (or eliminates) weight movement Achieving the ultimate energy efficiency stands at odds with Having a high degree of programmability Running operations with wide bit widths (> 4b) Circuit/architecture/software communities must work together to find the right compromise between fabric efficiency and generality for TinyML
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.