Pushing TinyML Systems toward Minimum Energy Perspectives from a Mixed-Signal Designer Boris Murmann September 26, 2019.

Slides:

Advertisements

Similar presentations

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

Advertisements

Distributed Arithmetic

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Introduction to CMOS VLSI Design SRAM/DRAM

Computer ArchitectureFall 2008 © August 20 th, Introduction to Computer Architecture Lecture 2 – Digital Logic Design.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

A 30-GS/sec Track and Hold Amplifier in 0.13-µm CMOS Technology

Digital Logic Structures. Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 3-2 Roadmap Problems Algorithms.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Sp09 CMPEN 411 L21 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 21: Shifters, Decoders, Muxes [Adapted from Rabaey’s Digital Integrated Circuits,

M. TWEPP071 MAPS read-out electronics for Vertex Detectors (ILC) A low power and low signal 4 bit 50 MS/s double sampling pipelined ADC M.

-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.

1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.

Nonvolatile memories:

ESE532: System-on-a-Chip Architecture

Scalpel: Customizing DNN Pruning to the

Stanford University.

Fang Fang James C. Hoe Markus Püschel Smarahara Misra

YASHWANT SINGH, D. BOOLCHANDANI

Analysis of Sparse Convolutional Neural Networks

Topics Subsystem design principles. Pipelining. Datapath.

Lecture 19: SRAM.

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]

Stateless Combinational Logic and State Circuits

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

R&D activity dedicated to the VFE of the Si-W Ecal

Instructor：Po-Yu Kuo 教師：郭柏佑

FPGA Acceleration of Convolutional Neural Networks

Deep Neural Network with Stochastic Computing

Instructor: Dr. Phillip Jones

Electronics for Physicists

Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.

Basics Combinational Circuits Sequential Circuits Ahmad Jawdat

Bit-Pragmatic Deep Neural Network Computing

Stripes: Bit-Serial Deep Neural Network Computing

Zhen Dong1,2, Haitong Li1, and H.-S. Philip Wong1

Mary Jane Irwin ( ) CSE477 VLSI Digital Circuits Fall 2002 Lecture 22: Shifters, Decoders, Muxes Mary Jane.

We will be studying the architecture of XC3000.

Power-Efficient Machine Learning using FPGAs on POWER Systems

Digital Logic Structures Logic gates & Boolean logic

Chapter 3 Digital Logic Structures

BIC 10503: COMPUTER ARCHITECTURE

A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering

Dual Mode Logic An approach for high speed and energy efficient design

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Programmable Configurations

Instructor：Po-Yu Kuo 教師：郭柏佑

8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,

Optimization for Fully Connected Neural Network for FPGA application

Final Project presentation

Overview Last lecture Digital hardware systems Today

UNIVERSITY OF MASSACHUSETTS Dept

Electronics for Physicists

ESE534: Computer Organization

Digital Circuits and Logic

Model Compression Joseph E. Gonzalez

Samira Khan University of Virginia Feb 6, 2019

Analog Senior Projects 2019

CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Learning and Memorization

UNIVERSITY OF MASSACHUSETTS Dept

Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Pushing TinyML Systems toward Minimum Energy Perspectives from a Mixed-Signal Designer Boris Murmann September 26, 2019

The TinyML Vision < milliwatt P. Warden, “AI and Unreliable Electronics (*batteries not included),” Pete Warden’s Blog, Dec. 2016

Workhorse of ML: Deep Convolutional Neural Network Memory (~100 kB…100 MB) and compute (mostly multiply and add) Typically more than 1 Billion arithmetic operations per inference, even for moderate-size 109 operations/10 ms/0.1 mW = 1000 TOps/W Sze, Proc. IEEE, 2017

CNN Processor Landscape TinyML Bankman ISSCC 2018 Better accuracy, programmability, but too inefficient for TinyML Source: IMEC, based on: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/

Questions for This Evening What stands in the way of building ~1000 TOps/W TinyML processors without sacrificing classification accuracy and programmability? What sets the relevant efficiency asymptotes? What can we do to overcome these limits? Analog/mixed-signal circuits? In memory computing?

Note on Performance Metrics 𝑃𝑜𝑤𝑒𝑟=𝑅𝑎𝑡𝑒× 𝐸𝑛𝑒𝑟𝑔𝑦 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 =𝑅𝑎𝑡𝑒× 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝐼𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 × 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 What we ultimately care about (@ given accuracy) 1 𝑝𝐽 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 1 1 𝑇𝑂𝑝𝑠/𝑊 Must minimize Operations/Inference and Energy/Operation (maximize TOps/W) Reporting only TOps/W as a standalone metric can be very misleading Fabric with high TOps/W may be underutilized (requiring more operations) Operation can have vastly different definitions (1b vs. 16b, multiply vs. add, etc.)

NAND Gate (~28 nm LP CMOS) 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 ≈0.2 𝑓𝐽 ⇒ 5,000 𝑇𝑂𝑝𝑠/𝑊

Building a Hardwired CNN Using Digital Logic? Example: One billion connections One multiplier and one adder per connection (8b) 8b multiplier requires 56 full adders 8b adder requires 8 full adders Full adder occupies 15x11 wire tracks with pitch 8λ In 28 nm CMOS one connection occupies >133 μm2 One billion connections occupy >133,000 mm2! Source: Danny Bankman’s PhD thesis

Typical CNN Accelerator x + x + x + RF RF RF DRAM SRAM SRAM x + x + x + RF RF RF x + x + x + RF RF RF Array of processing elements plus hierarchical memory architecture Large models require off-chip DRAM, small models (< 1 MB) may not Optimizing memory access and data movement is key

Illustration of Scale – On-Chip SRAM Cell

Illustration of Scale – Pulling SRAM Bit Across Chip 4 mm Eyeriss

Illustration of Scale – Pulling SRAM Bit from a Register File ~100 mm Eyeriss

Typical Numbers (~28 nm CMOS) 8b Multiply ~200 fJ 8b Add ~30 fJ 1 mm wire ~200 fJ ~100 MB ~10 pJ/b ~100 KB ~1 pJ/b ~1 KB ~0.1 pJ/b ~100 B ~10 fJ/b

Data Re-Use in CNNs Sze, Proc. IEEE, 2017

DRAM Energy 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 ×𝐴𝑐𝑐𝑒𝑠𝑠 𝐸𝑛𝑒𝑟𝑔𝑦 𝑝𝑒𝑟 𝑊𝑒𝑖𝑔ℎ𝑡 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 10 𝑀𝑖𝑙𝑙𝑖𝑜𝑛 ×80𝑝𝐽 1 𝐵𝑖𝑙𝑙𝑖𝑜𝑛 =800𝑓𝐽 ⇒ 1.25 𝑇𝑂𝑝𝑠/𝑊 Reality check: Eyeriss fetches 15.4 MB per 2.6 Billion operations (batch size = 4) Difficult to do better than single-digit TOps/W with external DRAM TinyML processors for small models may not need DRAM

Per-Layer Energy Distribution Example Yang et al., https://arxiv.org/abs/1809.04070

Processing Element Energy (~100 B Register File, 8b) 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = (𝐸 𝑅𝐹 + 𝐸 𝑚𝑢𝑙𝑡 + 𝐸 𝑎𝑑𝑑 + 𝐸 𝑐𝑜𝑚𝑚 )/2 𝐸𝑛𝑒𝑟𝑔𝑦 𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 =(3×80𝑓𝐹+200𝑓𝐽+30𝑓𝐽+80𝑓𝐽)/2≈275𝑓𝐽⇒3.6 𝑇𝑂𝑝𝑠/𝑊 Difficult to do better than single-digit TOps/W with standard RF PE topology

Can We Do Better Using Analog/Mixed-Signal Tricks? Analog Goodness Digital Base Fabric

Analog Neural Network Processor (1990) Efficiency limited by current steering operation (Class-A) B. E. Boser et al., “An Analog Neural Network Processor With Programmable Topology,” JSSC, Dec. 1991

Re-Evaluating the Potential for Mixed-Signal (2015) Energy for 16 x 8b MAC Opportunity Embrace what CMOS is good at Switches, capacitors Avoid active circuits Murmann et al., Asilomar 2015

Charge Domain Dot Product Circuit Externally digital, internally analog compute block ADC energy amortized over several multipliers Multiply via charge redistribution Add via passive charge sharing Small unit caps < 1fF Bankman, A-SSCC 2016

Performance Summary MNIST Classification Accuracy Floating Point 98.4% SC Dot Product 98.0% Power Multiplier Array 2.38 μW SAR ADC 5.36 μW Energy Per 16-element dot product 3.2 pJ Per arithmetic operation 104 fJ (9.61 TOps/W) Per classification 41.3 nJ Output Rate 2.4 MHz Core Area 0.113 mm x 0.102 mm

Energy Breakdown For a 16 x 8b dot product 3.2 pJ drawn from digital supplies in measurement Control signals High activity factor Data-independent 0.2 pJ drawn from VREF,DAC in post-layout simulation Dominated by clocks & digital!

Processing Energy Limits Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR Digital 𝐸~ log 2 SNR

Processing Energy of Actual Implementation Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR +N 11 𝐵−1 +14 𝐸 𝑠𝑤 Digital 𝐸~ log 2 SNR

Processing Energy of Actual Implementation Energy per N-element dot product 𝑧= 𝑖=0 𝑁−1 𝑤 𝑖 𝑥 𝑖 Switched-Capacitor 𝐸≥6kT∙N∙SNR +N 11 𝐵−1 +14 𝐸 𝑠𝑤 + 𝐸 𝐴𝐷𝐶 Digital 𝐸~ log 2 SNR

Mixed-Signal BinaryNet Digital Multiplication Based on results from Courbariaux et al., NIPS 2016 Weights and activations constrained to +1 and -1, multiplication becomes XNOR Minimizes D/A and A/D overhead At the time, a nice option for small/medium-size problems and further mixed-signal exploration Analog summation

Mixed-Signal Binary CNN Processor Binary CNN with “CMOS-inspired” topology, engineered for minimal circuit-level path loading Hardware architecture amortizes memory access across many computations, with all memory on chip (328 KB) Energy-efficient switched-capacitor neuron for wide vector summation, replacing digital adder tree CIFAR-10 Bankman et al., ISSCC 2018, JSSC 2019

CIFAR-10 Sample Images Human accuracy ~94%

Original BinaryNet Topology 88.54% accuracy on CIFAR-10 1.67 MB weight memory (68% FC layers) 27.9 mJ/classification with FPGA Zhao et al., FPGA 2017

Mixed-Signal BinaryNet Topology Sacrificed accuracy for regularity and energy efficiency 86.05% accuracy on CIFAR-10 328 KB weight memory 3.8 mJ per classification

Neuron

Naïve Sequential Computation

Weight-Stationary 256 x2 (north/south) x4 (neuron mux)

Weight-Stationary and Data-Parallel Parallel broadcast

Complete Architecture “some” programmability

Switched-Capacitor Neuron Implementation Weights x inputs Bias & offset cal. 𝑣 diff 𝑉 𝐷𝐷 = 𝐶 𝑢 𝐶 𝑡𝑜𝑡 𝑖=0 1023 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑏= −1 𝑠 𝑖=0 7 2 𝑖 𝑚 𝑖 Batch normalization folded into weight signs and bias

“Memory-Cell-Like” Processing Element 1 fF metal-oxide-metal fringe capacitor Standard-cell-based 42 transistors

Performance Summary Technology 28nm Algorithm CNN Dataset CIFAR-10 (Weight, Activation) Precision [bits] (1, 1) Supply [V] VNEU = 0.6 VCOMP = 0.8 VDD = 0.8 VMEM = 0.8 VDD = 0.6 VMEM = 0.53 Classification Accuracy [%] 86.05 85.69 Energy per Classification [μJ] 3.79 2.61 Power [mW] 0.899 0.094 Frame Rate [FPS] 237 36 Arithmetic Energy Efficiency 532 1b-TOps/W 772

Comparison to Synthesized Digital Mixed-Signal BinarEye (Moons et al., CICC 2018)

Digital vs. Mixed-Signal Binary CNN Processor Energy @ 86.05% CIFAR-10 Synthesized Digital Moons et al. CICC 2018 Hand-Designed Digital (Projected) Mixed-Signal Bankman et al. ISSCC 2018

Limitations of Mixed-Signal BinaryNet Limited programmability (CIFAR-10, keyword spotting) Relatively limited accuracy (86% on CIFAR-10) due to 1b arithmetic Large chip area Energy advantage over customized digital is not revolutionary Same SRAM, same baseline data movement energy Need a denser array and even less data movements to unleash larger gains In-memory computing

The Case for In-Memory Computing Loaded by many Access one Loaded by many Access many

Logical Progression: SRAM “Memory-like” [Valavi, VLSI 2018] 0.93 fJ per 1b-MAC in 28 nm 24107 F2 Single-bit 2.3 fJ per 1b-MAC in 65 nm 290 F2 Single-bit

One column per weight bit Binary weighted column summation after ADC Multi-Bit Extension Serialized data One column per weight bit Binary weighted column summation after ADC Jia et al., arXiv 2018, https://arxiv.org/abs/1811.04047

Complete Processor ~150…300 TOps/W (excluding DRAM) Jia et al., arXiv 2018, https://arxiv.org/abs/1811.04047

Embracing Emerging Memory Technology: RRAM 2.3 fJ per 1b-MAC in 65 nm 290 F2 Volatile, leaky TBD Approaching 12 F2 Non-volatile Multiple bits per cell (?)

RRAM Density – BinaryNet Example 256 columns … 1024 rows 25 F2 cell @ F = 90 nm Side length s = 0.45 mm 2 x 1024 x 0.45 um x 256 x 0.45 um = 0.106 mm2 per layer 0.84 mm2 for the complete 8-layer network  Can hold all weights in compute array for TinyML (2x)

Matrix-Vector Multiplication with Resistive Cells Bitlines sum currents that are proportional to cell conductance Typically use two cells to achieve pos/neg weights (other schemes possible) Important: Efficient D/A and A/D Interfaces Analog transimpedance amplifiers in column readout will likely be too inefficient Tsai, 2018

Dynamic Voltage-Mode Readout Precharge Operation is weakly nonlinear, but we can train for that… Time-multiplexed ADC ~1 pJ PWM input Clipped ReLU

Processing Pipeline Small ~KB Large ~MB Small ~KB

Example: Mapping of ResNet-32 on a 4x10 Array Dazzi et al., arXiv 2019, https://arxiv.org/abs/1906.03474

VGG-7 Experiment (4.8 Million Parameters) 2-bit weights, 2-bit activations Accuracy on CIFAR-10 2b quantization only: 93% 2b quantization + RRAM/ADC model: 91% Work in progress!

Energy Model for Column in Conv3 Layer  Efficiency set by ADC!

ADC Energy Chart ~mJ for 16b ADC ~pJ for 8b ADC B. Murmann, "ADC Performance Survey 1997-2019," [Online]. Available: http://web.stanford.edu/~murmann/adcsurvey.html

State-of-the-Art Example S. Yin, X. Sun, S. Yu, Jae-sun Seo, 2019, https://arxiv.org/abs/1909.07514

Summary CNN processors are limited by memory access and data movement Performing dense computations using “memory-like” or in-memory compute fabrics reduces (or eliminates) weight movement Achieving the ultimate energy efficiency stands at odds with Having a high degree of programmability Running operations with wide bit widths (> 4b) Circuit/architecture/software communities must work together to find the right compromise between fabric efficiency and generality for TinyML