Bit-Pragmatic Deep Neural Network Computing

Slides:



Advertisements
Similar presentations
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Datapath Functional Units. Outline  Comparators  Shifters  Multi-input Adders  Multipliers.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
2 s 2.org Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Optimizing FPGA Accelerator Design for Deep Convolution neural Networks By: Mohamad Kanafanai.
PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.
EE466: VLSI Design Lecture 14: Datapath Functional Units.
Introduction to CMOS VLSI Design Datapath Functional Units
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Study of AES Encryption/Decription Optimizations Nathan Windels.
Aug Shift Operations Source: David Harris. Aug Shifter Implementation Regular layout, can be compact, use transmission gates to avoid threshold.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
1 A Combined Decimal and Binary Floating-point Multiplier Charles Tsen, Sonia González-Navarro, Michael Schulte, Brian Hickmann, Katherine Compton 2009.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Cost/Performance Tradeoffs: a case study
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
ShiDianNao: Shifting Vision Processing Closer to the Sensor
Exploiting Parallelism
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Comparing FPGA vs. Custom CMOS and the Impact on Processor Microarchitecture Henry Wong Vaughn Betz, Jonathan Rose.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Stanford University.
Analysis of Sparse Convolutional Neural Networks
ESE532: System-on-a-Chip Architecture
Reza Yazdani Albert Segura José-María Arnau Antonio González
Variable Word Width Computation for Low Power
Neural Networks for Quantum Simulation
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
Application-Specific Customization of Soft Processor Microarchitecture
Performance of Single-cycle Design
Computer Organization & Design Microcode for Control Sec. 5
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Rethinking NoCs for Spatial Neural Network Accelerators
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Improving Program Efficiency by Packing Instructions Into Registers
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Presented by Rich Goyette
Stripes: Bit-Serial Deep Neural Network Computing
Power-Efficient Machine Learning using FPGAs on POWER Systems
Unsigned Multiplication
EVA2: Exploiting Temporal Redundancy In Live Computer Vision
Arithmetic Logical Unit
Applications of Distributed Arithmetic to Digital Signal Processing:
Topic 5: Processor Architecture Implementation Methodology
Fundamentals of Data Representation
DES Examples Chater#3 DES.
Final Project presentation
Lecture 4: Advanced Pipelines
Alireza Hodjat IVGroup
EE 193: Parallel Computing
Model Compression Joseph E. Gonzalez
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Chapter 14 Arithmetic Circuits (II): Multiplier Rev /12/2003
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Project Guidelines Prof. Eric Rotenberg.
Presentation transcript:

Bit-Pragmatic Deep Neural Network Computing J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1: This work was done while J. Albericio was a postdoc at the UofT.

Context Goal: Accelerate for deep learning inference with custom hardware Maximize energy efficiency Focus on Convolutional Neural Networks Dominated by inner products of activations and weights

We will show There are lots of ineffectual bits -> ineffectual computation An energy efficient accelerator design to exploit this

Motivation – ineffectual computation 101111 Weight X 001010 Activation 000000 101111 + 000000 What do we mean by ineffectual bits? Consider this textbook example of binary multiplication of one weight and one activation 101111 000000 000000

Motivation – ineffectual computation 101111 X 001010 Zero-bits lead to Zero terms = Ineffectual Computation 000000 101111 + 000000 Each zero bit in the multiplier results in a row of zeros to be added. We consider this to be ineffectual computation, making these zero bits ineffectual. 101111 000000 000000

Motivation – ineffectual computation 001010 101111 X + 000000 Effectual Bits Conversely, only the 1 bits are effectual and lead to effectual computation

* Current Approaches 1 1 * 1s can be ineffectual too A0 A1 A2 A3 Reduced Precision A0 A1 A2 A3 Activations Sparsity 1 Consider 4 16-bit activations The current approaches to deal with ineffectual computation in neural networks is to exploit reduced precision and sparsity We focus on activations but the same applied to weights In a way these techniques skip ineffectual bits at a course granularity 1

Goal 1 1 A0 A1 A2 A3 Reduced Precision Sparsity Effectual Bits Activations 1 Sparsity Our goal is to go further and only process the effectual bits 1 Effectual Bits

Motivation – Zero bits in activations Do we really need to go after every bit? Isn’t sparsity and reduced precision enough? With current approaches: 59% of bits are ineffectual

Process only the effectual bits Cycle 1: Activation 101111 X 001010 1 1 000000 101111 Offset 1 + 000000 Weight + << 101111 101111 How do we compute only the effectual bits? 000000 1011110 000000

Process only the effectual bits Cycle 2: Activation 101111 X 001010 1 1 000000 101111 Offset 3 + 000000 Weight + << 101111 101111 000000 111010110 000000

Baseline Inner Product Unit 16 Activation A0 32 X Weight W0 16 16 + + Activation A1 32 So that’s how we do one multiplication, how about computing an inner product In our baseline accelerator, DaDianNao, Weight W1 X 16

Shift and Add Inner Product Unit Activation A0 1 4 Offsets K0 15 31 Weight W0 << 16 Activation A1 + + 1 4 Offsets K1 31 Weight W1 << Time is proportional to the number of 1 bits 16

Throughput – Baseline … … … … Activations 16x + “Brick” = 16 elements Weights Change to triangles, annotate dimensions on grid … 16 Units, 16 outputs per cycle 16 Filters … …

Throughput – Pragmatic “Pallet” = 16 bricks Same position in 16 different windows … Activations + << 16x … … … Weights 256 Units, 256 outputs every N cycles … … Process all bits -> same throughput as baseline Skip ineffectual bits -> Speedup 16 Filters … …

Optimization: 2 stage shifting + << 2 19 16 4 2 Stage 4 31 << 16 + + 31 Complete shifters in front of the adder tree will make it very big. The positions of different one-bits are not very far apart. So we break the shifting in two parts << 16 Latency > Power Savings Latency < Power Savings

2 stage shifting - example 1 A0 A1 Cycle 1: 2 Offset K0 W0 1 << + 2 << Offset K1 2 W1 <<

2 stage shifting - example 1 Cycle 2: A1 2 Offset K0 1 W0 4 << + 2 << Offset K1 W1 <<

2 stage shifting - example 1 Cycle 3: A1 2 Offset K0 W0 6 << + 2 << Offset K1 W1 <<

2 stage shifting - example 1 Cycle 4: A1 2 Offset K0 W0 10 << + 2 << Trade off: potential slowdown 3 ones per activation, but takes 4 cycles Offset K1 W1 <<

Optimization – Run-ahead Allow columns to start next brick without waiting for slower columns 1 A0 A1 A2 Brick 17 A0 1 1 1 A0 1 1 Brick 0 Brick 1 A1 1 1 A1 1 A2 1 1 1 A2 1 Different columns operate on different windows, so Requires more scheduling to deliver new weight bricks to individual columns + << + <<

Optimization – Improved Offset Encoding Use booth like encoding to remove strings of 1s 1 A0 { 11, -7, -5, 1 } K0 6 Ones 4 Terms Negative terms are already supported by the pipeline

Methodology Performance: in-house cycle-level simulator Datapath: Synthesis: Synopsys Design Compiler, TSMC 65nm Layout: Cadense Encounter Activity: Mentor Graphics ModelSim Memory Models: eDRAM: Destiny SRAM: CACTI Networks: out-of-the-box Caffe model zoo

Results Performance Energy Efficiency 2 Stage Shifting – Scaling Offset Width All Optimizations Energy Efficiency

Reduce offset to 2 bits: negligible slowdown Scaling Offset Width N << 16 16 + 2N -1 + Evaluate with 6 CNNs Reduce offset to 2 bits: negligible slowdown

Performance = 4.31x vs. DaDianNao With Run Ahead and improved offset encoding performance goes up to 4.31x Performance = 4.31x vs. DaDianNao

Efficiency = 1.70x vs. DaDianNao Energy Efficiency Naïve design with 4 bit offsets is less efficient than the baseline 2 bit offsets with run-ahead improves efficiency, but still worse than STR Finally with improved encoding we get significantly better energy efficiency Efficiency = 1.70x vs. DaDianNao

Remaining Ineffectual bits 7.6% effectual bits -> 13x ideal speedup Performance is lost to load imbalance within bricks and columns + << Brick Size … There is more potential to exploit

Conclusion Pragmatic offers performance proportional to Numerical precision Number of effectual bits Three optimizations to increase performance and reduce power 2 stage shifting Run ahead Improved Offset Encoding 4.3x speedup over DaDianNao 1.7x energy efficiency

Bit-Pragmatic Deep Neural Network Computing J. Albericio1, A. Delmás2, P. Judd2, S. Sharify2, G. O’Leary2, R. Genov2, A. Moshovos2 2: 1:

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Figure 5a) Baseline Tile

Figure 5b) Pragmatic Tile

Figure 6: PIP

Figure 7a) 2 Stage PIP

Figure 7b) 2 Stage Example

Figure 8: Run-ahead Example

Table 2: Precision Profiles

Figure 9: Perf 2 Stage

Table 3: 2 Stage Area/Power

Figure 10: Perf Run-ahead

Table 4: Run-ahead Area/Power

Figure 11: Perf IOE

Figure 12: Efficiency

Table 5: Area/Power Breakdown

Table 6: Tile Configurations

Table 7: 8-bit Designs