Deep Neural Network with Stochastic Computing

Slides:



Advertisements
Similar presentations
Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Tuomas Sandholm Carnegie Mellon University Computer Science Department
ELE 523E COMPUTATIONAL NANOELECTRONICS W7-W8: Probabilistic Computing, 20/10/ /10/2014 FALL 2014 Mustafa Altun Electronics & Communication Engineering.
Weikang Qian Ph.D. Candidate Electrical & Computer Engineering
Xin Li, Weikang Qian, Marc Riedel, Kia Bazargan & David Lilja A Reconfigurable Stochastic Architecture for Highly Reliable Computing Electrical & Computer.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Weikang Qian The Synthesis of Stochastic Logic to Perform Multivariate Polynomial Arithmetic Abstract Ph.D. Student, University of Minnesota Marc D. Riedel.
Artificial Neural Networks
Accuracy-Configurable Adder for Approximate Arithmetic Designs
AICCSA’06 Sharja 1 A CAD Tool for Scalable Floating Point Adder Design and Generation Using C++/VHDL By Asim J. Al-Khalili.
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Machine Learning Chapter 4. Artificial Neural Networks
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Abdullah Aldahami ( ) March 12, Introduction 2. Background 3. Proposed Multiplier Design a.System Overview b.Fixed Point Multiplier.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Background Motivation Implementation Conclusion 2.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
ELE 523E COMPUTATIONAL NANOELECTRONICS W8-W9: Probabilistic Computing, 2/11/ /11/2015 FALL 2015 Mustafa Altun Electronics & Communication Engineering.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Philipp Gysel ECE Department University of California, Davis
Learning-Based Power Modeling of System-Level Black-Box IPs Dongwook Lee, Taemin Kim, Kyungtae Han, Yatin Hoskote, Lizy K. John, Andreas Gerstlauer.
EE5393, Circuits, Computation, and Biology Computing with Probabilities 1,1,0,0,0,0,1,0 1,1,0,1,0,1,1,1 1,1,0,0,1,0,1,0 a = 6/8 c = 3/8 b = 4/8.
Machine Learning Supervised Learning Classification and Regression
Power-Optimal Pipelining in Deep Submicron Technology
Stanford University.
High-Speed Stochastic Circuits Using Synchronous Analog Pulses M
Reza Yazdani Albert Segura José-María Arnau Antonio González
Floating-Point FPGA (FPFPGA)
Summary of “Efficient Deep Learning for Stereo Matching”
A Deterministic Approach to Stochastic Computation
Energy models and Deep Belief Networks
Chilimbi, et al. (2014) Microsoft Research
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
Article Review Todd Hricik.
A Simple Artificial Neuron
Inception and Residual Architecture in Deep Convolutional Networks
Classification with Perceptrons Reading:
An Enhanced Support Vector Machine Model for Intrusion Detection
Topic 3d Representation of Real Numbers
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs Shuo Wang1, Zhe Li2, Caiwen Ding2, Bo Yuan3, Qinru Qiu2, Yanzhi Wang2,
Supporting Fault-Tolerance in Streaming Grid Applications
Rahul Boyapati. , Jiayi Huang
Chapter 1 Data Storage.
Bit-Pragmatic Deep Neural Network Computing
Stripes: Bit-Serial Deep Neural Network Computing
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
MSECE Thesis Presentation Paul D. Reynolds
DSP Audio Effects using Stochastic Computing
Approximate Fully Connected Neural Network Generation
Paul D. Reynolds Russell W. Duren Matthew L. Trumbo Robert J. Marks II
A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering
Soft Error Detection for Iterative Applications Using Offline Training
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
ARM implementation the design is divided into a data path section that is described in register transfer level (RTL) notation control section that is viewed.
network of simple neuron-like computing elements
Artificial Neural Networks
Emre O. Neftci  iScience  Volume 5, Pages (July 2018) DOI: /j.isci
Optimization for Fully Connected Neural Network for FPGA application
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Final Project presentation
Sahand Salamat, Mohsen Imani, Behnam Khaleghi, Tajana Šimunić Rosing
ELE 523E COMPUTATIONAL NANOELECTRONICS
Topic 3d Representation of Real Numbers
Learning and Memorization
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
Presentation transcript:

Deep Neural Network with Stochastic Computing Jan. 19, 2016 Kyounghoon Kim and Kiyoung Choi

Precision vs. Efficiency Conventional binary computing Accurate with full precision binary computing High cost in area and energy consumption Compromising precision for efficiency Human brain Consumes ~20W power Does not perform precise computing Very well recognizes objects Approaches (approximate computing) Limited precision binary computing Near-threshold computing Neural processing Stochastic computing ...

Stochastic Number Responses of a cortical neuron 110001110001001011 111001110000000011

Stochastic Number Coin flipping Encoding: head-->1, tail-->0 Toss two coins (X and Y) eight times to obtain X=10010110 Y=01010011 x=P(Xi=1)=4/8=0.5=value of stochastic number X y=P(Yi=1)=4/8=0.5=value of stochastic number Y x*y=0.5*0.5=0.25=P(Xi=1 and Yi=1) can be calculated by bitwise AND of X and Y, i.e., (10010110)&(01010011)=(00010010) --> Multiplication can be done with an AND gate 10010110 00010010 01010011

Stochastic Computing Computing with stochastic numbers

Stochastic Computing Logic gates are used for SC (stochastic computing)

Stochastic Computing Example: trilinear interpolation Used in volume rendering q = xyzv1 + xyzv2 + xyzv4 + xyzv7 + xyv0 + xyv3 + xzv0 + xzv5 + xv1 + yzv0 + yzv6 + yv2 + zv4 + v0 − xyzv0 − xyzv3 − xyzv5 − xyzv6 − xyv1 − xyv2 − xzv1 − xzv4 − xv0 − yzv2 − yzv4 − yv0 − zv0, where x, y, and z are fractional values for current coordination and v0~v7 are voxel values v6 v7 v4 v5 (x,y,z) v2 v3 v0 v1

Stochastic Computing Example: trilinear interpolation Huge gain in area, latency, and power

Stochastic Computing Advantages (at low precision) Low cost Low latency Low power Error tolerance Uniform weight of bits --> single bit-flip causes a small change in the value

Stochastic Computing Challenges Addition with a MUX Scaled --> precision loss Random number for C should be generated --> area overhead Stochastic numbers must be independent of each other Can affect the accuracy A Y B C=0.5 y= =0.5(a+b) (1 c ) a+cb 1,1,0,1,1,1,1,0 (6/8) A Y 1,0,0,1,0,0,1,0 (3/8) 1,0,1,1,0,0,1,0 (4/8) B 1,1,0,1,1,1,1,0 (6/8) A Y 1,1,0,1,0,0,1,0 (4/8) 1,1,0,1,0,0,1,0 (4/8) B

Stochastic Computing Challenges Exponential length of bit-stream 3-bit binary --> 8-bit stream Can be parallelized --> performance-cost tradeoff Difficult to synthesize How to generate a logic network that implements a given expression? A B Y 1,1,0,1,1,1,1,0 (6/8) 1,0,1,1,0,0,1,0 (4/8) 1,0,0,1,0,0,1,0 (3/8) 1,1,0,1 1,0,1,1 1,0,0,1 1,1,1,0 0,0,1,0 1 (3/8) (6/8) (4/8) A B C D Y E = (1- ab ) cd + ab ( d+e - de y = abd+abe+cd-abcd-abde P(Y= 1 )=y

Problems in Applying SC to DNN Multiplication error Many near-zero weights Bigger error near zero <200x100 weights multiplied by zero> <XNOR gate>

Problems in Applying SC to DNN Accumulation Scaled addition Low precision Saturated addition Sensitive to input correlation Limited range [-1 1]

Proposed Solutions Multiplication error Accumulation Remove near-zero weights and re-train Weights scaling Accumulation Merge accumulator and activate function using counter-based FSM Limited range [-1 1] Adjust weights and re-train

Early Decision Termination Progressive precision Adjusting bit-length according to the precision Without hardware modification Early decision termination Most data are far from decision boundary Energy efficiency Faster decision 1024-bit stream 256 bits

Experimental Environment MNIST hand-written Dataset 60000: training set 10000: test set Network Fully connected network Identical to the previous work [Sanni, 2015] 784 x 100 x 200 x 10 Verilog HDL Synopsys Design Compiler, TSMC 45nm 784 100 200 10

Accuracy Comparison Accuracy of DNN Using SC compared to 32-bit floating-point Previous work [Sanni, CISS, 2015] Accuracy with progressive precision 1024-bit stream (32 bits/step)

Early Decision Termination # of EDT steps 1 step: 32 bits 1024-bit stream 32 steps 512-bit stream 16 steps Trade-off Normalized energy Error rate

Comparison of Synthesis Results Area, power, critical path delay, energy One neuron with 200 inputs (512-bit streams) Overhead: stochastic number generator (SNG) State-of-the-art SNG: SNG with MTJ [Rangharajan, DATE, 2015] 80.2% 53.8%

ISO-Area Comparison Iso-area 9-bit Fixed-point: 3-stage pipeline, 72,104 um2 Parallelism: SC (120x), SC-SNG (70x), SC-MTJ-SNG (109x)

Previous Work Summary Deep neural network Classification error comparison Contribution B. D. Brown and H. C. Card Trans. Comput., 2001 No (Soft competitive learning network) N/A Basic idea for neural network using SC State-machine based activation function N. Nedjah and L. de Macedo Mourelle, Proc. DSD, 2003 No (Normal neural network) FPGA implementation H. Li, D. Zhang, and S. Y. Foo Trans. Power Electronics, 2006 Application (Neural network controller for small wind turbine systems) D. Zhang and H. Li, Trans. Industrial Electronics, 2008 Application (Controller for an induction motor) Y. Ji, F. Ran, C. Ma, and D. J. Lilja Proc. DATE, 2015 No (Radial basis function) 2.7% (FP) 55%(SC , 1024 bits) (Iris flower dataset) Radial basis function neural network using SC K. Sanni, et al. Proc. CISS, 2015 Yes (Deep belief network) 5.8% (FP) 18.2%(SC , 1024 bits) (MNIST dataset) DBN FPGA implementation using SC Proposed Yes (Fully-connected network) 2.23% (FP) 2.41%(SC , 1024 bits) (MNIST dataset) Accuracy enhancement - Removing near-zero weights - Weight-scaling Early decision termination Merge of accumulation and activation

Conclusion Deep neural network by using stochastic computing Removing near-zero weights Weight-scaling Improved FSM-based activation function Early decision termination Experimental results Accuracy is close to that of floating-point implementation Reduction of area, power, delay, and energy Depending on stochastic number generator

Thank you!