RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator

Slides:



Advertisements
Similar presentations
Multicast congestion control on many-to- many videoconferencing Xuan Zhang Network Research Center Tsinghua University, China.
Advertisements

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
High Dynamic Range Emeka Ezekwe M11 Christopher Thayer M12 Shabnam Aggarwal M13 Charles Fan M14 Manager: Matthew Russo 6/26/
Chapter 12 CPU Structure and Function. Example Register Organizations.
Reduce Instrumentation Predictors Using Random Forests Presented By Bin Zhao Department of Computer Science University of Maryland May
Buck Regulator Architectures
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Chapter 1 Section 1.4 Dr. Iyad F. Jafar Evaluating Performance.
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
CS 1308 Computer Literacy and the Internet Computer Systems Organization.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
Team MUX Adam BurtonMark Colombo David MooreDaniel Toler.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Author: Sriram Ramabhadran, George Varghese Publisher: SIGMETRICS’03 Presenter: Yun-Yan Chang Date: 2010/12/29 1.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Sampling Dead Block Prediction for Last-Level Caches
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
QCAdesigner – CUDA HPPS project
IPv6-Oriented 4 OC768 Packet Classification with Deriving-Merging Partition and Field- Variable Encoding Scheme Mr. Xin Zhang Undergrad. in Tsinghua University,
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
CS 1308 Computer Literacy and the Internet. Objectives In this chapter, you will learn about:  The components of a computer system  Putting all the.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
Gwangsun Kim, Jiyun Jeong, John Kim
Central Processing Unit Architecture
Extreme Big Data Examples
Chapter 2 – Computer hardware
Components of Computer
Morgan Kaufmann Publishers The Processor
Components of Computer
Gouraud-shaded Triangle Rasterization
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Stripes: Bit-Serial Deep Neural Network Computing
Lecture 16: Parallel Algorithms I
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
Sum of Absolute Differences Hardware Accelerator
Serial versus Pipelined Execution
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Final Project presentation
Computer Architecture
TI C6701 VLIW MIMD.
Survey of Cache Compression
Compact DFA Structure for Multiple Regular Expressions Matching
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Optimizing RTL for EFLX Tony Kozaczuk, Shuying Fan December 21, 2016
Lei Zhao, Youtao Zhang, Jun Yang
… 1 2 n A B V W C X 1 2 … n A … V … W … C … A X feature 1 feature 2
Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI
CSE 190D Database System Implementation
Accelerating Regular Path Queries using FPGA
Presentation transcript:

RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator ICS 2019 RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator Lei Zhao+, Quan Deng*, Youtao Zhang+, Jun Yang+ +University of Pittsburgh *National University of Defense Technology

Outline Background Design Details Optimizations Experiment Conclusion

Outline Background Design Details Optimizations Experiment Conclusion Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion

Random Forest Training b c d Total 1 4 2 7 1 2 4 6 (02 + 02 + 12 + 02) x 1 1 = 1 (42 + 22 + 02 + 12) x 1 7 = 3 (02 + 12 + 12 + 02) x 1 1 = 2 (42 + 12 + 02 + 12) x 1 6 = 3 S1 a S1 a S1 a Requires massive relational comparisons S2 d S2 d S2 d Yes No S3 b S3 b S3 b S4 a S4 a S4 a F1 ≤ 2 F1 ≤ 1 S5 a S5 a S5 a a b c d Total S6 a S6 a S6 a S7 c S7 c S7 c S8 b S8 b S8 b Gini 1 2 3 4 5 6 7 8 9 F1 F4 5.5 4.4 3.7 3.8 3.4 3.3 - 3.1 4 5

Outline Background Design Details Optimizations Experiment Conclusion Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion

Only supports match comparisons 3D ReRAM based TCAM Metal electrode Metal oxide WL BL R Metal 0V 1/2V ML2 SA2 SA1 SL1 ML1 SL2 RH RL RL RH A B C Only supports match comparisons 1 1 1 1 1 1 1 1 A != B A = C A ? B A ? C 1 1

Outline Background Design Details Optimizations Experiment Conclusion 3D-VRComp Whole accelerator design Optimizations Experiment Conclusion

3D-VRComp Original Array Complementary Array A3 A2 A1 A0 1 SL ML SL’ Z 1/2V 0V Z 0V 1/2V Z 0V 1/2V A3 A2 A1 A0 1 SL ML SL’ ML’ Matcher RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - B3 B2 B1 B0 1 1 1 1 1 1 1 Compare bit by bit If equal, compare next bit Otherwise, stop comparing ML’ ML MWL GND VDD Original Array Complementary Array

Outline Background Design Details Optimizations Experiment Conclusion 3D VRComp Whole accelerator design Optimizations Experiment Conclusion

Accelerator Overview Load data into 3D-VRComps and labels into MACs ALU Task Buffer Host Interface Control Tile Accumulator RCU Output Buffer MAC Input Buffer Orignial Array Compl- metary 3D-VRComp Training data Training data Labels Label counts Label counts Gini Load data into 3D-VRComps and labels into MACs Split data into left/right sub-tree Count labels in each sub-tree Accumulate label counts across RCUs Calculate Gini value based on current split Record into Task Buffer

Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

Bit Encoding Original 3D-VRComp has to compare bit by bit Can not compare 0 with 1 and 1 with 0 at the same time Compare 0 with 1 makes ML drops Compare 1 with 0 makes ML’ drops 1/2V 0V A3 A2 A1 A0 1 RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - RL RH B3 B2 B1 B0 1 RL RH RH RL 1 SL SL’ ML Matcher ML’

Bit Encoding Encode n bits to 2n bits Compare n bits in parallel Tradeoff between storage and time 4 bit unary encoding A3 A2 A1 A0 1 00 0000 01 0001 10 0011 11 0111 B3 B2 B1 B0 1 1/2V 0V 1/2V 0V RL RH RL RH ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - 1 1 1 1 SL SL’ ML Matcher ML’

Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

Pipeline Stage 1: Split training samples into right and left sub-trees in RCU Stage 2: Count labels for each sub-tree in MAC Stage 3: Calculate Gini value in Acc and ALU Stage 4: Record into Task Buffer RCU MAC Acc+ALU Register 31 32 35 36 39 40 43 63 64 67 Cycle No encode RCU MAC Acc+ALU Register 7 8 11 12 15 16 19 20 23 24 27 Cycle 16 unary encode RCU MAC Acc+ALU Register 3 4 7 8 11 12 15 16 19 Cycle 32 unary encode

Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

Node Level Parallelism Host Interface Tile Tile Task Buffer Control Task Buffer records the RCUs required by each node Nodes can run simultaneously if they do not require the same RCUs Need to increase the number of Accumulators and ALUs Tile Tile Accumulator ALU Tree Node ID RCUs … Done 1 2 0,1,4,6 3 2,3 4 0,2,5,6,7

Outline Background Design Details Optimizations Experiment Conclusion

Experiment Schemes: RFAcc: Basic implementations with no optimizations RFAcc-X: Adopting X-unary encoding to enable parallel bit comparison RFAcc-P: Enable multiple node training RFAcc-X-P: All optimizations activated

Benchmarks

Performance GPU has less than 10X speedup over CPU, even worse than CPU for small benchmarks Our four schemes achieve 482X, 2558X, 1615X and 8564X speedup over CPU

Energy GPU consumes 2X more energy than CPU RFAcc and RFAcc-P achieves 105 energy saving on average Doubles the energy saving when encoding is enabled

Encoding Speedup increases when using larger unary encoding Energy also increases due to more RCUs are used to store and activated for the increased data size

Outline Background Design Details Optimizations Experiment Conclusion

Conclusion 3D Vertical ReRAM base relational comparator Full fledged Random Forest training accelerator Optimizations: Bit encoding to enable parallel bit comparison Pipeline to improve throughput Node level parallelism to train multiple tree nodes Performance and energy improvement due to parallel comparison and PIM

Thank you