RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator ICS 2019 RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator Lei Zhao+, Quan Deng*, Youtao Zhang+, Jun Yang+ +University of Pittsburgh *National University of Defense Technology
Outline Background Design Details Optimizations Experiment Conclusion
Outline Background Design Details Optimizations Experiment Conclusion Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion
Random Forest Training b c d Total 1 4 2 7 1 2 4 6 (02 + 02 + 12 + 02) x 1 1 = 1 (42 + 22 + 02 + 12) x 1 7 = 3 (02 + 12 + 12 + 02) x 1 1 = 2 (42 + 12 + 02 + 12) x 1 6 = 3 S1 a S1 a S1 a Requires massive relational comparisons S2 d S2 d S2 d Yes No S3 b S3 b S3 b S4 a S4 a S4 a F1 ≤ 2 F1 ≤ 1 S5 a S5 a S5 a a b c d Total S6 a S6 a S6 a S7 c S7 c S7 c S8 b S8 b S8 b Gini 1 2 3 4 5 6 7 8 9 F1 F4 5.5 4.4 3.7 3.8 3.4 3.3 - 3.1 4 5
Outline Background Design Details Optimizations Experiment Conclusion Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion
Only supports match comparisons 3D ReRAM based TCAM Metal electrode Metal oxide WL BL R Metal 0V 1/2V ML2 SA2 SA1 SL1 ML1 SL2 RH RL RL RH A B C Only supports match comparisons 1 1 1 1 1 1 1 1 A != B A = C A ? B A ? C 1 1
Outline Background Design Details Optimizations Experiment Conclusion 3D-VRComp Whole accelerator design Optimizations Experiment Conclusion
3D-VRComp Original Array Complementary Array A3 A2 A1 A0 1 SL ML SL’ Z 1/2V 0V Z 0V 1/2V Z 0V 1/2V A3 A2 A1 A0 1 SL ML SL’ ML’ Matcher RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - B3 B2 B1 B0 1 1 1 1 1 1 1 Compare bit by bit If equal, compare next bit Otherwise, stop comparing ML’ ML MWL GND VDD Original Array Complementary Array
Outline Background Design Details Optimizations Experiment Conclusion 3D VRComp Whole accelerator design Optimizations Experiment Conclusion
Accelerator Overview Load data into 3D-VRComps and labels into MACs ALU Task Buffer Host Interface Control Tile Accumulator RCU Output Buffer MAC Input Buffer Orignial Array Compl- metary 3D-VRComp Training data Training data Labels Label counts Label counts Gini Load data into 3D-VRComps and labels into MACs Split data into left/right sub-tree Count labels in each sub-tree Accumulate label counts across RCUs Calculate Gini value based on current split Record into Task Buffer
Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion
Bit Encoding Original 3D-VRComp has to compare bit by bit Can not compare 0 with 1 and 1 with 0 at the same time Compare 0 with 1 makes ML drops Compare 1 with 0 makes ML’ drops 1/2V 0V A3 A2 A1 A0 1 RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - RL RH B3 B2 B1 B0 1 RL RH RH RL 1 SL SL’ ML Matcher ML’
Bit Encoding Encode n bits to 2n bits Compare n bits in parallel Tradeoff between storage and time 4 bit unary encoding A3 A2 A1 A0 1 00 0000 01 0001 10 0011 11 0111 B3 B2 B1 B0 1 1/2V 0V 1/2V 0V RL RH RL RH ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - 1 1 1 1 SL SL’ ML Matcher ML’
Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion
Pipeline Stage 1: Split training samples into right and left sub-trees in RCU Stage 2: Count labels for each sub-tree in MAC Stage 3: Calculate Gini value in Acc and ALU Stage 4: Record into Task Buffer RCU MAC Acc+ALU Register 31 32 35 36 39 40 43 63 64 67 Cycle No encode RCU MAC Acc+ALU Register 7 8 11 12 15 16 19 20 23 24 27 Cycle 16 unary encode RCU MAC Acc+ALU Register 3 4 7 8 11 12 15 16 19 Cycle 32 unary encode
Outline Background Design Details Optimizations Experiment Conclusion Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion
Node Level Parallelism Host Interface Tile Tile Task Buffer Control Task Buffer records the RCUs required by each node Nodes can run simultaneously if they do not require the same RCUs Need to increase the number of Accumulators and ALUs Tile Tile Accumulator ALU Tree Node ID RCUs … Done 1 2 0,1,4,6 3 2,3 4 0,2,5,6,7
Outline Background Design Details Optimizations Experiment Conclusion
Experiment Schemes: RFAcc: Basic implementations with no optimizations RFAcc-X: Adopting X-unary encoding to enable parallel bit comparison RFAcc-P: Enable multiple node training RFAcc-X-P: All optimizations activated
Benchmarks
Performance GPU has less than 10X speedup over CPU, even worse than CPU for small benchmarks Our four schemes achieve 482X, 2558X, 1615X and 8564X speedup over CPU
Energy GPU consumes 2X more energy than CPU RFAcc and RFAcc-P achieves 105 energy saving on average Doubles the energy saving when encoding is enabled
Encoding Speedup increases when using larger unary encoding Energy also increases due to more RCUs are used to store and activated for the increased data size
Outline Background Design Details Optimizations Experiment Conclusion
Conclusion 3D Vertical ReRAM base relational comparator Full fledged Random Forest training accelerator Optimizations: Bit encoding to enable parallel bit comparison Pipeline to improve throughput Node level parallelism to train multiple tree nodes Performance and energy improvement due to parallel comparison and PIM
Thank you