Presentation is loading. Please wait.

Presentation is loading. Please wait.

RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator

Similar presentations


Presentation on theme: "RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator"— Presentation transcript:

1 RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator
ICS 2019 RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator Lei Zhao+, Quan Deng*, Youtao Zhang+, Jun Yang+ +University of Pittsburgh *National University of Defense Technology

2 Outline Background Design Details Optimizations Experiment Conclusion

3 Outline Background Design Details Optimizations Experiment Conclusion
Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion

4 Random Forest Training
b c d Total 1 4 2 7 1 2 4 6 ( ) x = 1 ( ) x = 3 ( ) x = 2 ( ) x = 3 S1 a S1 a S1 a Requires massive relational comparisons S2 d S2 d S2 d Yes No S3 b S3 b S3 b S4 a S4 a S4 a F1 ≤ 2 F1 ≤ 1 S5 a S5 a S5 a a b c d Total S6 a S6 a S6 a S7 c S7 c S7 c S8 b S8 b S8 b Gini 1 2 3 4 5 6 7 8 9 F1 F4 5.5 4.4 3.7 3.8 3.4 3.3 - 3.1 4 5

5 Outline Background Design Details Optimizations Experiment Conclusion
Random Forest 3D ReRAM based TCAM Design Details Optimizations Experiment Conclusion

6 Only supports match comparisons
3D ReRAM based TCAM Metal electrode Metal oxide WL BL R Metal 0V 1/2V ML2 SA2 SA1 SL1 ML1 SL2 RH RL RL RH A B C Only supports match comparisons 1 1 1 1 1 1 1 1 A != B A = C A ? B A ? C 1 1

7 Outline Background Design Details Optimizations Experiment Conclusion
3D-VRComp Whole accelerator design Optimizations Experiment Conclusion

8 3D-VRComp Original Array Complementary Array A3 A2 A1 A0 1 SL ML SL’
Z 1/2V 0V Z 0V 1/2V Z 0V 1/2V A3 A2 A1 A0 1 SL ML SL’ ML’ Matcher RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - B3 B2 B1 B0 1 1 1 1 1 1 1 Compare bit by bit If equal, compare next bit Otherwise, stop comparing ML’ ML MWL GND VDD Original Array Complementary Array

9 Outline Background Design Details Optimizations Experiment Conclusion
3D VRComp Whole accelerator design Optimizations Experiment Conclusion

10 Accelerator Overview Load data into 3D-VRComps and labels into MACs
ALU Task Buffer Host Interface Control Tile Accumulator RCU Output Buffer MAC Input Buffer Orignial Array Compl- metary 3D-VRComp Training data Training data Labels Label counts Label counts Gini Load data into 3D-VRComps and labels into MACs Split data into left/right sub-tree Count labels in each sub-tree Accumulate label counts across RCUs Calculate Gini value based on current split Record into Task Buffer

11 Outline Background Design Details Optimizations Experiment Conclusion
Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

12 Bit Encoding Original 3D-VRComp has to compare bit by bit
Can not compare 0 with 1 and 1 with 0 at the same time Compare 0 with 1 makes ML drops Compare 1 with 0 makes ML’ drops 1/2V 0V A3 A2 A1 A0 1 RH RL ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - RL RH B3 B2 B1 B0 1 RL RH RH RL 1 SL SL’ ML Matcher ML’

13 Bit Encoding Encode n bits to 2n bits Compare n bits in parallel
Tradeoff between storage and time 4 bit unary encoding A3 A2 A1 A0 1 00 0000 01 0001 10 0011 11 0111 B3 B2 B1 B0 1 1/2V 0V 1/2V 0V RL RH RL RH ML ML’ 1 Ai=Bi Ai>Bi Ai<Bi - 1 1 1 1 SL SL’ ML Matcher ML’

14 Outline Background Design Details Optimizations Experiment Conclusion
Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

15 Pipeline Stage 1: Split training samples into right and left sub-trees in RCU Stage 2: Count labels for each sub-tree in MAC Stage 3: Calculate Gini value in Acc and ALU Stage 4: Record into Task Buffer RCU MAC Acc+ALU Register 31 32 35 36 39 40 43 63 64 67 Cycle No encode RCU MAC Acc+ALU Register 7 8 11 12 15 16 19 20 23 24 27 Cycle 16 unary encode RCU MAC Acc+ALU Register 3 4 7 8 11 12 15 16 19 Cycle 32 unary encode

16 Outline Background Design Details Optimizations Experiment Conclusion
Bit Encoding Pipeline Node Level Parallelism Experiment Conclusion

17 Node Level Parallelism
Host Interface Tile Tile Task Buffer Control Task Buffer records the RCUs required by each node Nodes can run simultaneously if they do not require the same RCUs Need to increase the number of Accumulators and ALUs Tile Tile Accumulator ALU Tree Node ID RCUs Done 1 2 0,1,4,6 3 2,3 4 0,2,5,6,7

18 Outline Background Design Details Optimizations Experiment Conclusion

19 Experiment Schemes: RFAcc: Basic implementations with no optimizations
RFAcc-X: Adopting X-unary encoding to enable parallel bit comparison RFAcc-P: Enable multiple node training RFAcc-X-P: All optimizations activated

20 Benchmarks

21 Performance GPU has less than 10X speedup over CPU, even worse than CPU for small benchmarks Our four schemes achieve 482X, 2558X, 1615X and 8564X speedup over CPU

22 Energy GPU consumes 2X more energy than CPU
RFAcc and RFAcc-P achieves 105 energy saving on average Doubles the energy saving when encoding is enabled

23 Encoding Speedup increases when using larger unary encoding
Energy also increases due to more RCUs are used to store and activated for the increased data size

24 Outline Background Design Details Optimizations Experiment Conclusion

25 Conclusion 3D Vertical ReRAM base relational comparator
Full fledged Random Forest training accelerator Optimizations: Bit encoding to enable parallel bit comparison Pipeline to improve throughput Node level parallelism to train multiple tree nodes Performance and energy improvement due to parallel comparison and PIM

26 Thank you


Download ppt "RFAcc: A 3D ReRAM Associative Array based Random Forest Accelerator"

Similar presentations


Ads by Google