Download presentation
Published byGiles Jonah Williams Modified over 9 years ago
1
Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA?
2
Introduction Purpose of the paper:
Compare and Contrast effectiveness of FPGAs, GP-GPUs, and Multi-Core CPUs for accelerating classification using models generated by compact random forest machine learning classifiers. Topics in paper: Random Forest Classification Implementation of CRF on Select Devices Results from Implementation
3
Random Forest Classifier
Definition: A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, ...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x [1]. Key Determining Features: Number of Decision Trees Depth of Decision Trees
4
Challenges and Solutions
Hard to apply hardware as decision trees vary significantly in terms of shape and depth. Data dependent, and difficult to provide deterministic memory access into the trees Expensive to speed up processing time for each sample to be identical as the tree must be fully populated. Not compute intensive, computation to communication ratio is poor. Solution: Compact Random Forests Researchers at LLNL developed an efficient training algorithm that minimizes tree depth to produce a compact random forest. Allows for fully populating all decision trees. Makes the forest small enough to fit in the memory of one or more accelerators and to tap this internal memory bandwidth. This slide is most important.. Don’t’ forget to talk about how it was useless to use these random forests as they are too big and hardware implimentations won’t help at all but CFR’s are good and help bc they nolonger take a lot longer to look up data as they can fit everything on the board and now the most important part is the computation wich hardware helps improve.
5
Training Compact Random Forest Classifiers
The CRF training algorithm accepts a parameter for maximum tree depth , and generates tree’s no larger than that depth. Derived using LogitBoost. A “URL Reputation” data set is used for training and labels the data as either malicious or benign. Data from 121 days is split: 0 – 59 as training 60 – 120 as testing.
6
Algorithm used for OpenMP on Shared-Memory Multiprocessor
Doubly nested loop that iterates over samples and trees in the forest. For performance testing of the multi-core CPU a set of data that used sparse, irregular trees, which terminated as early as possible was run. Open MP exploited the data parallelism between samples, and processed each sample independently allowing for best performance. –an OpenMP pragma was used for this.
7
FPGA Algorithm Implementation
Targeted Hardware: Hitech Global HTG-V6-PCIE-L240-1 board with a XC6VLX240T-1FFG1759 Virtex 6. Key Parameters: Depth: Directly effects flip flop usage. Data Width: Very wide samples are hard to multiplex, taxing FPGA routing resources. Parameter Sizes Used Max Depth: 6 Data width: 2048 bits
8
FPGA Implementation Continued
Basic Compact Forest Implementation There ‘n’ number of trees in the system There are ‘s’ number of stages in a tree. Each stage represents all the nodes in that level. In-house gigabit ethernet core is used for communication. On start up data is loaded into the tree’s pipelines. Once configured, sample data is streamed to the FPGA where it enters the data pipeline in the CRF which then aligns the data with the stages in each tree. Small problem with this implementation: With a wide data pipeline distributed to multiple destinations it creates routing issues
9
FPGA Implementation Continued
Introducing Clumps: Have the same architecture as the CRF. Each clump could contain anywhere from 1 tree to all of the trees in the Forest. Increasing number of clups in design reduces amount of routing but increases number of Flip Flops. Even after applying Clumps it was apparent that the CRF wouldn’t fit on One FPGA 8 trees were placed on a LX240 FPGA, and 16 on the LX550T-2
10
FPGA Implementation Continued -Tree Implementation
Most direct implementation would be a specialized block of logic for each node within the tree. This would decrease memory requirements but would increase routing logic Instead the design uses a single block of logic at each level to implement the functionality of any node at that level. Memory distribution is now more complicated: BRAMs are fast, however they are limited and are only used on Levels with 32 or more nodes. Flip-flops are slow, but plentiful so they are used for the lower levels.
11
GP-GPU Algorithm Implementation
Each processor uses independent threads to process a small number of samples in parallel on a portion of the CRF. Memory is broken into two portions, sample data and forest data. Forest data is small and loaded once and re-used for every sample. Sample data is constantly changing therefore would use too many resources if had to be run on every processor Design is divided into blocks where each processor runs certain samples on certain trees within the CRF which doesn’t strain the resources.
12
Results Recap on Hardware used for each Design: Testing Parameters:
Multi-Core CPU and GP-GPU: 2 socket Intel X5660 Westmere system with 12 cores running at 2.8 GHz, 96 GB DRAM, and an attached NVIDIA Tesla M2050 that has 3 GB GDDR5 FPGA: Hitech Global HTG-V6-PCIE-L240-1 Testing Parameters: Maximum tree depth of 6 Data width of 2048 bits Criteria Evaluated to Compare: Performance Power and Cost Scalability Problems encountered: Unable to acquire the 4 FPGA board required to run the full implementation so they improvised with the one board and one smaller board and did a partial implementation and made assumptions. Relied on Data sheets for the Power as they were unable to measure it due to the first problem and the working environment for the CPU and GPU setup.
13
Results - Performance In order for a fair performance test they allowed the tree to be fully implemented and populated before measuring results. The results are in Kilo Samples per second (KSps): CPU: 9,291 KSps (12 threads) & 884 KSps (1thread) GPU: 20,398 KSps (14 processors w/ 1536 threads per processor) FPGA: 31,250 KSps (w/4 LX240s)
14
Results – Power and Cost
As mentioned before, Problem testing power, so the data sheets were used: Power for Tesla M2050 was listed as <= 225W Power for Intel Westmere-EP X5660 was 95W Xilinx Xpower Estimator provided the following chart for the power consumption of the FPGA’s
15
Results - Scalability As this is a machine learning algorithm the size of the CRF is dependent on the complexity of the input and the level of classification accuracy. FPGA based system can be scaled out to support moderately large forests by adding hardware The following is the results in Kilo-Samples per second when the number of trees was increased to almost 7 times the size, 234 trees. CPU: 1,044 KSps (12 threads) & 93 KSps (1thread) GPU: 5,381 KSps (14 processors w/ 1536 threads per processor) FPGA: 31,250 KSps (w/8 LX240s)
16
Discussion and Conclusion
FPGAs offer the highest level of performance and performance per Watt. FPGAs are built to support a maximum CRF size and require additional hardware to scale to larger classifiers. GP-GPUs offer the best performance, and degrades slowly with larger classifiers. GP-GPUs still have hard resource bounds that are sensitive to classifier or sample size. Multi-Core CPUs with OpenMP are extremely simple to get scalable, near linear performance
17
Questions?
18
References [1] - definition [2] Brian van Essen, C. M. (2012). Accelerating a random forest classifier: multi- core, GP-GPU, or FPGA? Livermore, CA: Lawrence Livermore National Laborator,.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.