Download presentation
Presentation is loading. Please wait.
Published byJadyn Haden Modified over 10 years ago
1
Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin
2
Contributions We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.
3
Motivation for large-scale Machine Learning Models operate on large data sets Large number of forecasting models New data arrives constantly and real-time training requirement
4
Regression Tree Classification algorithm maps features → target variable (prediction) Classifier uses a Binary Search Tree Structure Each non-leaf node is a binary classifier with a decision condition One numeric or categorical feature goes left or right in the tree Leaf Nodes contain the regression function or a single prediction value Intuitive to understand by domain users Effect for each feature
5
Google’s PLANET Algorithm Use distributed worker nodes coordinated using a master node to build regression tree Master worker 21-Sep-11USC DR Technical Forum5
6
OpenPlanet Give an introduction abolut OpenPlanet Introduce difference between OpenPlanet and PLANET Give specific re-implementation details Controller InitHistogramExpandNodeInMemoryWeka Model File Threshold Value(60000)
7
Cotroller Controller{ /*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] ); /*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet); /*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do { /*If any Set is not empty, continue the loop*/ /*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){ processing_nodes <- all nodes in ExpandSet; TaskRunner(InitHistogram, processing_nodes); CandidatePoints <- collects Reducers Result(); TaskRunner(ExpandNodes, processing_nodes); globalOptimalSplitPoint <- collects Reducers’ Result(); } if(InMemWekaSet != 0){ processing_nodes <- all nodes in InMemWekaSet; TaskRunner(InMemWeka, processing_nodes); } UpdatesTreeModel(Result); } While ( ExpandSet & InMemWekaSet & CompletionSet != 0 ) Output(modelfile); } Start Initialization While Queues are NOT empty Populate Queues Issue MRInitial Task MRExpandQueue NOT Empty True Issue MR-InMemGrow Task MRInMemQueue NOT Empty True Issue MRExpandNode Task False Update Model & Populate Queue End False
8
ModelFile It is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc. Advantages: More convenient to update the model and predict target value, compare to parsing XML file. Load and Write model file == serialize and de-serialize Java Object Root F1 < 27 F2 < 43 F1< 90 Weka Model F4 < 16 Predict Value = 95 Model File Instance *Regression Tree Model *Update Function( ) *CurrentLeaveNode( ) *.…...
9
InitHistogram A pre-processing step to find out potential candidate split points for ExpandNodes Numerical features: Find fewer candidate points from huge data at expanse of little accuracy lost, e.g feat1, feat2, Categorical features: All the components, e.g feat 3 Input node (or subset): ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources. block Map Reduce Feature 1: {10,2,1,8,3,6,9,4,6,5,7} {1,3,5,7,9} Colt: High performance Java library Sampling: Boundaries of equal-depth histogram f1: 1,3,5,7,9 f2: 30,40,50,60,70 f3: 1,2,3,4,5,6,7 {Moday -> Friday} Feat1(num) Feat2(num), Feat3 (categ) node 3 Filtering: Only interests in data point belong to node 3 Routing: Key-Value: (featureID, value)
10
ExpandNode Input node (or subset): block Map Reduce Candidate points Controller Local optimal split point, sp1 (value = 23) Local optimal split point, sp2 (value = 26) Global optimal split point, sp1 (value = 23) node3 f2< 23 Update expending node e.g. sp1 = 23 in feature 2 node 3 Filtering Routing
11
MRInMemWeka Input node (or subset): block Map Reduce node 4node 5 Filtering Routing: (NodeID, Data Point) Node 4 Node 5 1.Collect data points for node 4 2.Call Weka REPTree (or any other model, M5P), to build model Controller Location of Weka Model node4 Weka Model node5 Weka Model Update tree nodes …..
12
Distinctions between OpenPlanet and PLANET: (1)Sampling MapReduce method: InitHistogram (2)Broadcast(BC) function (3)Hybrid Model Weka Model BC_Key.Set(BC); for( i : num_reducers ){ BC_Key.partitionID = i ; send(BC_Key, BC_info); } Partitioning: ID of Partition = key.hashCode() % numReduceTasks Key.hashCode(){ if (key_type == BC) return partitionID; else return key.hashCode; }
13
Performance Analysis and Tuning Method Baseline for Weka, Matlab and OpenPlanet (single machine) Parallel Performance for OpenPlanet with default settings Question ? 1. 1. For 17 million data set, very little improvement difference between 2x8 case and 8x8 case 2. 2. Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)
14
Question 1: Similar performance between 2x8 case and 8x8 case ?
16
Improvement: Data Size Nodes 3.5M Tuples /170MB 17M Tuples /840MB 35M tuples /1.7GB 2x820MB80MB128MB 4x88MB32MB64MB 8x84MB16MB32MB 1. We tune block size = 16 MB 2. We have 842/16 = 52 blocks 3. Total running time = 4,300,457 sec, while 5,712,154 sec for original version, speed-up : 1.33
17
Question 2: 1.Weka works better if no memory overhead 2.Observed from the picture, Area is small Area is large What about balance those two areas but avoid memory overhead for Weka ? Solution: Increasing the threshold value between ExpandNodes task and InMemWeka Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.
18
Performance Improvement 1.Total running time = 1,835,430 sec vs 4,300,457 sec 2.Areas balanced 3.Iteration number decreased 4.Speed-up = 4300457 / 1835430 = 2.34 AVG Total speed-up on 17M data set using 8x8 cores: Weka: 4.93 times Matlab: 14.3 times AVG Accuracy (CV-RMSE): Weka10.09 % Matlab10.46 % OpenPlanet10.35 %
19
Summary: OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. We tune and analyze the impact of parameters such as HDFS block sizes and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance. Future work: (1)Parallel Execution between MRExpand and MRInMemWeka in each iteration (2)Issuing multiple OpenPlanet instances for different usages, which leads to increase the slots utilization (3)Optimal block size (4)Real-time model training method (5)Move to Cloud platform and give analysis about performance
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.