Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin.

Slides:



Advertisements
Similar presentations
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.
Spark: Cluster Computing with Working Sets
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202 Fall 2007 Introduction to Classification Greg Grudic.
A Hadoop MapReduce Performance Prediction Method
Ensemble Learning (2), Tree and Forest
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
Introduction to Directed Data Mining: Decision Trees
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Decision Trees.
Scaling up Decision Trees Shannon Quinn (with thanks to William Cohen of CMU, and B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo of IIT)
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Scaling up Decision Trees. Decision tree learning.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Hierarchical Classification
Copyright © 2010 SAS Institute Inc. All rights reserved. Decision Trees Using SAS Sylvain Tremblay SAS Canada – Education SAS Halifax Regional User Group.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  System architecture  Implementation – HDFS  Implementation – System Analysis ◦ System Information.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
A Generic Approach to Big Data Alarms Prioritization
Machine Learning with Spark MLlib
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Hadoop Aakash Kag What Why How 1.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Machine Learning Week 1.
On Spatial Joins in MapReduce
MapReduce.
Decision Trees.
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Scalable Regression Tree Learning on Hadoop using OpenPlanet Wei Yin

Contributions We implement OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. We tune and analyze the impact of two parameters, HDFS block sizes and threshold value between ExpandNode and InMemoryWeka Tasks of OpenPlanet to improve the default performance.

Motivation for large-scale Machine Learning Models operate on large data sets Large number of forecasting models New data arrives constantly and real-time training requirement

Regression Tree Classification algorithm maps features → target variable (prediction) Classifier uses a Binary Search Tree Structure Each non-leaf node is a binary classifier with a decision condition One numeric or categorical feature goes left or right in the tree Leaf Nodes contain the regression function or a single prediction value Intuitive to understand by domain users Effect for each feature

Google’s PLANET Algorithm Use distributed worker nodes coordinated using a master node to build regression tree Master worker 21-Sep-11USC DR Technical Forum5

OpenPlanet Give an introduction abolut OpenPlanet Introduce difference between OpenPlanet and PLANET Give specific re-implementation details Controller InitHistogramExpandNodeInMemoryWeka Model File Threshold Value(60000)

Cotroller Controller{ /*read user defined parameters, such as input file path, test data file, model output file etc.*/ Read Parameters( arguments[] ); /*Initialize 3 job Sets: MRExpandSet, MRInMemWekaSet, CompletionSet, each of which contains the nodes that need relevant process*/ JobSetsInit(ExpandSet, InMemWekaSet, CompletionSet); /*Initialize Model File instance containing a Regression Tree structure with root node only*/ InitModelFile(modelfile); Do { /*If any Set is not empty, continue the loop*/ /*populate each Set using modelfile*/ populateSets(modelfile, ExpandSet, InMemWekaSet, CompletionSet ); if(ExpandSet != 0){ processing_nodes <- all nodes in ExpandSet; TaskRunner(InitHistogram, processing_nodes); CandidatePoints <- collects Reducers Result(); TaskRunner(ExpandNodes, processing_nodes); globalOptimalSplitPoint <- collects Reducers’ Result(); } if(InMemWekaSet != 0){ processing_nodes <- all nodes in InMemWekaSet; TaskRunner(InMemWeka, processing_nodes); } UpdatesTreeModel(Result); } While ( ExpandSet & InMemWekaSet & CompletionSet != 0 ) Output(modelfile); } Start Initialization While Queues are NOT empty Populate Queues Issue MRInitial Task MRExpandQueue NOT Empty True Issue MR-InMemGrow Task MRInMemQueue NOT Empty True Issue MRExpandNode Task False Update Model & Populate Queue End False

ModelFile It is an object used to contain the regression model, and support relevant functions, like adding node, check node status etc. Advantages: More convenient to update the model and predict target value, compare to parsing XML file. Load and Write model file == serialize and de-serialize Java Object Root F1 < 27 F2 < 43 F1< 90 Weka Model F4 < 16 Predict Value = 95 Model File Instance *Regression Tree Model *Update Function( ) *CurrentLeaveNode( ) *.…...

InitHistogram A pre-processing step to find out potential candidate split points for ExpandNodes Numerical features: Find fewer candidate points from huge data at expanse of little accuracy lost, e.g feat1, feat2, Categorical features: All the components, e.g feat 3 Input node (or subset): ExpandNodes Task just need to evaluate all the points in the candidate set without consulting other resources. block Map Reduce Feature 1: {10,2,1,8,3,6,9,4,6,5,7} {1,3,5,7,9} Colt: High performance Java library Sampling: Boundaries of equal-depth histogram f1: 1,3,5,7,9 f2: 30,40,50,60,70 f3: 1,2,3,4,5,6,7 {Moday -> Friday} Feat1(num) Feat2(num), Feat3 (categ) node 3 Filtering: Only interests in data point belong to node 3 Routing: Key-Value: (featureID, value)

ExpandNode Input node (or subset): block Map Reduce Candidate points Controller Local optimal split point, sp1 (value = 23) Local optimal split point, sp2 (value = 26) Global optimal split point, sp1 (value = 23) node3 f2< 23 Update expending node e.g. sp1 = 23 in feature 2 node 3 Filtering Routing

MRInMemWeka Input node (or subset): block Map Reduce node 4node 5 Filtering Routing: (NodeID, Data Point) Node 4 Node 5 1.Collect data points for node 4 2.Call Weka REPTree (or any other model, M5P), to build model Controller Location of Weka Model node4 Weka Model node5 Weka Model Update tree nodes …..

Distinctions between OpenPlanet and PLANET: (1)Sampling MapReduce method: InitHistogram (2)Broadcast(BC) function (3)Hybrid Model Weka Model BC_Key.Set(BC); for( i : num_reducers ){ BC_Key.partitionID = i ; send(BC_Key, BC_info); } Partitioning: ID of Partition = key.hashCode() % numReduceTasks Key.hashCode(){ if (key_type == BC) return partitionID; else return key.hashCode; }

Performance Analysis and Tuning Method Baseline for Weka, Matlab and OpenPlanet (single machine) Parallel Performance for OpenPlanet with default settings Question ? For 17 million data set, very little improvement difference between 2x8 case and 8x8 case Not much performance improvement, especially compare to Weka baseline performance * 1.58 speed-up for 17 M dataSet * no speed-up for small data point (no memory overhead)

Question 1: Similar performance between 2x8 case and 8x8 case ?

Improvement: Data Size Nodes 3.5M Tuples /170MB 17M Tuples /840MB 35M tuples /1.7GB 2x820MB80MB128MB 4x88MB32MB64MB 8x84MB16MB32MB 1. We tune block size = 16 MB 2. We have 842/16 = 52 blocks 3. Total running time = 4,300,457 sec, while 5,712,154 sec for original version, speed-up : 1.33

Question 2: 1.Weka works better if no memory overhead 2.Observed from the picture, Area is small Area is large What about balance those two areas but avoid memory overhead for Weka ? Solution: Increasing the threshold value between ExpandNodes task and InMemWeka Then by experiment, when the JVM for reducer instance is 1GB, the maximum threshold value is 2,000,000.

Performance Improvement 1.Total running time = 1,835,430 sec vs 4,300,457 sec 2.Areas balanced 3.Iteration number decreased 4.Speed-up = / = 2.34 AVG Total speed-up on 17M data set using 8x8 cores: Weka: 4.93 times Matlab: 14.3 times AVG Accuracy (CV-RMSE): Weka10.09 % Matlab10.46 % OpenPlanet10.35 %

Summary: OpenPlanet, an open-source implementation of the PLANET regression tree algorithm using the Hadoop MapReduce framework. We tune and analyze the impact of parameters such as HDFS block sizes and threshold for in-memory handoff on the performance of OpenPlanet to improve the default performance. Future work: (1)Parallel Execution between MRExpand and MRInMemWeka in each iteration (2)Issuing multiple OpenPlanet instances for different usages, which leads to increase the slots utilization (3)Optimal block size (4)Real-time model training method (5)Move to Cloud platform and give analysis about performance