Presentation is loading. Please wait.

Presentation is loading. Please wait.

Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter.

Similar presentations


Presentation on theme: "Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter."— Presentation transcript:

1 Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

2 Haifa Research Lab © 2008 IBM Corporation 2 Why decision trees?  Simple classification model, short testing time  Understandable by humans  BUT: –Difficult to train on large data (need to sort each feature)

3 Haifa Research Lab © 2008 IBM Corporation 3 Previous work  Presorting (SLIQ, 1996)  Approximations (BOAT, 1999) (CLOUDS, 1997)  Parallel (e.g. SPRINT 1996) –Vertical parallelism –Task parallelism –Hybrid parallelism  Streaming –Minibatch (SPIES, 2003) –Statistic (pCLOUDS, 1999)

4 Haifa Research Lab © 2008 IBM Corporation 4 Streaming parallel decision tree Data

5 Haifa Research Lab © 2008 IBM Corporation 5 Iterative parallel decision tree Initialize root Master Workers Build histogram Compute node splits Build histogram Until convergence Time Data Build histogram Build histogram Merge

6 Haifa Research Lab © 2008 IBM Corporation 6 Building an on-line histogram  A histogram is a list of pairs (p 1, m 1 ) … (p n, m n )  Initialize: c=0, p=[ ], m=[ ]  For each data point p: –If p==p j for any j<=c m j = m j + 1 –Otherwise Add a bin to the histogram with the value (p, 1) c = c + 1 If c > max_bins –Merge the two closest bins in the histogram –c = max_bins

7 Haifa Research Lab © 2008 IBM Corporation 7 Merging two histograms  Concatenate the two histogram lists, creating a list of length c  Repeat until c <= max_bins –Merge the two closest bins

8 Haifa Research Lab © 2008 IBM Corporation 8 Example of the histogram 50 bins, 1000 data points

9 Haifa Research Lab © 2008 IBM Corporation 9 Pruning  Taken from the MDL-based SLIQ algorithm  Consists of two phases: –Tree construction –Bottom-up pass on the complete tree  During tree construction, for each tree node, set cleaf = 1 + number of samples that reached the node and do not belong to the majority class  The bottom-up pass: –for each leaf, set cboth = cleaf –for each internal node, for which cboth(left) and cboth(right) have been assigned, set cboth = 2 + cboth(left) + cboth(right) –The subtree rooted at a node is to be pruned if cleaf is small, namely: Only a few samples reach it A substantial portion of the samples that reach it belongs to the majority class –If cleaf < cboth (i.e., the subtree does not contribute much information) then: Prune the subtree Set cboth = cleaf

10 Haifa Research Lab © 2008 IBM Corporation 10 IBM Parallel Machine Learning toolbox  A toolbox for conducting large-scale machine learning –Supports architectures ranging from single machines with multiple cores to large distributed clusters  Works by distributing the computations across multiple nodes –Allows for rapid learning of very large datasets  Includes state-of-the-art machine learning algorithms for: –Classification: Support-vector machines (SVM), decision tree –Regression: Linear and SVM –Clustering: k-means, fuzzy k-means, kernel k-means, Iclust –Feature reduction: Principal component analysis (PCA), and kernel PCA.  Includes an API for adding algorithms  Freely available from alphaWorks  Joint project of the Haifa Machine Learning group and the Watson Data Analytics group K-means, Blue Gene Shameless PR slide

11 Haifa Research Lab © 2008 IBM Corporation 11 Results: Comparing single node solvers DatasetNumber of examples Number of features Standard treeSPDT Adult32561 (16281)10517.715.7 Isolet6238 (1559)61718.714.6 Letter20000167.58.6 Nursery12960251.02.6 Page blocks5473103.1 Pen digits7494 (3498)164.65.4 Spambase4601578.410.5 No statistically Significant difference Ten-fold cross-validation, unless test\train partition exists

12 Haifa Research Lab © 2008 IBM Corporation 12 Results: Pruning DatasetStandard treeSPDT before pruning SPDT after pruning Tree size before pruning Tree size after pruning Adult17.715.714.31645409 Isolet18.714.617.8211141 Letter7.58.69.313567 Nursery1.02.63.2178167 Page blocks3.1 3.45536 Pen digits4.65.45.88981 Spambase8,410.511.4572445 80% reduction in size

13 Haifa Research Lab © 2008 IBM Corporation 13 Speedup (Strong scalability) AlphaBeta Speedup improves with data size!

14 Haifa Research Lab © 2008 IBM Corporation 14 Weak scalability AlphaBeta Scalability improves with the number of processors!

15 Haifa Research Lab © 2008 IBM Corporation 15 Algorithm complexity

16 Haifa Research Lab © 2008 IBM Corporation 16 Summary  An efficient new algorithm for parallel streaming decision trees  Results as good as single-node trees, but with scalability that improves with the data size and the number of processors  Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm

17 Haifa Research Lab © 2008 IBM Corporation 17 תודה Hebrew (Toda) Thank You Merci Grazie Gracias Obrigado Danke Japanese English French Russian German Italian Spanish Portuguese Arabic Traditional Chinese Simplified Chinese Thai Korean KIITOS Danish


Download ppt "Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter."

Similar presentations


Ads by Google