Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter.

Slides:



Advertisements
Similar presentations
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Advertisements

Dynamic Network Visualization in 1.5D
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
A Presentation for the Enterprise Architect © 2008 IBM Corporation IBM Technology Day - SOA SOA Governance Miroslav Petrek IT Software Architect
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Advanced Topics in Algorithms and Data Structures Page 1 An overview of lecture 3 A simple parallel algorithm for computing parallel prefix. A parallel.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
FLANN Fast Library for Approximate Nearest Neighbors
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Ensemble Learning (2), Tree and Forest
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
IBM Proof of Technology Discovering the Value of SOA with WebSphere Process Integration © 2005 IBM Corporation SOA on your terms and our expertise WebSphere.
Data mining and machine learning A brief introduction.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
© 2006 IBM Corporation Flash Copy Solutions im Windows Umfeld TSM for Copy Services Wolfgang Hitzler Technical Sales Tivoli Storage Management
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Understanding Text Corpora with Multiple Facets Lei Shi, Furu Wei, Shixia Liu, Xiaoxiao Lian, Li Tan and Michelle X. Zhou IBM Research.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.
UA in ImageCLEF 2005 Maximiliano Saiz Noeda. Index System  Indexing  Retrieval Image category classification  Building  Use Experiments and results.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Taylor Rassmann.  Grouping data objects into X tree of clusters and uses distance matrices as clustering criteria  Two Hierarchical Clustering Categories:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
© 2005 IBM Corporation Discovering the Value of SOA with WebSphere Process Integration SOA on your terms and our expertise Building a Services Oriented.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Mining and Text Mining. The Standard Data Mining process.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
© 2012 IBM Corporation Perfect Hashing and CNF Encodings of Cardinality Constraints Yael Ben-Haim Alexander Ivrii Oded Margalit Arie Matsliah SAT 2012.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
RAD – 255 Certification Overview
Data Transformation: Normalization
Classification with Gene Expression Data
Presented By S.Yamuna AP/CSE
Machine Learning Basics
Machine Learning Week 1.
CMPT 733, SPRING 2016 Jiannan Wang
Communication and Memory Efficient Parallel Decision Tree Construction
Introduction to Geoinformatics L-10. Managing GIS
Decision Trees By Cole Daily CSCI 446.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
Presentation transcript:

Haifa Research Lab © 2008 IBM Corporation Parallel streaming decision trees Yael Ben-Haim & Elad Yom-Tov Presented by: Yossi Richter

Haifa Research Lab © 2008 IBM Corporation 2 Why decision trees?  Simple classification model, short testing time  Understandable by humans  BUT: –Difficult to train on large data (need to sort each feature)

Haifa Research Lab © 2008 IBM Corporation 3 Previous work  Presorting (SLIQ, 1996)  Approximations (BOAT, 1999) (CLOUDS, 1997)  Parallel (e.g. SPRINT 1996) –Vertical parallelism –Task parallelism –Hybrid parallelism  Streaming –Minibatch (SPIES, 2003) –Statistic (pCLOUDS, 1999)

Haifa Research Lab © 2008 IBM Corporation 4 Streaming parallel decision tree Data

Haifa Research Lab © 2008 IBM Corporation 5 Iterative parallel decision tree Initialize root Master Workers Build histogram Compute node splits Build histogram Until convergence Time Data Build histogram Build histogram Merge

Haifa Research Lab © 2008 IBM Corporation 6 Building an on-line histogram  A histogram is a list of pairs (p 1, m 1 ) … (p n, m n )  Initialize: c=0, p=[ ], m=[ ]  For each data point p: –If p==p j for any j<=c m j = m j + 1 –Otherwise Add a bin to the histogram with the value (p, 1) c = c + 1 If c > max_bins –Merge the two closest bins in the histogram –c = max_bins

Haifa Research Lab © 2008 IBM Corporation 7 Merging two histograms  Concatenate the two histogram lists, creating a list of length c  Repeat until c <= max_bins –Merge the two closest bins

Haifa Research Lab © 2008 IBM Corporation 8 Example of the histogram 50 bins, 1000 data points

Haifa Research Lab © 2008 IBM Corporation 9 Pruning  Taken from the MDL-based SLIQ algorithm  Consists of two phases: –Tree construction –Bottom-up pass on the complete tree  During tree construction, for each tree node, set cleaf = 1 + number of samples that reached the node and do not belong to the majority class  The bottom-up pass: –for each leaf, set cboth = cleaf –for each internal node, for which cboth(left) and cboth(right) have been assigned, set cboth = 2 + cboth(left) + cboth(right) –The subtree rooted at a node is to be pruned if cleaf is small, namely: Only a few samples reach it A substantial portion of the samples that reach it belongs to the majority class –If cleaf < cboth (i.e., the subtree does not contribute much information) then: Prune the subtree Set cboth = cleaf

Haifa Research Lab © 2008 IBM Corporation 10 IBM Parallel Machine Learning toolbox  A toolbox for conducting large-scale machine learning –Supports architectures ranging from single machines with multiple cores to large distributed clusters  Works by distributing the computations across multiple nodes –Allows for rapid learning of very large datasets  Includes state-of-the-art machine learning algorithms for: –Classification: Support-vector machines (SVM), decision tree –Regression: Linear and SVM –Clustering: k-means, fuzzy k-means, kernel k-means, Iclust –Feature reduction: Principal component analysis (PCA), and kernel PCA.  Includes an API for adding algorithms  Freely available from alphaWorks  Joint project of the Haifa Machine Learning group and the Watson Data Analytics group K-means, Blue Gene Shameless PR slide

Haifa Research Lab © 2008 IBM Corporation 11 Results: Comparing single node solvers DatasetNumber of examples Number of features Standard treeSPDT Adult32561 (16281) Isolet6238 (1559) Letter Nursery Page blocks Pen digits7494 (3498) Spambase No statistically Significant difference Ten-fold cross-validation, unless test\train partition exists

Haifa Research Lab © 2008 IBM Corporation 12 Results: Pruning DatasetStandard treeSPDT before pruning SPDT after pruning Tree size before pruning Tree size after pruning Adult Isolet Letter Nursery Page blocks Pen digits Spambase8, % reduction in size

Haifa Research Lab © 2008 IBM Corporation 13 Speedup (Strong scalability) AlphaBeta Speedup improves with data size!

Haifa Research Lab © 2008 IBM Corporation 14 Weak scalability AlphaBeta Scalability improves with the number of processors!

Haifa Research Lab © 2008 IBM Corporation 15 Algorithm complexity

Haifa Research Lab © 2008 IBM Corporation 16 Summary  An efficient new algorithm for parallel streaming decision trees  Results as good as single-node trees, but with scalability that improves with the data size and the number of processors  Ongoing work: Proof that the algorithm is only epsilon different from previous decision tree algorithm

Haifa Research Lab © 2008 IBM Corporation 17 תודה Hebrew (Toda) Thank You Merci Grazie Gracias Obrigado Danke Japanese English French Russian German Italian Spanish Portuguese Arabic Traditional Chinese Simplified Chinese Thai Korean KIITOS Danish