Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren.

Slides:



Advertisements
Similar presentations
Adopt Algorithm for Distributed Constraint Optimization
Advertisements

PARTITIONAL CLUSTERING
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Decision Tree Approach in Data Mining
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Bit Complexity of Breaking and Achieving Symmetry in Chains and Rings.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.
Distributed and Efficient Classifiers for Wireless Audio-Sensor Networks Baljeet Malhotra Ioanis Nikolaidis Mario A. Nascimento University of Alberta Canada.
Distributed Model-Based Learning PhD student: Zhang, Xiaofeng.
Distributed Constraint Optimization * some slides courtesy of P. Modi
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
Static Process Scheduling
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Adversarial Search 2 (Game Playing)
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Management of Broadband Media Assets on Wide Area Networks Lars-Olof Burchard.
Bias Management in Time Changing Data Streams We assume data is generated randomly according to a stationary distribution. Data comes in the form of streams.
SZRZ6014 Research Methodology Prepared by: Aminat Adebola Adeyemo Study of high-dimensional data for data integration.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
DECISION TREES An internal node represents a test on an attribute.
Data Mining Soongsil University
Data Science Algorithms: The Basic Methods
Rule Induction for Classification Using
Data Mining K-means Algorithm
Chapter 6 Classification and Prediction
Communication and Memory Efficient Parallel Decision Tree Construction
ExaO: Software Defined Data Distribution for Exascale Sciences
CSE572, CBS598: Data Mining by H. Liu
Third year project – review of basic statistical concepts
Bootstrapped Optimistic Algorithm for Tree Construction
Peer-to-Peer Streaming: An Hierarchical Approach
Fast and Exact K-Means Clustering
Decision Trees for Mining Data Streams
Chapter 7: Transformations
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren

Motivation Large distributed computation is costly Especially data intensive and synchronization intensive ones; e.g., data mining Decision tree induction: –Collect global statistics (thousands) for every attribute (thousands) in every tree node (hundreds) –Global statistics – global synchronization

Motivation Hierarchy Helps Simplifies synchronization –Synchronize on each level Simplifies communication An “industrial strength” architecture –The way real systems (including grids) are often organized

Motivation Mining highly dimensional data Thousands of sources Central control Examples: –Genomically enriched healthcare data –Text repositories

Objectives of the Algorithm Exact results –Common approaches would either Collect a sample of the data Build independent models at each site and then use centralized meta-learning atop of them Communication efficiency –Naive approach: collect exact statistics for each tree node would result in GBytes of communication

Decision tree in a Teaspoon A tree were at each level the learning samples are splitted according to one attribute’s value Hill-climbing heuristic is used to induce the tree –The attribute that maximized a gain function is taken –Gain functions: Gini or Information Gain No real need to compute the gain

Main Idea Infer deterministic bounds on the gain of each attribute Improve bounds until best attribute is provenly better than the rest Communication efficiency is achieved because bounds require just limited data –Partial statistics for promising attributes –Rough bound on irrelevant attributes

Hierarchical Algorithm At each level of the hierarchy –Wait for reports from all descendants Contain upper and lower bounds on the gain of each attribute, number of samples from each class –Use descendant's report to compute cumulative bounds –If no clear separation, request descendants to tighten bounds by sending more data –At worst, all data is gathered

Deterministic Bounds Upper bound Lower bound

Performance Figures 99% reduction in communication bandwidth Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy Percent declines with hierarchy level

Performance Figures 99% reduction in communication bandwidth Out of 1000 SNP, only ~12 were reported to higher levels of the hierarchy Percent declines with hierarchy level

More Performance Figures Larger datasets require lower bandwidth Outlier noise is not a big issue –White noise even better

More Performance Figures Larger datasets require lower bandwidth Outlier noise is not a big issue –White noise even better

Future Work Text mining Incremental algorithm Accommodation of failure Testing on a real grid system Is this a general framework?