Scaling Decision Tree Induction. Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the.

Slides:

Advertisements

Similar presentations

Mining High-Speed Data Streams

Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen

Decision Tree Approach in Data Mining

Pavan J Joshi 2010MCS2095 Special Topics in Database Systems

Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.

M INING H IGH -S PEED D ATA S TREAMS Presented by: Yumou Wang Dongyun Zhang Hao Zhou.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.

Scalable Classification Robert Neugebauer David Woo.

Classification Techniques: Decision Tree Learning

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

Forest Trees for On-line Data Joao Gama, Pedro Medas, Ricado Rocha Proc. ACM Symposium on Applied Computing- SAC /5/27 報告人：董原賓.

Decision Tree Learning 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.

1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.

Classification and Prediction

I/O-Algorithms Lars Arge Aarhus University February 7, 2005.

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

Ensemble Learning: An Introduction

Induction of Decision Trees

1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.

Classification.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

A General Framework for Mining Massive Data Streams Geoff Hulten Advised by Pedro Domingos.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Mining High Speed Data Streams

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

Handling Numeric Attributes in Hoeffding Trees Bernhard Pfahringer, Geoff Holmes and Richard Kirkby.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Scaling up Decision Trees. Decision tree learning.

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.

1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Internal and External Sorting External Searching

Bootstrapped Optimistic Algorithm for Tree Construction

1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference

University of Waikato, New Zealand

DECISION TREES An internal node represents a test on an attribute.

C4.5 - pruning decision trees

Artificial Intelligence

Mining Time-Changing Data Streams

Ch9: Decision Trees 9.1 Introduction A decision tree:

Extra: B+ Trees CS1: Java Programming Colorado State University

“Mining Highspeed Data streams” - Pedro Domingos and Geoff Hulten CSCE 566 SPRING 2017 By, Mounika Pylla mxp5826.

COSC160: Data Structures Linked Lists

Issues in Decision-Tree Learning Avoiding overfitting through pruning

Introduction to Data Mining, 2nd Edition by

Spatial Online Sampling and Aggregation

Communication and Memory Efficient Parallel Decision Tree Construction

Bootstrapped Optimistic Algorithm for Tree Construction

Fast and Exact K-Means Clustering

Machine Learning in Practice Lecture 17

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Decision Trees for Mining Data Streams

Mining Decision Trees from Data Streams

Learning from Data Streams

Presentation transcript:

Scaling Decision Tree Induction

Outline Why do we need scaling? Cover state of the art methods Details on my research (which is one of the state of the art methods)

Problems Scaling Decision Trees Data doesn’t fit in RAM Numeric attributes require repeated sorting Noisy datasets lead to very large trees Large datasets fundamentally different from smaller ones –Can’t store the entire dataset –Underlying phenomenon changes over time

Current State-Of-The-Art Disk based methods –Sprint –SLIQ Sampling methods –BOAT –VFDT & CVFDT Data Stream Methods –VFDT & CVFDT

SPRINT/SLIQ Shafer, Agrawal, Mehta In the IBM Intelligent Miner for Data Learns the same tree as traditional method but works with data on disk One scan over the data per level of the induced tree

SPRINT/SLIQ Details Split the dataset into one file per attribute –(value, record ID) Pre-sort each numeric attribute’s file Do one scan over each file, find best split point Use hash-tables to split the files maintaining sort order Recur

SPRINT/SLIQ Splitting Example 3 | 3 5 | 2 6 | 5 9 | 1 10 | 4 12 | 6 val | rec 10 | 1 14 | 6 20 | 2 25 | 4 30 | 3 40 | 5 val | rec 10 | 1 14 | 6 20 | 4 20 | 2 30 | 3 40 | 5 > 1 | > 2 | < 3 | < 4 | > 5 | < 6 | > ‘hashtable’ To Split < Test Attrib

BOAT Gehrke, Ganti, Ramakrishnan, Loh Learns the same tree as traditional methods but can be as much as 3x faster than SPRINT/SLIQ When things work out learns more than one level of tree in one scan over the database

BOAT Details Read a sample of data into memory Learn N trees via traditional methods on bootstrap samples from this sample Keep any subset of the N trees that is exactly the same Verify the subtree with a scan over all data When this fails revert to SPRINT/SLIQ

BOAT Example x1? x2? malefemale > 65<= 65 x1? x2? malefemale > 67<= 67 x1? x2? malefemale > 61<= 61 x3? noyes x1? x2? malefemale > 61<= 67 ?

VFDT/CVFDT Hulten, Spencer, Domingos With high probability learns what traditional methods would learn, but much faster Learns from data stream instead of data base CVFDT is extension to time changing concepts

Motivation Why use a data stream model? –High data rate –Essentially infinite data –Data collected in varied circumstances Need a algorithms that are: –Constant time per example & use each example once –Incremental –Anytime –Produce results ‘equivalent’ to traditional methods

Hoeffding Trees In order to pick split attribute for a node looking at a few example may be sufficient Given a stream of examples: –Use the first to pick the split at the root –Sort succeeding ones to the leaves –Pick best attribute there –Continue… Leaves predict most common class

How Much Data? Make sure best attribute is better than second –That is: Using a statistical result: Hoeffding bound –Collect data till:

Hoeffding Tree Algorithm Proceedure HoeffdingTree(Stream, δ) Let HT = Tree with single leaf (root) Initialize sufficient statistics at root For each example (X, y) in Stream Sort (X, y) to leaf using HT Update sufficient statistics at leaf Compute G for each attribute If G(best) – G(2 nd best) > ε, then Split leaf on best attribute For each branch Start new leaf, init sufficient statistics Return HT x1? y=0 x2? y=0y=1 malefemale > 65<= 65

Properties of Hoeffding Trees Model may contain incorrect splits, useful? Bound the difference with infinite data tree –Chance an arbitrary example takes different path Intuition: example on level i of tree has i chances to go through a mistaken node

VFDT (Very Fast Decision Tree) Memory management –Memory dominated by sufficient statistics –Deactivate less promising leaves when needed Ties: –Wasteful to decide between identical attributes Check for splits periodically Pre-pruning (optional) –Only make splits that improve the value of G(.) Early stop on bad attributes Bootstrap with traditional learner Rescan old data when time available

Experiments Compared VFDT and C4.5 (Quinlan, 1993) Same memory limit for both (40 MB) –100k examples for C4.5 VFDT settings: δ = 10^-7, τ = 5% Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30%

Running Times Pentium III at 500 MHz running Linux C4.5 takes 35 seconds to read and process 100k examples; VFDT takes 47 seconds VFDT takes 6377 seconds for 20 million examples; 5752s to read 625s to process VFDT processes 32k examples per second (excluding I/O)

Time-Changing Data Streams Underlying concept often changes over time –Seasonal effects –Economic cycles –Etc. Many KDD systems assume data is sample from stationary distribution CVFDT -- Extends VFDT for time changing data streams

Dealing with Time Changing Concepts Out-of-date data misleads learner and results in larger or less accurate models Maintain a window of the most recent examples –When new data arrives update the window and reapply the learner –Effective when window size similar to concept drift rate Extremely inefficient!

Concept adapting VFDT Keep up to date with a window of size w –Incrementally incorporate and forget examples Smoothly change the induced tree –Grow speculative structure –Change structure when more accurate Incorporates new examples in constant time instead of relearning on window: O(w) time

Window (Forgetting Examples) Keep sufficient statistics at every node Update with new & old examples –Keep an ID and only forget where needed –Quickly update leaf predictions Periodically check for any invalid splits –Some portion due to incorrect initial splits –The rest due to changes in the data stream

Alternate Sub-Trees When new test looks better grow alternate subtree Replace the old when new is more accurate This smoothly adjusts to changing concepts Gender? Pets?College? Hair? false true false true

CVFDT Details Memory Requirements –When drift present, CVFDT uses fewer nodes than VFDT –Observed good results with relatively few alternate-trees Update time –O(# attribs * # values * # classes * path length) Independent of training set and window size!

Other things Dynamic window size –Drastic changes in the data stream –Drastic changes in the induced model –No apparent changes (learn more detail)

Synthetic Experiments Concept based on parallel hyper-planes Aligned axis better split attribute, rotate the hyper- planes to change structure of ‘true’ tree Concept Drift

Synthetic Experiments (cont.) Compare CVFDT with VFDT 5 million training examples Drift inserted by periodically rotating hyper- planes –About 8% of test points change label each drift 100,000 examples in window 5% noise Results sampled every 10k examples throughout the run and averaged

Error Rate vs. # Attributes

Tree Size vs. # Attributes

Detailed View of Single Run

Varying Levels of Drift

Details of Adaptation

Comparison With VFDT-window CVFDT most of the accuracy gain VFDT: 10 min CVFDT: 46 min VFDT-window –Est. 548 days! VFDT-Window CVFDT VFDT

Application: Web Data Trace of all web requests from UW campus 82.8 million requests over one-week period Goal: to predict which pages to cache CVFDT does better for first 70% of run VFDT’s performance improved near end Data seems to contain drift, but more study is needed

Open Issues Continuous Attributes Batch version of VFDT Very Fast Post Pruning Extending general method to other algorithms

Summary Decision trees important, need some more work to scale to today's problems Disk based methods –About one scan per level of tree Sampling can produce equivalent trees much faster