SLIQ and SPRINT for disk resident data

Slides:



Advertisements
Similar presentations
Is Random Model Better? -On its accuracy and efficiency-
Advertisements

COMP3740 CR32: Knowledge Management and Adaptive Systems
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Data Mining Lecture 9.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Multidimensional Indexing
Scalable Classification Robert Neugebauer David Woo.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Decision Tree under MapReduce Week 14 Part II. Decision Tree.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
BTrees & Bitmap Indexes
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
CSci 8980: Data Mining (Fall 2002)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
CS4432: Database Systems II
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Scaling up Decision Trees. Decision tree learning.
SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Illustration of the Classification Task: Learning Algorithm Model.
Bootstrapped Optimistic Algorithm for Tree Construction
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
SLIQ (SUPERVISED LEARNING IN QUEST) STUDENT: NIKOLA TERZIĆ PROFESOR: VELJKO MILUTINOVIĆ.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
DECISION TREES An internal node represents a test on an attribute.
Privacy-Preserving Data Mining
CS 540 Database Management Systems
Parallel Databases.
Database Management Systems (CS 564)
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Classification and Prediction
CS222P: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Lecture 12 Lecture 12: Indexing.
Basic Concepts and Decision Trees
Data Mining – Chapter 3 Classification
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Indexing 1.
CS222/CS122C: Principles of Data Management Notes #6 Index Overview and ISAM Tree Index Instructor: Chen Li.
Birch presented by : Bahare hajihashemi Atefeh Rahimi
©Jiawei Han and Micheline Kamber
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #05 Index Overview and ISAM Tree Index Instructor: Chen Li.
Presentation transcript:

SLIQ and SPRINT for disk resident data

SLIQ SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate trees Uses a pre-sorting technique in the tree growing phase Suitable for classification of large disk-resident datasets.

Issues There are two major, critical performance, issues in the tree-growth phase: How to find split points How to partition the data The well-known decision tree classifiers: Grow trees depth-first Repeatedly sort the data at every node SLIQ: Replace this repeated sorting with one-time sort Use new a data structure call class-list Class-list must remain memory resident at all time

Some Data rid age salary marital car 1 30 60 single sports 2 25 20 mini 3 40 80 married van 4 45 100 luxury 5 150 6 35 120 7 50 70 8 55 90 9 65 10 200

SLIQ - Attribute Lists rid age 1 30 2 25 3 40 4 45 5 60 6 35 7 50 8 55 9 65 10 70 rid salary 1 60 2 20 3 80 4 100 5 150 6 120 7 70 8 90 9 30 10 200 rid marital 1 single 2 3 married 4 5 6 7 8 9 10 These are projections on (rid, attribute).

SLIQ - Sort Numeric, Group Categorical rid age 2 25 1 30 6 35 3 40 4 45 7 50 8 55 5 60 9 65 10 70 rid salary 2 20 9 30 1 60 7 70 3 80 8 90 4 100 6 120 5 150 10 200 rid marital 3 married 5 7 9 1 single 2 4 6 8 10

SLIQ - Class List rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10

SLIQ - Histograms rid age 2 25 1 30 6 35 3 40 4 45 7 50 8 55 5 60 9 65 10 70 rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10 N1 sports mini van luxury L R 3 2 sports mini van luxury L R age25 ? sports mini van luxury L R age30 ? Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms rid age 2 25 1 30 6 35 3 40 4 45 7 50 8 55 5 60 9 65 10 70 rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10 N1 sports mini van luxury L R 3 2 sports mini van luxury L 1 R 3 2 age25 sports mini van luxury L 1 R 2 3 age30 Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms rid salary 2 20 9 30 1 60 7 70 3 80 8 90 4 100 6 120 5 150 10 200 rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10 N1 sports mini van luxury L R 3 2 sports mini van luxury L 1 R 3 2 salary20 sports mini van luxury L 2 R 3 salary30 Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms rid marital 3 married 5 7 9 1 single 2 4 6 8 10 rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10 N1 Married sports mini van luxury Yes 1 2 No 3 Single sports mini van luxury Yes 3 1 2 No Evaluate each split, using GINI or Entropy.

SLIQ - Perform best split and Update Class List salary60 rid salary 2 20 9 30 1 60 7 70 3 80 8 90 4 100 6 120 5 150 10 200 rid car LEAF 1 sports N1 2 mini 3 van 4 luxury 5 6 7 8 9 10 N2 N3

SLIQ - Perform best split and Update Class List salary60 N2 N3 rid salary 2 20 9 30 1 60 7 70 3 80 8 90 4 100 6 120 5 150 10 200 rid car LEAF 1 sports N2 2 mini 3 van N3 4 luxury 5 6 7 8 9 10

SLIQ - Histograms N1 salary60 N2 N3 rid age 2 25 1 30 6 35 3 40 4 45 7 50 8 55 5 60 9 65 10 70 rid car LEAF 1 sports N2 2 mini 3 van N3 4 luxury 5 6 7 8 9 10 N1 sports mini van luxury L R 1 N2 sports mini van luxury L R 2 3 N1 sports mini van luxury L R age25 ? N2 sports mini van luxury L R Evaluate each split, using GINI or Entropy. ...

SLIQ - Histograms N1 salary60 N2 N3 rid age 2 25 1 30 6 35 3 40 4 45 7 50 8 55 5 60 9 65 10 70 rid car LEAF 1 sports N2 2 mini 3 van N3 4 luxury 5 6 7 8 9 10 N1 sports mini van luxury L R 1 N2 sports mini van luxury L R 2 3 N1 sports mini van luxury L 1 R age25 N2 sports mini van luxury L R 2 3 Evaluate each split, using GINI or Entropy. ...

SLIQ - Pseudocode Split evaluation: EvaluateSplits() for each numeric attribute A do for each value v in the attribute list do find the corresponding entry in the class list, and hence the corresponding class and the leaf node Ni update the class histogram in leaf Ni compute splitting score for test (A ≤ v) for Ni for each categorical attribute A do for each leaf of the tree do find subset of A with best split

SLIQ - Pseudocode Updating the class list UpdateLabels() for each split leaf Ni do Let A be the split attribute for Ni. for each (rid,v) in the attribute list for A do find the corresponding entry in the class list e (using the rid) if the leaf referenced by e is Ni then find the new leaf Nj to which (rid,v) belongs (by applying the splitting test) update the leaf pointer for e to Nj

SLIQ - bottleneck Class-list must remain memory resident at all time! Although not a big problem with today's memories, still there might be cases where this is a bottleneck. So, what can we do when the class-list doesn't fit in main memory? SPRINT is a solution...

SPRINT rid age car 2 25 mini 1 30 sports 6 35 3 40 van 4 45 luxury 7 50 8 55 5 60 9 65 10 70 rid salary car 2 20 mini 9 30 1 60 sports 7 70 van 3 80 8 90 4 100 luxury 6 120 5 150 10 200 rid marital car 3 married van 5 luxury 7 9 mini 1 single sports 2 4 6 8 10 The main data structures used in SPRINT are: Attribute lists and Class histograms

SPRINT - Histograms rid age car 2 25 mini 1 30 sports 6 35 3 40 van 4 45 luxury 7 50 8 55 5 60 9 65 10 70 sports mini van luxury L R 3 2 sports mini van luxury L 1 R 3 2 age25 sports mini van luxury L 1 R 2 3 age30 Evaluate each split, using GINI or Entropy. ...

SPRINT - Histograms rid salary car 2 20 mini 9 30 1 60 sports 7 70 van 80 8 90 4 100 luxury 6 120 5 150 10 200 sports mini van luxury L R 3 2 sports mini van luxury L 1 R 3 2 salary20 sports mini van luxury L 2 R 3 salary30 Evaluate each split, using GINI or Entropy. ...

SPRINT - Histograms rid marital car 3 married van 5 luxury 7 9 mini 1 single sports 2 4 6 8 10 Married sports mini van luxury Yes 1 2 No 3 Single sports mini van luxury Yes 3 1 2 No Evaluate each split, using GINI or Entropy.

SPRINT - Performing Best Split Once the best split point has been found for a node, we execute the split by creating child nodes. Requires splitting the node’s lists for every attribute into two. Partitioning the attribute list of the winning attribute (salary) is easy. We scan the list, apply the split test, and move the records to two new attribute lists - one for each new child.

SPRINT - Performing Best Split Unfortunately, for the remaining attribute lists of the node (age and marital), we have no test that we can apply to the attribute values to decide how to divide the records. Solution: use the rids. As we partition the list of the splitting attribute (i.e. salary), we insert the rids of each record into a probe structure (hash table), noting to which child the record was moved. Once we have collected all the rids, we scan the lists of the remaining attributes and probe the hash table with the rid of each record. The retrieved information tells us with which child to place the record.

SPRINT - Performing Best Split If the hash-table is too large for the memory, splitting is done in more than one step. The attribute list for the splitting attribute is partitioned up to the attribute record for which the hash table will fit in memory; Portions of attribute lists of non-splitting attributes are partitioned; and the process is repeated for the remainder of the attribute list of the splitting attribute.