SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen 1996. Presentation by: Vladan Radosavljevic.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CHAPTER 9: Decision Trees
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
Ensemble Learning (2), Tree and Forest
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
MULTI-INTERVAL DISCRETIZATION OF CONTINUOUS VALUED ATTRIBUTES FOR CLASSIFICATION LEARNING KIRANKUMAR K. TAMBALKAR.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Lecture Notes for Chapter 4 Introduction to Data Mining
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Decision Trees.
Classification and Regression Trees
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
SLIQ (SUPERVISED LEARNING IN QUEST) STUDENT: NIKOLA TERZIĆ PROFESOR: VELJKO MILUTINOVIĆ.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
DECISION TREES An internal node represents a test on an attribute.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Decision Tree Saed Sayad 9/21/2018.
Data Mining Classification: Basic Concepts and Techniques
Database Management Systems Data Mining - Data Cubes
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Communication and Memory Efficient Parallel Decision Tree Construction
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
Machine Learning: Lecture 3
CS 685: Special Topics in Data Mining Jinze Liu
Statistical Learning Dong Liu Dept. EEIS, USTC.
INTRODUCTION TO Machine Learning 2nd Edition
Avoid Overfitting in Classification
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic

Outline Introduction Motivation SLIQ Algorithm Building tree Pruning Example Results Conclusion

Introduction Most of the classification algorithms are designed for memory resident data – limited suitability for mining large datasets Solution – build a scalable classifier - SLIQ SLIQ – Supervised Learning in Quest, Quest was the data mining project at the IBM

Motivation Recall (ID3, C4.5, CART):

Motivation NON SCALABLE DECISION TREES: The complexity lies in determining the best split for each attribute The cost of evaluating splits for numerical attributes is dominated by the cost of sorting values at each node The cost of evaluating splits for categorical attributes is dominated by the cost of searching for the best subset Pruning crossvalidation inapplicable for large datasets divide data in two parts - training and test set - sizes, distribution???

Motivation Improve scalability of tree classifiers Previous proposals: Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for each partition All methods achieve low accuracy! SLIQ – improve learning time without loss in accuracy!

SLIQ Key features: Tree classifier, handling both numerical and categorical attributes Presort numerical attributes before tree has been built Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm based on Minimum Description Length (MDL)

SLIQ - Algorithm Eliminate the need to sort the data at each node Create sorted list for each numerical attribute Create class list

SLIQ - Algorithm Example:

SLIQ - Algorithm Split evaluation:

SLIQ - Algorithm Example:

SLIQ - Algorithm Update class list:

SLIQ - Algorithm Example:

SLIQ - Algorithm For large-cardinality categorical attributes (determined based on threshold) the best split is computed in greedy way, otherwise all possible splits are evaluated When node becomes pure stop splitting it, then condense attribute lists by discarding examples that correspond to the pure node SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

SLIQ - Pruning Post pruning algorithm based on Minimum Description Length principle Find a model that minimizes: Cost(M,D) = Cost(D|M) + Cost(M) Cost(M) - cost of the model Cost(D|M) - cost of encoding the data D if model M is given

SLIQ - Pruning Cost of the data: classification error Cost of the model: Encoding the tree: number of bits Encoding the splits: numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality The MDL pruning evaluate the code length at each node to determine whether to prune one or both child or leave the node intact

SLIQ - pruning Three pruning strategies: Full – pruning both children and convert node to the leaf Partial – prune into the leaf or prune the left child or prune the right child or leave node intact Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

Results SLIQ was tested on the datasets:

Results Pruning strategy comparison:

Results Accuracy:

Results Scalability:

Conclusion SLIQ demonstrates to be a fast, low-cost and scalable classifier that builds accurate trees Based on empirical test which compared SLIQ to other tree based classifiers, SLIQ achieves a comparable accuracy while producing smaller decision trees Scalability??? Memory problem when increasing number of attributes or number of classes

References [1] M. Mehta, R. Agrawal and J. Rissanen, "SLIQ: A Fast Scalable Classifier for Data Mining", in Proceedings of the 5th International Conference on Extending Database Technology, Avignon, France, Mar

THANK YOU!