An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Data Mining Lecture 9.
CHAPTER 9: Decision Trees
Fast Algorithms For Hierarchical Range Histogram Constructions
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Deriving rules from data Decision Trees a.j.m.m (ton) weijters.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
Decision Tree Algorithm
Ensemble Learning: An Introduction
1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.
Basic Data Mining Techniques
Lecture 5 (Classification with Decision Trees)
Classification.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Ensemble Learning (2), Tree and Forest
Chapter 7 Decision Tree.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Data Mining: Classification
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
CS690L Data Mining: Classification
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Classification and Prediction
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,
Decision Trees.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
Chapter 6 Decision Tree.
Chapter 6 Classification and Prediction
Data Mining Classification: Basic Concepts and Techniques
Database Management Systems Data Mining - Data Cubes
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Classification by Decision Tree Induction
Basic Concepts and Decision Trees
Decision Trees for Mining Data Streams
©Jiawei Han and Micheline Kamber
Presentation transcript:

An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB Conference Vancouver, Canada, Presentation by: Vladan Radosavljevic

Outline Introduction Motivation Interval Classifier Example Results Conclusion

Introduction Given a small set of labeled examples find classifier which will efficiently classify large unlabeled population database Or – retrieve all examples from the database that belong to the desired class Assumption: labeled examples are representative of entire population, number of classes are known in advance (m)

Motivation Why an Interval Classifier? Neural Networks – not database oriented, tuples have to be retrieved one at a time into memory before classification Decision Trees (ID3, CART) – binary splits increase computation time, pruning the tree after building makes the tree generation more expensive

Interval Classifier (IC) Key features: Tree classifier Categorical attributes – branches for each value Numerical attributes – decomposing range into k intervals, k determined algorithmically for each node IC generates SQL queries as final classification functions!

Interval Classifier - Algorithm Algorithm: Partition the domain of numerical attributes into predefined number of intervals, and for each interval determine winning class (class that has the largest frequency in that interval) For each attribute compute the value of the goodness function - information gain ratio (or re-substitution error rate) and find the winner attribute A Then for each partition of attribute A set strength of the winning class based on the frequency and predefined threshold, strength - weak or strong RRRGGG W W S S S S

Interval Classifier - Algorithm … Merge adjacent intervals that have the same winners with the equal strengths Divide training set of examples using calculated intervals Strong intervals become leaves with assigned winning class Recursively proceed with weak intervals. Stop when all intervals are strong, or specified maximum tree depth are obtained W S S Leaves

Interval Classifier - Pruning Pruning Dynamic, while tree is generated Find accuracy for the node using training set Expand the node only if classification error is below threshold that depends on number of leaves and entire accuracy The aim is to check whether the expansion will bring error reduction or not To avoid pruning to aggressively – each node inherits from its parent a certain number of credits

Example Age: numerical, uniformly distributed Zip-code: categorical, uniformly Level of Education, elevel: categorical, unif. Two classes: A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) B: otherwise

Example 1000 training tuples Calculate class histogram for numerical attribute age by choosing 100 equi-distant intervals and determine winning class for each partition Find the best attribute based on the resubstitution error rate: 1-sum(win_freq(part)/total_freq)

Example Choose age – the smallest error rate, partition the domain by merging adjacent intervals which have the same winning class with equal strengths B

Example Proceed with weak nodes and repeat the same procedure Finally: Classes defined in the beginning: A: (age<40 and elevel 0 to 1) OR (40<=age<60 and elevel 0 to 3) OR (age>=60 and elevel 0) B: otherwise

Results Generate examples with smooth boundaries among the groups Training set 2500 tuples, test Fixed precision – threshold 0.9 Adaptive precision – adaptive threshold Error pruning – credits Function 5 – nonlinear

Results Comparing to the ID3:

Conclusion IC interface efficiently with the database systems Treatment of numerical attributes Dynamic pruning Too many user defined parameters? Scalability? In practice K-ary trees are less accurate than binary ones?

References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, A. Swami: “An Interval Classifier for Database Mining Applications”, in Proceeding of the VLDB Conference, Vancouver, BC, Canada, 1992, pp

THANK YOU!