CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Lecture # 24: Data Warehousing / Data Mining (R&G, ch 25 and 26)
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Lecture outline Classification Decision-tree classification.
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
Decision Tree Algorithm
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Data Warehousing / Data Mining.
Classification Continued
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification II.
Classification.
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Introduction to Data Mining ( based on notes by Jiawei Han and Micheline.
Chapter 7 Decision Tree.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Data Mining: Classification
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Basics of Decision Trees  A flow-chart-like hierarchical tree structure –Often restricted to a binary structure  Root: represents the entire dataset.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification CS 685: Special Topics in Data Mining Fall 2010 Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26)
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
Classification and Prediction
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 – DB Applications Data Warehousing / Data Mining (R&G, ch 25 and 26) C. Faloutsos and.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Decision Trees.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Warehousing / Data Mining
Data Mining Classification: Basic Concepts and Techniques
Database Management Systems Data Mining - Data Cubes
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Basic Concepts and Decision Trees
CS 685: Special Topics in Data Mining Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
©Jiawei Han and Micheline Kamber
Avoid Overfitting in Classification
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on notes by Christos Faloutsos)

Agenda Statistics AI - decision trees –Problem –Approach –Conclusions DB

Decision Trees Problem: Classification - Ie., given a training set (N tuples, with M attributes, plus a label attribute) find rules, to predict the label for newcomers Pictorially:

Decision trees ??

Decision trees Issues: –missing values –noise –‘rare’ events

Decision trees types of attributes –numerical (= continuous) - eg: ‘salary’ –ordinal (= integer) - eg.: ‘# of children’ –nominal (= categorical) - eg.: ‘car-type’

Decision trees Pictorially, we have num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Decision trees and we want to label ‘?’ num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ?

Decision trees so we build a decision tree: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level) ? 50 40

Decision trees so we build a decision tree: age<50 Y + chol. <40 N -... Y N

Agenda Statistics AI - decision trees –Problem –Approach –Conclusions DB

Decision trees Typically, two steps: –tree building –tree pruning (for over-training/over-fitting)

Tree building How? num. attr#1 (eg., ‘age’) num. attr#2 (eg.,chol-level)

Tree building How? A: Partition, recursively - pseudocode: Partition ( dataset S) if all points in S have same label then return evaluate splits along each attribute A pick best split, to divide S into S1 and S2 Partition(S1); Partition(S2) Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

Tree building Q1: how to introduce splits along attribute A i A1: –for numerical attributes: binary split, or multiple split –for categorical attributes: compute all subsets (expensive!), or use a greedy approach

Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split?

Tree building Q1: how to introduce splits along attribute A i Q2: how to evaluate a split? A: by how close to uniform each subset is - ie., we need a measure of uniformity:

Tree building entropy: H(p+, p-) p Any other measure? p+: relative frequency of class + in S

Tree building entropy: H(p +, p - ) p ‘gini’ index: 1-p p - 2 p p+: relative frequency of class + in S

Tree building entropy: H(p +, p - ) ‘gini’ index: 1-p p - 2 (How about multiple labels?)

Tree building Intuition: entropy: #bits to encode the class label gini: classification error, if we randomly guess ‘+’ with prob. p +

Tree building Thus, we choose the split that reduces entropy/classification-error the most: Eg.: num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Tree building Before split: we need (n + + n - ) * H( p +, p - ) = (7+6) * H(7/13, 6/13) bits total, to encode all the class labels After the split we need: 0 bits for the first half and (2+6) * H(2/8, 6/8) bits for the second half

Agenda Statistics AI - decision trees –Problem –Approach tree building tree pruning –Conclusions DB

Tree pruning What for? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Tree pruning Q: How to do it? num. attr#1 (eg., ‘age’) num. attr#2 (eg., chol-level)

Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?)

Tree pruning Q: How to do it? A1: use a ‘training’ and a ‘testing’ set - prune nodes that improve classification in the ‘testing’ set. (Drawbacks?) A2: or, rely on MDL (= Minimum Description Language) - in detail:

Tree pruning envision the problem as compression (of what?)

Tree pruning envision the problem as compression (of what?) and try to minimize the # bits to compress (a) the class labels AND (b) the representation of the decision tree

(MDL) a brilliant idea - eg.: best n-degree polynomial to compress these points: the one that minimizes (sum of errors + n )

Conclusions Classification through trees Building phase - splitting policies Pruning phase (to avoid over-fitting) Observation: classification is subtly related to compression

Reference M. Mehta, R. Agrawal and J. Rissanen, `SLIQ: A Fast Scalable Classifier for Data Mining', Proc. of the Fifth Int'l Conference on Extending Database Technology, Avignon, France, March 1996