Download presentation
Presentation is loading. Please wait.
1
EECS 647: Introduction to Database Systems
Instructor: Luke Huan Spring 2007
2
Luke Huan Univ. of Kansas
Review Nature of the data Data types: SSN Grade Temperature (degree) Length Data Quality Noise Outlier Missing/duplicated data Nominal Ordinal Interval Ratio 9/20/2018 Luke Huan Univ. of Kansas
3
Luke Huan Univ. of Kansas
Summary Common tools for exploratory data analysis Histogram Box plot Scatter plot Correlation Association Each rule: L => R has two parts: L, the left hand item set and R the right hand item set Each rule is measured by two parameters: Support Confidence 9/20/2018 Luke Huan Univ. of Kansas
4
Luke Huan Univ. of Kansas
An Exercise The support value of pattern {acm} is Sup(acm)=3 The support of pattern {ac} is Sup(ac)=3 Given min_sup=3, acm is Frequent The confidence of the rule: {ac} => {m} is 100% Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB 9/20/2018 Luke Huan Univ. of Kansas
5
Luke Huan Univ. of Kansas
Today’s Topic Association rule mining Apriori properity Classification Decision tree construction 9/20/2018 Luke Huan Univ. of Kansas
6
Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 9/20/2018 Luke Huan Univ. of Kansas
7
Illustrating Apriori Principle
Found to be Infrequent Pruned supersets 9/20/2018 Luke Huan Univ. of Kansas
8
An Example of Apriori Principle
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Min_sup=2 9/20/2018 Luke Huan Univ. of Kansas
9
Luke Huan Univ. of Kansas
Classification Goal Decision-tree Model evaluation 9/20/2018 Luke Huan Univ. of Kansas
10
Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 9/20/2018 Luke Huan Univ. of Kansas
11
Illustrating Classification Task
Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 9/20/2018 Luke Huan Univ. of Kansas
12
Luke Huan Univ. of Kansas
Apply Model to Data Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? 9/20/2018 Luke Huan Univ. of Kansas
13
Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 9/20/2018 Luke Huan Univ. of Kansas
14
An Example Decision Tree
age? overcast student? credit rating? no yes fair excellent <=30 >40 30..40 9/20/2018 Luke Huan Univ. of Kansas
15
Luke Huan Univ. of Kansas
Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. How to determine the best split? 9/20/2018 Luke Huan Univ. of Kansas
16
How to determine the Best Split
Before Splitting: 10 records of class 0, 10 records of class 1 Students? Income? Age? high Yes No low c c 1 20 c c Medium 10 11 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 Which test condition is the best? 9/20/2018 Luke Huan Univ. of Kansas
17
How to determine the Best Split
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity 9/20/2018 Luke Huan Univ. of Kansas
18
How to Find the Best Split
Before Splitting: M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34 9/20/2018 Luke Huan Univ. of Kansas
19
Measure of Impurity: GINI
Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information 9/20/2018 Luke Huan Univ. of Kansas
20
Examples for computing GINI
P(C1) = 0/6 = P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/ P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/ P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 9/20/2018 Luke Huan Univ. of Kansas
21
Splitting Based on GINI
Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. 9/20/2018 Luke Huan Univ. of Kansas
22
Decision Tree Based Classification
Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets 9/20/2018 Luke Huan Univ. of Kansas
23
Luke Huan Univ. of Kansas
Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: 9/20/2018 Luke Huan Univ. of Kansas
24
Practical Issues of Classification
Underfitting and Overfitting Missing Values Costs of Classification 9/20/2018 Luke Huan Univ. of Kansas
25
Underfitting and Overfitting (Example)
500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x12+x22) 1 Triangular points: sqrt(x12+x22) > 0.5 or sqrt(x12+x22) < 1 9/20/2018 Luke Huan Univ. of Kansas
26
Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large Overfitting: when the model is too complex, the test error increases 9/20/2018 Luke Huan Univ. of Kansas
27
Overfitting due to Noise
Decision boundary is distorted by noise point 9/20/2018 Luke Huan Univ. of Kansas
28
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task 9/20/2018 Luke Huan Univ. of Kansas
29
Luke Huan Univ. of Kansas
Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors 9/20/2018 Luke Huan Univ. of Kansas
30
Luke Huan Univ. of Kansas
Decision Boundary Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time 9/20/2018 Luke Huan Univ. of Kansas
31
Oblique Decision Trees
x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive 9/20/2018 Luke Huan Univ. of Kansas
32
Luke Huan Univ. of Kansas
Model Evaluation How do you know your algorithm is good? 9/20/2018 Luke Huan Univ. of Kansas
33
Metrics for Performance Evaluation
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 9/20/2018 Luke Huan Univ. of Kansas
34
Metrics for Performance Evaluation…
Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) 9/20/2018 Luke Huan Univ. of Kansas
35
Cost-Sensitive Measures
9/20/2018 Luke Huan Univ. of Kansas
36
Luke Huan Univ. of Kansas
Summary Apriori property and association rule mining Classification Decision tree 9/20/2018 Luke Huan Univ. of Kansas
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.