EECS 647: Introduction to Database Systems

EECS 647: Introduction to Database Systems
Instructor: Luke Huan Spring 2007

Luke Huan Univ. of Kansas
Review Nature of the data Data types: SSN Grade Temperature (degree) Length Data Quality Noise Outlier Missing/duplicated data Nominal Ordinal Interval Ratio 9/20/2018 Luke Huan Univ. of Kansas

Summary Common tools for exploratory data analysis Histogram Box plot Scatter plot Correlation Association Each rule: L => R has two parts: L, the left hand item set and R the right hand item set Each rule is measured by two parameters: Support Confidence 9/20/2018 Luke Huan Univ. of Kansas

An Exercise The support value of pattern {acm} is Sup(acm)=3 The support of pattern {ac} is Sup(ac)=3 Given min_sup=3, acm is Frequent The confidence of the rule: {ac} => {m} is 100% Transaction-id Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l,m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n Transaction database TDB 9/20/2018 Luke Huan Univ. of Kansas

Today’s Topic Association rule mining Apriori properity Classification Decision tree construction 9/20/2018 Luke Huan Univ. of Kansas

Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support 9/20/2018 Luke Huan Univ. of Kansas

Illustrating Apriori Principle
Found to be Infrequent Pruned supersets 9/20/2018 Luke Huan Univ. of Kansas

An Example of Apriori Principle
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e null AB AC AD AE BC BD BE CD CE DE A B C D E ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Min_sup=2 9/20/2018 Luke Huan Univ. of Kansas

Classification Goal Decision-tree Model evaluation 9/20/2018 Luke Huan Univ. of Kansas

Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 9/20/2018 Luke Huan Univ. of Kansas

Illustrating Classification Task
Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 9/20/2018 Luke Huan Univ. of Kansas

Apply Model to Data Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured? 9/20/2018 Luke Huan Univ. of Kansas

Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc 9/20/2018 Luke Huan Univ. of Kansas

An Example Decision Tree
age? overcast student? credit rating? no yes fair excellent <=30 >40 30..40 9/20/2018 Luke Huan Univ. of Kansas

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. How to determine the best split? 9/20/2018 Luke Huan Univ. of Kansas

How to determine the Best Split
Before Splitting: 10 records of class 0, 10 records of class 1 Students? Income? Age? high Yes No low c c 1 20 c c Medium 10 11 C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 Which test condition is the best? 9/20/2018 Luke Huan Univ. of Kansas

How to determine the Best Split
Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity 9/20/2018 Luke Huan Univ. of Kansas

How to Find the Best Split
Before Splitting: M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34 9/20/2018 Luke Huan Univ. of Kansas

Measure of Impurity: GINI
Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information 9/20/2018 Luke Huan Univ. of Kansas

Examples for computing GINI
P(C1) = 0/6 = P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/ P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/ P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 9/20/2018 Luke Huan Univ. of Kansas

Splitting Based on GINI
Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p. 9/20/2018 Luke Huan Univ. of Kansas

Decision Tree Based Classification
Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets 9/20/2018 Luke Huan Univ. of Kansas

Example: C4.5 Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. You can download the software from: 9/20/2018 Luke Huan Univ. of Kansas

Practical Issues of Classification
Underfitting and Overfitting Missing Values Costs of Classification 9/20/2018 Luke Huan Univ. of Kansas

Underfitting and Overfitting (Example)
500 circular and 500 triangular data points. Circular points: 0.5  sqrt(x12+x22)  1 Triangular points: sqrt(x12+x22) > 0.5 or sqrt(x12+x22) < 1 9/20/2018 Luke Huan Univ. of Kansas

Underfitting and Overfitting
Underfitting: when model is too simple, both training and test errors are large Overfitting: when the model is too complex, the test error increases 9/20/2018 Luke Huan Univ. of Kansas

Overfitting due to Noise
Decision boundary is distorted by noise point 9/20/2018 Luke Huan Univ. of Kansas

Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task 9/20/2018 Luke Huan Univ. of Kansas

Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors 9/20/2018 Luke Huan Univ. of Kansas

Decision Boundary Border line between two neighboring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time 9/20/2018 Luke Huan Univ. of Kansas

Oblique Decision Trees
x + y < 1 Class = + Class = Test condition may involve multiple attributes More expressive representation Finding optimal test condition is computationally expensive 9/20/2018 Luke Huan Univ. of Kansas

Model Evaluation How do you know your algorithm is good? 9/20/2018 Luke Huan Univ. of Kansas

Metrics for Performance Evaluation
Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a b c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) 9/20/2018 Luke Huan Univ. of Kansas

Metrics for Performance Evaluation…
Most widely-used metric: PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No a (TP) b (FN) c (FP) d (TN) 9/20/2018 Luke Huan Univ. of Kansas

Cost-Sensitive Measures
9/20/2018 Luke Huan Univ. of Kansas

Summary Apriori property and association rule mining Classification Decision tree 9/20/2018 Luke Huan Univ. of Kansas

EECS 647: Introduction to Database Systems

Similar presentations

Presentation on theme: "EECS 647: Introduction to Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECS 647: Introduction to Database Systems

Similar presentations

Presentation on theme: "EECS 647: Introduction to Database Systems"— Presentation transcript:

Similar presentations

About project

Feedback