E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113
CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.
Input for Data Mining/Machine Learning Concepts result of learning process intelligible operational Instances Attributes
Concept Learning Four styles of learning in data mining classification learning supervised association learning association between features clustering numeric prediction
Iris Data–Clustering Problem
Weather Data–Numeric Class
Instances Input to machine learning scheme is a set of instances Matrix of examples versus attributes is a flat file Input data as instances is common but also restrictive in representing relationships between objects
Family Tree Example Peter M Peggy F Grace F = = Ray M Steven M Graham Pam F Ian M Pippa F Brian M = Nikki F Anna F
Two ways of expressing sister-of relation (a) (b)
Family Tree As Table
Sister-of As Table (combines 2 tables)
Rule for sister-of relation If second person’s gender = female and first person’s parent1 = second person’s parent1 then sister-of = yes
Denormalization Relationship between different nodes of a tree recast into set of independent instances Join two records and make into one by process of flattening Relationship among more would be combinatorially large
Denormalization can produce spurious discoveries Supermarket database customers and products bought relation products and supplier relation suppliers and their address relation Denormalizing produces flat file each instance has: customer, product, supplier, supplier address Database mining tool discovers: customers that buy beer also buy chips supplier address can be “discovered” from supplier!
Relations need not be finite Relation ancestor-of involves arbitrarily long paths through tree Inductive logic programming learns rules such as: If person-1 is a parent of person-2 then person-1 is an ancestor of person-2 and person-2 is an ancestor of person-3 then person-1 is an ancestor of person-3
Inductive Logic Programming can learn recursive rules from set of relation instances Drawback of such techniques: do not cope with noisy data, so slow as to be unusable, not covered in book
Summary of Data-mining Input Input is table of independent instances of concept to be learned (file-mining!) Relational data is more complex than a flat file Finite set of relations can be recast into a single table Denormalizaion can result in spurious data
Attributes Each instance is characterized by a set of predefined features, eg, iris data different instances may have different features instances are transportation vehicles no. of wheels useful for land vehicles but not to ships no. of masts is applicable to ships but not land vehicles one feature may depend on value of another eg spouse’s name depends on married/unmarried use “irrelevant value” flag
Attribute Values Nominal Ordinal Interval Ratio outlook = sunny, overcast, rainy Ordinal temperature = hot, mild, cool hot > mild > cool Interval ordered and measured in fixed units eg, temp. in F differences are meaningful, not sums Ratio inherently defines zero point, eg, distance between points real nos, all mathematical operations
Preparing the Input Denormalization Integrate data from different sources marketing study: sales dept, billing dept, service dept Each source may have varying conventions,error, etc Enterprise-wide database integration is data warehousing
ARFF File for Weather Data % ARFF file for the weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data %14 instances sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no
Simple Disjunction a n y b c y n y n x c d y n y n d x y n x
Exclusive-Or Problem X =1? If x = 1 and y = 0 then class = a then class = b If x = 1 and y = 1 b a no yes Y =1? Y =1? 1 no yes no yes b a a b
Replicated Subtree If x = 1 and y = 1 then class = a If z = 0 and w = 1 Otherwise class = b X 1 2 3 y 3 1 a 2 z 1 2 3 w b b 1 2 3 a b b
New Iris Flower
Rules for Iris Data Default: Iris-setosa 1 except if petal-length 2.45 and petal-length < 5.355 2 and petal-width < 1.75 3 then Iris-versicolor 4 except if petal-length 4.95 and petal-width < 1.55 5 then Iris-virginica 6 else if sepal-length < 4.95 and sepal-width 2.45 7 then Iris-virginica 8 else if petal-length 3.35 9 then Iris-virginica 10 except if petal-length < 4.85 and sepal-length < 5.95 11 then Iris-versicolor 12
The Shapes Problem Shaded: Standing Unshaded: Lying
Training Data for Shapes Problem
CPU Performance Data PRP = -56.1 +0.049 MYCT +0.015 MMIN +0.006MMAX +0.630CACH -0.270CHMIN +1.46 CHMAX CHMIN 7.5 >7.5 CACH MMAX 8.5 (8.5, 28] >28 28000 >28000 MMAX 64.6 (24/19.2%) MMAX 157 (21/73.7%) CHMAX (a) linear regression (2500, 4250] 2500 >4250 1000 >10000 58 >58 19.3 (28/8.7%) 29.8 (37/8.18%) CACH 75.7 (10/24.6%) 133 (16/28.8%) MMIN 783 (5/359%) 0.5 (0.5,8.5] 12000 >12000 MYCT 59.3 (24/16.9%) 281 (11/56%) 492 (7/53.9%) 550 >550 37.3 (19/11.3%) 18.3 (7/3.83%) (b) regression tree
CPU Performance Data CHMIN 7.5 >7.5 CACH MMAX 8.5 >8.5 28000 >28000 MMAX LM4 (50/22.17%) LM5 (21/45.5%) LM6 (23/63.5%) 4250 >4250 LM1 PRP = 8.29 + 0.004 MMAX + 2.77 CHMIN LM2 PRP = 20.3 + 0.004 MMIN - 3.99 CHMIN + 0.946 CHMAX LM3 PRP = 38.1 + 0.012 MMIN LM4 PRP = 10.5 + 0.002 MMAX + 0.698 CACH +0.969 CHMAX LM5 PRP = 285 - 1.46 MYCT + 1.02 CACH - 9.39 CHMIN LM6 PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN = 4.98 CHMAX LM1 (65/7.32%) CACH 0.5 (0.5,8.5] LM2 (26/6.37%) LM3 (24/14.5%) (c) model tree
Partitioning Instance Space
Ways to Represent Clusters