Presentation is loading. Please wait.

Presentation is loading. Please wait.

E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113.

Similar presentations


Presentation on theme: "E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113."— Presentation transcript:

1 E-mail: srihari@cedar.buffalo.edu
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113

2 CSE 711 Texts Required Text
1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

3 Input for Data Mining/Machine Learning
Concepts result of learning process intelligible operational Instances Attributes

4 Concept Learning Four styles of learning in data mining
classification learning supervised association learning association between features clustering numeric prediction

5 Iris Data–Clustering Problem

6 Weather Data–Numeric Class

7 Instances Input to machine learning scheme is a set of instances
Matrix of examples versus attributes is a flat file Input data as instances is common but also restrictive in representing relationships between objects

8 Family Tree Example Peter M Peggy F Grace F = = Ray M Steven M Graham
Pam F Ian M Pippa F Brian M = Nikki F Anna F

9 Two ways of expressing sister-of relation (a) (b)

10 Family Tree As Table

11 Sister-of As Table (combines 2 tables)

12 Rule for sister-of relation
If second person’s gender = female and first person’s parent1 = second person’s parent1 then sister-of = yes

13 Denormalization Relationship between different nodes of a tree recast into set of independent instances Join two records and make into one by process of flattening Relationship among more would be combinatorially large

14 Denormalization can produce spurious discoveries
Supermarket database customers and products bought relation products and supplier relation suppliers and their address relation Denormalizing produces flat file each instance has: customer, product, supplier, supplier address Database mining tool discovers: customers that buy beer also buy chips supplier address can be “discovered” from supplier!

15 Relations need not be finite
Relation ancestor-of involves arbitrarily long paths through tree Inductive logic programming learns rules such as: If person-1 is a parent of person-2 then person-1 is an ancestor of person-2 and person-2 is an ancestor of person-3 then person-1 is an ancestor of person-3

16 Inductive Logic Programming can learn recursive rules from set of relation instances
Drawback of such techniques: do not cope with noisy data, so slow as to be unusable, not covered in book

17 Summary of Data-mining Input
Input is table of independent instances of concept to be learned (file-mining!) Relational data is more complex than a flat file Finite set of relations can be recast into a single table Denormalizaion can result in spurious data

18 Attributes Each instance is characterized by a set of predefined features, eg, iris data different instances may have different features instances are transportation vehicles no. of wheels useful for land vehicles but not to ships no. of masts is applicable to ships but not land vehicles one feature may depend on value of another eg spouse’s name depends on married/unmarried use “irrelevant value” flag

19 Attribute Values Nominal Ordinal Interval Ratio
outlook = sunny, overcast, rainy Ordinal temperature = hot, mild, cool hot > mild > cool Interval ordered and measured in fixed units eg, temp. in F differences are meaningful, not sums Ratio inherently defines zero point, eg, distance between points real nos, all mathematical operations

20 Preparing the Input Denormalization
Integrate data from different sources marketing study: sales dept, billing dept, service dept Each source may have varying conventions,error, etc Enterprise-wide database integration is data warehousing

21 ARFF File for Weather Data
% ARFF file for the weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data %14 instances sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no

22 Simple Disjunction a n y b c y n y n x c d y n y n d x y n x

23 Exclusive-Or Problem X =1? If x = 1 and y = 0 then class = a
then class = b If x = 1 and y = 1 b a no yes Y =1? Y =1? 1 no yes no yes b a a b

24 Replicated Subtree If x = 1 and y = 1 then class = a
If z = 0 and w = 1 Otherwise class = b X 1 2 3 y 3 1 a 2 z 1 2 3 w b b 1 2 3 a b b

25 New Iris Flower

26 Rules for Iris Data Default: Iris-setosa 1
except if petal-length  2.45 and petal-length < and petal-width < then Iris-versicolor except if petal-length  4.95 and petal-width < then Iris-virginica else if sepal-length < 4.95 and sepal-width  then Iris-virginica 8 else if petal-length  then Iris-virginica except if petal-length < 4.85 and sepal-length < then Iris-versicolor 12

27 The Shapes Problem Shaded: Standing Unshaded: Lying

28 Training Data for Shapes Problem

29 CPU Performance Data PRP = -56.1 +0.049 MYCT +0.015 MMIN +0.006MMAX
+0.630CACH -0.270CHMIN +1.46 CHMAX CHMIN 7.5 >7.5 CACH MMAX 8.5 (8.5, 28] >28 28000 >28000 MMAX 64.6 (24/19.2%) MMAX 157 (21/73.7%) CHMAX (a) linear regression (2500, 4250] 2500 >4250 1000 >10000 58 >58 19.3 (28/8.7%) 29.8 (37/8.18%) CACH 75.7 (10/24.6%) 133 (16/28.8%) MMIN 783 (5/359%) 0.5 (0.5,8.5] 12000 >12000 MYCT 59.3 (24/16.9%) 281 (11/56%) 492 (7/53.9%) 550 >550 37.3 (19/11.3%) 18.3 (7/3.83%) (b) regression tree

30 CPU Performance Data CHMIN 7.5 >7.5 CACH MMAX 8.5 >8.5 28000
>28000 MMAX LM4 (50/22.17%) LM5 (21/45.5%) LM6 (23/63.5%) 4250 >4250 LM1 PRP = MMAX CHMIN LM2 PRP = MMIN CHMIN CHMAX LM3 PRP = MMIN LM4 PRP = MMAX CACH CHMAX LM5 PRP = MYCT CACH CHMIN LM6 PRP = MMIN CHMIN = 4.98 CHMAX LM1 (65/7.32%) CACH 0.5 (0.5,8.5] LM2 (26/6.37%) LM3 (24/14.5%) (c) model tree

31 Partitioning Instance Space

32 Ways to Represent Clusters


Download ppt "E-mail: srihari@cedar.buffalo.edu CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113."

Similar presentations


Ads by Google