Download presentation
Presentation is loading. Please wait.
Published byClaude Armstrong Modified over 6 years ago
1
E-mail: srihari@cedar.buffalo.edu
CSE 711: DATA MINING Sargur N. Srihari Phone: , ext. 113
2
CSE 711 Texts Required Text
1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.
3
Input for Data Mining/Machine Learning
Concepts result of learning process intelligible operational Instances Attributes
4
Concept Learning Four styles of learning in data mining
classification learning supervised association learning association between features clustering numeric prediction
5
Iris Data–Clustering Problem
6
Weather Data–Numeric Class
7
Instances Input to machine learning scheme is a set of instances
Matrix of examples versus attributes is a flat file Input data as instances is common but also restrictive in representing relationships between objects
8
Family Tree Example Peter M Peggy F Grace F = = Ray M Steven M Graham
Pam F Ian M Pippa F Brian M = Nikki F Anna F
9
Two ways of expressing sister-of relation (a) (b)
10
Family Tree As Table
11
Sister-of As Table (combines 2 tables)
12
Rule for sister-of relation
If second person’s gender = female and first person’s parent1 = second person’s parent1 then sister-of = yes
13
Denormalization Relationship between different nodes of a tree recast into set of independent instances Join two records and make into one by process of flattening Relationship among more would be combinatorially large
14
Denormalization can produce spurious discoveries
Supermarket database customers and products bought relation products and supplier relation suppliers and their address relation Denormalizing produces flat file each instance has: customer, product, supplier, supplier address Database mining tool discovers: customers that buy beer also buy chips supplier address can be “discovered” from supplier!
15
Relations need not be finite
Relation ancestor-of involves arbitrarily long paths through tree Inductive logic programming learns rules such as: If person-1 is a parent of person-2 then person-1 is an ancestor of person-2 and person-2 is an ancestor of person-3 then person-1 is an ancestor of person-3
16
Inductive Logic Programming can learn recursive rules from set of relation instances
Drawback of such techniques: do not cope with noisy data, so slow as to be unusable, not covered in book
17
Summary of Data-mining Input
Input is table of independent instances of concept to be learned (file-mining!) Relational data is more complex than a flat file Finite set of relations can be recast into a single table Denormalizaion can result in spurious data
18
Attributes Each instance is characterized by a set of predefined features, eg, iris data different instances may have different features instances are transportation vehicles no. of wheels useful for land vehicles but not to ships no. of masts is applicable to ships but not land vehicles one feature may depend on value of another eg spouse’s name depends on married/unmarried use “irrelevant value” flag
19
Attribute Values Nominal Ordinal Interval Ratio
outlook = sunny, overcast, rainy Ordinal temperature = hot, mild, cool hot > mild > cool Interval ordered and measured in fixed units eg, temp. in F differences are meaningful, not sums Ratio inherently defines zero point, eg, distance between points real nos, all mathematical operations
20
Preparing the Input Denormalization
Integrate data from different sources marketing study: sales dept, billing dept, service dept Each source may have varying conventions,error, etc Enterprise-wide database integration is data warehousing
21
ARFF File for Weather Data
% ARFF file for the weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data %14 instances sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no
22
Simple Disjunction a n y b c y n y n x c d y n y n d x y n x
23
Exclusive-Or Problem X =1? If x = 1 and y = 0 then class = a
then class = b If x = 1 and y = 1 b a no yes Y =1? Y =1? 1 no yes no yes b a a b
24
Replicated Subtree If x = 1 and y = 1 then class = a
If z = 0 and w = 1 Otherwise class = b X 1 2 3 y 3 1 a 2 z 1 2 3 w b b 1 2 3 a b b
25
New Iris Flower
26
Rules for Iris Data Default: Iris-setosa 1
except if petal-length 2.45 and petal-length < and petal-width < then Iris-versicolor except if petal-length 4.95 and petal-width < then Iris-virginica else if sepal-length < 4.95 and sepal-width then Iris-virginica 8 else if petal-length then Iris-virginica except if petal-length < 4.85 and sepal-length < then Iris-versicolor 12
27
The Shapes Problem Shaded: Standing Unshaded: Lying
28
Training Data for Shapes Problem
29
CPU Performance Data PRP = -56.1 +0.049 MYCT +0.015 MMIN +0.006MMAX
+0.630CACH -0.270CHMIN +1.46 CHMAX CHMIN 7.5 >7.5 CACH MMAX 8.5 (8.5, 28] >28 28000 >28000 MMAX 64.6 (24/19.2%) MMAX 157 (21/73.7%) CHMAX (a) linear regression (2500, 4250] 2500 >4250 1000 >10000 58 >58 19.3 (28/8.7%) 29.8 (37/8.18%) CACH 75.7 (10/24.6%) 133 (16/28.8%) MMIN 783 (5/359%) 0.5 (0.5,8.5] 12000 >12000 MYCT 59.3 (24/16.9%) 281 (11/56%) 492 (7/53.9%) 550 >550 37.3 (19/11.3%) 18.3 (7/3.83%) (b) regression tree
30
CPU Performance Data CHMIN 7.5 >7.5 CACH MMAX 8.5 >8.5 28000
>28000 MMAX LM4 (50/22.17%) LM5 (21/45.5%) LM6 (23/63.5%) 4250 >4250 LM1 PRP = MMAX CHMIN LM2 PRP = MMIN CHMIN CHMAX LM3 PRP = MMIN LM4 PRP = MMAX CACH CHMAX LM5 PRP = MYCT CACH CHMIN LM6 PRP = MMIN CHMIN = 4.98 CHMAX LM1 (65/7.32%) CACH 0.5 (0.5,8.5] LM2 (26/6.37%) LM3 (24/14.5%) (c) model tree
31
Partitioning Instance Space
32
Ways to Represent Clusters
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.