Download presentation
Presentation is loading. Please wait.
Published byAmy Booth Modified over 8 years ago
1
1 CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113
2
2 CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.
3
3 CSE 711 Texts 2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.
4
4 Introduction Challenge: How to manage ever- increasing amounts of information Solution: Data Mining and Knowledge Discovery Databases (KDD)
5
5 Information as a Production Factor Most international organizations produce more information in a week than many people could read in a lifetime
6
6 Data Mining Motivation Mechanical production of data need for mechanical consumption of data Large databases = vast amounts of information Difficulty lies in accessing it
7
7 KDD and Data Mining KDD: Extraction of knowledge from data Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” Data Mining: Discovery stage of the KDD process
8
8 Data Mining Process of discovering patterns, automatically or semi-automatically, in large quantities of data Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic
9
9 Export systems Statistics Machine learning KDD Database Visualization Figure 1.1 Data mining is a multi-disciplinary field. KDD and Data Mining
10
10 Data Mining vs. Query Tools SQL: When you know exactly what you are looking for Data Mining: When you only vaguely know what you are looking for
11
11 Practical Applications KDD more complicated than initially thought 80% preparing data 20% mining data
12
12 Data Mining Techniques Not so much a single technique More the idea that there is more knowledge hidden in the data than shows itself on the surface
13
13 Data Mining Techniques Any technique that helps to extract more out of data is useful Query tools Statistical techniques Visualization On-line analytical processing (OLAP) Case-based learning (k-nearest neighbor)
14
14 Data Mining Techniques Decision trees Association rules Neural networks Genetic algorithms
15
15 Machine Learning and the Methodology of Science Analysis Observation Prediction Theory Empirical cycle of scientific research
16
16 Machine Learning... Analysis Limited number of observation Theory ‘All swans are white’ Reality: Infinite number of swans Theory formation
17
17 Machine Learning... Prediction Single observation Theory “All swans are white” Theory falsification Reality: Infinite number of swans
18
18 A Kangaroo in Mist a.)b.)c.) d.)e.)f.) Complexity of search spaces
19
19 Association Rules Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression X Y, where X and Y are sets of an item.
20
20 Association Rules Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.
21
21 Association Rules Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X Y is the percentage of transactions that contain both X and Y.
22
22 Association Rules Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.
23
23 Example Data Sets Contact Lens (symbolic) Weather (symbolic data) Weather ( numeric +symbolic) Iris (numeric; outcome:symbolic) CPU Perf.(numeric; outcome:numeric) Labor Negotiations (missing values) Soybean
24
24 Contact Lens Data
25
25 Structural Patterns Part of structural description Example is simplistic because all combinations of possible values are represented in table If tear production rate = reducedthen recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft
26
26 Structural Patterns In most learning situations, the set of examples given as input is far from complete Part of the job is to generalize to other, new examples
27
27 Weather Data
28
28 Weather Problem This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = highthen play = no If outlook = rainy and windy = truethen play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the abovethen play = yes
29
29 Weather Data with Some Numeric Attributes
30
30 Classification and Association Rules Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no
31
31 Classification and Association Rules Association Rules: rules which strongly associate different attribute values Association rules which derive from weather table If temperature = coolthen humidity = normal If humidity = normal and windy = falsethen play = yes If outlook = sunny and play = nothen humidity = high If windy = false and play = nothen outlook = sunny and humidity = high
32
32 If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none Rules for Contact Lens Data
33
33 Decision Tree for Contact Lens Data tear production rate astigmatism spectacle prescription none soft hardnone
34
34 Iris Data
35
35 Iris Rules Learned If petal-length <2.45 then Iris-setosa If sepal-width <2.10 then Iris-versicolor If sepal-width < 2.45 and petal-length <4.55 then Iris- versicolor...
36
36 CPU Performance Data
37
37 CPU Performance Numerical Prediction: outcome as linear sum of weighted attributes Regression equation: PRP=-55.9+.049MYCT+.+1.48CHMAX Regression can discover linear relationships, not non-linear ones
38
38 Linear Regression Debt Regression Line Income A simple linear regression for the loan data set
39
39 Labor Negotiations Data
40
40 Decision Trees for... Wage increase first year Statutory holidays Wage increase first year Bad Good BadGood 2.5 > 2.5 > 10 < 4 10 4
41
41 … Labor Negotiations Data Wage increase first year Good BadGood 2.5 > 2.5 > 10 < 4 10 4 Working hours per week Statutory holidays Health plan contribution Wage increase first year Bad GoodBad > 36 36 full half none
42
42 Soy Bean Data
43
43 Two Example Rules If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If[leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot
44
44 Debt Loan Income No loan A simple linear classification boundary for the loan data set; shaded region denotes class “no loan” Classification
45
45 Clustering Debt Cluster 1Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s
46
46 Non-Linear Classification Debt No Loan Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set
47
47 Nearest Neighbor Classifier Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.