1 CSE 711: DATA MINING Sargur N. Srihari Phone: 645-6164, ext. 113.

1 CSE 711: DATA MINING Sargur N. Srihari E-mail: srihari@cedar.buffalo.edu Phone: 645-6164, ext. 113

2 CSE 711 Texts Required Text 1. Witten, I. H., and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000. Recommended Texts 1. Adriaans, P., and D. Zantinge, Data Mining, Addison- Wesley,1998.

3 CSE 711 Texts 2. Groth, R., Data Mining: A Hands-on Approach for Business Professionals, Prentice-Hall PTR,1997. 3. Kennedy, R., Y. Lee, et al., Solving Data Mining Problems through Pattern Recognition, Prentice-Hall PTR, 1998. 4. Weiss, S., and N. Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann, 1998.

4 Introduction Challenge: How to manage ever- increasing amounts of information Solution: Data Mining and Knowledge Discovery Databases (KDD)

5 Information as a Production Factor Most international organizations produce more information in a week than many people could read in a lifetime

6 Data Mining Motivation Mechanical production of data need for mechanical consumption of data Large databases = vast amounts of information Difficulty lies in accessing it

7 KDD and Data Mining KDD: Extraction of knowledge from data Official definition: “non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data” Data Mining: Discovery stage of the KDD process

8 Data Mining Process of discovering patterns, automatically or semi-automatically, in large quantities of data Patterns discovered must be useful: meaningful in that they lead to some advantage, usually economic

9 Export systems Statistics Machine learning KDD Database Visualization Figure 1.1 Data mining is a multi-disciplinary field. KDD and Data Mining

10 Data Mining vs. Query Tools SQL: When you know exactly what you are looking for Data Mining: When you only vaguely know what you are looking for

11 Practical Applications KDD more complicated than initially thought 80% preparing data 20% mining data

12 Data Mining Techniques Not so much a single technique More the idea that there is more knowledge hidden in the data than shows itself on the surface

13 Data Mining Techniques Any technique that helps to extract more out of data is useful Query tools Statistical techniques Visualization On-line analytical processing (OLAP) Case-based learning (k-nearest neighbor)

14 Data Mining Techniques Decision trees Association rules Neural networks Genetic algorithms

15 Machine Learning and the Methodology of Science Analysis Observation Prediction Theory Empirical cycle of scientific research

16 Machine Learning... Analysis Limited number of observation Theory ‘All swans are white’ Reality: Infinite number of swans Theory formation

17 Machine Learning... Prediction Single observation Theory “All swans are white” Theory falsification Reality: Infinite number of swans

18 A Kangaroo in Mist a.)b.)c.) d.)e.)f.) Complexity of search spaces

19 Association Rules Definition: Given a set of transactions, where each transaction is a set of items, an association rule is an expression X  Y, where X and Y are sets of an item.

20 Association Rules Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.

21 Association Rules Example: 98% of customers that purchase tires and automotive accessories also buy some automotive services. Here, 98% is called the confidence of the rule. The support of the rule X  Y is the percentage of transactions that contain both X and Y.

22 Association Rules Problem: The problem of mining association rules is to find all rules which satisfy a user-specified minimum support and minimum confidence. Applications include cross-marketing, attached mailing, catalog design, loss leader analysis, add-on sales, store layout and customer segmentation based on buying patterns.

23 Example Data Sets Contact Lens (symbolic) Weather (symbolic data) Weather ( numeric +symbolic) Iris (numeric; outcome:symbolic) CPU Perf.(numeric; outcome:numeric) Labor Negotiations (missing values) Soybean

24 Contact Lens Data

25 Structural Patterns Part of structural description Example is simplistic because all combinations of possible values are represented in table If tear production rate = reducedthen recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

26 Structural Patterns In most learning situations, the set of examples given as input is far from complete Part of the job is to generalize to other, new examples

27 Weather Data

28 Weather Problem This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of examples If outlook = sunny and humidity = highthen play = no If outlook = rainy and windy = truethen play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the abovethen play = yes

29 Weather Data with Some Numeric Attributes

30 Classification and Association Rules Classification Rules: rules which predict the classification of the example in terms of whether to play or not If outlook = sunny and humidity = >83, then play = no

31 Classification and Association Rules Association Rules: rules which strongly associate different attribute values Association rules which derive from weather table If temperature = coolthen humidity = normal If humidity = normal and windy = falsethen play = yes If outlook = sunny and play = nothen humidity = high If windy = false and play = nothen outlook = sunny and humidity = high

32 If tear production rate = reduced then recommendation = none If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none Rules for Contact Lens Data

33 Decision Tree for Contact Lens Data tear production rate astigmatism spectacle prescription none soft hardnone

34 Iris Data

35 Iris Rules Learned If petal-length <2.45 then Iris-setosa If sepal-width <2.10 then Iris-versicolor If sepal-width < 2.45 and petal-length <4.55 then Iris- versicolor...

36 CPU Performance Data

37 CPU Performance Numerical Prediction: outcome as linear sum of weighted attributes Regression equation: PRP=-55.9+.049MYCT+.+1.48CHMAX Regression can discover linear relationships, not non-linear ones

38 Linear Regression Debt Regression Line Income A simple linear regression for the loan data set

39 Labor Negotiations Data

40 Decision Trees for... Wage increase first year Statutory holidays Wage increase first year Bad Good BadGood  2.5 > 2.5 > 10 < 4  10  4

41 … Labor Negotiations Data Wage increase first year Good BadGood  2.5 > 2.5 > 10 < 4  10  4 Working hours per week Statutory holidays Health plan contribution Wage increase first year Bad GoodBad > 36  36 full half none

42 Soy Bean Data

43 Two Example Rules If [leaf condition is normal and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot If[leaf malformation is absent and stem condition is abnormal and stem cankers is below soil line and canker lesion color is brown] then diagnosis is rhizoctonia root rot

44 Debt Loan Income No loan A simple linear classification boundary for the loan data set; shaded region denotes class “no loan” Classification

45 Clustering Debt Cluster 1Cluster 2 Cluster 3 Income A simple clustering of the loan data set into 3 clusters; note that the original labels are replaced by +’s

46 Non-Linear Classification Debt No Loan Loan Income An example of classification boundaries learned by a non-linear classifier (such as a neural network) for the loan data set

47 Nearest Neighbor Classifier Debt No Loan Loan Income Classification boundaries for a nearest neighbor classifier for the loan data set

1 CSE 711: DATA MINING Sargur N. Srihari Phone: 645-6164, ext. 113.

Similar presentations

Presentation on theme: "1 CSE 711: DATA MINING Sargur N. Srihari Phone: 645-6164, ext. 113."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CSE 711: DATA MINING Sargur N. Srihari Phone: 645-6164, ext. 113.

Similar presentations

Presentation on theme: "1 CSE 711: DATA MINING Sargur N. Srihari Phone: 645-6164, ext. 113."— Presentation transcript:

Similar presentations

About project

Feedback