Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003.

Similar presentations


Presentation on theme: "Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003."— Presentation transcript:

1 Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003

2 Fall 2004Data Mining2 What is Data Mining? (… and should I be here?)

3 Fall 2004Data Mining3 Dilbert Replies...

4 Fall 2004Data Mining4 Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.”

5 Fall 2004Data Mining5 Classification Prediction Supervised Association discovery ClusteringUnsupervised What can Data Mining Do?

6 Fall 2004Data Mining6 Applications of Data Mining Manufacturing Process Improvement Sales and Marketing Mapping the Human Genome Diagnosing Breast Cancer Financial Crime Identification Portfolio Management

7 Fall 2004Data Mining7 Technical Background Machine Learning –Data mining: business-oriented use of AI Statistics –Regression, sampling, DOE, etc Decision Support –Data warehousing, data marts, OLAP, etc Interdisciplinary tools put together to form the process of knowledge discovery in databases …

8 Fall 2004Data Mining8 Historical Perspective < 40StatBayes theorem, regression, etc. 40sAINeural networks 50sAINearest neighbor, single link, perceptron StatResampling, bias reduction, jackknife 60sStatLinear models for classification, exploratory data analysis (EDA) IRSimilarity measures, clustering DBRelational data model 70sIRSmart IR systems AIGenetic algorithms StatEM algorithm, k-means clustering 80sAIKohonen maps, decision trees 90sDBAssociation rule algorithms, web & search engines, data warehousing, OLAP

9 Fall 2004Data Mining9 What Changed? Very large databases Increased computational power as enabler Business perspective

10 Fall 2004Data Mining10 Knowledge Discovery in Databases DatabasesData warehouse Prepared Data Model/StructuresKnowledge Data Warehouse Systems Engineering Knowledge Discovery and Data Mining

11 Fall 2004Data Mining11 Course Information We assume data is ready for mining Thus, we focus on: –models and structures, and –algorithms More information on course homepage http://www.public.iastate.edu/~olafsson/mining.html

12 Fall 2004Data Mining12

13 Fall 2004Data Mining13 Course Outline Introduction Exploratory Data Mining Supervised Learning Unsupervised Learning Optimization Methods in Learning Selected Advanced Topics –Mining the Web –Customer Relationship Management (CRM) Course Review

14 Fall 2004Data Mining14 Questions?

15 Fall 2004Data Mining15 Data Mining Discover patterns in data –automatic or semi-automatic process –meaningful or useful pattern –large amounts of data What does such a pattern look like? Black boxTransparent box

16 Fall 2004Data Mining16 Describing Structural Patterns Some ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters

17 Fall 2004Data Mining17 The Weather Problem

18 Fall 2004Data Mining18 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes These are classification rules

19 Fall 2004Data Mining19 Association Rules Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high

20 Fall 2004Data Mining20 Three Layers of the Process Inputs Outputs Algorithms

21 Fall 2004Data Mining21 Inputs Three forms –Concepts concept description - what you want to learn –Instances examples - what you learn from –Attributes features of instances - variables you have values for

22 Fall 2004Data Mining22 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction

23 Fall 2004Data Mining23 Instances: Learn from Examples Set of instances to be classified, or associated, or clustered Example of concept to be learned Data set: flat file (single relation) –denormalization Family tree example –concept: sister –example: family tree

24 Fall 2004Data Mining24 Family Tree =

25 Fall 2004Data Mining25 Denormalizing Relational Data

26 Fall 2004Data Mining26 Denormalization Problems Computational and storage costs Trivial regularities customersproducts productsupplier suppliersupplier address Infinite relations

27 Fall 2004Data Mining27 Content of Instances: Attributes Instance characterized by values of its (predefined) set of attributes –Numeric (“continuous”) –Nominal (categorical) –Ordinal (rank) –Interval –Ratio Focus in this class

28 Fall 2004Data Mining28 Data Preparation Data … –assembly set of instances/denormalizing relational data –integration enterprise-wide database/data warehouse –cleaning missing data –aggregation good information

29 Fall 2004Data Mining29 ARFF Format Used by JAVA package (Weka) Independent, unordered instances No relationship between instances

30 Fall 2004Data Mining30 Weather Data

31 Fall 2004Data Mining31 Features % = comments @relation @attribute –Attribute types: Nominal and numeric @data –List of instances –Missing values represented by ?

32 Fall 2004Data Mining32 Other Issues Missing data Inaccurate values Look at the data!!!

33 Fall 2004Data Mining33 Recall the Three Layers of the Data Mining Process Inputs Outputs (structural patterns) Algorithms Done Next

34 Fall 2004Data Mining34 Describing Structural Patterns Ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters

35 Fall 2004Data Mining35 The Weather Problem

36 Fall 2004Data Mining36 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes

37 Fall 2004Data Mining37 A Decision Tree Outlook HumidityWindy Play=No Sunny Rainy Overcast High Play=Yes Play=No TRUE

38 Fall 2004Data Mining38 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction

39 Fall 2004Data Mining39 Classification Rules Classification easily read off decision trees How? Other direction possible, but not as straightforward If a and b then x If c and d then x

40 Fall 2004Data Mining40 Corresponding Decision Tree a bc cd d x x x y y y y y y n n n n n n

41 Fall 2004Data Mining41 Replicated Subtree Problem X=1 Y=1 b y y n n n aab If x=1 and y=0 then a If x=0 and y=1 then a If x=0 and y=0 then b If x=1 and y=1 then b

42 Fall 2004Data Mining42 Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3

43 Fall 2004Data Mining43 If x and y then a EXCEPT if z then b Rules with exceptions Account for new instances Exceptions from exceptions, etc

44 Fall 2004Data Mining44 Association Rules Coverage (support): number of instances it predicts correctly Accuracy (confidence): coverage divided by number of instances it applies to Coverage = 4 Accuracy = 100% If temperature = cool then humidity = normal

45 Fall 2004Data Mining45 Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny

46 Fall 2004Data Mining46 The Shapes Problem Shaded=standing Unshaded=lying

47 Fall 2004Data Mining47 Instances

48 Fall 2004Data Mining48 Classification Rules If width  3.5 and height < 7.0 then lying If height  3.5 then standing Work well to classify these instances Problems?

49 Fall 2004Data Mining49 Relational Rules Rules comparing attributes to constants are called propositional rules Structural patterns? If width > height then lying If height > width then standing

50 Fall 2004Data Mining50 CPU Performance Example

51 Fall 2004Data Mining51 Numerical Prediction: regression equation

52 Fall 2004Data Mining52 Regression Tree CHMIN CACHMMAX  7.5 > 7.5 MMAX 64.6 MMAX  8.5  (8.5,28] >28 - Accuracy? - Large and possibly awkward

53 Fall 2004Data Mining53 Model Trees CHMIN CACHMMAX  7.5 > 7.5 MMAX LM4  8.5 >8.5 LM5LM6  28000 > 28000

54 Fall 2004Data Mining54 Instance-Base Representation Store actual instances New instance: algorithm finds “most similar” stored instance Features –What is a similar instance? –Need store (all?) instances –Really a black box method

55 Fall 2004Data Mining55 Clusters: d e a j c k h f b i g d e a j c k h f b i g

56 Fall 2004Data Mining56 Next: Algorithms


Download ppt "Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003."

Similar presentations


Ads by Google