Download presentation
Presentation is loading. Please wait.
Published byReynard French Modified over 9 years ago
1
Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003
2
Fall 2004Data Mining2 What is Data Mining? (… and should I be here?)
3
Fall 2004Data Mining3 Dilbert Replies...
4
Fall 2004Data Mining4 Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.”
5
Fall 2004Data Mining5 Classification Prediction Supervised Association discovery ClusteringUnsupervised What can Data Mining Do?
6
Fall 2004Data Mining6 Applications of Data Mining Manufacturing Process Improvement Sales and Marketing Mapping the Human Genome Diagnosing Breast Cancer Financial Crime Identification Portfolio Management
7
Fall 2004Data Mining7 Technical Background Machine Learning –Data mining: business-oriented use of AI Statistics –Regression, sampling, DOE, etc Decision Support –Data warehousing, data marts, OLAP, etc Interdisciplinary tools put together to form the process of knowledge discovery in databases …
8
Fall 2004Data Mining8 Historical Perspective < 40StatBayes theorem, regression, etc. 40sAINeural networks 50sAINearest neighbor, single link, perceptron StatResampling, bias reduction, jackknife 60sStatLinear models for classification, exploratory data analysis (EDA) IRSimilarity measures, clustering DBRelational data model 70sIRSmart IR systems AIGenetic algorithms StatEM algorithm, k-means clustering 80sAIKohonen maps, decision trees 90sDBAssociation rule algorithms, web & search engines, data warehousing, OLAP
9
Fall 2004Data Mining9 What Changed? Very large databases Increased computational power as enabler Business perspective
10
Fall 2004Data Mining10 Knowledge Discovery in Databases DatabasesData warehouse Prepared Data Model/StructuresKnowledge Data Warehouse Systems Engineering Knowledge Discovery and Data Mining
11
Fall 2004Data Mining11 Course Information We assume data is ready for mining Thus, we focus on: –models and structures, and –algorithms More information on course homepage http://www.public.iastate.edu/~olafsson/mining.html
12
Fall 2004Data Mining12
13
Fall 2004Data Mining13 Course Outline Introduction Exploratory Data Mining Supervised Learning Unsupervised Learning Optimization Methods in Learning Selected Advanced Topics –Mining the Web –Customer Relationship Management (CRM) Course Review
14
Fall 2004Data Mining14 Questions?
15
Fall 2004Data Mining15 Data Mining Discover patterns in data –automatic or semi-automatic process –meaningful or useful pattern –large amounts of data What does such a pattern look like? Black boxTransparent box
16
Fall 2004Data Mining16 Describing Structural Patterns Some ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters
17
Fall 2004Data Mining17 The Weather Problem
18
Fall 2004Data Mining18 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes These are classification rules
19
Fall 2004Data Mining19 Association Rules Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high
20
Fall 2004Data Mining20 Three Layers of the Process Inputs Outputs Algorithms
21
Fall 2004Data Mining21 Inputs Three forms –Concepts concept description - what you want to learn –Instances examples - what you learn from –Attributes features of instances - variables you have values for
22
Fall 2004Data Mining22 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction
23
Fall 2004Data Mining23 Instances: Learn from Examples Set of instances to be classified, or associated, or clustered Example of concept to be learned Data set: flat file (single relation) –denormalization Family tree example –concept: sister –example: family tree
24
Fall 2004Data Mining24 Family Tree =
25
Fall 2004Data Mining25 Denormalizing Relational Data
26
Fall 2004Data Mining26 Denormalization Problems Computational and storage costs Trivial regularities customersproducts productsupplier suppliersupplier address Infinite relations
27
Fall 2004Data Mining27 Content of Instances: Attributes Instance characterized by values of its (predefined) set of attributes –Numeric (“continuous”) –Nominal (categorical) –Ordinal (rank) –Interval –Ratio Focus in this class
28
Fall 2004Data Mining28 Data Preparation Data … –assembly set of instances/denormalizing relational data –integration enterprise-wide database/data warehouse –cleaning missing data –aggregation good information
29
Fall 2004Data Mining29 ARFF Format Used by JAVA package (Weka) Independent, unordered instances No relationship between instances
30
Fall 2004Data Mining30 Weather Data
31
Fall 2004Data Mining31 Features % = comments @relation @attribute –Attribute types: Nominal and numeric @data –List of instances –Missing values represented by ?
32
Fall 2004Data Mining32 Other Issues Missing data Inaccurate values Look at the data!!!
33
Fall 2004Data Mining33 Recall the Three Layers of the Data Mining Process Inputs Outputs (structural patterns) Algorithms Done Next
34
Fall 2004Data Mining34 Describing Structural Patterns Ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters
35
Fall 2004Data Mining35 The Weather Problem
36
Fall 2004Data Mining36 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes
37
Fall 2004Data Mining37 A Decision Tree Outlook HumidityWindy Play=No Sunny Rainy Overcast High Play=Yes Play=No TRUE
38
Fall 2004Data Mining38 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction
39
Fall 2004Data Mining39 Classification Rules Classification easily read off decision trees How? Other direction possible, but not as straightforward If a and b then x If c and d then x
40
Fall 2004Data Mining40 Corresponding Decision Tree a bc cd d x x x y y y y y y n n n n n n
41
Fall 2004Data Mining41 Replicated Subtree Problem X=1 Y=1 b y y n n n aab If x=1 and y=0 then a If x=0 and y=1 then a If x=0 and y=0 then b If x=1 and y=1 then b
42
Fall 2004Data Mining42 Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3
43
Fall 2004Data Mining43 If x and y then a EXCEPT if z then b Rules with exceptions Account for new instances Exceptions from exceptions, etc
44
Fall 2004Data Mining44 Association Rules Coverage (support): number of instances it predicts correctly Accuracy (confidence): coverage divided by number of instances it applies to Coverage = 4 Accuracy = 100% If temperature = cool then humidity = normal
45
Fall 2004Data Mining45 Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny
46
Fall 2004Data Mining46 The Shapes Problem Shaded=standing Unshaded=lying
47
Fall 2004Data Mining47 Instances
48
Fall 2004Data Mining48 Classification Rules If width 3.5 and height < 7.0 then lying If height 3.5 then standing Work well to classify these instances Problems?
49
Fall 2004Data Mining49 Relational Rules Rules comparing attributes to constants are called propositional rules Structural patterns? If width > height then lying If height > width then standing
50
Fall 2004Data Mining50 CPU Performance Example
51
Fall 2004Data Mining51 Numerical Prediction: regression equation
52
Fall 2004Data Mining52 Regression Tree CHMIN CACHMMAX 7.5 > 7.5 MMAX 64.6 MMAX 8.5 (8.5,28] >28 - Accuracy? - Large and possibly awkward
53
Fall 2004Data Mining53 Model Trees CHMIN CACHMMAX 7.5 > 7.5 MMAX LM4 8.5 >8.5 LM5LM6 28000 > 28000
54
Fall 2004Data Mining54 Instance-Base Representation Store actual instances New instance: algorithm finds “most similar” stored instance Features –What is a similar instance? –Need store (all?) instances –Really a black box method
55
Fall 2004Data Mining55 Clusters: d e a j c k h f b i g d e a j c k h f b i g
56
Fall 2004Data Mining56 Next: Algorithms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.