Fall 2004Data Mining1 IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003
Fall 2004Data Mining2 What is Data Mining? (… and should I be here?)
Fall 2004Data Mining3 Dilbert Replies...
Fall 2004Data Mining4 Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.”
Fall 2004Data Mining5 Classification Prediction Supervised Association discovery ClusteringUnsupervised What can Data Mining Do?
Fall 2004Data Mining6 Applications of Data Mining Manufacturing Process Improvement Sales and Marketing Mapping the Human Genome Diagnosing Breast Cancer Financial Crime Identification Portfolio Management
Fall 2004Data Mining7 Technical Background Machine Learning –Data mining: business-oriented use of AI Statistics –Regression, sampling, DOE, etc Decision Support –Data warehousing, data marts, OLAP, etc Interdisciplinary tools put together to form the process of knowledge discovery in databases …
Fall 2004Data Mining8 Historical Perspective < 40StatBayes theorem, regression, etc. 40sAINeural networks 50sAINearest neighbor, single link, perceptron StatResampling, bias reduction, jackknife 60sStatLinear models for classification, exploratory data analysis (EDA) IRSimilarity measures, clustering DBRelational data model 70sIRSmart IR systems AIGenetic algorithms StatEM algorithm, k-means clustering 80sAIKohonen maps, decision trees 90sDBAssociation rule algorithms, web & search engines, data warehousing, OLAP
Fall 2004Data Mining9 What Changed? Very large databases Increased computational power as enabler Business perspective
Fall 2004Data Mining10 Knowledge Discovery in Databases DatabasesData warehouse Prepared Data Model/StructuresKnowledge Data Warehouse Systems Engineering Knowledge Discovery and Data Mining
Fall 2004Data Mining11 Course Information We assume data is ready for mining Thus, we focus on: –models and structures, and –algorithms More information on course homepage
Fall 2004Data Mining12
Fall 2004Data Mining13 Course Outline Introduction Exploratory Data Mining Supervised Learning Unsupervised Learning Optimization Methods in Learning Selected Advanced Topics –Mining the Web –Customer Relationship Management (CRM) Course Review
Fall 2004Data Mining14 Questions?
Fall 2004Data Mining15 Data Mining Discover patterns in data –automatic or semi-automatic process –meaningful or useful pattern –large amounts of data What does such a pattern look like? Black boxTransparent box
Fall 2004Data Mining16 Describing Structural Patterns Some ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters
Fall 2004Data Mining17 The Weather Problem
Fall 2004Data Mining18 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes These are classification rules
Fall 2004Data Mining19 Association Rules Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high
Fall 2004Data Mining20 Three Layers of the Process Inputs Outputs Algorithms
Fall 2004Data Mining21 Inputs Three forms –Concepts concept description - what you want to learn –Instances examples - what you learn from –Attributes features of instances - variables you have values for
Fall 2004Data Mining22 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction
Fall 2004Data Mining23 Instances: Learn from Examples Set of instances to be classified, or associated, or clustered Example of concept to be learned Data set: flat file (single relation) –denormalization Family tree example –concept: sister –example: family tree
Fall 2004Data Mining24 Family Tree =
Fall 2004Data Mining25 Denormalizing Relational Data
Fall 2004Data Mining26 Denormalization Problems Computational and storage costs Trivial regularities customersproducts productsupplier suppliersupplier address Infinite relations
Fall 2004Data Mining27 Content of Instances: Attributes Instance characterized by values of its (predefined) set of attributes –Numeric (“continuous”) –Nominal (categorical) –Ordinal (rank) –Interval –Ratio Focus in this class
Fall 2004Data Mining28 Data Preparation Data … –assembly set of instances/denormalizing relational data –integration enterprise-wide database/data warehouse –cleaning missing data –aggregation good information
Fall 2004Data Mining29 ARFF Format Used by JAVA package (Weka) Independent, unordered instances No relationship between instances
Fall 2004Data Mining30 Weather Data
Fall 2004Data Mining31 Features % –Attribute types: Nominal and –List of instances –Missing values represented by ?
Fall 2004Data Mining32 Other Issues Missing data Inaccurate values Look at the data!!!
Fall 2004Data Mining33 Recall the Three Layers of the Data Mining Process Inputs Outputs (structural patterns) Algorithms Done Next
Fall 2004Data Mining34 Describing Structural Patterns Ways of representing knowledge: –Decision tables –Decision trees –Classification rules –Association rules –Regression trees –Clusters
Fall 2004Data Mining35 The Weather Problem
Fall 2004Data Mining36 A Decision List If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = truethen play = no If outlook = overcastthen play = yes If humidity = normalthen play = yes If none of the abovethen play = yes
Fall 2004Data Mining37 A Decision Tree Outlook HumidityWindy Play=No Sunny Rainy Overcast High Play=Yes Play=No TRUE
Fall 2004Data Mining38 Concepts: Styles of Learning Classification (supervised) learning Association learning Clustering Numeric prediction
Fall 2004Data Mining39 Classification Rules Classification easily read off decision trees How? Other direction possible, but not as straightforward If a and b then x If c and d then x
Fall 2004Data Mining40 Corresponding Decision Tree a bc cd d x x x y y y y y y n n n n n n
Fall 2004Data Mining41 Replicated Subtree Problem X=1 Y=1 b y y n n n aab If x=1 and y=0 then a If x=0 and y=1 then a If x=0 and y=0 then b If x=1 and y=1 then b
Fall 2004Data Mining42 Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3
Fall 2004Data Mining43 If x and y then a EXCEPT if z then b Rules with exceptions Account for new instances Exceptions from exceptions, etc
Fall 2004Data Mining44 Association Rules Coverage (support): number of instances it predicts correctly Accuracy (confidence): coverage divided by number of instances it applies to Coverage = 4 Accuracy = 100% If temperature = cool then humidity = normal
Fall 2004Data Mining45 Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny
Fall 2004Data Mining46 The Shapes Problem Shaded=standing Unshaded=lying
Fall 2004Data Mining47 Instances
Fall 2004Data Mining48 Classification Rules If width 3.5 and height < 7.0 then lying If height 3.5 then standing Work well to classify these instances Problems?
Fall 2004Data Mining49 Relational Rules Rules comparing attributes to constants are called propositional rules Structural patterns? If width > height then lying If height > width then standing
Fall 2004Data Mining50 CPU Performance Example
Fall 2004Data Mining51 Numerical Prediction: regression equation
Fall 2004Data Mining52 Regression Tree CHMIN CACHMMAX 7.5 > 7.5 MMAX 64.6 MMAX 8.5 (8.5,28] >28 - Accuracy? - Large and possibly awkward
Fall 2004Data Mining53 Model Trees CHMIN CACHMMAX 7.5 > 7.5 MMAX LM4 8.5 >8.5 LM5LM6 > 28000
Fall 2004Data Mining54 Instance-Base Representation Store actual instances New instance: algorithm finds “most similar” stored instance Features –What is a similar instance? –Need store (all?) instances –Really a black box method
Fall 2004Data Mining55 Clusters: d e a j c k h f b i g d e a j c k h f b i g
Fall 2004Data Mining56 Next: Algorithms