Download presentation
Presentation is loading. Please wait.
Published byBryan Woods Modified over 9 years ago
1
Page 1/64 Data Mining Versus Semantic Web Data Mining Versus Semantic Web Veljko Milutinovic, vm@etf.rsvm@etf. http://home.etf.rs/~vm This material was developed with financial help of the WUSA fund of Austria.
2
Page 2/64 DataMining versus SemanticWeb Two different avenues leading to the same goal! The goal: Efficient retrieval of knowledge, from large compact or distributed databases, or the Internet What is the knowledge: Synergistic interaction of information (data) and its relationships (correlations). The major difference: Placement of complexity!
3
Page 3/64 Essence of DataMining Data and knowledge represented with simple mechanisms (typically, HTML) and without metadata (data about data). Consequently, relatively complex algorithms have to be used (complexity migrated into the retrieval request time). In return, low complexity at system design time!
4
Page 4/64 Essence of SemanticWeb Data and knowledge represented with complex mechanisms (typically XML) and with plenty of metadata (a byte of data may be accompanied with a megabyte of metadata). Consequently, relatively simple algorithms can be used (low complexity at the retrieval request time). However, large metadata design and maintenance complexity at system design time.
5
Page 5/64 Major Knowledge Retrieval Algorithms (for DataMining) Neural Networks Decision Trees Rule Induction Memory Based Reasoning, etc… Consequently, the stress is on algorithms!
6
Page 6/64 Major Metadata Handling Tools (for SemanticWeb) XML RDF Ontology Languages Verification (Logic +Trust) Efforts in Progress Consequently, the stress is on tools!
7
Page 7/64 Issues in Data Mining Infrastructure Issues in Data Mining Infrastructure Authors:Nemanja Jovanovic, nemko@acm.orgnemko@acm.org Valentina Milenkovic, tina@eunet.rstina@eunet.rs Veljko Milutinovic, vm@etf.rsvm@etf.rs http://home.etf.rs/~vm
8
Page 8/64 Ivana Vujovic (ile@eunet.rs) Erich Neuhold (neuhold@ipsi.fhg.de) Peter Fankhauser (fankhaus@ipsi.fhg.de) Claudia Niederée (niederee@ipsi.fhg.de) Veljko Milutinovic (vm@etf.rs) http://home.etf.rs/~vm
9
Page 9/64 Data Mining in the Nutshell Data Mining in the Nutshell Uncovering the hidden knowledge Huge n-p complete search space Multidimensional interface
10
Page 10/64 A Problem … A Problem … You are a marketing manager for a cellular phone company Problem: Churn is too high Bringing back a customer after quitting is both difficult and expensive Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful) You pay a sales commission of 250$ per contract Customers receive free phone (cost 125$) with contract Turnover (after contract expires) is 40%
11
Page 11/64 … A Solution … A Solution Three months before a contract expires, predict which customers will leave If you want to keep a customer that is predicted to churn, offer them a new phone The ones that are not predicted to churn need no attention If you don’t want to keep the customer, do nothing How can you predict future behavior? Tarot Cards? Magic Ball? Data Mining?
12
Page 12/64 Still Skeptical? Still Skeptical?
13
Page 13/64 The Definition The Definition Automated The automated extraction of predictive information from (large) databases Extraction Predictive Databases
14
Page 14/64 History of Data Mining History of Data Mining
15
Page 15/64 Repetition in Solar Activity Repetition in Solar Activity 1613 – Galileo Galilei 1859 – Heinrich Schwabe
16
Page 16/64 The Return of the Halley Comet The Return of the Halley Comet 191019862061 ??? 1531 1607 1682 239 BC Edmund Halley (1656 - 1742)
17
Page 17/64 Data Mining is Not Data Mining is Not Data warehousing Ad-hoc query/reporting Online Analytical Processing (OLAP) Data visualization
18
Page 18/64 Data Mining is Data Mining is Automated extraction of predictive information from various data sources Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines
19
Page 19/64 Focus of this Presentation Focus of this Presentation Data Mining problem types Data Mining models and algorithms Efficient Data Mining Available software
20
Page 20/64 Data Mining Problem Types Data Mining Problem Types
21
Page 21/64 Data Mining Problem Types Data Mining Problem Types 6 types Often a combination solves the problem
22
Page 22/64 Data Description and Summarization Data Description and Summarization Aims at concise description of data characteristics Lower end of scale of problem types Provides the user an overview of the data structure Typically a sub goal
23
Page 23/64 Segmentation Segmentation Separates the data into interesting and meaningful subgroups or classes Manual or (semi)automatic A problem for itself or just a step in solving a problem
24
Page 24/64 Classification Classification Assumption: existence of objects with characteristics that belong to different classes Building classification models which assign correct labels in advance Exists in wide range of various application Segmentation can provide labels or restrict data sets
25
Page 25/64 Concept Description Concept Description Understandable description of concepts or classes Close connection to both segmentation and classification Similarity and differences to classification
26
Page 26/64 Prediction (Regression) Prediction (Regression) Similar to classification- difference: discrete becomes continuous Finds the numerical value of the target attribute for unseen objects
27
Page 27/64 Dependency Analysis Dependency Analysis Finding the model that describes significant dependences between data items or events Prediction of value of a data item Special case: associations
28
Page 28/64 Data Mining Models Data Mining Models 7A AntiPasti Analytics Animation Anecdotic Advantages Applications AfterTea
29
Page 29/64 Neural Networks Neural Networks Characterizes processed data with single numeric value Efficient modeling of large and complex problems Based on biological structures Neurons Network consists of neurons grouped into layers
30
Page 30/64 Neuron Functionality Neuron Functionality I1 I2 I3 In Output W1 W2 W3 Wn f Output = f (W1*I1, W2*I2, …, Wn*In)
31
Page 31/64 Training Neural Networks Training Neural Networks Advantages? Applications? Anecdotes (Dark+Dark)?
32
Page 32/64 Decision Trees Decision Trees A way of representing a series of rules that lead to a class or value Iterative splitting of data into discrete groups maximizing distance between them at each split CHAID, CHART, Quest, C5.0 Classification trees and regression trees Unlimited growth and stopping rules Univariate splits and multivariate splits
33
Page 33/64 Decision Trees Balance>10Balance<=10 Age<=32Age>32 Married=NOMarried=YES
34
Page 34/64 Decision Trees Decision Trees Advantages? Applications? Anecdotes (CRO+Past)?
35
Page 35/64 Rule Induction Rule Induction Method of deriving a set of rules to classify cases Creates independent rules that are unlikely to form a tree Rules may not cover all possible situations Rules may sometimes conflict in a prediction
36
Page 36/64 Rule Induction Rule Induction If balance>100.000 then confidence=HIGH & weight=1.7 If balance>25.000 and status=married then confidence=HIGH & weight=2.3 If balance<40.000 then confidence=LOW & weight=1.9 Advantages? Applications? Anecdotes (Teeth+Devor)
37
Page 37/64 K-nearest Neighbor and Memory-Based Reasoning (MBR) K-nearest Neighbor and Memory-Based Reasoning (MBR) Usage of knowledge of previously solved similar problems in solving the new problem Assigning the class to the group where most of the k-”neighbors” belong First step – finding the suitable measure for distance between attributes in the data How far is black from green? + Easy handling of non-standard data types - Huge models
38
Page 38/64 K-nearest Neighbor and Memory-Based Reasoning (MBR) K-nearest Neighbor and Memory-Based Reasoning (MBR) Advantages? Applications? Anecdotes (Predictions Based on Similarities)? E: Nd+1s, so(loMOTH), prof(badA>off), prof(abuA>fam)
39
Page 39/64 Data Mining Models and Algorithms Data Mining Models and Algorithms Logistic regression Discriminant analysis Generalized Adaptive Models (GAM) Genetic algorithms Etc… Many other available models and algorithms Many application-specific variations of known models Final implementation usually involves several techniques Adding selected metadata (e.g., time for prediction) E:27+sports(5+6+7+11+…+2+1)
40
Page 40/64 Efficient Data Mining Efficient Data Mining
41
Page 41/64 Don’t Mess With It! YES NO YES You Shouldn’t Have! NO Will it Explode In Your Hands? NO Look The Other Way Anyone Else Knows? You’re in TROUBLE! YES NO Hide It Can You Blame Someone Else? NO NO PROBLEM! YES Is It Working? Did You Mess With It?
42
Page 42/64 DM Process Model DM Process Model CRISP–DM – tends to become a standard 5A – used by SPSS Clementine (Assess, Access, Analyze, Act and Automate ) SEMMA – used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess)
43
Page 43/64 CRISP - DM CRISP - DM CRoss-Industry Standard for DM Conceived in 1996 by three companies:
44
Page 44/64 CRISP – DM methodology CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Process Instances Specialized Tasks
45
Page 45/64 Mapping generic models to specialized models Mapping generic models to specialized models Analyze the specific context Remove any details not applicable to the context Add any details specific to the context Specialize generic context according to concrete characteristic of the context Possibly rename generic contents to provide more explicit meanings
46
Page 46/64 Generalized and Specialized Cooking Generalized and Specialized Cooking Preparing food on your own Find out what you want to eat Find the recipe for that meal Gather the ingredients Prepare the meal Enjoy your food Clean up everything (or leave it for later) Raw stake with vegetables? Check the Cookbook or call mom Defrost the meat (if you had it in the fridge) Buy missing ingredients or borrow the from the neighbors Cook the vegetables and fry the meat Enjoy your food or even more You were cooking so convince someone else to do the dishes
47
Page 47/64 CRISP – DM model CRISP – DM model Business understanding Data understanding Data preparation Modeling Evaluation Deployment Business understanding Data understanding Data preparation ModelingEvaluation Deployment
48
Page 48/64 Business Understanding Business Understanding Determine business objectives Assess situation Determine data mining goals Produce project plan
49
Page 49/64 Data Understanding Data Understanding Collect initial data Describe data Explore data Verify data quality
50
Page 50/64 Data Preparation Data Preparation Select data Clean data Construct data Integrate data Format data
51
Page 51/64 Modeling Modeling Select modeling technique Generate test design Build model Assess model
52
Page 52/64 Evaluation Evaluation Evaluate results Review process Determine next steps results = models + findings
53
Page 53/64 Deployment Deployment Plan deployment Plan monitoring and maintenance Produce final report Review project
54
Page 54/64 At Last… At Last…
55
Page 55/64 Available Software Available Software 14
56
Page 56/64 Comparison of forteen DM tools The Decision Tree products were - CART - Scenario - See5 - S-Plus The Rule Induction tools were - WizWhy - DataMind - DMSK Neural Networks were built from three programs - NeuroShell2 - PcOLPARS - PRW The Polynomial Network tools were - ModelQuest Expert - Gnosis - a module of NeuroShell 2 - KnowledgeMiner
57
Page 57/64 Criteria for evaluating DM tools A list of 20 criteria for evaluating DM tools, put into 4 categories: Capability measures what a desktop tool can do, and how well it does it - Handless missing data - Considers misclassification costs - Allows data transformations - Quality of tesing options - Has programming language - Provides useful output reports - Visualisation
58
Page 58/64 Criteria for evaluating DM tools Learnability/Usability shows how easy a tool is to learn and use: - Tutorials - Wizards - Easy to learn - User’s manual - Online help - Interface
59
Page 59/64 Criteria for evaluating DM tools Interoperability shows a tool’s ability to interface with other computer applications - Importing data - Exporting data - Links to other applications Flexibility - Model adjustment flexibility - Customizable work enviroment - Ability to write or change code
60
Page 60/64 A classification of data sets Pima Indians Diabetes data set –768 cases of Native American women from the Pima tribe some of whom are diabetic, most of whom are not –8 attributes plus the binary class variable for diabetes per instance Wisconsin Breast Cancer data set –699 instances of breast tumors some of which are malignant, most of which are benign –10 attributes plus the binary malignancy variable per case The Forensic Glass Identification data set –214 instances of glass collected during crime investigations –10 attributes plus the multi-class output variable per instance Moon Cannon data set –300 solutions to the equation: x = 2v 2 sin(g)cos(g)/g –the data were generated without adding noise
61
Page 61/64 Evaluation of forteen DM tools
62
Page 62/64 Conclusions Conclusions
63
Page 63/64 WWW.NBA. COM WWW.NBA.COM
64
Page 64/64 Se7en Se7en
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.