Page 1/71 Issues in Data Mining Infrastructure Issues in Data Mining Infrastructure Authors:Nemanja Jovanovic, Valentina Milenkovic,

Slides:



Advertisements
Similar presentations
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Advertisements

Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Chapter 9 DATA WAREHOUSING Transparencies © Pearson Education Limited 1995, 2005.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining.
DATA WAREHOUSING.
SLIDE 1IS 257 – Fall 2008 Data Mining and the Weka Toolkit University of California, Berkeley School of Information IS 257: Database Management.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Data Mining: A Closer Look
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Microsoft Enterprise Consortium Data Mining Concepts Introduction: The essential background Prepared by David Douglas, University of ArkansasHosted by.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Page 1/64 Data Mining Versus Semantic Web Data Mining Versus Semantic Web Veljko Milutinovic, This material was.
What is Business Intelligence? Business intelligence (BI) –Range of applications, practices, and technologies for the extraction, translation, integration,
Dr. Awad Khalil Computer Science Department AUC
Data Mining Techniques
More on Data Mining KDnuggets Datanami ACM SIGKDD
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Mining Chun-Hung Chou
Understanding Data Analytics and Data Mining Introduction.
Introduction: The essential background
Page 1/71 Issues in Data Mining Infrastructure Issues in Data Mining Infrastructure Authors:Nemanja Jovanovic, Valentina Milenkovic,
The CRISP-DM Process Model
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Fox MIS Spring 2011 Data Mining Week 9 Introduction to Data Mining.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Does GridGIS require more intelligence than GIS? Claire Jarvis Department of Geography GEOGRAPHY.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Data Mining and Decision Support
Data Mining Copyright KEYSOFT Solutions.
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
DATA MINING It is a process of extracting interesting(non trivial, implicit, previously, unknown and useful ) information from any data repository. The.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining Versus Semantic Web Veljko Milutinovic,
DATA MINING © Prentice Hall.
Week 11 Knowledge Discovery Systems & Data Mining :
Supporting End-User Access
CRISP Process Stephen Wyrick.
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

Page 1/71 Issues in Data Mining Infrastructure Issues in Data Mining Infrastructure Authors:Nemanja Jovanovic, Valentina Milenkovic, Prof. Dr. Veljko Milutinovic,

Page 2/71 Data Mining in the Nutshell Data Mining in the Nutshell  Uncovering the hidden knowledge  Huge n-p complete search space  Multidimensional interface

Page 3/71 A Problem … A Problem … You are a marketing manager for a cellular phone company  Problem: Churn is too high  Bringing back a customer after quitting is both difficult and expensive  Giving a new telephone to everyone whose contract is expiring is very expensive (as well as wasteful)  You pay a sales commission of 250$ per contract  Customers receive free phone (cost 125$) with contract  Turnover (after contract expires) is 40%

Page 4/71 … A Solution … A Solution  Three months before a contract expires, predict which customers will leave  If you want to keep a customer that is predicted to churn, offer them a new phone  The ones that are not predicted to churn need no attention  If you don’t want to keep the customer, do nothing  How can you predict future behavior?  Tarot Cards?  Magic Ball?  Data Mining?

Page 5/71 Still Skeptical? Still Skeptical?

Page 6/71 The Definition The Definition  Automated The automated extraction of predictive information from (large) databases  Extraction  Predictive  Databases

Page 7/71 History of Data Mining History of Data Mining

Page 8/71 Repetition in Solar Activity Repetition in Solar Activity  1613 – Galileo Galilei  1859 – Heinrich Schwabe

Page 9/71 The Return of the Halley Comet The Return of the Halley Comet ??? BC Edmund Halley ( )

Page 10/71 Data Mining is Not Data Mining is Not  Data warehousing  Ad-hoc query/reporting  Online Analytical Processing (OLAP)  Data visualization

Page 11/71 Data Mining is Data Mining is  Automated extraction of predictive information from various data sources  Powerful technology with great potential to help users focus on the most important information stored in data warehouses or streamed through communication lines

Page 12/71 Data Mining can Data Mining can  Answer question that were too time consuming to resolve in the past  Predict future trends and behaviors, allowing us to make proactive, knowledge driven decision

Page 13/71 Focus of this Presentation Focus of this Presentation  Data Mining problem types  Data Mining models and algorithms  Efficient Data Mining  Available software

Page 14/71 Data Mining Problem Types Data Mining Problem Types

Page 15/71 Data Mining Problem Types Data Mining Problem Types  6 types  Often a combination solves the problem

Page 16/71 Data Description and Summarization Data Description and Summarization  Aims at concise description of data characteristics  Lower end of scale of problem types  Provides the user an overview of the data structure  Typically a sub goal

Page 17/71 Segmentation Segmentation  Separates the data into interesting and meaningful subgroups or classes  Manual or (semi)automatic  A problem for itself or just a step in solving a problem

Page 18/71 Classification Classification  Assumption: existence of objects with characteristics that belong to different classes  Building classification models which assign correct labels in advance  Exists in wide range of various application  Segmentation can provide labels or restrict data sets

Page 19/71 Concept Description Concept Description  Understandable description of concepts or classes  Close connection to both segmentation and classification  Similarity and differences to classification

Page 20/71 Prediction (Regression) Prediction (Regression)  Similar to classification- difference: discrete becomes continuous  Finds the numerical value of the target attribute for unseen objects

Page 21/71 Dependency Analysis Dependency Analysis  Finding the model that describes significant dependences between data items or events  Prediction of value of a data item  Special case: associations

Page 22/71 Data Mining Models Data Mining Models

Page 23/71 Neural Networks Neural Networks  Characterizes processed data with single numeric value  Efficient modeling of large and complex problems  Based on biological structures Neurons  Network consists of neurons grouped into layers

Page 24/71 Neuron Functionality Neuron Functionality I1 I2 I3 In Output W1 W2 W3 Wn f Output = f (W1*I1, W2*I1, …, Wn*In)

Page 25/71 Training Neural Networks Training Neural Networks

Page 26/71 Neural Networks - Conclusion Neural Networks - Conclusion  Once trained, Neural Networks can efficiently estimate value of output variable for given input  Neurons and network topology are essentials  Usually used for prediction or regression problem types  Difficult to understand  Data pre-processing often required

Page 27/71 Decision Trees Decision Trees  A way of representing a series of rules that lead to a class or value  Iterative splitting of data into discrete groups maximizing distance between them at each split  CHAID, CHART, Quest, C5.0  Classification trees and regression trees  Unlimited growth and stopping rules  Univariate splits and multivariate splits

Page 28/71 Decision Trees Balance>10Balance<=10 Age<=32Age>32 Married=NOMarried=YES

Page 29/71 Decision Trees Decision Trees

Page 30/71 Rule Induction Rule Induction  Method of deriving a set of rules to classify cases  Creates independent rules that are unlikely to form a tree  Rules may not cover all possible situations  Rules may sometimes conflict in a prediction

Page 31/71 Rule Induction Rule Induction If balance> then confidence=HIGH & weight=1.7 If balance> and status=married then confidence=HIGH & weight=2.3 If balance< then confidence=LOW & weight=1.9

Page 32/71 K-nearest Neighbor and Memory-Based Reasoning (MBR) K-nearest Neighbor and Memory-Based Reasoning (MBR)  Usage of knowledge of previously solved similar problems in solving the new problem  Assigning the class to the group where most of the k-”neighbors” belong  First step – finding the suitable measure for distance between attributes in the data  How far is black from green?  + Easy handling of non-standard data types  - Huge models

Page 33/71 K-nearest Neighbor and Memory-Based Reasoning (MBR) K-nearest Neighbor and Memory-Based Reasoning (MBR)

Page 34/71 Data Mining Models and Algorithms Data Mining Models and Algorithms  Logistic regression  Discriminant analysis  Generalized Adaptive Models (GAM)  Genetic algorithms  Etc…  Many other available models and algorithms  Many application specific variations of known models  Final implementation usually involves several techniques  Selection of solution that match best results

Page 35/71 Efficient Data Mining Efficient Data Mining

Page 36/71 Don’t Mess With It! YES NO YES You Shouldn’t Have! NO Will it Explode In Your Hands? NO Look The Other Way Anyone Else Knows? You’re in TROUBLE! YES NO Hide It Can You Blame Someone Else? NO NO PROBLEM! YES Is It Working? Did You Mess With It?

Page 37/71 DM Process Model DM Process Model  CRISP–DM – tends to become a standard  5A – used by SPSS Clementine (Assess, Access, Analyze, Act and Automate )  SEMMA – used by SAS Enterprise Miner (Sample, Explore, Modify, Model and Assess)

Page 38/71 CRISP - DM CRISP - DM  CRoss-Industry Standard for DM  Conceived in 1996 by three companies:

Page 39/71 CRISP – DM methodology CRISP – DM methodology Four level breakdown of the CRISP-DM methodology: Phases Generic Tasks Process Instances Specialized Tasks

Page 40/71 Mapping generic models to specialized models Mapping generic models to specialized models  Analyze the specific context  Remove any details not applicable to the context  Add any details specific to the context  Specialize generic context according to concrete characteristic of the context  Possibly rename generic contents to provide more explicit meanings

Page 41/71 Generalized and Specialized Cooking Generalized and Specialized Cooking  Preparing food on your own  Find out what you want to eat  Find the recipe for that meal  Gather the ingredients  Prepare the meal  Enjoy your food  Clean up everything (or leave it for later)  Raw stake with vegetables?  Check the Cookbook or call mom  Defrost the meat (if you had it in the fridge)  Buy missing ingredients or borrow the from the neighbors  Cook the vegetables and fry the meat  Enjoy your food or even more  You were cooking so convince someone else to do the dishes

Page 42/71 CRISP – DM model CRISP – DM model  Business understanding  Data understanding  Data preparation  Modeling  Evaluation  Deployment Business understanding Data understanding Data preparation ModelingEvaluation Deployment

Page 43/71 Customizing a Web Page Customizing a Web Page  User-friendly design  Prediction of the users interests  Reduction of server workload  Reduction of Web traffic

Page 44/71 Customizing a Web Page Customizing a Web Page

Page 45/71 Business Understanding Business Understanding  Determine business objectives  Assess situation  Determine data mining goals  Produce project plan

Page 46/71 Business Understanding - Outputs Business Understanding - Outputs  Background  Business objectives and success criteria  Inventory of resources  Requirements, assumptions, and constrains  Risks and contingencies  Terminology  Costs and benefits  Data mining goals and success criteria  Project plan  Initial assessment of tools and techniques

Page 47/71 Customizing a Web Page – Business Understanding Example Customizing a Web Page – Business Understanding Example  Business objectives  Assess situation  Data mining goals  Project plan  Make the users surfing more comfortable  Decrease of overhead for users  Reduction of workload and Web traffic  Make the users surfing more comfortable  Decrease of overhead for users  Reduction of workload and Web traffic  Find the patterns in the user behavior

Page 48/71 Data Understanding Data Understanding  Collect initial data  Describe data  Explore data  Verify data quality

Page 49/71 Data Understanding - Outputs Data Understanding - Outputs  Data collection report  Data description report  Data exploration report  Data quality report  Background of data  List of data sources  For each data source, method of acquisition  Problems encountered in data acquisition  Detailed description of each data source  List of tables or other database objects  Description of each field including units, codes, etc.  Expected regularities or patterns and methods of detection  Regularities or patterns found (expected and unexpected)  Any other surprises  Conclusions for data transformation, data cleaning and any other pre-processing  Conclusions related to data mining goals or business objectives  Approach taken to assess data quality  Results of data quality assessment

Page 50/71 Customizing a Web Page – Data Understanding Example Customizing a Web Page – Data Understanding Example  Collecting the data  Data description  Results of data exploring  Verification of the quality of the data  Update the server to monitor user behavior  Record the users activities into a storage  Analyze recorded data  Decide which data is usable for mining

Page 51/71 Data Preparation Data Preparation  Select data  Clean data  Construct data  Integrate data  Format data

Page 52/71 Data Preparation - Outputs Data Preparation - Outputs  Dataset description report  Background including broad goals and plan for pre-processing  Description of pre-processing  Detailed description of resultant datasets  Rational for inclusion/exclusion of attributes  Discoveries made during pre-processing and implications for further work  Dataset

Page 53/71 Customizing a Web Page – Data Preparation Example Customizing a Web Page – Data Preparation Example  Decide from what period will the users monitored actions be considered  Make assumptions about unnecessary monitored data and discard them  Classify user actions into categories, group interesting links, etc…  If more information about user is available from other sources, use them  Transform data into suitable forms so several modeling techniques could be applied

Page 54/71 Modeling Modeling  Select modeling technique  Generate test design  Build model  Assess model

Page 55/71 Modeling - Outputs Modeling - Outputs  Assessment of DM results with respect to business success criteria  Test design  Model description  Model assessment  Broad description of the type of model and the training data to be used  Explanation of how the model will be tested or assessed  Description of any data required for testing  Description of any planned examination of models by domain or data experts  Type of model and relation to data mining goals  Parameter settings used to produce model  Detailed description of the model and any special features  Conclusions regarding patterns in the data  Overview of assessment process including deviations from the test plan  Detailed assessment of the model  Comments on models by domain or data experts  Insights into why a certain modeling technique and certain parameter setting lead to good/bad results

Page 56/71 Customizing a Web Page – Modeling Example Customizing a Web Page – Modeling Example  The problem is prediction of behavior  Regression could be a good solution due to distinct nature of the data  Create the software according to the project plan  Observe the behavior of the software  Tune the model after each evaluation phase if needed

Page 57/71 Evaluation Evaluation  Evaluate results  Review process  Determine next steps results = models + findings

Page 58/71 Evaluation - Outputs Evaluation - Outputs  Assessment of DM results with respect to business success criteria  Review of process  List of possible actions  Review of Business Objectives and Business Success Criteria  Comparison between success criterion and DM results  Conclusion about achievability of success criterion and suitability of data mining process  Review of “Project Success”  Are there new business objectives?

Page 59/71 Customizing a Web Page – Evaluation Example Customizing a Web Page – Evaluation Example  Observe the model behavior at work  Collect response from Beta testers  Check user satisfaction  Check server and network engagement  Classify results  Determine which parameter of the model should be changed  Present new ideas and modifications  Step back into previous phases as needed

Page 60/71 Deployment Deployment  Plan deployment  Plan monitoring and maintenance  Produce final report  Review project

Page 61/71 Deployment - Outputs Deployment - Outputs  Monitoring and maintenance plan  Final report  Overview of deployment results and indication which of results may require updating  Description of how updating will be triggered  Description of how updating will be performed  Summary of Business Understanding (background, objectives and success criteria)  Summary of data mining process  Summary of data mining results  Summary of results evaluation  Summary of deployment and maintenance plan  Cost/benefit analysis  Conclusions for the business  Conclusions for future data mining

Page 62/71 Customizing a Web Page – Deployment Example Customizing a Web Page – Deployment Example  Make the feature available to all users  Make plan for maintenance and user feedback  Analyze costs and benefits  Summarize the whole documentation  Summarize network and server additional activity  Collect the new ideas  Award according to results  Leave space for upgrade

Page 63/71 At Last… At Last…

Page 64/71 Available Software Available Software

Page 65/71 Available Software Available Software Discussion of data mining vendors and software is not included into this slide set

Page 66/71 Conclusions Conclusions

Page 67/71 COM

Page 68/71 Se7en Se7en

Page 69/71 CD – ROM CD – ROM CD – ROM

Page 70/71 Credits Credits Anne Stern, SPSS, Inc. Djuro Gluvajic, ITE, Denmark Obrad Milivojevic, PC PRO, Yugoslavia

Page 71/71 References References  Bruha, I., ‘Data Mining, KDD and Knowledge Integration: Methodology and A case Study”, SSGRR 2000  Fayyad, U., Shapiro, P., Smyth, P., Uthurusamy, R., “Advances in Knowledge Discovery and Data Mining”, MIT Press, 1996  Glumour, C., Maddigan, D., Pregibon, D., Smyth, P., “Statistical Themes nad Lessons for Data Mining”, Data Mining And Knowledge Discovery 1, 11-28, 1997  Hecht-Nilsen, R., “Neurocomputing”, Addison-Wesley, 1990  Pyle, D., “Data Preparation for Data Mining”, Morgan Kaufman, 1999  galeb.etf.bg.ac.yu/~vm     

Page 72/71 The END The END