The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.

Slides:



Advertisements
Similar presentations
Data Mining in Computer Games By Adib Adam Hussain & Mohammed Sarfraz.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1. Abstract 2 Introduction Related Work Conclusion References.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining.
Data Mining By Archana Ketkar.
Presented by Zeehasham Rasheed
Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.
Mining Behavior Models Wenke Lee College of Computing Georgia Institute of Technology.
TimeCleanser: A Visual Analytics Approach for Data Cleansing of Time-Oriented Data Theresia Gschwandtner, Wolfgang Aigner, Silvia Miksch, Johannes Gärtner,
Data Mining – Intro.
Business Intelligence: Essential of Business
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Data Mining Chun-Hung Chou
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Chapter 2 Data Mining Processes and Knowledge Discovery
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Lecture 9: Knowledge Discovery Systems Md. Mahbubul Alam, PhD Associate Professor Dept. of AEIS Sher-e-Bangla Agricultural University.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
DATA MINING 1. 2 Data Mining Extracting or “mining” knowledge from large amounts of data Data mining is the process of autonomously retrieving useful.
Data Mining By : Tung, Sze Ming ( Leo ) CS 157B. Definition A class of database application that analyze data in a database using tools which look for.
Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Data Mining Processes Identify actionable results.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Chapter 4 Decision Support System & Artificial Intelligence.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining Copyright KEYSOFT Solutions.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
The Application of Data Mining in Telecommunication by Wang Lina February 2003.
Data Mining.
Machine Learning with Spark MLlib
Data Mining – Intro.
Presented by Khawar Shakeel
MIS2502: Data Analytics Advanced Analytics - Introduction
Introduction to Data Mining
Introduction C.Eng 714 Spring 2010.
Data Mining: Concepts and Techniques Course Outline
A Unifying View on Instance Selection
Week 11 Knowledge Discovery Systems & Data Mining :
Data Warehousing and Data Mining
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
Course Introduction CSC 576: Data Mining.
Data Warehousing Data Mining Privacy
Data Pre-processing Lecture Notes for Chapter 2
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Credit Card Fraudulent Transaction Detection
Presentation transcript:

The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras

University of Patras, HCI Group - SETN02 2 Outline of the talk Knowledge in a DM process Case study in a large DM project: Prediction of customer insolvency in Telecommunications business The role of domain expertise (and domain experts ) in the process Summary and conclusions

University of Patras, HCI Group - SETN02 3 Data Mining Evolution of knowledge-based systems Key partners in Data Mining –Data analyst / statistician –Knowledge Engineer –Domain Expert Role of domain knowledge in Data Mining

University of Patras, HCI Group - SETN02 4 DM phases (a) Problem definition (b) Creating target data set (c ) Data pre-processing and transformation (d ) Feature and algorithm selection (e) Data Mining (f) Evaluation of learned knowledge (g) Fielding the knowledge base

University of Patras, HCI Group - SETN02 5 Case study: Prediction of Customer Insolvency in Telecommunications business Predict the insolvent customers to be, that is the customers that will refuse to pay their telephone bills in the next payment due date, while there is still time for preventive (and possibly avertive) measures Problem ObjectivesProblem Objectives –Detect as many insolvent customers as possible –Minimize false alarms (solvent customers classified as insolvent)

University of Patras, HCI Group - SETN02 6 Case study: problem characteristics Significant loss of revenue for the company Human behavior is (generally) unpredictable Insolvency cases are rare compared to non- insolvencies Information can be retrieved only after processing huge amounts of data from several sources

University of Patras, HCI Group - SETN02 7 The billing process (domain knowledge) JunJulAugSeptFebAprMarOctNovJanDec Billing Period Due DateIssue of Bill Service Interruption Nullification

University of Patras, HCI Group - SETN02 8 Target data set definition (semantic value of data) Data from 3 different cities (combination of rural, urban and touristic areas) Types of data –Customer data (coded) –Data from billing and payments –Call detail records (from switching centers) Time span of data studied –Cases of collected and uncollected bills (10/99-2/01) –Calls records (8/99-12/00)

University of Patras, HCI Group - SETN02 9 Data pre-processing (knowledge-based reduction of search space) Eliminated inexpensive calls (< 0.3 €) Synchronizing data Removing noise Missing values Data aggregation by period DATA WAREHOUSE

University of Patras, HCI Group - SETN02 10 Dataset for model fitting Stratified sample of solvent customers –Class distribution: 90% solvent customers and 10% insolvent customers 2066 total number of cases and 46 variables –2 variables describing the phone account –4 variables describing customer attitude towards previous phone bills –40 variables summarizing customer call habits over fifteen 2-week periods

University of Patras, HCI Group - SETN02 11 Data mining Classification problemClassification problem –2 classes: solvent and insolvent customers –Distribution among classes in original dataset: 99% of solvent customers and 1% of insolvent customers –Very small number of insolvencies –Very different costs of misclassification between the two classes of customers

University of Patras, HCI Group - SETN02 12 Criteria for evaluation of prediction The precision of the classifier, defined as the percentage of the actually insolvent customers in those, predicted as insolvent by the classifier. The accuracy of the classifier, defined as the percentage of the correctly predicted insolvent out of the total cases of insolvent customers in the data set. Precision > 30% & Accuracy > 70%

University of Patras, HCI Group - SETN02 13 Features selected (most popular in 50 classifiers) NewCust Latency Count_X_charges CountResiduals StdDif TrendDif11 TrendDif10 TrendDif7 TrendDif6 TrendDif3 TrendUnitsMax TrendDif5 TrendDif8 Average_Dif Type MaxSec TrendUnits5 AverageUnits TrendCount5 CountInstallments TrendDifxx, StdDif dispersion of called telephone numbers in a given time interval xx

University of Patras, HCI Group - SETN02 14 Deployment of the Knowledge- based system The classifiers are combined (voting algorithms have been used) Heuristics are used as applicability criteria Visualization plays an important role in the design of the system The roles of the user and the knowledge-based system have to be carefully defined

University of Patras, HCI Group - SETN02 15 Stepwise Discriminant Analysis

University of Patras, HCI Group - SETN02 16 Decision Tree

University of Patras, HCI Group - SETN02 17 Neural Network

University of Patras, HCI Group - SETN02 18 Evaluation of classifiers (example) Performance over 90% in the majority class and over 83% in the minority class. precision = 113/2844= 3.9% accuracy = 113/136= 83%,

University of Patras, HCI Group - SETN02 19 stageDK Type of DK (a) Problem definition HIGH Business and domain knowledge, requirements Implicit, tacit knowledge (b) Creating target data set MEDIUM Attribute relations, semantics of corporate DB (c ) Data pre- processing HIGH Tacit and implicit knowledge for inferences (d ) Feature and algorithm selection MEDIUM Interpretation of the selected features (e) Data Mining LOWInspection of discovered knowledge (f) Evaluation of learned knowledge MEDIUM Definition of criteria related to business objectives (g) Fielding the knowledge base HIGH Supplementary domain knowledge necessary for implementing the system

University of Patras, HCI Group - SETN02 20 Selection of DM tool (Elder 98)

University of Patras, HCI Group - SETN02 21 Conclusion Data mining is a knowledge-driven process All stages contribute to the success of the process Domain experts play significant role in most phases of the process Need for selection of algorithms and techniques that support interpretation of mined knowledge Need for integrated tools and adequate techniques to support involvement of domain experts in the process