Data Mining: A Closer Look

Slides:



Advertisements
Similar presentations
1. Abstract 2 Introduction Related Work Conclusion References.
Advertisements

Data Mining: A Closer Look Chapter Data Mining Strategies.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
L The Difference Between Logical and Physical Views of Information l Databases and Database Management Systems l How You Can Develop Database Applications.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Data Mining: A Closer Look Chapter Data Mining Strategies.
Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T.
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
CS-470: Data Mining Fall Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science.
Basic Data Mining Techniques
Data Mining Techniques
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
© Negnevitsky, Pearson Education, Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Inductive learning Simplest form: learn a function from examples
Introduction To Data Mining. What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Introduction, or what is data mining? Introduction, or what is data mining? Data warehouse and query tools Data warehouse and query tools Decision trees.
Succeeding with Technology Database Systems Basic Data Management Concepts Organizing Data in a Database Database Management Systems Using Database Systems.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS.
MIS2502: Data Analytics Advanced Analytics - Introduction.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining Copyright KEYSOFT Solutions.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Classification as data mining tool Classification as data mining tool Done by William Hellela William Hellela Rauf Gadar Alex Prewett.
Business intelligence systems. Data warehousing. An orderly and accessible repositery of known facts and related data used as a basis for making better.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
MIS 451 Building Business Intelligence Systems
Adrian Tuhtan CS157A Section1
Presentation transcript:

Data Mining: A Closer Look Typical Problems

Data Mining: Typical Problems Classification Estimation Prediction

Classification & Estimation Classification deals with discrete outcomes: yes or no; big or small; strange or no strange; sick or healthy; yellow, green or red; etc. It determines a class membership of a certain object. Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc. Neural networks and regression models are the best tools for classification/estimation

Prediction Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. Any of the techniques used for classification and estimation can be used in prediction.

Classification and Prediction: Implementation To implement both classification and prediction, we should use the training examples, where the value of the variable to be predicted is already known or membership of the data instance to be classified is already known.

Is Data Mining Appropriate for My Problem?

Will Data Mining help me? Can we clearly define the problem? Do potentially meaningful data exist? Do the data contain hidden knowledge or the data is useful for reporting purposes only? Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining?

Data Mining vs. Data Query Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge

Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database.

Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.

Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.

Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.

Data Mining vs. Data Query Shallow Knowledge ( can be extracted by the data base query language like SQL) Multidimensional Knowledge (can be extracted by the On-line Analytical Processing (OLAP) tools) Hidden Knowledge represents patterns and regularities in data that can not be easily found (data mining tools can be used) Deep Knowledge can be found if we are given some direction about what we are looking for (data mining tools can be used)

Data Mining vs. Data Query: Use data query if you already almost know what you are looking for. Use data mining to find regularities in data that are not obvious and (or) that are hidden.

A Simple Data Mining Process Model

Data Mining: A KDD Process Knowledge Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases

The Data Warehouse The data warehouse is a historical database designed for decision support.

Data Mining Strategies

A hierarchy of data mining strategies

Supervised Data Mining Algorithms: A single output attribute/multiple output attributes Output attributes are also called dependent variables because they depend on the values of input attributes (variables): Input attributes are also known as independent variables

Data Mining Strategies: Classification Learning is supervised. The dependent variable(s) (output) is categorical or numeric. Well-defined classes. Current rather than future behavior. Classify a loan applicant as a good or poor credit risk Develop a customer profile To classify a patient as sick or healthy

Data Mining Strategies: Estimation Learning is supervised. The dependent variable(s) (output) is numeric. Well-defined classes. Current rather than future behavior. Estimate the number of minutes before a thunderstorm will reach a given location Estimate the amount of credit card purchases Estimate the salary of an individual

Data Mining Strategies: Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. Predict next week’s (year’s) currency exchange rate Predict next week’s (year’s) Dow Jones Industrial closing value Predict a level of the power consumption for some period of time

Classification, Estimation or Prediction? The nature of the data determines whether a model is suitable for classification, estimation, or prediction.

The Cardiology Patient Dataset This dataset contains 303 instances. Each instance holds information about a patient who either has or does not have a heart condition.

The Cardiology Patient Dataset 138 instances represent patients with heart disease. 165 instances contain information about patients free of heart disease.

Classification, Estimation or Prediction? The next two slides each contain a rule generated from this dataset. Are either of these rules predictive?

A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%

A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%

Is the rule appropriate for classification or prediction? Prediction: has one’s maximum heart rate checked on a regular basis is low, he/she may be at risk of having a heart attack. Classification: If one has a heart attack, expect a maximum heart rate to decrease.

Data Mining Strategies: Unsupervised Clustering

Unsupervised Clustering can be used to: determine if relationships can be found in the data. evaluate the likely performance of a supervised model. find a best set of input attributes for supervised learning. detect outliers.

Data Mining Strategies: Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms.

Supervised Data Mining Techniques

Generation of Production Rules

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

Rule Accuracy and Rule Coverage Rule accuracy is the correctness of the rule in terms of a percentage with respect to the class to be determined by this rule. For example, if the rule holds for 9 of 10 instances, to which it is applicable, the accuracy is 90%. Rule coverage is the coverage of the class to be classified by this rule in terms of a percentage. For example, if the rule covers 10 of 20 instances from the class to be classified, the rule coverage is 50%.

Rule Accuracy and Rule Coverage Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Production Rules for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% IF Sex = Male & 40K<=Income Range <= 50K THEN Life Insurance Promotion = No Rule Accuracy: 100.00% Rule Coverage: 50% IF Credit Card Insurance= Yes Rule Accuracy: 100.00% Rule Coverage: 33.33% IF 30K<=Income Range <= 40K & Watch Promotion=Yes

Production Rules for the Credit Card Promotion Database Rules 1-3 are predictive for new card holders Rule 4 might be used for the classification of the existing card holders