CS-470: Data Mining Fall 2009 1. Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science.

Slides:



Advertisements
Similar presentations
Data Mining Glen Shih CS157B Section 1 Dr. Sin-Min Lee April 4, 2006.
Advertisements

Overview of Data Mining & The Knowledge Discovery Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CS Introduction to Data Mining
Chapter 2. Introduction to Data Mining
Data Mining Knowledge Discovery in Databases Data 31.
Dr. Tahar Kechadi Dr. Joe Carthy
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
CS 5941 CS583 – Data Mining and Text Mining Course Web Page 05/cs583.html.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Chapter 26.
Data Mining Techniques
10 Data Mining. What is Data Mining? “Data Mining is the process of selecting, exploring and modeling large amounts of data to uncover previously unknown.
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Business Intelligence
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Data Mining: Concepts and Techniques
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Introduction To Data Mining. What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
Data Warehousing/Mining 1 Data Warehousing/Mining Comp 150 DW Chapter 1. Introduction Instructor: Dan Hebert.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Data Mining – A First View Roiger & Geatz. Definition Data mining is the process of employing one or more computer learning techniques to automatically.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting!
October 18, 2015 Data Mining: Concepts and Techniques 1 DATA MINING Motivation: Why data mining? What is data mining? Data Mining: On what kind of data?
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.
MIS2502: Data Analytics Advanced Analytics - Introduction.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Conclusions. Why Data Mining? -- Potential Applications Database analysis and decision support – Market analysis and management target marketing, customer.
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Warehousing/Mining 1. 2 Chapter 1. Introduction v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Lecture-2 Bscshelp.com.  Why Data Mining and What Kinds of Data Can Be Mined?  Potential Applications 2.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Mining.
Data Mining – Intro.
Data Mining Motivation: “Necessity is the Mother of Invention”
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
UNIT – I Data Warehouse and data mining
DATA MINING BY: PRADEEP AGRAWAL MBA (SEC – A) ALLIANCE UNIVERSITY – SCHOOL OF BUSINESS.
MIS 451 Building Business Intelligence Systems
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Course Outline
Supporting End-User Access
Data Mining Concepts and Techniques
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining: Concepts and Techniques
Presentation transcript:

CS-470: Data Mining Fall

Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science and Technology Building, 104C Phone ( ) Office hours: Monday, Wednesday 10am-6pm Tuesday 11pm-3pm Class Web Page: 2

Text Book R. J. Roiger, M.W. Geatz, Data Mining. A Tutorial-Based Primer, Addison Wesley, 2003, ISBN

Control  Exams (open book, open notes): Exam 1: October 6, 2009 Exam 2: November 10, 2009 Exam 3:December 8, 2009  Homework 4

Grading Grading Method Homework and preparation:10% Exam 1: 30% Exam 2: 30% Exam 3: 30% Grading Scale: 90%+  A 80%+  B 70%+  C 60%+  D less than 60%  F 5

Data Mining: A First View 6

Data Mining: A Definition  The process of employing one or more machine learning techniques to automatically analyze and extract knowledge from data.  The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. 7

8 What Is Data Mining? Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. Machine learning and data mining are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.

9 Why Data Mining? — Potential Applications Database analysis and decision support –Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation –Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis –Fraud detection and management Other Applications –Text mining (news group, , documents) and Web analysis. –Intelligent query answering. –Medical decision support.

Market Analysis and Management (1) Where are the data sources for analysis? –Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing –Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time –Conversion of single to a joint bank account: marriage, etc. Cross-market analysis –Associations/co-relations between product sales –Prediction based on the association information 10

Market Analysis and Financial Time Series Prediction 11

Market Analysis and Financial Time Series Prediction 12

Market Analysis and Financial Time Series Prediction 13

Market Analysis and Financial Time Series Prediction 14

Market Analysis and Management (2) Customer profiling –data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements –identifying the best products for different customers –use prediction to find what factors will attract new customers Provides summary information –various multidimensional summary reports –statistical summary information (data central tendency and variation) 15

Corporate Analysis and Risk Management Finance planning and asset evaluation –cash flow analysis and prediction –contingent claim analysis to evaluate assets –cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: –summarize and compare the resources and spending Competition: –monitor competitors and market directions –group customers into classes and a class-based pricing procedure –set pricing strategy in a highly competitive market 16

Fraud Detection and Management (1) Applications –widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach –use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples –auto insurance: detect a group of people who stage accidents to collect on insurance –money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) –medical insurance: detect professional patients and ring of doctors and ring of references 17

Fraud Detection and Management (2) Detecting inappropriate medical treatment –Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud –Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. –British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail –Analysts estimate that 38% of retail shrink is due to dishonest employees. 18

19 Other Applications Sports –IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy –JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid –IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.

Induction-based Learning The process of forming general concept definitions by observing specific examples of concepts to be learned. 20

Four Levels of Learning Facts Concepts Procedures Principles 21

Facts A fact is a simple statement of truth. 22

Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics. 23

Procedures A procedure is a step-by-step course of action to achieve a goal. 24

Principles A principles are general truths or laws that are basic to other truths. 25

What Can Computers Learn? 26

Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. 27

Three Concept Views Classical View Probabilistic View Exemplar View 28

Classical View All concepts have definite defining properties. 29

Probabilistic View People store and recall concepts as generalizations created by observations. 30

Exemplar View People store and recall likely concept exemplars that are used to classify unknown instances. 31

Methods of Learning 32

Supervised Learning Build a learner model using data instances of known origin. Use the model to determine the outcome new instances of unknown origin. 33

Supervised Learning: A Decision Tree Example 34

Decision Tree A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. 35

36

37

38

Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy 39

Unsupervised Clustering A data mining method that builds models from data without predefined classes. 40

The “Acme Investors” Dataset of customers maintaining a brokerage account 41

The “Acme Investors” Dataset 42

The “Acme Investors” Dataset & Supervised Learning 1.Can I develop a general profile of an online investor? 2.Can I determine if a new customer is likely to open a margin account? 3.Can I build a model predict the average number of trades per month for a new investor? 4.What characteristics differentiate female and male investors? 43

The “Acme Investors” Dataset & Supervised Learning 1.Can I develop a general profile of an online investor? – output attribute – transaction method 2.Can I determine if a new customer is likely to open a margin account? - output attribute – margin account 3.Can I build a model predict the average number of trades per month for a new investor? - output attribute – trades/month 4.What characteristics differentiate female and male investors? - output attribute – sex 44

Alternative: The “Acme Investors” Dataset & Unsupervised Clustering 45

The “Acme Investors” Dataset & Unsupervised Clustering 1.What attribute similarities group customers of Acme Investors together? 2.What differences in attribute values segment the customer database? 46

Clustering Clustering is the task of segmenting a heterogeneous population into a number of more homogeneous subgroups (clusters). 47

Clustering: Two Approaches A clustering algorithm requires us to provide an initial best estimate about the total number of clusters in the data (supervised). A clustering algorithm uses some method in an attempt to determine a best number of clusters (unsupervised) 48

Classification Classification deals with discrete outcomes: yes or no; big or small; strange or no strange; yellow, green or red; etc. Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc. Neural networks and regression models are the best tools for classification/estimation 49

Prediction Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. Any of the techniques used for classification and estimation for use in prediction. 50

Classification and Prediction: Implementation To implement both classification and prediction, we should use the training examples, where the value of the variable to be predicted is already known or membership of the variable to be classified is already known. 51

Is Data Mining Appropriate for My Problem? 52

Will Data Mining help me? Can we clearly define the problem Do potentially meaningful data exist? Do the data contain hidden knowledge or the data is useful for reporting purposes only? Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining? 53

Data Mining or Data Query? Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge 54

Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database. 55

Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge. 56

Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease. 57

Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for. 58

Data Mining or Data Query? Shallow Knowledge ( can be extracted by the data base query language like SQL) Multidimensional Knowledge (can be extracted by the On-line Analytical Processing (OLAP) tools Hidden Knowledge represents patterns and regularities in data that can not be easily found Deep Knowledge can be found if we are given some direction about what we are looking for 59

Data Mining vs. Data Query: Use data query if you already almost know what you are looking for. Use data mining to find regularities in data that are not obvious. 60

A Simple Data Mining Process Model 61

Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. Data mining is one step of the KDD process. 62

Data Mining: A KDD Process –Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation 63

The Data Warehouse The data warehouse is a historical database designed for decision support. 64

A Simple Data Mining Process Model 1.Assemble a collection of data to analyze 2.Present these data to a data mining tool 3.Interpret the results 4.Apply the results to a new problem or situation 65