Data Mining Lecture 1: Introduction to Data Mining Manuel Penaloza, PhD.

Slides:



Advertisements
Similar presentations
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Advertisements

/faculteit technologie management Introduction to Data Mining a.j.m.m. (ton) weijters (slides are partially based on an introduction of Gregory Piatetsky-Shapiro)
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Data Mining By Archana Ketkar.
BUSINESS DRIVEN TECHNOLOGY
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Lesson Outline Introduction: Data Flood
Data Mining & Data Warehousing PresentedBy: Group 4 Kirk Bishop Joe Draskovich Amber Hottenroth Brandon Lee Stephen Pesavento.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Microsoft Enterprise Consortium Data Mining Concepts Introduction: The essential background Prepared by David Douglas, University of ArkansasHosted by.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining Knowledge Discovery: An Introduction
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Dr. Awad Khalil Computer Science Department AUC
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
More on Data Mining KDnuggets Datanami ACM SIGKDD
Chapter 11 Databases.
Data Mining Solutions (Westphal & Blaxton, 1998) Dr. K. Palaniappan Dept. of Computer Engineering & Computer Science, UMC.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Introduction: The essential background
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Chapter 9 Business Intelligence and Information Systems for Decision Making.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Chapter 14 Data Mining Transparencies. 2 Chapter Objectives u The concepts associated with data mining. u The main features of data mining operations,
MIS2502: Data Analytics Advanced Analytics - Introduction.
Academic Year 2014 Spring Academic Year 2014 Spring.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining.
Data Mining – Intro.
Identify and Meet a Market Need
MIS2502: Data Analytics Advanced Analytics - Introduction
Statistics 202: Statistical Aspects of Data Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
CSCI 200 Data MINING Lecture 1.
Course Introduction CSC 576: Data Mining.
Data Mining: Concepts and Techniques
Presentation transcript:

Data Mining Lecture 1: Introduction to Data Mining Manuel Penaloza, PhD

2 Introduction to Data Mining Society produces huge amounts of data daily — Retail Store – POS data on customer purchases — Banks – Collection of customer service calls — Telecommunications – Phone call records (mobile and house-based calls) — Medicine – Genomic data collected on the structure of genes — Government – Law enforcement data, income tax data — Others: (Transactional) data from Sports, Schools, Research, Search engines, etc.

3 What is Data Mining (DM)? It is the process of discovering hidden relationships and patterns in large data sets — It can also predict the outcome of a future observation Data mining is an interdisciplinary field — It is an extension to statistical analysis — It uses techniques from: – Statistics – Machine learning – Pattern recognition – Database technology – Visualization – High-performance computing

4 Questions answered by DM Extracting useful information from a dataset that answer: — Which CC customers are most profitable? — Which loan applicants are high-risk? — Which customer will respond to a planned promotion? — How do we detect phone card fraud? — How do customer profile change over time? — Which customers do prefer product A over product B? — What is the revenue prediction for next year? — Which students are most likely to transfer than others? — Which tax payer may be cheating the system? — Who is most likely to violate a probation sentence? — What is the predicted outcome for some treatment?

5 Data sources Relational Databases — Transactional data with many tables Data warehouses — Historical data, aggregated and updated periodically Files — In special format (e.g., CSV) or proprietary binary Internet or electronic mail — HTML, XML, web search results, s Scientific, research — Seismology, remote sensing, etc.

6 Example: Health System Characteristics of the Health System: — Personal medical records (GP, specialists, etc.) — Billing records — Hospital data (surgery, admission, etc.) Questions: — Are MD's following the procedures? — Which patient may have an adverse drug reactions? — Are people committing frauds? — Which patient are most likely to get cancer?

7 Case study: E-commerce A person buys book from Amazon.com Objective: Recommend other books this person is likely to buy Amazon may do clustering or sequential pattern analysis based on books bought by other people Data analyzed: —“Customer who bought “Data Mining: Practical Machine Learning Tools and Techniques” also bought “Introduction to Data Mining” Recommendations have been successful for Amazon — Increasing buyer’s satisfaction and purchases

8 What motivated data mining? Growth in data collection Presence of data warehouses with reliable data Competitive pressure to increase sales The development of commercial off the shelves (COTS) data mining software — Examples: XLMiner, Insightful Miner, SAS, SPSS Growth of computing power and storage capacity High dimensionality of the data Heterogeneous and complex data Limitation of humans

9 Insightful Miner TM 7: GUI *Figures taken from the Insightful Miner 7 Guide

10 Creating Models Create a network of pipelined components — By dragging and dropping components

11 Choosing a data mining system They have different functionality or methodology Selection determined by: — Type of operating system used in your organization — The data sources handle by the tool: –ASCII text files, relational databases, XML data — The data mining functions and methods offered — Scalability of the system –Row and column scalability — Visualization tools available — Graphical user interface that guides the execution of the methods — Integration with other information systems — Cost and performance

12 Data Mining in Databases Current applications include data mining modules Example: — Database management systems such as Oracle and MS SQL Server — CRM (Customer Relationship Management) Advantages for Database systems: — One Stop shopping — Minimize data movement and conversion Disadvantages for Database systems: — Limited to DM methods available in the system — Data extractions and transformations may not be powerful enough

13 Standard data mining life cycle CRISP (Cross-Industry Standard Process) It is an iterative process with phase dependencies IT consists of six (6) phases: see for more information

14 CRISP_DM Cross-industry standard developed in 1996 — Analysts from SPSS/ISL, NCR, Daimler-Benz, OHRA Funding from European Commission Important Characteristics: — Non-proprietary — Application/Industry neutral — Tool neutral — General problem-solving process — Process with six phases but missing: –Saving results and updating the model

15 CRISP-DM Phases (1) Business Understanding — Understand project objectives and requirements — Formulation of a data mining problem definition Data Understanding — Data collection — Evaluate the quality of the data — Perform exploratory data analysis Data Preparation — Clean, prepare, integrate, and transform the data — Select appropriate attributes and variables

16 CRISP-DM Phases (2) Modeling — Select and apply appropriate modeling techniques — Calibrate model parameters to optimize results — If necessary, return to data preparation phase to satisfy model's data format Evaluation — Determine if model satisfies objectives set in phase 1 — Identify business issues that have not been addressed Deployment — Organize and present the model to the “user” — Put model into practice — Set up for continuous mining of the data

17 Data mining tasks (1) Classification — Predict the categorical value of a target (dependent) variable based on the values of other attributes — Target variable is partitioned into classes — It predicts class membership of a new observation — Examples: Which drug should be prescribed for older patients with low sodium/potassium ratios? Estimation — Similar to classification except target variable is numeric —That is, predicting a numeric value — Example: Estimate the blood pressure of a person based on his/her age, gender, body mass index, etc.

18 Data mining tasks (2) Prediction — Similar to estimation except that results lie in the future —Example: Predict the price of a stock 3 months into the future Clustering — Grouping similar records together — Example: Find patients with similar profiles Associations — Uncover rules that indicates the association between two or more attributes — Find out which items are purchased together

19 Task: Classification Build a model that learns to predict the class from pre-labeled instances or observations — Many approaches: Regression, Decision Trees, Neural Networks Given a set of points from classes what is the class of new point ? * Diagram taken fromwww.kdnuggets.com/data_mining_course/index.htmlwww.kdnuggets.com/data_mining_course/index.html

20 Task: Clustering Find grouping of instances given un-labeled data * Diagram taken fromwww.kdnuggets.com/data_mining_course/index.htmlwww.kdnuggets.com/data_mining_course/index.html

21 DM looks easy Data Data Mining Method Regression Decision Tree Neural Network … Association Rules Model - But it is not easy - Real-world is complicate

22 Methods and Techniques Cluster Analysis (tasks: clustering) Association Rules (tasks: association) Decision trees (tasks: prediction, classification) Neural networks (tasks: prediction, classification) K-nearest neighbor (tasks: prediction, classification, clustering) Regression analysis (task: estimation, prediction) Confidence interval estimation (task: estimation)

23 Fallacies of Data Mining (1) Fallacy 1: There are data mining tools that automatically find the answers to our problem — Reality: There are no automatic tools that will solve your problems “while you wait” Fallacy 2: The DM process require little human intervention — Reality: The DM process require human intervention in all its phases, including updating and evaluating the model by human experts Fallacy 3: Data mining have a quick ROI — Reality: It depends on the startup costs, personnel costs, data source costs, and so on

24 Fallacies of Data Mining (2) Fallacy 4: DM tools are easy to use — Reality: Analysts must be familiar with the model Fallacy 5: DM will identify the causes to the business problem — Reality: DM tool only identify patterns in your data, analysts must identify the cause Fallacy 6: Data mining will clean up a data repository automatically — Reality: Sequence of transformation tasks must be defined by an analysts during early DM phases * Fallacies described by Jen Que Louie, President of Nautilus Systems, Inc.

25 In summary, Problems suitable for Data Mining: —Require to discover knowledge to make right decisions —Current solutions are not adequate —Expected high-payoff for the right decisions —Have accessible, sufficient, and relevant data —Have a changing environment IMPORTANT: — ENSURE privacy if personal data is used! —Not every data mining application is successful!

26 Main References Ian Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2 nd edition, Morgan Kaufmann Publishers Daniel LaRose. Discovering Knowledge in Data: An Introduction to Data Mining, Wiley Publication Pang-Ning Tang et. al. Introduction to Data Mining, Addison Wesley Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Online data mining course offered by KDnuggets TM at Engineering Statistics Handbook available online at

27 Exercise #1 CRISP-DM is not the only DM process, do a quick search on the Internet for another process. Describe any similarity and differences with CRISP-DM. Determine how data mining could help a web search engine company like Google in its operation? — Identify one or more objectives. — Which data mining task(s) could help this company?