Introduction To Data Mining. What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)

Slides:



Advertisements
Similar presentations
1. Abstract 2 Introduction Related Work Conclusion References.
Advertisements

Data Mining: A Closer Look Chapter Data Mining Strategies.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Data Mining By Archana Ketkar.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Building Knowledge-Driven DSS and Mining Data
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
DataMining By Guan Hang Su CS157A section 2 fall 2005.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Basic Data Mining Techniques
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Inductive learning Simplest form: learn a function from examples
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Outline What Neural Networks are and why they are desirable Historical background Applications Strengths neural networks and advantages Status N.N and.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Part I Data Mining Fundamentals. Data Mining: A First View Chapter 1.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining. Overview the extraction of hidden predictive information from large databases Data mining tools predict future trends and behaviors, allowing.
Data Mining Copyright KEYSOFT Solutions.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
CENG 770. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)
Introduction.  Instructor: Cengiz Örencik   Course materials:  myweb.sabanciuniv.edu/cengizo/courses.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining: Introduction
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Warehousing Data Mining Privacy
CSE591: Data Mining by H. Liu
Presentation transcript:

Introduction To Data Mining

What Is Data Mining? A toolA tool Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of dataExtraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Core of KDDCore of KDD Integration of Multiple technologiesIntegration of Multiple technologies

adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Data Target Data Selection Knowledge Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Part of KDD (Knowledge Discovery in Databases) Preprocessing

Integration of Multiple Technologies Machine Learning Database Management Artificial Intelligence Statistics Data Mining Visualization Algorithms Other knowledge

Why Data Mining? We are drowning in data (Data explosion problem ), but starving for knowledge! We are drowning in data (Data explosion problem ), but starving for knowledge! Solution: Data warehousing and data mining Solution: Data warehousing and data mining –Data warehousing and on-line analytical processing –Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases A lot of potential applications A lot of potential applications –Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation –Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Forecasting, customer retention, improved underwriting, quality control, competitive analysis –Health care …

Data mining process Knowledge-base State the problem + hypothesis

Knowledge from Data Mining Association rules Association rules Sequential Association Sequential Association Classification rules Classification rules Clustering Clustering Deviation Detection Deviation Detection …

Association Rules Identify association in the data: (correlation [A,B] and causality[A->B]) Identify association in the data: (correlation [A,B] and causality[A->B]) Indicate significance of each association (only interesting if its confidence exceed a certain measure) Indicate significance of each association (only interesting if its confidence exceed a certain measure) Not all the Association is interesting Not all the Association is interesting (too trivial, negative association) (too trivial, negative association) E.g. market-basket analysis “Find groups of items commonly purchased together” –People who purchase fish are likely to purchase wine

Sequential Associations Find event sequences that are unusually likely Find event sequences that are unusually likely Requires “training” event list, known “interesting” events Requires “training” event list, known “interesting” events Must be robust in the face of additional “noise” events Must be robust in the face of additional “noise” eventsUses: Failure analysis and prediction Failure analysis and predictionTechnologies: Dynamic programming (Dynamic time warping) Dynamic programming (Dynamic time warping) “Custom” algorithms “Custom” algorithms “Find common sequences of warnings/faults within 10 minute periods” –Warn 2 on Switch C preceded by Fault 21 on Switch B –Fault 17 on any switch preceded by Warn 2 on any switch

Classification rules Classify a set of data based on their values in certain attributes Classify a set of data based on their values in certain attributes Requires “training data”: have predefined attributes Requires “training data”: have predefined attributesUses: Profiling ProfilingTechnologies: Generate decision trees (results are human understandable) Generate decision trees (results are human understandable) Neural Nets Neural Nets “Route documents to most likely interested parties” –English or non- english? –Domestic or Foreign?

Clustering Group a set of data base on the conceptual clustering principle(i.e. maximizing the intraclass similarity and minimizing the interclass similarity) Group a set of data base on the conceptual clustering principle(i.e. maximizing the intraclass similarity and minimizing the interclass similarity) No “training data”: Without predefined attributes No “training data”: Without predefined attributesUses: Demographic analysis Demographic analysisTechnologies: Self-Organizing Maps Self-Organizing Maps Probability Densities Probability Densities Conceptual Clustering Conceptual Clustering “Group people with similar travel profiles” –George, Patricia –Jeff, Evelyn, Chris –Rob

Deviation Detection Find unexpected values, outliers Find unexpected values, outliersUses: Failure analysis Failure analysis Anomaly discovery for analysis Anomaly discovery for analysisTechnologies: clustering/classification methods clustering/classification methods Statistical techniques Statistical techniques visualization visualization “Find unusual occurrences in IBM stock prices” “Find unusual occurrences in IBM stock prices”

Popular Data Mining Techniques Supervised Supervised –Decision trees –Rule induction –Regression models –Neural Networks … Unsupervised Unsupervised —K-means clustering —Self organized maps …

Supervised vs. Unsupervised Supervised algorithms » Learning by example: – Use training data which the value of the response variable is already known – Create a model by running the algorithm on the training data – Identify a class label for the incoming new data » Driven by a real business problems and historical data Unsupervised algorithms » Do not use training data. » Patterns may not be known in advance

Supervised Algorithms

Decision Trees A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. Advantages of decision trees Advantages of decision trees —Understandable —Relatively fast —Easy to translate into SQL queries Disadvantages of decision trees Disadvantages of decision trees Limited to one output attribute — Limited to one output attribute Decision tree algorithms are not so stable — Decision tree algorithms are not so stable Types of decision trees Types of decision trees —CHAID: Chi-Square Automatic Interaction Detection —CART: Classification and Regression Trees…

Figure 1.1 A decision tree for the data in Table 1.1

Rule induction The extraction of useful independent if-then rules from data based on statistical significance The extraction of useful independent if-then rules from data based on statistical significance If rules cause prediction confliction -> solve it according to confidence If rules cause prediction confliction -> solve it according to confidence Advantage and disadvantage Advantage and disadvantage —Understandable —not cover all the possible situation E.g. IF Swollen Glands = Yes THEN Diagnosis = Strep Throat THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy THEN Diagnosis = Allergy IF = Antecedent THEN = Consequence

Neural Networks Non-linear predictive models that learn through training and resemble biological neural networks in structure Non-linear predictive models that learn through training and resemble biological neural networks in structure Means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions Means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions Disadvantage Disadvantage –Difficult understand –Can require significant amounts of time to train, to prepare data –…

Figure 2.2 A multilayer fully connected neural network

Regression Models Statistical techniques Statistical techniques Using existing values to forecast what other values will be. Using existing values to forecast what other values will be. Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5) … Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5) … A lot of types regression (linear regression, logistic regression …) A lot of types regression (linear regression, logistic regression …)

K-Means Clustering Unsupervised algorithm Unsupervised algorithm Steps of algorithm Steps of algorithm 1.Choose a value for K, the total number of clusters. 2.Randomly choose K points as cluster centers. 3.Assign the remaining instances to their closest cluster center. 4.Calculate a new cluster center for each cluster. 5.Repeat steps 3-5 until the cluster centers do not change.

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

Figure 2.3 An unsupervised cluster of the credit card database

Choosing a Data Mining Technique Know which kind knowledge you want to get Know which kind knowledge you want to get Know your data Know your data --What is the interaction between input and output attributes? --What is the Distribution of the Data? --Which Attributes Best Define the Data? Know the difference among different data mining techniques Know the difference among different data mining techniques

Questions to Determine Data Mining Applicability 1. Can the problem be clearly defined? 2. Does potentially meaningful data exist? 3. Does data contain hidden knowledge or is it just filled with facts? 4. Is the “juice worth the squeeze?”

Data Mining vs. OLAP Discovery-based Discovery-based (deductive process) (deductive process) Mine data warehouse and others Mine data warehouse and others Can provide information you didn’t expect Can provide information you didn’t expect Verification-based Verification-based (inductive process) (inductive process) DSS tool for data warehouse DSS tool for data warehouse Pre-defined queries Pre-defined queries

Data Mining vs. Data Query For hidden knowledge Try to get the answer as accurate as possible Results are the analysis of the data Data need to be prepare before producing results For specific question Answer to query is 100% accurate if data correct Results are subset of data Need not prepare data