Data Mining An Introduction.

Slides:



Advertisements
Similar presentations
1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
Advertisements

Data Mining of Very Large Data
IDS561 Big Data Analytics Week 6.
1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining, Frequent-Itemset Mining
Chapter 9 Business Intelligence Systems
1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Data Mining, Frequent-Itemset Mining. Data Mining Some mining problems Find frequent itemsets in "market-basket" data – "50% of the people who buy hot.
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
Enterprise systems infrastructure and architecture DT211 4
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Business Intelligence. Topics Chart Online Analytical Process, OLAP – Excel’s Pivot table – Data visualization with dashboard Data warehousing Data Mining.
 BA_EM Electronic Marketing – Pavel
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Spatial Statistics and Spatial Knowledge Discovery First law of geography [Tobler]: Everything is related to everything, but nearby things are more related.
© 2008 Pearson Prentice Hall, Experiencing MIS, David Kroenke Slide 1 Chapter 9 Competitive Advantage with Information Systems for Decision Making.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining By : Tung, Sze Ming ( Leo ) CS 157B. Definition A class of database application that analyze data in a database using tools which look for.
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Business Intelligence - 2 BUS 782. Topics Data warehousing Data Mining.
Business Intelligence. Topics Chart Online Analytical Process, OLAP – Excel’s Pivot table – Data visualization with dashboard Scenario Management Data.
1 CPS216: Advanced Database Systems Data Mining Slides created by Jeffrey Ullman, Stanford.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Elsayed Hemayed Data Mining Course
Academic Year 2014 Spring Academic Year 2014 Spring.
Data Mining Copyright KEYSOFT Solutions.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
DATA MINING It is a process of extracting interesting(non trivial, implicit, previously, unknown and useful ) information from any data repository. The.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
1 Data Warehousing Data Warehousing. 2 Objectives Definition of terms Definition of terms Reasons for information gap between information needs and availability.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
By Arijit Chatterjee Dr
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining 101 with Scikit-Learn
CPS216: Advanced Database Systems Data Mining
MIS5101: Data Analytics Advanced Analytics - Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Market Basket Analysis and Association Rules
MIS2502: Data Analytics Introduction to Advanced Analytics
Kenneth C. Laudon & Jane P. Laudon
MIS2502: Data Analytics Introduction to Advanced Analytics and R
Presentation transcript:

Data Mining An Introduction

Definitions Data Mining is … knowledge discovery using sophisticated blend of techniques from traditional statistics, artificial intelligence and computer graphics. Data mining is … discovery of useful, possibly unexpected, patterns in data.

Definitions Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner Data mining is the extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases.

Goals of Data Mining Explanatory Confirmatory Exploratory To explain some observed event or condition, e.g., why sales of trucks has increased in region A? Confirmatory To confirm a hypothesis, e.g., whether two-income families are more likely to buy family medical coverage as compared to single-income families Exploratory To analyze data for new or unexpected relationships, e.g., what spending patterns are likely to accompany credit card fraud?

Data Mining Techniques Data mining can be performed on a data mart or EDW. Depending on the nature of the data to be analyzed and size of data set, any of the following techniques can be used: Regression Test or discover relationships from historical data Decision tree induction Test or discover ‘if…Then’ rules for decision tendency Clustering and Signal Processing Discover subgroups or segments

Data Mining Techniques(Contd.) Affinity Discover things with strong mutual relationships Sequence Association Discover cycles of events & behaviours Case-based reasoning Derives rules from real-world case examples Rule discover Searches for patterns & correlations in large datasets Fractals Compresses large databases without losing information Neural nets Develops predictive models based on principles modeled after human brain

Analytical Processing vs. Models To a database person, data-mining is a powerful form of analytic processing --- queries that examine large amounts of data. Result is the data that answers the query. To a statistician, data-mining is the inference of models. Result is the parameters of the model

Data Mining & Knowledge Discovery in Databases

CRISP-DM methodology Cross Industry Standard Process for Data Mining Six phases Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

Data Mining Applications Profiling Populations Developing profiles of a high-value customers, credit risks, and credit-card frauds Analysis of business trends Identifying markets with above/below average growth Target marketing Identifying customers for promotional activity Usage analysis Identifying usage patterns for products and services Campaign effectiveness Comparing campaign strategies for effectivesness

Data Mining Applications Product Affinity Identifying products that are purchased concurrently, or the shoppers characteristics for certain product groups Customer retention and churn Examining the behaviour of customers who left for competitors to prevent remaining customers from leaving Profitability analysis Determining which customers are profitable Customer Value Analysis Determining where valuable customers are at different stages in their life

Data Mining Applications Up-selling Identifying new products and services to sell to a customer based upon critical events and life-style changes Intelligence gathering total information awareness Web analysis Ranking web pages Financial predict price movements to make better investments

Frequent ItemSets and “Association Rules” Some related concepts to Data Mining Slides by Jeffrey Ullman, Stanford

The Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day.

Association Rule Mining transaction id customer id products bought sales records: market-basket data Trend: Products p5, p8 often bought together

Support Simplest question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I. Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets.

Applications --- (1) Real market baskets: chain stores keep terabytes of information about what customers buy together. Tells how typical customers navigate stores, lets them position tempting items. Suggests tie-in “tricks,” e.g., run sale on diapers and raise the price of beer. High support needed, or no $$’s .

Applications --- (2) “Baskets” = documents; “items” = words in those documents. Lets us find words that appear together unusually frequently, i.e., linked concepts. “Baskets” = sentences, “items” = documents containing those sentences. Items that appear together too often could represent plagiarism.

Applications --- (3) “Baskets” = Web pages; “items” = linked pages. Pairs of pages with many common references may be about the same topic. “Baskets” = Web pages p ; “items” = pages that link to p . Pages with many of the same links may be mirrors or about the same topic.

Important Point “Market Baskets” is an abstraction that models any many-many relationship between two concepts: “items” and “baskets.” Items need not be “contained” in baskets. The only difference is that we count co-occurrences of items related to a basket, not vice-versa.

Scale of Problem WalMart sells 100,000 items and can store billions of baskets. The Web has over 100,000,000 words and billions of pages.

Association Rules If-then rules about the contents of baskets. {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it is likely to contain j. Confidence of this association rule is the probability of j given i1,…,ik.

Example An association rule: {m, b} → c. B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} An association rule: {m, b} → c. Confidence = 2/4 = 50%. + _ _ +

Interest The interest of an association rule is the absolute value of the amount by which the confidence differs from what you would expect, where items are selected independently of one another.

Example B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j} B7 = {c, b, j} B8 = {b, c} For association rule {m, b} → c, item c appears in 5/8 of the baskets. Interest = | 2/4 - 5/8 | = 1/8 --- not very interesting.

Relationships Among Measures Rules with high support and confidence may be useful even if they are not “interesting.” We don’t care if buying bread causes people to buy milk, or whether simply a lot of people buy both bread and milk. But high interest suggests a cause that might be worth investigating.

Finding Association Rules A typical question: “find all association rules with support ≥ s and confidence ≥ c.” Note: “support” of an association rule is the support of the set of items it mentions. Hard part: finding the high-support (frequent ) itemsets. Checking the confidence of association rules involving those sets is relatively easy.

Computation Model Typically, data is kept in a “flat file” rather than a database system. Stored on disk. Stored basket-by-basket. Expand baskets into pairs, triples, etc. as you read baskets.

Computation Model --- (2) The true cost of mining disk-resident data is usually the number of disk I/O’s. In practice, association-rule algorithms read the data in passes --- all baskets read in turn. Thus, we measure the cost by the number of passes an algorithm takes.

Main-Memory Bottleneck In many algorithms to find frequent itemsets we need to worry about how main memory is used. As we read baskets, we need to count something, e.g., occurrences of pairs. The number of different things we can count is limited by main memory. Swapping counts in/out is a disaster.

Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs.

Other Effective Approaches Naïve Algorithm A-Priori Algorithm