Predictive Analytics in SQL and Datalog

Slides:



Advertisements
Similar presentations
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
1 Knowledge discovery & data mining Association rules and market basket analysis --introduction A EDBT2000 Fosca Giannotti and Dino Pedreschi.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 Knowledge discovery & data mining Association rules and market basket analysis --introduction UCLA CS240A Course Notes* __________________________ *
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Association Rules Repoussis Panagiotis.
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Association Rules.
Association Rules Zbigniew W. Ras*,#) presented by
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Gyozo Gidofalvi Uppsala Database Laboratory
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Association Analysis: Basic Concepts and Algorithms
Frequent patterns and Association Rules
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
關連分析 (Association Analysis)
Association Rule Mining
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

Predictive Analytics in SQL and Datalog UCLA CS240A Course Notes*

Rollups, Data Cubes: Descriptive Analytics In the 90s DBMS were extended with Descriptive Analytics—a major technical and commercial success From GROUP BY of SQL-2 aggregates to GROUP BY ROLLUP/DATACUBE: an obvious step at the language level. The sort-based support of traditional aggregates extended to supergroups. Some technical challenges, but the general DBMS framework of big data residing in secondary store was not challenged. While the need to go beyond that and support Discovery analytics was immediately realized no satisfactory solution was found.

Memorable Attempts Apriori in DB2 Sarawagi et al. [SIGMOD 1998]: only cache-mining works at the required speed ! Imielinski and Mannila [CACM 1996]: Define Declarative Language constructs for Data Mining. The rest will follow, as future research will invent query optimization techniques that will make these constructs very efficient. Inductive Databases: a research field that explored this idea for a few years and … then gave up. But recently things started changing

MBA: applicable to many other contexts Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc.

Association Rules Express how product/services relate to each other, and tend to group together “if a customer purchases three-way calling, then will also purchase call-waiting” simple to understand actionable information: bundle three-way calling and call-waiting in a single package

Useful, trivial, unexplicable Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”

Basic Concepts Transaction: Relational format Compact format <Tid,item> <Tid,itemset> <1, item1> <1, {item1,item2}> <1, item2> <2, {item3}> <2, item3> Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support  : threshold for support Frequent Itemset : with support  . Frequent Itemsets represents set of items which are positively correlated

Frequent Itemsets Support({dairy}) = 3 (75%) Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If  = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not.

Association Rules: Measures Let A and B be a partition of I : A  B [s, c] A and B are itemsets s = support of A  B = support(AB) c = confidence of A  B = support(AB)/support(A) Measure for rules: minimum support  minimum confidence  The rules holds if : s   and c  

Association Rules: Meaning A  B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A  B [ s, c ]) = p(A  B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability . confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).

Association Rules - Example Min. support 50% Min. confidence 50% For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent

Problem Statement The database consists of a set of transactions. Each transaction consists of a transaction ID and a set of items bought in that transaction (as in a market basket). An association rule is an implication of the form X  Y , which says that customers who buy item X are also likely to buy item Y . In practice we are only interested in relationships between high­volume items (aka frequent items) Confidence: X  Y holds with confidence C% if C% of transactions that contain X also contain Y . Support: X  Y has support S% if S% of transactions contain X  Y . Observe that the support level for X is  to that for X  Y and that their inverse ratio is the confidence of X Y: confidence(X  Y) = support(XY)/support(X)

Algorithms: Apriori A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D Freq 1-itemsets 2-candidates 1-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 Freq 2-itemsets Counting 3-candidates Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2

Performance Challenges of Frequent Sets (aka Frequent Pattern) Mining Data structures: Hash tables and Prefix trees Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: Many algorithms proposed. General ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates FP without candidate generation [Han, Pei, Yin 2000].

Apriori Summary Scanning the database and counting occurrences Pruning the itemsets below the minimum support level: [Particularly after the first step, we might want to prune the database D as well] Combining frequent sets of size n into candidate larger sets of size n + 1 [or even larger]. Monotonicity Condition: The support level of a set is always smaller than that of every subset

Apriori in DB2 S. Sarawagi, S. Thomas, R. Agrawal: "Integrating Association Rule Mining with Databases: Alternatives and Implications", Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. Very difficult to integrate Apriori into DBMS: the only approach that works is cache-mining. And similar conclusions apply to other KDD algorithms Lackluster commercial success of DBMS vendors withpredictive analytics– in stark contrast with their success in descriptive analytics, i.e. rollups and data cubes.

Extracting the Rules For rule A  C: Support for rule: support for set of items = support({A, C})=50% Confidence:support for the rule over support for its left side= support({A, C})/support({A})=66.6%

Rule Implications Lemma: If X  Y Z, then XY  Z, and XZ Y This properties can be used to limit the number of rules tested. Example: For frequent itemset ABC If ABC does not hold, then neither do ABC or BAC. Then we can test BCA and if this hold we need to test CAB

Apriori in Datalog