Predictive Analytics in SQL and Datalog

Predictive Analytics in SQL and Datalog
UCLA CS240A Course Notes*

Rollups, Data Cubes: Descriptive Analytics
In the 90s DBMS were extended with Descriptive Analytics—a major technical and commercial success From GROUP BY of SQL-2 aggregates to GROUP BY ROLLUP/DATACUBE: an obvious step at the language level. The sort-based support of traditional aggregates extended to supergroups. Some technical challenges, but the general DBMS framework of big data residing in secondary store was not challenged. While the need to go beyond that and support Discovery analytics was immediately realized no satisfactory solution was found.

Memorable Attempts Apriori in DB2 Sarawagi et al. [SIGMOD 1998]: only cache-mining works at the required speed ! Imielinski and Mannila [CACM 1996]: Define Declarative Language constructs for Data Mining. The rest will follow, as future research will invent query optimization techniques that will make these constructs very efficient. Inductive Databases: a research field that explored this idea for a few years and … then gave up. But recently things started changing

MBA: applicable to many other contexts
Telecommunication: Each customer is a transaction containing the set of customer’s phone calls Atmospheric phenomena: Each time interval (e.g. a day) is a transaction containing the set of observed event (rains, wind, etc.) Etc.

Association Rules Express how product/services relate to each other, and tend to group together “if a customer purchases three-way calling, then will also purchase call-waiting” simple to understand actionable information: bundle three-way calling and call-waiting in a single package

Useful, trivial, unexplicable
Useful: “On Thursdays, grocery store consumers often purchase diapers and beer together”. Trivial: “Customers who purchase maintenance agreements are very likely to purchase large appliances”. Unexplicable: “When a new hardaware store opens, one of the most sold items is toilet rings.”

Basic Concepts Transaction: Relational format Compact format
<Tid,item> <Tid,itemset> <1, item1> <1, {item1,item2}> <1, item2> <2, {item3}> <2, item3> Item: single element, Itemset: set of items Support of an itemset I: # of transaction containing I Minimum Support  : threshold for support Frequent Itemset : with support  . Frequent Itemsets represents set of items which are positively correlated

Frequent Itemsets Support({dairy}) = 3 (75%)
Support({fruit}) = 3 (75%) Support({dairy, fruit}) = 2 (50%) If  = 60%, then {dairy} and {fruit} are frequent while {dairy, fruit} is not.

Association Rules: Measures
Let A and B be a partition of I : A  B [s, c] A and B are itemsets s = support of A  B = support(AB) c = confidence of A  B = support(AB)/support(A) Measure for rules: minimum support  minimum confidence  The rules holds if : s   and c  

Association Rules: Meaning
A  B [ s, c ] Support: denotes the frequency of the rule within transactions. A high value means that the rule involve a great part of database. support(A  B [ s, c ]) = p(A  B) Confidence: denotes the percentage of transactions containing A which contain also B. It is an estimation of conditioned probability . confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).

Association Rules - Example
Min. support 50% Min. confidence 50% For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent

Problem Statement The database consists of a set of transactions.
Each transaction consists of a transaction ID and a set of items bought in that transaction (as in a market basket). An association rule is an implication of the form X  Y , which says that customers who buy item X are also likely to buy item Y . In practice we are only interested in relationships between highvolume items (aka frequent items) Confidence: X  Y holds with confidence C% if C% of transactions that contain X also contain Y . Support: X  Y has support S% if S% of transactions contain X  Y . Observe that the support level for X is  to that for X  Y and that their inverse ratio is the confidence of X Y: confidence(X  Y) = support(XY)/support(X)

Algorithms: Apriori A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D Freq 1-itemsets 2-candidates 1-candidates TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Itemset Sup a 2 b 3 c d 1 e Itemset Sup a 2 b 3 c e Itemset ab ac ae bc be ce Scan D Min_sup=2 Freq 2-itemsets Counting 3-candidates Scan D Itemset bce Itemset Sup ac 2 bc be 3 ce Itemset Sup ab 1 ac 2 ae bc be 3 ce Scan D Freq 3-itemsets Itemset Sup bce 2

Performance Challenges of Frequent Sets (aka Frequent Pattern) Mining
Data structures: Hash tables and Prefix trees Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: Many algorithms proposed. General ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates FP without candidate generation [Han, Pei, Yin 2000].

Apriori Summary Scanning the database and counting occurrences
Pruning the itemsets below the minimum support level: [Particularly after the first step, we might want to prune the database D as well] Combining frequent sets of size n into candidate larger sets of size n + 1 [or even larger]. Monotonicity Condition: The support level of a set is always smaller than that of every subset

Apriori in DB2 S. Sarawagi, S. Thomas, R. Agrawal: "Integrating Association Rule Mining with Databases: Alternatives and Implications", Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. Very difficult to integrate Apriori into DBMS: the only approach that works is cache-mining. And similar conclusions apply to other KDD algorithms Lackluster commercial success of DBMS vendors withpredictive analytics– in stark contrast with their success in descriptive analytics, i.e. rollups and data cubes.

Extracting the Rules For rule A  C:
Support for rule: support for set of items = support({A, C})=50% Confidence:support for the rule over support for its left side= support({A, C})/support({A})=66.6%

Rule Implications Lemma: If X  Y Z, then XY  Z, and XZ Y
This properties can be used to limit the number of rules tested. Example: For frequent itemset ABC If ABC does not hold, then neither do ABC or BAC. Then we can test BCA and if this hold we need to test CAB

Apriori in Datalog

Predictive Analytics in SQL and Datalog

Similar presentations

Presentation on theme: "Predictive Analytics in SQL and Datalog"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictive Analytics in SQL and Datalog

Similar presentations

Presentation on theme: "Predictive Analytics in SQL and Datalog"— Presentation transcript:

Similar presentations

About project

Feedback