1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows.

Slides:



Advertisements
Similar presentations
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Hash-Based Improvements to A-Priori
1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Data Mining Techniques Association Rule
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Data Mining of Very Large Data
1 Data Mining Introductions What Is It? Cultures of Data Mining.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Mining Data Streams.
 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Association Rule Mining Zhenjiang Lin Group Presentation April 10, 2007.
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Extension of DGIM to More Complex Problems
Data Mining Association Analysis: Basic Concepts and Algorithms
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Intractable Problems Time-Bounded Turing Machines Classes P and NP Polynomial-Time Reductions.
CURE Algorithm Clustering Streams
1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.
Improvements to A-Priori
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Improvements to A-Priori
A survey on stream data mining
1 More Stream-Mining Counting How Many Elements Computing “Moments”
More Stream-Mining Counting Distinct Elements Computing “Moments”
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
 The Law of Large Numbers – Read the preface to Chapter 7 on page 388 and be prepared to summarize the Law of Large Numbers.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
CS 349: Market Basket Data Mining All about beer and diapers.
IE 594 : Research Methodology – Discrete Event Simulation David S. Kim Spring 2009.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
1 Transactions, Views, Indexes Controlling Concurrent Behavior Virtual and Materialized Views Speeding Accesses to Data This slides are from J. Ullman’s.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
SAS Homework 3 Review Association rules mining
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
1 What Is Forecasting? Sales will be $200 Million!
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Mining Data Streams (Part 1)
The Stream Model Sliding Windows Counting 1’s
Frequent Pattern Mining
Mining Data Streams (Part 1)
CPS216: Advanced Database Systems Data Mining
Counting How Many Elements Computing “Moments”
Mining Data Streams (Part 2)
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Data Streams Some of these slides are based on Stanford Mining Massive Data Sets Course slides at
Association Analysis: Basic Concepts
Counting Bits.
Presentation transcript:

1 Still More Stream-Mining Frequent Itemsets Elephants and Troops Exponentially Decaying Windows

2 Counting Items uProblem: given a stream, which items appear more than s times in the window? uPossible solution: think of the stream of baskets as one binary stream per item. w1 = item present; 0 = not present. wUse DGIM to estimate counts of 1’s for all items.

3 Extensions uIn principle, you could count frequent pairs or even larger sets the same way. wOne stream per itemset. uDrawbacks: 1.Only approximate. 2.Number of itemsets is way too big.

4 Approaches 1.“Elephants and troops”: a heuristic way to converge on unusually strongly connected itemsets. 2.Exponentially decaying windows: a heuristic for selecting likely frequent itemsets.

5 Elephants and Troops uWhen Sergey Brin wasn’t worrying about Google, he tried the following experiment. uGoal: find unusually correlated sets of words. w“High Correlation ” = frequency of occurrence of set >> product of frequency of members.

6 Experimental Setup uThe data was an early Google crawl of the Stanford Web. uEach night, the data would be streamed to a process that counted a preselected collection of itemsets. wIf {a, b, c} is selected, count {a, b, c}, {a}, {b}, and {c}. w“Correlation” = n 2 * #abc/(#a * #b * #c). n = number of pages.

7 After Each Night’s Processing... 1.Find the most correlated sets counted. 2.Construct a new collection of itemsets to count the next night. wAll the most correlated sets (“winners ”). wPairs of a word in some winner and a random word. wWinners combined in various ways. wSome random pairs.

8 After a Week... uThe pair {“elephants”, “troops”} came up as the big winner. uWhy? It turns out that Stanford students were playing a Punic-War simulation game internationally, where moves were sent by Web pages.

9 Mining Streams Vs. Mining DB’s (New Topic) uUnlike mining databases, mining streams doesn’t have a fixed answer. uWe’re really mining in the “Stat” point of view, e.g., “Which itemsets are frequent in the underlying model that generates the stream?”

10 Stationarity uTwo different assumptions make a big difference. 1.Is the model stationary ? uI.e., are the same statistics used throughout all time to generate the stream? 2.Or does the frequency of generating given items or itemsets change over time?

11 Some Options for Frequent Itemsets uWe could: 1.Run periodic experiments, like E&T. uLike SON --- itemset is a candidate if it is found frequent on any “day.” uGood for stationary statistics. 2.Frame the problem as finding all frequent itemsets in an “exponentially decaying window.” uGood for nonstationary statistics.

12 Exponentially Decaying Windows  If stream is a 1, a 2,… and we are taking the sum of the stream, take the answer at time t to be: Σ i = 1,2,…,t a i e -c (t-i ). uc is a constant, presumably tiny, like or

13 Example: Counting Items uIf each a i is an “item” we can compute the characteristic function of each possible item x as an E.D.W.  That is: Σ i = 1,2,…,t δ i e -c (t-i ), where δ i = 1 if a i = x, and 0 otherwise. wCall this sum the “count ” of item x.

14 Counting Items --- (2) uSuppose we want to find those items of weight at least ½. uImportant property: sum over all weights is 1/(1 – e -c ) or very close to 1/[1 – (1 – c)] = 1/c. uThus: at most 2/c items have weight at least ½.

15 Extension to Larger Itemsets* uCount (some) itemsets in an E.D.W. uWhen a basket B comes in: 1.Multiply all counts by (1-c ); drop counts < ½. 2.If an item in B is uncounted, create new count. 3.Add 1 to count of any item in B and to any counted itemset contained in B. 4.Initiate new counts (next slide). * Informal proposal of Art Owen

16 Initiation of New Counts  Start a count for an itemset S ⊆ B if every proper subset of S had a count prior to arrival of basket B. uExample: Start counting {i, j } iff both i and j were counted prior to seeing B. uExample: Start counting {i, j, k } iff {i, j }, {i, k }, and {j, k } were all counted prior to seeing B.

17 How Many Counts? uCounts for single items < (2/c ) times the average number of items in a basket. uCounts for larger itemsets = ??. But we are conservative about starting counts of large sets. wIf we counted every set we saw, one basket of 20 items would initiate 1M counts.