Dynamically Maintaining Frequent Items Over A Data Stream

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Hashing.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Finding Frequent Items in Data Streams Moses CharikarPrinceton Un., Google Inc. Kevin ChenUC Berkeley, Google Inc. Martin Franch-ColtonRutgers Un., Google.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Fast Algorithms For Hierarchical Range Histogram Constructions
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Sampling: Final and Initial Sample Size Determination
A Fast High Utility Itemsets Mining Algorithm Ying Liu,Wei-keng Liao,and Alok Choudhary KDD’05 Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.
Heavy hitter computation over data stream
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
11.Hash Tables Hsu, Lih-Hsing. Computer Theory Lab. Chapter 11P Directed-address tables Direct addressing is a simple technique that works well.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Hash Tables1 Part E Hash Tables  
Hash Tables1 Part E Hash Tables  
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
ETM 607 – Random Number and Random Variates
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Chapter 7 Estimation: Single Population
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
CpSc 881: Machine Learning Evaluating Hypotheses.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
Midterm Midterm is Wednesday next week ! The quiz contains 5 problems = 50 min + 0 min more –Master Theorem/ Examples –Quicksort/ Mergesort –Binary Heaps.
Describing Samples Based on Chapter 3 of Gotelli & Ellison (2004) and Chapter 4 of D. Heath (1995). An Introduction to Experimental Design and Statistics.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Watermarking Scheme Capable of Resisting Sensitivity Attack
Chapter 4: Basic Estimation Techniques
Frequency Counts over Data Streams
Hash table CSC317 We have elements with key and satellite data
Basic Estimation Techniques
Streaming & sampling.
When to Update the Sequential Patterns of Stream Data?
Context-based Data Compression
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Basic Estimation Techniques
Targeted Association Mining in Time-Varying Domains
Lecture 4: CountSketch High Frequencies
2.7 Two-variable inequalities (linear) 3.3 Systems of Inequalities
Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, ICDM 2004.
Approximate Frequency Counts over Data Streams
CSCI B609: “Foundations of Data Science”
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Summarizing Itemset Patterns: A Profile-Based Approach
Chapter 8 Estimation.
Presentation transcript:

Dynamically Maintaining Frequent Items Over A Data Stream C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, in Proc. of the 12th ACM International Conference on Information and Knowledge Management, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date: 2004.7.22

Introduction In this paper, propose a new approach, with a small memory space, to maintain a list of most frequent items above some user-specified threshold in a dynamic environment where items can be inserted and deleted.

Problem Definition (1) A transaction is either inserting or deleting an item k at time point i, denoted ti = delete(k) or ti = insert(k). Items:integers in a range of [1..M]. :the net occurrence of item k. :sum of net occurrence of all items. :the frequency of item k.

Problem Definition (2) three user-specified parameters: a support parameter an error parameter a probability parameter such that is near 1 Guarantees: all items whose true frequency exceeds s are output no item whose true frequency is less than s –ε is output estimated frequencies are more than the true frequencies by at most ε with high probability ρ

Algorithm (1) S[m][h]:hash table, h hash functions which maps an digit from [0..M-1] to [0..m-1]. Hash function: ai, bi :random number P:a large prime An item k has a set of associated counters:

Algorithm (2) Algorithm 1:hCount maintains hash table Algorithm 2: eFreq checks and outputs the items with frequency above a user-specified threshold s along with their estimated frequencies.

k 的 associated counters 裡最小值即為他的 estimated occurrence Algorithm (3) 分 “insert” 跟 “delete” h 個 hash functions k 的 associated counters 裡最小值即為他的 estimated occurrence

Example (1) items:[1..16] Hash table:S[m][h], m = 5, h = 4, P = 31 H1:(a1, b1) = (7, 13) H2:(a2, b2) = (22, 6) H3:(a3, b3) = (24, 11) H4:(a4, b4) = (14, 27) Initially, all counters of S[m][h] are initialized to zero.

Example (2) A data Stream of 38 transactions ti indicates the i-th transaction. k indicates the item handled by the corresponding transaction.

Example (3) t = 1, k = 2, call hCount(2, insert) The state at t1: H1(2)=((7*2+13) mod 31) mod 5 = 2 , S[2][1] = S[2][1] + 1 H2(2)=((22*2+6) mod 31) mod 5 = 4 , S[4][2] = S[4][2] + 1 H3(2)=((24*2+11) mod 31) mod 5 = 3 , S[3][3] = S[3][3] + 1 H4(2)=((14*2+27) mod 31) mod 5 = 4 , S[4][4] = S[4][4] + 1 The state at t1: m h 1 2 3 4

Example (4) t = 6, k = -6, call hCount(6, delete) The State: H1(6)=((7*6+13) mod 31) mod 5 = 4 , S[4][1] = S[4][1] + 1 H2(6)=((22*6+6) mod 31) mod 5 = 4 , S[4][2] = S[4][2] + 1 H3(6)=((24*6+11) mod 31) mod 5 = 0 , S[0][3] = S[0][3] + 1 H4(6)=((14*6+27) mod 31) mod 5 = 3 , S[3][4] = S[3][4] + 1 The State:

Example (5) Final state for t38: The occurrence of item 6 is 2. Minimum value of its associated counters: 2, 8, 2, and 2.

Example (6) estimated values and true values: We observe that: the estimated values are all greater than or equal to the true values. the gap between the true value and estimated value is very small. N=30 True frequency = 7/30 = 23%

Proposition 1 - Choose m and h (1) Any item k, its associated counters are : <e1, e2, …, eh>:errors of each h counter for k. <e1 + nk, e2 + nk, …, eh + nk>:associated counters for k. Approximately [M/m] items mapped to a counter.

Proposition 1 - Choose m and h (2) expected value of each associated counter is N/m. expected value of each error is no more than N/m. Let random variable Y denote this error: E[Y] <= N/m, Y > 0 From Markov’s Inequality: P[ |Y| – λE[|Y|] > 0 ] ≦ 1/λ, Y > 0, λ > 0 P[ Y – λN/m > 0 ] ≦ 1/λ P[ Ymin – λN/m > 0 ] ≦ 1/λh , try h times P[ Ymin – λN/m < 0 ] ≧ 1 – 1/λh , try h times (1)

Proposition 1 - Choose m and h (3) ρ:the probability of all M items satisfy Eq. (1). ρ = (1 – 1/λh)M ~exp(-M/λh) (2) Ymin is the error part: Ymin < εN = λN/m ε = λ/m (3) V = m.h:size of hash table (由(2)(3)可知) Example: With 47K counters, hCount can support a data stream in range [1..220], with ε=0.01 and ρ=0.95.

Error Reduction – hCount* Every counter contains an error. A range [1..M], extend the range to [1..M+Δ]. Each k’ in [M+1.. M+Δ], its estimated occurrence is the error. The error factor τ is the average of all . hCount*:estimated occurrence .

Experiment – hCount v.s hCount*(1) Synthetic database choose zipf distribution with [1..1000000]. 2,740 4-byte counters are sufficient.

Experiment – hCount v.s hCount*(2)

Experiment – hCount v.s. groupTest (1)

Experiment – hCount v.s. groupTest (2)

Experiment – real data (1)

Experiment – real data (2)