Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamically Maintaining Frequent Items Over A Data Stream

Similar presentations


Presentation on theme: "Dynamically Maintaining Frequent Items Over A Data Stream"— Presentation transcript:

1 Dynamically Maintaining Frequent Items Over A Data Stream
C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou, in Proc. of the 12th ACM International Conference on Information and Knowledge Management, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date:

2 Introduction In this paper, propose a new approach, with a small memory space, to maintain a list of most frequent items above some user-specified threshold in a dynamic environment where items can be inserted and deleted.

3 Problem Definition (1) A transaction is either inserting or deleting an item k at time point i, denoted ti = delete(k) or ti = insert(k). Items:integers in a range of [1..M]. :the net occurrence of item k. :sum of net occurrence of all items. :the frequency of item k.

4 Problem Definition (2) three user-specified parameters:
a support parameter an error parameter a probability parameter such that is near 1 Guarantees: all items whose true frequency exceeds s are output no item whose true frequency is less than s –ε is output estimated frequencies are more than the true frequencies by at most ε with high probability ρ

5 Algorithm (1) S[m][h]:hash table, h hash functions which maps an digit from [0..M-1] to [0..m-1]. Hash function: ai, bi :random number P:a large prime An item k has a set of associated counters:

6 Algorithm (2) Algorithm 1:hCount maintains hash table
Algorithm 2: eFreq checks and outputs the items with frequency above a user-specified threshold s along with their estimated frequencies.

7 k 的 associated counters 裡最小值即為他的 estimated occurrence
Algorithm (3) 分 “insert” 跟 “delete” h 個 hash functions k 的 associated counters 裡最小值即為他的 estimated occurrence

8 Example (1) items:[1..16] Hash table:S[m][h], m = 5, h = 4, P = 31
H1:(a1, b1) = (7, 13) H2:(a2, b2) = (22, 6) H3:(a3, b3) = (24, 11) H4:(a4, b4) = (14, 27) Initially, all counters of S[m][h] are initialized to zero.

9 Example (2) A data Stream of 38 transactions
ti indicates the i-th transaction. k indicates the item handled by the corresponding transaction.

10 Example (3) t = 1, k = 2, call hCount(2, insert) The state at t1:
H1(2)=((7*2+13) mod 31) mod 5 = 2 , S[2][1] = S[2][1] + 1 H2(2)=((22*2+6) mod 31) mod 5 = 4 , S[4][2] = S[4][2] + 1 H3(2)=((24*2+11) mod 31) mod 5 = 3 , S[3][3] = S[3][3] + 1 H4(2)=((14*2+27) mod 31) mod 5 = 4 , S[4][4] = S[4][4] + 1 The state at t1: m h 1 2 3 4

11 Example (4) t = 6, k = -6, call hCount(6, delete) The State:
H1(6)=((7*6+13) mod 31) mod 5 = 4 , S[4][1] = S[4][1] + 1 H2(6)=((22*6+6) mod 31) mod 5 = 4 , S[4][2] = S[4][2] + 1 H3(6)=((24*6+11) mod 31) mod 5 = 0 , S[0][3] = S[0][3] + 1 H4(6)=((14*6+27) mod 31) mod 5 = 3 , S[3][4] = S[3][4] + 1 The State:

12 Example (5) Final state for t38: The occurrence of item 6 is 2.
Minimum value of its associated counters: 2, 8, 2, and 2.

13 Example (6) estimated values and true values: We observe that:
the estimated values are all greater than or equal to the true values. the gap between the true value and estimated value is very small. N=30 True frequency = 7/30 = 23%

14 Proposition 1 - Choose m and h (1)
Any item k, its associated counters are : <e1, e2, …, eh>:errors of each h counter for k. <e1 + nk, e2 + nk, …, eh + nk>:associated counters for k. Approximately [M/m] items mapped to a counter.

15 Proposition 1 - Choose m and h (2)
expected value of each associated counter is N/m. expected value of each error is no more than N/m. Let random variable Y denote this error: E[Y] <= N/m, Y > 0 From Markov’s Inequality: P[ |Y| – λE[|Y|] > 0 ] ≦ 1/λ, Y > 0, λ > 0 P[ Y – λN/m > 0 ] ≦ 1/λ P[ Ymin – λN/m > 0 ] ≦ 1/λh , try h times P[ Ymin – λN/m < 0 ] ≧ 1 – 1/λh , try h times (1)

16 Proposition 1 - Choose m and h (3)
ρ:the probability of all M items satisfy Eq. (1). ρ = (1 – 1/λh)M ~exp(-M/λh) (2) Ymin is the error part: Ymin < εN = λN/m ε = λ/m (3) V = m.h:size of hash table (由(2)(3)可知) Example: With 47K counters, hCount can support a data stream in range [1..220], with ε=0.01 and ρ=0.95.

17 Error Reduction – hCount*
Every counter contains an error. A range [1..M], extend the range to [1..M+Δ]. Each k’ in [M+1.. M+Δ], its estimated occurrence is the error. The error factor τ is the average of all hCount*:estimated occurrence

18 Experiment – hCount v.s hCount*(1)
Synthetic database choose zipf distribution with [ ]. 2,740 4-byte counters are sufficient.

19 Experiment – hCount v.s hCount*(2)

20 Experiment – hCount v.s. groupTest (1)

21 Experiment – hCount v.s. groupTest (2)

22 Experiment – real data (1)

23 Experiment – real data (2)


Download ppt "Dynamically Maintaining Frequent Items Over A Data Stream"

Similar presentations


Ads by Google