Download presentation
Presentation is loading. Please wait.
1
Tracking most frequent items dynamically. Article by G.Cormode and S.Muthukrishnan. Presented by Simon Kamenkovich.
2
Motivation Most DB management systems maintains “ hot items ” statistics. Hot items are used as simple outliers in data mining, anomaly detection in network applications and financial market data. The proposed algorithm handles deletions and insertions and with better qualities then other existing methods
3
Introduction Given: n integers in the range [1… m] So there may be up to k hot items and there may be none.
4
Preliminaries If we’re allowed O(m) space, then simple heap will process each insert and delete in O(log m) and find all the hot items in O(k logk). Lemma 1: Any algorithm which guarantees to find all and only items which have frequency greater then 1/(k+1) must store at Ω(m) bits.
5
Small Tail Property
6
Prior works AlgorithmType Time Per Item Space Misra-GriesDeterministicO(log k) amortized O(k) FrequentRandomizedO(1) expectedO(k) Lossy CountingDeterministicO(log(n/k))Ω(klog(n/k)) Charikar et al.RandomizedΩ(k/ε²logn) Gilbert et al. (quantiles) RandomizedΩ(k²log²n)
7
Base method Theorem 1: Calling DivideAndConquer(1,m,n/(k+1)) will output all and only hot items. A total of O(k log m/k) calls will be made to the oracle. DivideAndConquer(l,r,thresh) if oracle(l,r) > thresh if (l=r) then output(l) else DivideAndConquer(l,r-l/2,thresh) DivideAndConquer(r-l/2+1,r,thresh)
8
Oracle design N.Alon et al. - requires O(km) space Gilbert (Random Subset Sums) requires O(k²log m log k/δ) space Charikar et al. requires O(klogm log k/δ) Cormode requires O(k logm log k/δ)
9
Group Testing Idea: To design number of tests, each of which groups together a number of m items in order to find up to k items which test positive. Majority item can be found for insertions only case in O(1) time and space by the algorithm of Boyer and Moore. General procedure: for each transaction on item i we determine each subset it’s included in,S(i). And increment or decrement the counters associated with the subsets.
10
Deterministic algorithm Each test includes half of the range [1..m], corres- ponding to binary representations of values int c[0…log m] UpdateCounters(i, transtype, c[0…log m]) c[0]=c[0] + diff for j=1 to log m do If (transtype = ins) c[j] = c[j] + bit(j,i) Else c[j] = c[j] - bit(j,i)
11
Deterministic algorithm(cont) Theorem 2: The above algorithm finds a majority item if there is one with time O(log m) per operation. FindMajority(c[0... log m]) Position = 0, t =1 for j=1 to log m do if (c[j] > c[0]/2) then position = position + t t = 2* t return position
12
Randomized Algorithm
13
Coupon Collector Problem X – number of trials required to collect at least one of each type of coupon Epoch i begins with after i-th success and ends with (i+1)-th success Xi – number of trials in the i-th epoch Xi distributed geometrically and pi = p(k-i)/k p is probability that coupon is good
14
Using hash functions We don’t store sets explicitly – O(mlogk), instead we choose the set in pseudo-random fashion using hash functions: Fix a prime P > 2k, Take a and b uniformly from [0…P-1] Choose T= log k/δ pairs of a,b ; each pair will define 2k sets.
15
Data Structure log(k/ δ) groups 2k subsets log m counters
16
ProccessItem Initialize c[0 … 2Tk][0 … log m] Draw a[1 … T], b[1 … T], c=0 ProccessItem(i,transtype,T,k) for all (i,transtype) do if (trans = ins) c = c +1 else c = c – 1 for x = 1 to T do index = 2k(x-1) + (i*a[x]+b[x] mod P) mod 2k //they had 2(x-1) UpdateCounters(i,transtype,c[index]) Space used by the algorithm is O(k log(k/ δ) log m).
17
Lemma Lemma 2: The probability of each hot item being in at least one good set is at least 1- δ Proof: For each T repetitions we put a hot item in one of 2k buckets. If m < 1/(k+1) then there is a majority and we can find it If m > 1/(k+1) then we won’t be able to find a majority Probability of failure < ½ by Markov inequality. Probability to fail on each T is at most δ/k. Probability of any hot items failing at most δ.
18
GroupTest GroupTest(T,k,b) for i=1 to 2Tk do // they had T here if c[i][0] > cb position = 0; t =1 for j = 1 to log m do if c[i][j] > cb and c[i][0] – c[i][j] > cb or c[i][j] < cb and c[i][0] – c[i][j] < cb then Skip to next i if c[i][j] > cb position = position + t t = 2 * t return position
19
Algorithm properties Theorem 4: With proability at least 1- δ, calling the GroupTest(log k/δ,k,1/(k+1)) procedure finds all the hot items using O(k log k/δ logm) space. The time for an update is O(logk/δ log m) and the time to list all hot items is O(k log k/δ log m) Corollary 1: If we have the small tail property, then we will output no items which are not hot. Proof: We will look for hot items only if it exists and only it will be output.
20
Algorithm properties (cont) Lemma 4: The output of the algorithm is the same for any reordering of the input data. Corollary 2: The set of counters created with T= log k/ δ can be used to find hot items with parameter k’ for any k’<k with probability of success 1 – δ by calling GroupTest(logk/δ,k,1/(k’+1) Proof: According to Lemma2, sinceLemma2
21
Experiments Definitions: The recall is the proportion of the hot items that are found by the method to the total number of hot items. The precision is the proportion of items identified by the algorithm, which are hot, to number of all output items. GroupTesting algorithm was compared to Loosy Counting and Frequent algorithms. The authors implemented them so that when an item is deleted we decrement the corresponding counter if such exist.
22
Synthetic data (Recall) Zipf for hot items: 0 – distributed uniformly, 3 – highly skewed
23
Synthetic data (Precision) GroupTesting required more memory and took longer to process each item.
24
Real Data (Recall) Real data was obtained from one of AT&T network for part of a day.
25
Real Data (Precision) Real data has guarantee of having small tail property….
26
Varying frequency at query time Other algorithms are designed around fixed frequency threshold supplied in advance. The data structure was build for queries at the 0.5% level, but was then tested with queries ranged from 10% to 0.02%
27
Conclusions and extensions New method which can cope with dynamic dataset is proposed. It ’ s interesting to try to use the algorithm to compare the differences in frequencies between different datasets. Can we find combinatorial design that achieve the same properties but in deterministic construction for maintaining hot items?
28
FIN
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.