Download presentation
Presentation is loading. Please wait.
1
How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥
2
A na ï ve approach to find frequent items Method: Maintain an array of counters Increment the corresponding counter by one whenever a new item arrives Problem: Available array size M << n (distinct item number) Inappropriate to continuous query
3
Applications The statistical property of sensor monitoring data The statistical property of Internet packets through a router The statistical property of searching keywords of a search engine
4
ψ011ψ0212121 Basic idea: MJRTY (majority voting) (*) Use one counter to find the majority of a group Number of comparisons: n-1 Example: 1222321 Counter: element _name value *: Tech. Report ICSCA-CMP-32, Robert S. Boyer and J Strother Moore, 1982
5
Why MJRTY works? Assume a majority item α exists in group G, we randomly delete 2 different items from G: If the two items are not α, α would naturally still be the majority after deleting them If one of the two items is α, α would still be the majority since both α and its adversary are decrement by one
6
Apply MJRTY to distributed environment Merge two nodes with the same element_name Add the values directly Merge two nodes with different element_names Set value to the abstract value of the difference between two values Set element_name to the one with larger value d 9 d 3 d 12 d 9 c 3 d 6 … …
7
Apply MJRTY to data stream (basic) element_name value d 15 a 8 b 22 d 2 b 10 c 3 d 7 d 9 c 3 time d 15 a 8 b 22 d 2 b 10 c 3 d 7 d 9 b 5 0-2-3-4-5-6-7-8 Use the recycled counter element now Ex: number of available counters = 9 Going to be recycled… Required space window size time = t time = t+1
8
Apply MJRTY to data stream (improved) manage time unit element_name value 1 d 15 1 a 8 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 c 3 time 1 d 15 1 a 8 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 Going to be recycled… 1 b 5 0-2-4-6-10-18-34-66 2 d 7 1 b 5 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 2 d 7 1 b 5 4 b 20 4 b 10 8 c 3 16 d 7 32 d 9 Use the recycled counter element “Three” counters are responsible for time unit with length 1 merge “Three” counters are responsible for time units with length 2 merge Required space log(window size)
9
Extend MJRTY to HI-FRQCY (high-frequency) Frequent item: frequency > 1/(n+1) Use n counters to get frequent items Ex: when n=2 11233 Counter: element _name value Φ0 Φ0 1112 21 11 Φ031
10
Why HI_FRQCY works? At most n items whose frequency are larger than 1/(n+1), so n counters are enough to record all frequent items If frequent items exist in group G, randomly delete n different items from G will not affect the status of frequent items
11
Apply HI_FRQCY to distributed and continuous environment Merge two nodes If any counters in the two nodes record the same item, merge them Sort the counters Choose the larger n counters as result Can be applied to distributed systems Can be applied to continuous query environment
12
Continuous query Characteristics Data updates continuously Tend to query “ recent ” data May query some statistic during any period of time within the window size A diagram of continuous query 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Elements Arrival time now window size (7)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.