Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥.

Similar presentations


Presentation on theme: "How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥."— Presentation transcript:

1 How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥

2 A na ï ve approach to find frequent items  Method: Maintain an array of counters Increment the corresponding counter by one whenever a new item arrives  Problem: Available array size M << n (distinct item number) Inappropriate to continuous query

3 Applications  The statistical property of sensor monitoring data  The statistical property of Internet packets through a router  The statistical property of searching keywords of a search engine

4 ψ011ψ0212121 Basic idea: MJRTY (majority voting) (*)  Use one counter to find the majority of a group  Number of comparisons: n-1  Example: 1222321 Counter: element _name value *: Tech. Report ICSCA-CMP-32, Robert S. Boyer and J Strother Moore, 1982

5 Why MJRTY works?  Assume a majority item α exists in group G, we randomly delete 2 different items from G: If the two items are not α, α would naturally still be the majority after deleting them If one of the two items is α, α would still be the majority since both α and its adversary are decrement by one

6 Apply MJRTY to distributed environment  Merge two nodes with the same element_name Add the values directly  Merge two nodes with different element_names Set value to the abstract value of the difference between two values Set element_name to the one with larger value d 9 d 3 d 12 d 9 c 3 d 6 … …

7 Apply MJRTY to data stream (basic) element_name value d 15 a 8 b 22 d 2 b 10 c 3 d 7 d 9 c 3 time d 15 a 8 b 22 d 2 b 10 c 3 d 7 d 9 b 5 0-2-3-4-5-6-7-8 Use the recycled counter element now Ex: number of available counters = 9 Going to be recycled…  Required space  window size time = t time = t+1

8 Apply MJRTY to data stream (improved) manage time unit element_name value 1 d 15 1 a 8 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 c 3 time 1 d 15 1 a 8 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 Going to be recycled… 1 b 5 0-2-4-6-10-18-34-66 2 d 7 1 b 5 2 b 22 2 d 2 4 b 10 8 c 3 16 d 7 32 d 9 2 d 7 1 b 5 4 b 20 4 b 10 8 c 3 16 d 7 32 d 9 Use the recycled counter element “Three” counters are responsible for time unit with length 1  merge “Three” counters are responsible for time units with length 2  merge  Required space  log(window size)

9 Extend MJRTY to HI-FRQCY (high-frequency)  Frequent item: frequency > 1/(n+1)  Use n counters to get frequent items  Ex: when n=2 11233 Counter: element _name value Φ0 Φ0 1112 21 11 Φ031

10 Why HI_FRQCY works?  At most n items whose frequency are larger than 1/(n+1), so n counters are enough to record all frequent items  If frequent items exist in group G, randomly delete n different items from G will not affect the status of frequent items

11 Apply HI_FRQCY to distributed and continuous environment  Merge two nodes If any counters in the two nodes record the same item, merge them Sort the counters Choose the larger n counters as result  Can be applied to distributed systems  Can be applied to continuous query environment

12 Continuous query  Characteristics Data updates continuously Tend to query “ recent ” data May query some statistic during any period of time within the window size  A diagram of continuous query 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 0 1 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Elements Arrival time now window size (7)


Download ppt "How to find frequent items continuously in data streams Speaker: 陳弘軒 Adviser: 王家祥."

Similar presentations


Ads by Google