Download presentation
Presentation is loading. Please wait.
1
Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6
2
Continuous Data Stream Processing 2 Clustering engine Clustering engine Music metadata Music metadata Music Virtual Channel … 1 1 N N 2 2 … Music collections Internet V.C. player V.C. player Filtering engine Filtering engine Music channel simulator Music channel simulator Interface Profile monitor Profile monitor Channel monitor Channel monitor Favorite channel Favorite channel Cluster monitor Cluster monitor Cluster coordinator Cluster coordinator Peer search engine Peer search engine Profile database Profile database MusicXML database MusicXML database XML Filtering engine XML Filtering engine
3
Continuous Data Stream Processing 3 Research Directions Streaming Data Management Mining Filtering Temporal Query Processing Spatial Query Processing Aggregate Query Processing Frequent Tree Pattern Mining Frequent Itemset Mining (sliding window) Sequence Query Matching Episode Query Matching Range Search KNN Search Top-K Search Closed Tree Pattern Mining Frequent Itemset Mining (landmark model)
4
Hash-based synopsis with memory consideration for mining frequent itemsets over data streams
5
Continuous Data Stream Processing 5 Landmark model
6
Continuous Data Stream Processing 6 Lossy Counting Step 1: Divide the stream into ‘buckets’ bucket 1bucket 2bucket 3 bucket-size = 1/ε ε = 10% of support s
7
Continuous Data Stream Processing 7 Lossy Counting in Action Empty At bucket boundary, decrement all counters by 1
8
Continuous Data Stream Processing 8 Lossy Counting continued... At bucket boundary, decrement all counters by 1 Next Bucket + Output: Elements with counter values exceeding sN – εN
9
Continuous Data Stream Processing 9 Drawbacks of Lossy Counting s ε Lossy-Counting Keep all items with frequency > Applied to mine frequent itemsets, the space may exponentially increase 0 1
10
Continuous Data Stream Processing 10 hCount ……, 9, …… 1 0 1 1 1 1 0 0 1 2 1 1 0 1 1 1 0 1 1 1 m h 2 h 1 (9) mod m h 2 (9) mod m h 3 (9) mod m h 4 (9) mod m 2 2 2 For each item, hash the item into buckets, choose the minimum count and return the item if its minimum count ≥ sN
11
Continuous Data Stream Processing 11 hash-based Transaction {1, 2, 3} Subsets of {1, 2, 3}: …… Total_ Access N last_access ItemsetSurplus_ Estimate True_ Count {1} {2} {3} {1, 2} {1, 3} {2, 3} {1, 2, 3} × ○ ○ ○ × × × N N N N +1 1 How to compute the Surplus_Estimate?
12
Continuous Data Stream Processing 12 Compute the Surplus_Estimate for an Itemset Two variables n: number of different itemsets in the bucket but not in the list c: sensible counts to be divided between itemsets which are not in the list If c = [3, 5], n = [3, ?] → Surplus_Estimate = 3, (3, 1, 1) Surplus_Estimate --, until (Surplus_Estimate) / N last_acces < minSup
13
Continuous Data Stream Processing 13 Determine c and n 43 {1}20 Itemset 5 11 {2} 4 Total_ Access N last_access Surplus_ Estimate True_ Count {2, 3, 5}, N = 4, minSup = 0.4 {2} is hashed into the bucket Boundary of c: 4-(2+SE) ≤ c ≤ 4-2 Boundary of n: c = 2, n = 2 → (1, 1) → Surplus_Estimate = 1
14
Monitoring Constrained k-Nearest Neighbor over Moving Objects with Different Values
15
Continuous Data Stream Processing 15 Motivation (Cont.) Example: Consider that an user wants to find the k places to buy new shoes where the costs are the lowest. Cost = Price($) + Traffic Cost($) $90 $100 $200 $400 1 2 3 5 400+100*1=500 200+100*2=400 100+100*3=400 90+100*5=590 2-NN Query
16
Continuous Data Stream Processing 16 Motivation Objects with different values in spatial database. find the k places to buy something where the costs are the lowest. Cost = Price($) + Traffic Cost($) Taxi driver wants to find the k places to gain the most profits. Profit = Gain($) - Traffic Cost($) Taxi driver wants to find the k places to gain the most profits. Profit = Gain($) / Time = Gain($) / Time Virtual Channel age * profile distance listen hours / profile distance Market Survey $consumption (or income, age … ) / profile distance
17
Continuous Data Stream Processing 17 Challenges Efficiency Search space reduction Query processing enhancement Effectiveness Previous result reuse
18
Continuous Data Stream Processing 18 Framework Step1 Find k-candidates to restrict the search region. Step2 Run Pruning Ring on the remaining candidates to determine actual answer. Initialization Handling updates positionsvalues - Incrementally update positions or values for objects and queries - Computation is necessary only for affected query q
19
Querying Episodes over Event Stream
20
Continuous Data Stream Processing 20 Motivation Knowledge Discovery from Telecommunication Network Alarm Databases [ICDE96] If an alarm of type A occurs, then an alarm of type B occurs within 30 seconds with probability 0.8 If alarms of types A and B occurs within 5 seconds, then a alarm of type C occurs within 60 seconds with probability 0.7 If an alarm of type A precedes an alarm of type B, and C precedes D, all within 15 seconds, then E will follow within 4 minutes with probability 0.6 A A B 5 seconds CD A B 15 seconds
21
Continuous Data Stream Processing 21 Challenges Efficiency Index impaction Partial result sharing Load shedding
22
Continuous Data Stream Processing 22 DBC 5 AB D C 7 D C 3 Q1Q1 Q2Q2 Q3Q3 p1p1 p2p2 p3p3 p4p4 p5p5 p1p1 C B D D p1p1 p2p2 p3p3 p4p4 A p4p4 p5p5 p5p5 Joining events B and C: B C p 1, p 5 Q 3 is composed of p 5 and p 4 Framework
23
Continuous Data Stream Processing 23 PQueue M.Q.: Q 1 E. I.: -1 P1P1 PQueue M.Q.: Q 2 E. I.: 2 P2P2 PQueue M.Q.: Q 2 E. I.: 2 P3P3 PQueue M.Q.: Q 3 E. I.: 6 P4P4 PQueue M.Q.: Q 3 E. I.: 6 P5P5 DBC 5 AB D C 7 D C 3 A EQueue TLink B E. I.: 5 EQueue TLink C E. I.: 2 EQueue TLink D E. I.: 4 EQueue TLink (time)(S t, E t ) p1p1 C B D D p1p1 p2p2 p3p3 p4p4 A p4p4 p5p5 p5p5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.