Download presentation
Presentation is loading. Please wait.
Published byEdgar Carpenter Modified over 9 years ago
1
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring
2
Acknowledgements l Al Demers l Abhinandan Das l Alin Dobra l Sasha Evfimievski l Johannes Gehrke l KD-D initiative (Art Becker et al.)
3
Introduction l Data streams versus databases l Infinite stream, continuous queries l Limited resources l Network monitoring l High arrival rates, approximation [CGJSS02] l Stock trading l Complex computation [ZS02] l Retail, E-business, Intelligence, Medical Surveillance l Identify relevant information on-the-fly, archive for data mining l Exact results, error guarantees
4
Information Spheres l Local Information Sphere l Within each organization l Continuous processing of distributed data streams l Online evaluation of thousands of triggers l Storage/archival of important data l Global Information Sphere l Between organizations l Share data in privacy preserving way
5
Local Information Sphere Distributed data stream event processing and online data mining l Technical challenges l Blocking operators, unbounded state l Graceful degradation under increasing load l Integration with archive l Processing of physically distributed streams
6
Event Matching, Correlation l Join of data streams BrandMpixPrice Canon3.0200 MpixPrice >2.0<250
7
Event Matching, Correlation l Join of data streams BrandMpixPrice Canon3.0200 Fuji3.0100 MpixPrice >2.0<250 >4.0<400
8
Event Matching, Correlation l Join of data streams l Equi-join, text similarity, geographical proximity,… l Problem: unbounded state, computation BrandMpixPrice Canon3.0180 Fuji3.0220 Kodak4.0340 MpixPrice > 2.0< 250 > 4.0< 400 = 3.0< 200
9
Window Joins l Restrict join to window of most recent records (tuples) l Landmark window l Sliding window based on time or number of records l Problem definition l Window based on time: size w l Synchronous record arrival l Equi-join
10
Abstract Model l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w 111 231 R S (r0,s2), (r1,s2), (r2,s2)
11
Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w 1113 2311 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3)
12
Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w 11132 23114 R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output
13
Limited Resources l Focus on limited memory M<2w l State of the art: random load shedding [KNV03] l Random sample of streams l Desired approach: semantic load shedding l Goal: graceful degradation l Approximation l Set-valued result: Error measure?
14
Set-Approximation Error l What is a good error measure? l Information Retrieval, Statistics, Data Mining l Matching coefficient l Dice coefficient l Jaccard coefficient l Cosine coefficient l Overlap coefficient l Earth Mover’s Distance (EMD) [RTG98] l Match And Compare (MAC) [IP99] l Join: subset of output result l EMD, Overlap coefficient trivially 0 or 1 l Others (except MAC) reduce to MAX-subset error measure
15
Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints l Lightweight online technique l Adaptivity in presence of memory fluctuations
16
Optimal Offline Algorithm l What is the best possible that can be achieved? l Optimal sampling strategy for MAX-subset l Bottom-line for evaluation of any online algorithm l Same optimization problem, but knows future l Finite subsets of input streams l Formulate as linear flow problem
17
Generation of Flow Model R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
18
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
19
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
20
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
21
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
22
Complexity l Integer solution exists l Optimal solution found in O(n 2 m log n) l N input size of single stream l #nodes: n < 2wN + N + 2 l #arcs: m < 2n + M + 1 l Reasonable costs for benchmarking l Approx. 1GB memory (w=800, M=800) l Approx. 1h computation time
23
Optimal Flow R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
24
Easy to Extend R=1,1,1,3 S=2,3,1,1 M=2, w=3 Variable memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
25
Online Heuristics l Maximize expected output l PROB: sort tuples by join partner arrival probability l LIFE: sort tuples by product of partner arrival probability and remaining lifetime l Maintain stream statistics l Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)
26
Approximation Quality
27
Effect of Skew
28
Summary l Information sphere architecture l Optimal algorithm and fast efficient heuristic for sliding window joins l Open problems l Other set error measures, resource models l Other joins: compress records l Complex queries l Distributed processing l Integration with other techniques into local information sphere
29
Related Work l Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) l Memory requirements [ABBMW02,TM02] l Aggregation l Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
30
Other Results [DGR03] l Integration with archive l Load smoothing, not shedding l Novel “error” measure: archive access cost l Static join for sensor networks l Maximize result size subject to constraints on energy consumption l Polynomial dynamic programming solution l Fast 2-approximation algorithms l NP-hardness proof for join of 3 or more streams
31
Other Results (cont.) [DGGR02] l Computation of aggregates over streams for multiple joins l Small pseudo-random sketch synopses (randomized linear projections) l Explicit, tunable error guarantees l Sketch partitioning to boost accuracy (intelligently partition join attribute space)
32
Thanks! Questions? ? ? ? ? ? ? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.