Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring
Acknowledgements l Al Demers l Abhinandan Das l Alin Dobra l Sasha Evfimievski l Johannes Gehrke l KD-D initiative (Art Becker et al.)
Introduction l Data streams versus databases l Infinite stream, continuous queries l Limited resources l Network monitoring l High arrival rates, approximation [CGJSS02] l Stock trading l Complex computation [ZS02] l Retail, E-business, Intelligence, Medical Surveillance l Identify relevant information on-the-fly, archive for data mining l Exact results, error guarantees
Information Spheres l Local Information Sphere l Within each organization l Continuous processing of distributed data streams l Online evaluation of thousands of triggers l Storage/archival of important data l Global Information Sphere l Between organizations l Share data in privacy preserving way
Local Information Sphere Distributed data stream event processing and online data mining l Technical challenges l Blocking operators, unbounded state l Graceful degradation under increasing load l Integration with archive l Processing of physically distributed streams
Event Matching, Correlation l Join of data streams BrandMpixPrice Canon MpixPrice >2.0<250
Event Matching, Correlation l Join of data streams BrandMpixPrice Canon Fuji MpixPrice >2.0<250 >4.0<400
Event Matching, Correlation l Join of data streams l Equi-join, text similarity, geographical proximity,… l Problem: unbounded state, computation BrandMpixPrice Canon Fuji Kodak MpixPrice > 2.0< 250 > 4.0< 400 = 3.0< 200
Window Joins l Restrict join to window of most recent records (tuples) l Landmark window l Sliding window based on time or number of records l Problem definition l Window based on time: size w l Synchronous record arrival l Equi-join
Abstract Model l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2)
Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3)
Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output
Limited Resources l Focus on limited memory M<2w l State of the art: random load shedding [KNV03] l Random sample of streams l Desired approach: semantic load shedding l Goal: graceful degradation l Approximation l Set-valued result: Error measure?
Set-Approximation Error l What is a good error measure? l Information Retrieval, Statistics, Data Mining l Matching coefficient l Dice coefficient l Jaccard coefficient l Cosine coefficient l Overlap coefficient l Earth Mover’s Distance (EMD) [RTG98] l Match And Compare (MAC) [IP99] l Join: subset of output result l EMD, Overlap coefficient trivially 0 or 1 l Others (except MAC) reduce to MAX-subset error measure
Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints l Lightweight online technique l Adaptivity in presence of memory fluctuations
Optimal Offline Algorithm l What is the best possible that can be achieved? l Optimal sampling strategy for MAX-subset l Bottom-line for evaluation of any online algorithm l Same optimization problem, but knows future l Finite subsets of input streams l Formulate as linear flow problem
Generation of Flow Model R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Correspondence to Windows R=1,1,1,3 S=2,3,1,1
Complexity l Integer solution exists l Optimal solution found in O(n 2 m log n) l N input size of single stream l #nodes: n < 2wN + N + 2 l #arcs: m < 2n + M + 1 l Reasonable costs for benchmarking l Approx. 1GB memory (w=800, M=800) l Approx. 1h computation time
Optimal Flow R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
Easy to Extend R=1,1,1,3 S=2,3,1,1 M=2, w=3 Variable memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace
Online Heuristics l Maximize expected output l PROB: sort tuples by join partner arrival probability l LIFE: sort tuples by product of partner arrival probability and remaining lifetime l Maintain stream statistics l Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)
Approximation Quality
Effect of Skew
Summary l Information sphere architecture l Optimal algorithm and fast efficient heuristic for sliding window joins l Open problems l Other set error measures, resource models l Other joins: compress records l Complex queries l Distributed processing l Integration with other techniques into local information sphere
Related Work l Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) l Memory requirements [ABBMW02,TM02] l Aggregation l Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy
Other Results [DGR03] l Integration with archive l Load smoothing, not shedding l Novel “error” measure: archive access cost l Static join for sensor networks l Maximize result size subject to constraints on energy consumption l Polynomial dynamic programming solution l Fast 2-approximation algorithms l NP-hardness proof for join of 3 or more streams
Other Results (cont.) [DGGR02] l Computation of aggregates over streams for multiple joins l Small pseudo-random sketch synopses (randomized linear projections) l Explicit, tunable error guarantees l Sketch partitioning to boost accuracy (intelligently partition join attribute space)
Thanks! Questions? ? ? ? ? ? ? ?