Download presentation
Presentation is loading. Please wait.
1
Models and Issues in Data Stream Systems
Rajeev Motwani Stanford University (with Brian Babcock, Shivnath Babu, Mayur Datar, and Jennifer Widom) STREAM Project Members: Arvind Arasu, Gurmeet Manku, Liadan O’Callaghan, Justin Rosentein, Qi Sun, Rohit Varma PODS 2002
2
Data Streams Traditional DBMS – data stored in finite, persistent data sets New Applications – data input as continuous, ordered data streams Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets PODS 2002
3
Using Traditional Database
User/Application Query Result Result Query … Loader PODS 2002
4
New Approach for Data Streams
User/Application Register Query Results Stream Query Processor PODS 2002
5
New Approach for Data Streams
User/Application Register Query Data Stream Management System (DSMS) Results Stream Query Processor Scratch Space (Memory and/or Disk) PODS 2002
6
Sample Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun) Network packet streams, user session information Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) Streams of trading data, stock tickers, news feeds Queries: arbitrage opportunities, analytics, patterns SEC requirement on closing trades PODS 2002
7
Sample Applications Network management and traffic engineering (e.g., Sprint) Streams of measurements and packet traces Queries: detect anomalies, adjust routing Telecom call data (e.g., AT&T) Streams of call records Queries: fraud, customer call patterns, billing PODS 2002
8
Sample Applications Sensor Networks (e.g. Cornell’s Cougar, UCB’s Telegraph) Large number of cheap, wireless sensors Noisy streams of real-world measurements Abstraction: Massive distributed database Queries: aggregate, correlate, localize, alert Novelty: control rate for battery power Manufacturing processes Sensors in plants and warehouses PODS 2002
9
Sample Applications Web tracking and personalization (e.g., Yahoo, Google, Akamai) Clickstreams, user query streams, log records Queries: monitoring, analysis, personalization Truly massive databases (e.g., Astronomy Archives) Stream the data by once (or over and over) Queries: do the best they can PODS 2002
10
Challenges Multiple, continuous, rapid, time-varying, ordered streams
Main memory computations Queries may be continuous (not just one-time) Evaluated continuously as stream data arrives Answer updated over time Queries may be complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining) PODS 2002
11
Executive Summary Data Stream Management Systems (DSMS) Caveats
Highlight issues and motivate research Not a tutorial or comprehensive survey Caveats Personal view of emerging field Stanford STREAM Project bias Cannot cover all projects in detail PODS 2002
12
Meta-Questions Killer-apps Motivation Non-Trivial
Application stream rates exceed DBMS capacity? Can DSMS handle high rates anyway? Motivation Need for general-purpose DSMS? Not ad-hoc, application-specific systems? Non-Trivial DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt? PODS 2002
13
DBMS versus DSMS Persistent relations Transient streams PODS 2002
14
DBMS versus DSMS Persistent relations One-time queries
Transient streams Continuous queries PODS 2002
15
DBMS versus DSMS Persistent relations One-time queries Random access
Transient streams Continuous queries Sequential access PODS 2002
16
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Transient streams Continuous queries Sequential access Bounded main memory PODS 2002
17
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical PODS 2002
18
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Passive repository Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores PODS 2002
19
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Passive repository Relatively low update rate Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate PODS 2002
20
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements PODS 2002
21
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Assume precise data Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise PODS 2002
22
DBMS versus DSMS Persistent relations One-time queries Random access
“Unbounded” disk store Only current state matters Passive repository Relatively low update rate No real-time services Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-GB arrival rate Real-time requirements Data stale/imprecise Unpredictable/variable data arrival and characteristics PODS 2002
23
Making Things Concrete
ALICE BOB Central Office Central Office Outgoing (call_ID, caller, time, event) Incoming (call_ID, callee, time, event) DSMS event = start or end PODS 2002
24
Query 1 (self-join) Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time – O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end PODS 2002
25
Query 2 (join) Pair up callers and callees
SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage … … unless streams are near-synchronized PODS 2002
26
Query 3 (group-by aggregation)
Total connection time for each caller SELECT O1.caller, sum(O2.time – O1.time) FROM Outgoing O1, Outgoing O2 WHERE (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) GROUP BY O1.caller Cannot provide result in (append-only) stream Output updates? Provide current value on demand? Memory? PODS 2002
27
Remainder of Talk Reconsider all aspects of data management and processing in presence of data streams PODS 2002
28
Outline of Remaining Talk
Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002
29
Data Characteristics Database: stored relations + data streams
Stream characteristics Type of data (schema) Data distribution Flow rate Stability of distribution and flow Ordering and other constraints Timestamps Synchronization of multiple streams Distributed streams PODS 2002
30
Data Model Append-only Updates Deletes Meta-Data
Call records Updates Stock tickers Deletes Transactional data Meta-Data Control signals, punctuations System Internals – probably need all above PODS 2002
31
Query Model User/Application Query Processor DSMS PODS 2002
32
Related Database Technology
DSMS must use ideas, but none is substitute Triggers, Materialized Views in Conventional DBMS Main-Memory Databases Distributed Databases Pub/Sub Systems Active Databases Sequence/Temporal/Timeseries Databases Realtime Databases Adaptive, Online, Partial Results Novelty in DSMS Semantics: input ordering, streaming output, … State: cannot store unending streams, yet need history Performance: rate, variability, imprecision, … PODS 2002
33
Related Database Technology
Triggers on Conventional Databases handling stream ordering/rate, scaling/generality for triggers Main-Memory Databases handling stream ordering/rate, better for read-only/query-intensive Publish/Subscribe Systems handling stream ordering, event-filtering only, dissemination focus Materialized Views handling stream ordering, no streaming output Active Databases event-condition-action rules, similar to triggers Sequence/Temporal/Timeseries Databases represents time/ordering in stored relations Realtime Databases transactions with deadlines PODS 2002
34
Stream Projects Amazon/Cougar (Cornell) – sensors
Aurora (Brown/MIT) – sensor monitoring, dataflow Hancock (AT&T) – telecom streams Niagara (OGI/Wisconsin) – Internet XML databases OpenCQ (Georgia) – triggers, incr. view maintenance Stream (Stanford) – general-purpose DSMS Tapestry (Xerox) – pub/sub content-based filtering Telegraph (Berkeley) – adaptive engine for sensors Tribeca (Bellcore) – network monitoring PODS 2002
35
Aurora Architecture (Brown/MIT)
⋈ Output Streams Applications process output streams σ Historical Storage UI for designing operator network and querying trigger-base ⋈ ⋈ ⋈ σ π σ π Input Data Streams Application Administrator specifies processing strategy and QoS parameters PODS 2002
36
STREAM Architecture (Stanford)
Output streams Synopses Query Plans Running Op Ready Op Applications register continuous queries x p Waiting Op s s x Users issue continuous and ad-hoc queries Historical Storage Administrator monitors query execution and adjusts run-time parameters Input streams PODS 2002
37
Despite different emphasis, two approaches very similar
Aurora versus STREAM Focus on large number of sensor streams Exposes query plan QoS/Realtime emphasis Load-shedding Attempt at a general-purpose DSMS Declarative SQL Semantic precision Approximations Despite different emphasis, two approaches very similar Differences in runtime environment (scheduler, memory manager, QoS manager) PODS 2002
38
Eddies – Continuous Adaptivity [Avnur-Hellerstein]
EDDY Continuously Adaptive Query Plan Tuples flow in different orders Visit each operator once before output Routing policy adaptively chooses the “optimal” plan PODS 2002 (Slide courtsey: Joe Hellerstein)
39
Adaptivity (Telegraph)
Output Queues STeMs for join R grouped filter (R.A) EDDY S grouped filter (S.B) R x S x T T Input Streams R S T Runtime Adaptivity Multi-query Optimization Framework – implements arbitrary schemes PODS 2002
40
CACQ versus STREAM Continuously adaptive plans
Multi-query optimization via operator sharing and tuple routing Memory overhead in maintaining tuple lineage Memory allocation, scheduling policy, hard-coded in system Web-based databases and sensor networks Relatively static query plans Multi-query optimization via operator sharing and resource allocation No such extra memory overhead Flexible allocate/schedule to optimize memory usage and response time General purpose DSMS PODS 2002
41
Niagara Architecture (Wisconsin)
GUI Niagara Query Engine Continuous Query Processor Query Parser CQ Manager Query Optimizer Niagara Search Engine Group Optimizer Execution Engine Event Detector Data Manager Internet Data Sources PODS 2002
42
Query-Split Scheme (Niagara)
… IBM MSFT file i file j trig.Act.i trig.Act.j scan scan file i file j split Symbol = Const.Value Quotes.XML join constant table scan scan Aggregate subscription for efficiency Split – evaluate trigger only when file updated Triggers – multi-query optimization PODS 2002
43
Shared Predicates [Niagara, Telegraph]
> Predicates for R.A 7 1 11 R.A > 1 R.A > 7 R.A > 11 R.A < 3 R.A < 5 R.A = 6 R.A = 8 R.A ≠ 9 A>1 A>7 A>11 Tuple A=8 < 3 A<3 A<5 = ≠ 9 PODS 2002
44
Niagara versus STREAM Files (possibly on disk) store intermediate results Multi-query optimization via operator sharing, incremental group optimization Explicit timer support for query output No support for approximation and load management Designed for online XML data sources Queues (in main memory) store intermediate results Multi-query optimization via operator sharing and resource management More difficult to express in query language Graceful degradation with load via approximations General-purpose DSMS PODS 2002
45
Outline of Remaining Talk
Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002
46
Query Processing Outline
Query Language Blocking Operators, Punctuations, Constraints Impact of Limited Memory Approximations Sliding Windows and Timestamps Load-shedding Synopses Query Evaluation Multiple Queries Adaptive Processing Distributed Processing PODS 2002
47
Blocking Operators Blocking Simple Aggregates – output “update” stream
No output until entire input seen Streams – input never ends Simple Aggregates – output “update” stream Set Output (sort, group-by) Root – could maintain output data structure Intermediate nodes – try non-blocking analogs Example – juggle for sort [Raman,R,Hellerstein] Punctuations and constraints Join non-blocking, but intermediate state? sliding-window restrictions PODS 2002
48
Punctuations [Tucker, Maier, Sheard, Fegaras]
Assertion about future stream contents Unblocks operators, reduces state Future Work Inserted at source or internal (operator signaling)? Does P unblock Q? Exists P? Rewrite Q? Relation between P and memory for Q? group-by R.A<10 R.A≥10 State/Index X R S P: S.A≥10 PODS 2002
49
Constraints [Babu-Widom] Future Work
Schema-level: ordering, referential integrity, many-one joins Instance-level: punctuations Query-level: windowed join (nearby tuples only) [Babu-Widom] Input – multi-stream SPJ query, schema-level constraints Output – plan with low intermediate state for joins Future Work Query-level constraints? Combining constraints? Relaxed constraints (near-sorted, near-clustered) Exploiting constraints in intra-operator signaling PODS 2002
50
Relaxed Constraints DBMS – typically strict constraints
Streams are dynamic ⇒ relax constraints Example Stream S, attribute A, tuples s,t Strict Ordering time(t) – time(s) ≥ 0 ⇒ t.A ≥ s.A Relaxed Ordering time(t) – time(s) ≥ k ⇒ t.A ≥ s.A Allow limited out-of-order arrival Open – relaxation-benefit tradeoff PODS 2002
51
Impact of Limited Memory
Continuous streams grow unboundedly Queries may require unbounded memory [ABBMW 02] a priori memory bounds for query Conjunctive queries with arithmetic comparisons Queries with join need domain restrictions Impact of duplication elimination Open – general queries PODS 2002
52
Approximate Query Evaluation
Why? Handling load – streams coming too fast Avoid unbounded storage and computation Ad hoc queries need approximate history How? Sliding windows, synopsis, samples, load-shed Major Issues? Metric for set-valued queries Composition of approximate operators How is it understood/controlled by user? Integrate into query language Query planning and interaction with resource allocation Accuracy-efficiency-storage tradeoff and global metric PODS 2002
53
Sliding Window Approximation
Why? Approximation technique for bounded memory Natural in applications (emphasizes recent data) Well-specified and deterministic semantics Issues Extend relational algebra, SQL, query optimization Algorithmic work Timestamps? PODS 2002
54
Timestamps Explicit Implicit Issues Injected by data source
Models real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small buffers Implicit Introduced as special field by DSMS Arrival time in system Enables order-based querying and sliding windows Issues Distributed streams? Composite tuples created by DSMS? PODS 2002
55
Timestamps in JOIN Output
R x T S Approach 1 User-specified, with defaults Compute output timestamp Must output in order of timestamps Better for Explicit Timestamp Need more buffering Get precise semantics and user-understanding Approach 2 Best-effort, no guarantee Output timestamp is exit-time Tuples arriving earlier more likely to exit earlier Better for Implicit Timestamp Maximum flexibility to system Difficult to impose precise semantics PODS 2002
56
Approximate via Load-Shedding
Handles scan and processing rate mismatch Input Load-Shedding Sample incoming tuples Use when scan rate is bottleneck Positive – online aggregation [Hellerstein, Haas, Wang] Negative – join sampling [Chaudhuri, Motwani, Narasaya] Output Load-Shedding Buffer input infrequent output Use when query processing is bottleneck Example – XJoin [Urhan, Franklin] Exploit synopses PODS 2002
57
Processing Multiple Queries
Large number of continuous queries Long-running Shared resources Multi-query optimization Operator sharing Adaptivity (Eddies) Shared predicate indexes (CACQ, Niagara) PODS 2002
58
Adaptive Query Evaluation
Why adaptivity? Queries are long-running Fluctuating stream arrival & data characteristics Evolving query loads Issues in Adaptivity Resource allocation (memory, computation) Dynamic query execution plans (Eddies) PODS 2002
59
Distributed Query Evaluation
Logical stream = many physical streams maintain top 100 Yahoo pages Correlate streams at distributed servers network monitoring Many streams controlled by few servers sensor networks Issues Move processing to streams, not streams to processors Approximation-bandwidth tradeoff PODS 2002
60
Example: Distributed Streams
Maintain top 100 Yahoo pages Pages served by geographically distributed servers Must aggregate server logs Minimize communication Pushing processing to streams Most pages not in top 100 Avoid communicating about such pages Send updates about relevant pages only Requires server coordination PODS 2002
61
Distributed Streams: Example
Problem – streams are dynamic Popular set can change rapidly (e.g. 9/11) Must detect popularity change Communication for unpopular pages is unavoidable Approach – assign server “quotas” Servers report: count of unpopular page exceeds quota Popularity rise detected quickly Low communication for pages remaining unpopular PODS 2002
62
Stream Query Language? SQL extension
Sliding windows as first-class construct Awkward in SQL, needs reference to timestamps SQL-99 allows aggregations over sliding windows Sampling/approximation/load-shedding/QoS support? Stream relational algebra and rewrite rules Aurora and STREAM Sequence/Temporal Databases PODS 2002
63
Outline of Remaining Talk
Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002
64
STREAM Implementation Goals
Comprehensive DSMS query processor Broad suite of operators/synopses Sophisticated Developers-Workbench interface Submit queries in extended SQL or algebra Submit/edit query plans in XML or GUI Visualizing query execution On-the-fly modification of memory allocation, scheduling policies, queue management, etc. PODS 2002
65
Aurora Run-time Architecture
Inputs Outputs Router π Q1 Q2 Scheduler σ Q3 x Buffer Manager Box Processors Catalogs Persistent Store Q4 Load Shedder QoS Monitor Q5 PODS 2002
66
DSMS Internals Query plans: operators, synopses, queues
Memory management Dynamic Allocation – queries, operators, queues, synopses Graceful adaptation to reallocation Impact on throughput and precision Operator scheduling Variable-rate streams, varying operator/query requirements Response time and QoS Load-shedding Interaction with queue/memory management PODS 2002
67
Synopses Memory Management
Current work: Optimize synopsis for accuracy, respecting memory constraint Global Optimization Memory allocation to synopses Similar to [Jagadish, Jin, Ooi, Tan] Synopsis sharing? Across multiple operators and queries Similar to optimal Index Selection PODS 2002
68
Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani]
Goal Given – query plan and selectivity estimates Schedule – tuples through operator chains Minimize total queue memory Best-slope scheduling is near-optimal Danger of starvation for some tuples Minimize tuple response time Schedule tuple completely through operator chain Danger of exceeding memory bound Open – graceful combination and adaptivity PODS 2002
69
Queue Memory and Scheduling [Babcock, Babu, Datar, Motwani]
Output σ1 σ3 best slope selectivity = 0.0 σ2 σ2 Net Selectivity selectivity = 0.6 starvation point σ3 σ1 selectivity = 0.2 Time Input PODS 2002
70
Memory vs. Response time
Open – Parametrized algorithm for trade-off PODS 2002
71
Precision-Resource Tradeoff
Resources – memory, computation, I/O Global Optimization Problem Input: queries with alternate plans, importance weights Precision: function of resource allocation to queries/operators Goal: select plans, allocate resources, maximize precision Memory Allocation Algorithm [Varma, Widom] Model – single query plan, simple precision model Rules for precision of composed operators Non-linear numerical optimization formulation Open – Combinatorial algorithm? General case? PODS 2002
72
Query Optimization STREAM Aurora
Combine resource allocation and query optimizer Aurora Data flow ⇒ less scope for plan optimization Inserts inferred projects Reorders selects Rate-based optimization [Viglas, Naughton] Model for output-rates as function of input-rates Enables optimizer to increase throughput PODS 2002
73
Load Shedding Aurora – QoS approach Static: drop-based
Runtime: delay-based Semantic: value-based % tuples delivered Drop-based QoS Delay Delay-based Ouput-value Value based PODS 2002
74
Query Processing: Other Issues
Handling blocking operators Ad-hoc and Inactive-registered queries Query importance dial Query Optimization Caching Disk spilling for full accuracy-memory-latency tradeoff Rate management Systems Issues User Interfaces Gathering and using statistics Crash recovery and transaction management PODS 2002
75
Outline of Remaining Talk
Stream Models and DSMS Architectures Query Processing Runtime and Systems Issues Algorithms Conclusion PODS 2002
76
Synopses Queries may access or aggregate past data
Need bounded-memory history-approximation Synopsis? Succinct summary of old stream tuples Like indexes/materialized-views, but base data is unavailable Examples Sliding Windows Samples Sketches Histograms Wavelet representation PODS 2002
77
Model of Computation Synopses/Data Structures Increasing time
1 Synopses/Data Structures Increasing time Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε: error parameter Data Stream PODS 2002
78
Reservoir Sampling [Vitter]
Maintain R samples from stream Can generalize to weighted samples Efficient implementation N = number of data items seen R = 5 1 6 3 4 5 1 6 3 4 8 1 2 3 4 5 1 6 3 4 5 Replace one at random. Replace one at random. T = 6, heads prob. = 5/ 6. Result = H T = 8, heads prob. = 5/ 8. Result = H T = 7, heads prob. = 5/ 7. Result = T PODS 2002
79
Sketching Techniques [Alon,Matias,Szegedy] frequency moments
[Feigenbaum etal, Indyk] extended to Lp norm [Dobra et al] complex aggregates over joins Key Subproblem – Self-Join Size Estimation Stream of values from D = {1,2,…,N} Let fi = frequency of value i Self-join size S = Σ fi2 Question – estimating S in small space? PODS 2002
80
Self-Join Size Estimation
AMS Technique (randomized sketches) Given (f1,f2,…,fN) Zi = random{-1,1} X = Σ fiZi (X incrementally computable) Theorem Exp[X2] = Σ fi2 Cross-terms fiZi fjZj have 0 expectation Square-terms fiZi fiZi = fi2 Space = log (N + Σ fi) Independent samples Xk reduce variance PODS 2002
81
Sample Run of AMS 3 6 2 5 7 V = Z1 = -1 1 Z2 = X1= 5, X12 = 25
X2= 14, X22 = Est = 110.5 Σvi2 = 123 4 6 2 5 7 V = Z1 = -1 1 Z2 = X1= 6, X12 = 36, X2= 12, X22 = 144, Est = 90 Σ vi2 = 130, PODS 2002
82
Tug-of-War +1 -1 Odd dimension Even dimension.
even dimensions go right, odd dimensions go left +1 -1 {1, 2} {2, 4} {1, 6} {4, 8} {1, 5} {4, 3} {1, 1} Odd dimension Even dimension. {x,i} (“x” = value, “i” = dimension. Based on “i”, add pull of strength “x”, to right or left). E[{ strength( ) – strength ( )}2] = F2 (l2) PODS 2002
83
Quantiles (Greenwald-Khanna)
Maintain triplets (vi, gi, Δi), where vi are increasing gi represents number of values merged into this representative tuple (can be viewed as added error) Δi represents error when tuple was created. Merge tuples periodically while maintaining gi+Δi ≤ 2εN Vi , 7, 8, 10, 34, 78, 85, 100 gi 1, 2, 3, 6, 2, 3, 4, 5 5 tuples get added. They can be merged into“34” Δi 0, 2, 5, 3, 2, 3, 4, 4 2εN = 10 PODS 2002
84
Sliding Window Computations [Datar, Gionis, Indyk, Motwani]
Goal: statistics/queries Memory: o(N), preferably poly(1/ε, log N) Problem: count/sum/variance, histogram, clustering,… Sample Results: (1+ε)-approximation Counting: Space O(1/ε (log N)) bits, Time O(1) amortized Sum over [0,R]: Space O(1/ε log N (log N + log R)) bits, Time O(log R/log N) amortized Lp sketches: maintain with poly(1/ε, log N) space overhead Matching space lower bounds PODS 2002
85
Sliding Window Histograms
Key Subproblem – Counting 1’s in bit-stream Goal – Space O(log N) for window size N Problem – Accounting for expiring bits Idea Partition/track buckets of known count Error in oldest bucket only Future 0’s? … PODS 2002
86
Exponential Histograms
Buckets of exponentially increasing size Between K/2 and K/2+1 buckets of each size K = 1/ε and ε = relative error PODS 2002
87
Exponential Histograms
Buckets of exponentially increasing size Between K/2 and K/2+1 buckets of each size K = 1/ε and ε = relative error K = 2 Bucket sizes = 4,2,2,1,1. Bucket sizes = 4,2,2,1. Bucket sizes = 4,2,2,2,1. Bucket sizes = 4,4,2,1. Bucket sizes = 4,2,2,1,1,1. … … Element arrived this step. Future Ci-1 + Ci-2+…+ C2 + C1 + 1 >= (K/2) Ci PODS 2002
88
Many other results … Histograms Data Mining V-Opt Histograms
[Gilbert, Guha, Indyk, Kotidis, Muthukrishnan, Strauss], [Indyk] End-Biased Histograms (Iceberg Queries) [Manku, Motwani], [Fang, Shiva, Garcia-Molina, Motwani, Ullman] Equi-Width Histograms (Quantiles) [Manku, Rajagopalan, Lindsay], [Khanna, Greenwald] Wavelets Seminal work [Vitter, Wang, Iyer] + many others! Data Mining Stream Clustering [Guha, Mishra, Motwani, O’Callaghan] [O’Callaghan, Meyerson, Mishra, Guha, Motwani] Decision Trees [Domingos, Hulten], [Domingos, Hulten, Spencer] PODS 2002
89
Algorithms – Open Issues
Global space allocation for synopsis Global error metric Dynamic optimization Synopses for sliding windows Correlated aggregates [Gehrke, Korn, Srivastava] Provable guarantees? Distributed Algorithms (e.g., top-k counting) PODS 2002
90
Conclusion: Future Work
Query Processing Stream Algebra and Query Languages Approximations Blocking, Constraints, Punctuations Runtime Management Scheduling, Memory Management, Rate Management Query Optimization (Adaptive, Multi-Query, Ad-hoc) Distributed processing Synopses and Algorithmic Problems Systems UI, statistics, crash recovery and transaction management System development and deployment PODS 2002
91
Thank You! PODS 2002
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.