Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley.

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley

Road Map How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks

Background: CONTROL project Online/Interactive query processing Online aggregation Scalable spreadsheets & refining visualizations Online data cleaning (Potter’s Wheel) Pipelining operators (ripple joins, online reordering) over streaming samples

Example: Online Aggregation

Online Data Visualization CLOUDS

Potter’s Wheel

Goals for Online Processing Performance metric: Statistical (e.g. conf. intervals) User-driven (e.g. weighted by widgets) New “greedy” performance regime Maximize 1 st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL Time 100% Online Traditional

CONTROL  Volatility Goals and data may change over time User feedback, sample variance Goals and data may be different in different “regions” Group-by, scrollbar position [An aside: dependencies in selectivity estimation] Q: Query optimization in this world? Or in any pipelining, volatile environment?? Where else do we see volatility?

Continuous Adaptivity: Eddies A little more state per tuple Ready/done bits (extensible a la Volcano/Starburst) Query processing = dataflow routing!! We'll come back to this! Eddy

Eddies: Two Key Observations Break the set-oriented boundary Usual DB model: algebra expressions: (R S) T Usual DB implementation: pipelining operators! Subexpressions never materialized Typical implementation is more flexible than algebra We can reorder in-flight operators Other gains possible by breaking the set-oriented boundary… Don’t rewrite graph. Impose a router Graph edge = absence of routing constraint Observe operator consumption/production rates Consumption: cost Production: cost*selectivity

Coincidence: Eddie Comes to Berkeley CLICK: a NW router is a query plan! “The Click Modular Router”, Robert Morris, Eddie Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Figure 3:Example Router Graph Also Scout Paths the key to comm-centric OS “Making Paths Explicit in the Scout Operating System”, David Mosberger and Larry L. Peterson. OSDI ‘96.

More Interaction: CS262 Experiment w/ Eric Brewer Merge OS & DBMS grad class, over a year Eric/Joe, point/counterpoint Some tie-ins were obvious: memory mgmt, storage, scheduling, concurrency Surprising: QP and networks go well side by side E.g. eddies and TCP Congestion Control Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment Eddies close to the n-armed bandit problem

Networking Overview for DB People Like Me Core function of protocols: data xfer Data Manipulation (buffer, checksum, encryption, xfer to/fr app space, presentation) Transfer Control (flow/congestion ctl, detecting xmission probs, acks, muxing, timestamps, framing) -- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90 Basic Internet assumption: “a network of unknown topology and with an unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

Exchange! Data Modeling! Query Opt! Thesis: nets are good at xfer control, not so good at data manipulation Some C&T wacky ideas for better data manipulation Xfer semantic units, not packets (ALF) Auto-rewrite layers to flatten them (ILP) Minimize cross-layer ordering constraints Control delivery in parallel via packet content C & T’s Wacky Ideas

Wacky New Ideas in QP What if… We had unbounded data producers and consumers (“streams” … “continuous queries”) We couldn’t know our producers’ behavior or contents?? (“federation” … “mediators”) We couldn’t predict user behavior? (“control”) We couldn’t predict behavior of components in the dataflow? (“networked services”) We had partial failure as a given? (oops, have we ignored this?) Yes … networking people have been here! Remember Van Jacobson’s quote?

The Cosmic Convergence NETWORKING RESEARCH Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion Adaptivity, Federated Control, GeoScalability DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability

The Cosmic Convergence Adaptivity, Federated Control, GeoScalability NETWORKING RESEARCH Content-Based Routing Router Toolkits Content Addressable Networks Directed Diffusion DATABASE RESEARCH Adaptive Query Processing Continuous Queries Approximate/ Interactive QP Sensor Databases Data Models, Query Opt, DataScalability Telegraph

What’s in the Sweet Spot? Scenarios with: Structured Content Volatility Rich Queries Clearly: Long-running data analysis a la CONTROL Continuous queries Queries over Internet sources and services Two emerging scenarios: Sensor networks P2P query processing

Telegraph: Engineering the Sweet Spot An adaptive dataflow system Dataflow programming model A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 Extensible set of pipelining operators, including relational ops, grouped filters (e.g. XFilter) SQL parser for convenience (looking at XQuery) Adaptivity operators Eddies + Extensible rules for routing constraints, Competition SteMs (state modules) FLuX (Fault-tolerant Load-balancing eXchange) Bounded and continuous: Data sources Queries

State Modules (SteMs) Goal: Further adaptivity through competition Multiple mirrored sources Handle rate changes, failures, parallelism Multiple alternate operators Join = Routing + State SteM operator manages tradeoffs State Module, unifies caches, rendezvous buffers, join state Competitive sources/operators share building/probing SteMs Join algorithm hybridization! Vijayshankar Raman static dataflow eddy + stems

FLuX: Routing Across Cluster Fault Tolerance, Load Balancing Continuous/long-running flows need high availability Big flows need parallelism Adaptive Load-Balancing req’d FLuX operator: Exchange plus… Adaptive flow partitioning (River) Transient state replication & migration RAID for SteMs Needs to be extensible to different ops: Content-sensitivity History-sensitivity Dataflow semantics Optimize based on edge semantics Networking tie-in again: At-least-once delivery? Exactly-once delivery? In/Out of order? Migration policy: the ski rental analogy Mehul Shah

Continuously Adaptive Continuous Queries (CACQ) Continuous Queries clearly need all this stuff! Address adaptivity 1st. 4 Ideas in CACQ: Use eddies to allow reordering of ops. But one eddy will serve for all queries Explicit tuple lineage Mark each tuple with per-op ready/done bits Mark each tuple with per-query completed bits Queries are data: join with Grouped Filter Much like XFilter, but for relational queries Joins via SteMs, shared across all queries Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared algebraic expressions! Delete a tuple from flow only if it matches no query Next: F.T. CACQ via FLuXen Sam Madden, Mehul Shah, Vijayshankar Raman

Sensor Nets “Smart Dust” + TinyOS Thousands of “motes” Expensive communication Power constraints Query workload: Aggregation & approximation Queries and Continuous Queries Challenges: Push the processing into the network Deal with volatility & failure CONTROL issues: data variance, user desires Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab) Simple example: Aggregation query

P2P QP Starting point: P2P as grassroots phenomenon Outrageous filesharing volume (1.8Gfiles in October 2001) No business case to date Challenge: scale DDBMS QP ideas to P2P Motivate why Pick the right parts of DBMS research to focus on Storage: no! QP: yes. Make it work: Scalability well beyond our usual target Admin constraints Unknown data distributions, load Heterogeneous comm/processing Partial failure Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

A Grassroots Example: TeleNap

Themes Throughout Adaptivity Requires clever system design The Exchange model: encapsulate in ops? Interesting adaptive policy problems E.g. eddy routing, flux migration Control Theory, Machine Learning Encompasses another CS goal? “No-knobs”, “Autonomic”, etc. New performance regimes Decent performance in the common case Mean/Variance more important than MAX Interactive Metrics Time to completion often unimportant/irrelevant

More Themes Set-valued thinking as albatross? E.g. eddies vs. Kabra/DeWitt or Tukwila E.g. SteMs vs. Materialized Views E.g. CACQ vs. NiagaraCQ Some clean theory here would be nice Current routing correctness proofs are inelegant Extensibility Model/language of choice is not clear SEQ? Relational? XQuery? Extensible operators, edge semantics [A whine about VLDB’s absurd “Specificity Factor”]

Conclusions? Too early for technical conclusions Of this I’m sure: The CS262 experiment is a success Our students are getting a bigger picture than before I’m learning, finding new connections May morph to OS/Nets, Nets/DB Eventually rethink the systems software curriculum at the undergraduate level too Nets folks are coming our way Doing relevant work, eager to collaborate DB community needs to branch out Outbound: Better proselytizing in CS Inbound: Need new ideas

Conclusions, cont. Sabbatical is a good invention Hasn’t even started, I’m already grateful!

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley.

Similar presentations

Presentation on theme: "Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley.

Similar presentations

Presentation on theme: "Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback