Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.

Slides:



Advertisements
Similar presentations
The Top 10 Reasons Why Federated Can’t Succeed And Why it Will Anyway.
Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Planning under Uncertainty
Telegraph Endeavour Retreat 2000 Joe Hellerstein.
IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.
Distributed Processing, Client/Server, and Clusters
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Continuously Adaptive Processing of Data and Query Streams Michael Franklin UC Berkeley April 2002 Joint work w/Joe Hellerstein and the Berkeley DB Group.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
Federated Facts and Figures Joseph M. Hellerstein UC Berkeley.
Traffic Engineering With Traditional IP Routing Protocols
1 Continuously Adaptive Continuous Queries (CACQ) over Streams Samuel Madden, Mehul Shah, Joseph Hellerstein, and Vijayshankar Raman Presented by: Bhuvan.
Chapter 1.3: Data Models and DBMS Architecture Title: Anatomy of a Database System Authors: J. Hellerstein, M. Stonebraker Pages:
Overview Distributed vs. decentralized Why distributed databases
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.
Telegraph: Ideas & Status. Overview Folks –Amol Deshpande, Mohan Lakhamraju, VijayShankar Raman –Rob von Behren, Steve Gribble, Matt Welsh –Kris Hildrum.
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.
Data-Intensive Systems Michael Franklin UC Berkeley
Distributed Systems: Client/Server Computing
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
Telegraph Continuously Adaptive Dataflow Joe Hellerstein.
Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.
CS6530 Graduate-level Database Systems Prof. Feifei Li.
Access Path Selection in a Relational Database Management System Selinger et al.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Master Thesis Defense Jan Fiedler 04/17/98
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
OpenACS: Porting Oracle Applications to PostgreSQL Ben Adida
Lecture 5: Sun: 1/5/ Distributed Algorithms - Distributed Databases Lecturer/ Kawther Abas CS- 492 : Distributed system &
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 REED: Robust, Efficient Filtering and Event Detection in Sensor Networks Daniel Abadi, Samuel Madden, Wolfgang Lindner MIT United States VLDB 2005.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Granularity in the Data Warehouse Chapter 4. Raw Estimates The single most important design issue facing the data warehouse developer is determining the.
Societal-Scale Computing: The eXtremes Scalable, Available Internet Services Information Appliances Client Server Clusters Massive Cluster Gigabit Ethernet.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Query Optimization for Stream Databases Presented by: Guillermo Cabrera Fall 2008.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Applying Control Theory to Stream Processing Systems
The Top 10 Reasons Why Federated Can’t Succeed
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Telegraph: An Adaptive Global-Scale Query Engine
Distributing Queries Over Low Power Sensor Networks
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Overview of big data tools
Software models - Software Architecture Design Patterns
(A Research Proposal for Optimizing DBMS on CMP)
Parallel DBMS Chapter 22, Sections 22.1–22.6
REED : Robust, Efficient Filtering and Event Detection
Eddies for Continuous Queries
Information Capture and Re-Use
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Adaptive Dataflow Joe Hellerstein UC Berkeley

Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level routing –query processing distributed & “shared nothing” parallel query systems Adaptive Dataflow –Rivers, Eddies Telegraph –Facts and Figures on the Web (FFW) –sensor-based information systems –software traces and adaptive distributed systems FFW Questions

Recent Trends 1990s: shift in the focus of academic CS –driving example applications changed –everybody on the information bandwagon 1990s: tightening of bottlenecks –data growth double Moore’s Law 90’s systems R&D: Parallel & Distributed Information Services –infocentric, multi-user, highly available –“shared-nothing” clusters, not “parallelism” a la 1989 –distributed data, not “distributed OS” a la 1989

Systems Research: Up or Down UP: Global Federations –internet services as procedure calls with fees and lawyers! –B2B e-commerce has this problem today Cohera examples DOWN: Sensor/Actuator Networks –UPC codes? Clickstream? Smart dust! –HUGE, noisy data volumes –new architectures, major challenges

Core Technology Not There Yet Key component: dataflow –The plumbing is coming XML/http, WML/WAP, etc. give LCD communication glue at the boundaries ho hum –Systems challenge: move the data efficiently robustly intelligently –Language challenge too programming and debugging tools interfaces and economic models

What’s So Hard Here? Volatile regime –Data flows unpredictably from sources –Code performs unpredictably along flows –Continuous volatility due to many decentralized systems Lots of choices –Choice of services –Choice of machines –Choice of info: sensor fusion, data reduction, etc. –Order of operation Maintenance –Federated world –Partial failure is the common case Adaptivity required!

A Networking Problem!? Networks do dataflow! Significant history of adaptive techniques –E.g. TCP congestion control –E.g. routing But traditionally much lower function –Ship bitstreams –Minimal, fixed code Lately, moving up the foodchain? –app-level routing –active networks –politics of growth assumption of complexity = assumption of liability

Networking Code as Dataflow? States & Events, Not Threads –Asynchronous events natural to networks –State machines in protocol specification and system code –Low-overhead, spreading to big systems –Totally different programming style remaining area of hacker machismo Eventflow optimization –Can’t eventflow be adaptively optimized like dataflow? –Why didn’t that happen years ago? –Hold this thought

Query Plans are Dataflow Too Programming model: iterators –old idea, widely used in DB query processing –object with three methods: Init(), GetNext(), Close() –input/output types –query plan: graph of iterators pipelining: iterators that return results before children Close()

Distributed/Parallel Databases? Query plans across machines Bloom filters, query optimization minimize bandwidth –Send lossily compressed signatures, not just data –Model network, disk, CPU costs in dataflow optimization –A “Distributed DB” contribution App-level optimization … code to data or data to code –DB research highest up the food chain Data partitioning, query opt. parallelizes dataflow –Pipeline & partition parallelism: natural! –Model resource consumption and response time –A “Parallel DB” contribution Challenge: move down the foodchain to serve all –Biggest problem: limited adaptivity

Adaptive Systems: General Flavor Repeat: 1.Observe (model) environment 2.Use observation to choose behavior 3.Take action

Adaptive Dataflow in DBs: History Rich But Unacknowledged History –Codd's data independence predicated on adaptivity! adapt opaquely to changing schema and storage –Query optimization does it! statistics-driven optimization key differentiator between DBMSs and other systems

Adaptivity in Current DBs Limited & coarse grain Repeat: 1.Observe (model) environment –runstats (once per week!!): model changes in data 2.Use observation to choose behavior –query optimization: fixes a single static query plan 3.Take action –query execution: blindly follow plan

Adaptive Query Processing Work –Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97] –Per Query: Mariposa [SA+96], ASE [CR94] –Competition: RDB [AZ96] –Inter-Op: [KD98], Tukwila [IF+99] –Query Scrambling: [AF+96,UFA98] Survey: Hellerstein, Franklin, et al., DE Bulletin 2000 System R Late Binding Per Query Competition & Sampling Inter-Operator Query Scrambling Eddies Ingres DECOMP Frequency of Adaptivity Future Work

Some Solutions We’re Focusing On Rivers –Adaptive partitioning of work across machines Eddies –Adaptive ordering of pipelined operations Quality of Service –Online aggregation & data reduction: CONTROL –MUST have app-semantics –Often may want user interaction UI models of temporal interest Data Dissemination –Adaptively choosing what to send, what to cache

Dataflow Parallelism in DBs Volcano: “exchange” iterator [Graefe] –encapsulate exchange logic in an iterator –not in the dataflow system –Box-and-arrow programming can ignore parallelism

River We built the world’s fastest sorting machine –On the “NOW”: 100 Sun workstations + SAN –Only beat the record under ideal conditions No such thing in practice! River: adaptive dataflow on clusters –One main idea: Distributed Queues adaptive exchange operator –Simplifies management and programming

River

Multi-Operator Query Plans Deal with pipelines of commutative operators Adapt at finer granularity than current DBMSs

Continuous Adaptivity: Eddies A pipelining tuple-routing iterator –just like join or sort or exchange Works great with other pipelining operators –like Ripple Joins, online reordering, etc. Eddy Avnur & Hellerstein SIGMOD 2000

Continuous Adaptivity: Eddies How to order and reorder operators over time – based on performance, economic/admin feedback Vs.River: –River optimizes each operator “horizontally” –Eddies optimize a pipeline “vertically” Eddy

Continuous Adaptivity: Eddies Adjusts flow adaptively –Tuples routed through ops in different orders –Visit each op once before output Naïve routing policy: –All ops fetch from eddy as fast as possible A la River –Turns out, doesn’t quite work Only measures rate of work, not benefit

An Aside: n-Arm Bandits A little machine learning problem: –Each arm pays off differently –Explore? Or Exploit? Sometimes want to randomly choose an arm Usually want to go with the best If probabilities are stationary, dampen exploration over time

Eddies with Lottery Scheduling Operator gets 1 ticket when it takes a tuple –Favor operators that run fast (low cost) Operator loses a ticket when it returns a tuple –Favor operators with high rejection rate Low selectivity Lottery Scheduling: –When two ops vie for the same tuple, hold a lottery –Never let any operator go to zero tickets Support occasional random “exploration” Set up “inflation” (forgetting) to adapt over time –E.g. tix’ =   oldtix + newtix

Promising! Initial performance results Ongoing work on proofs of convergence –have analysis for contrained case

To Be Continued Tune & formalize policy Competitive eddies –Source & Join selection –Requires duplicate management Parallelism –Eddies + Rivers? Reliability –Long-running flows –Rivers + RAID-style computation Eddy R2R1 R3 S1S2 S3  hash blockindex1  index2

To Be Continued, cont. What about wide area? –data reduction –sensor fusion –asynchronous communication Continuous queries –events –disconnected operation Lower-level eventflow? –can eddies, rivers, etc. be brought to bear on programming?

Telegraph: An Adaptive Dataflow System An adaptive dataflow system Currently cluster + http –Rivers and Eddies –Web wrappers Sensor nets next Target applications –Facts and Figures on the Web (FFW) –Distributed Introspection Services –Sensor Stream Services w/Mike Franklin, Sirish Chandrasekaran, Amol Deshpande, Nick Lanham, Sam Madden, VijayShankar Raman, Mehul Shah

Facts & Figures on the Web “Deep” Web –“Hidden” Web, “Dark Matter” More interesting: Facts & figures, not text –“search” is not the main problem “search” was always easy ranking often not apropos to facts –combine, transform, summarize, analyze

FFW Election 2000 Campaign Finance Drill-down –Bush/Gore donations Personal and industrial –Industry data –Neighborhood data –Personal data –Historical voting data Live demo, online aggregation

Web Research Revisited Crawling, Caching, Relationship Graphs –Transitive Closure –The Berkeley Bindings & graph analysis? –Form identification & APIs –“Semantic” caching Socio-Techno-Legal Issues –Privacy: Statistical DBs + Federation WhitePages |x| WhoIs |x| DoubleClick |x| CDC Wonder –Stats in the wrong hands –Accuracy of derived results –Intellectual property Etc!

Summary Adaptive software systems must happen –federation & scaling require it –systems and stats must marry Dataflow programming natural –for many applications –best hope for large-scale apps Terrific nexus of research –DB, Networking, Learning/Stat –Lots of work to be done! Drop by the Telegraph FFW demo! –

Backup slides The rest of the slides are backup to answer questions…

Prior Progress in DB Adaptivity Per-query adaptivity –E.g. Mariposa [Sto95] 1 st distributed DBMS to consider scalability economic APIs for federation, give limited adaptivity too One-time intra-query –DEC Rdb competition [AZ96] –Sampling [lots] Intra-query, inter-operator –“Query Scrambling”: reoptimize in face of delays [UFA98] –Kabra/DeWitt ‘98: dam the flow, reoptimize downstream