Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003.

Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003

Data Everywhere  Increasingly ubiquitous networking at all scales.  ad hoc sensor nets, wireless, global Internet numbertypeslocations  Explosion in number, types, and locations of data sources and sinks.  mobile devices, P2P networks, data centers  Emerging software infrastructure to put it all together.  pub/sub, XML, web services, …

Data Management in a Networked World the  Data is the crucial resource for emerging networked applications.  Database techniques are all about data organization and access.  They can be adapted for network-centric environments. query processing  In particular, query processing can play a central role in a number of non-traditional settings. “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” (Jim Gray’s 1998 Turing Award Paper)

Networked Data Management Projects @UCB-DB Group  GridDB - Relational interaction model for Scientific Grid Computing. [SIGMOD 03 Demo]  MobiScope  MobiScope - Distributed processing for Location-based Services [MDM 03]  PIER  PIER - P2P Data Management [VLDB 03]  TelegraphCQ  TelegraphCQ - Adaptive Dataflow Processing for Data Streams. [CIDR 03; SIGMOD 03 Demo]  TinyDB  TinyDB - Sensor Networks for environmental monitoring [OSDI 02;SIGMOD 03]  YFilter  YFilter - XML Message Brokering [ICDE 02 Demo; VLDB 03]

Why Database Queries?  Declarative approach.  Programmer productivity.  Robustness to change.  Let the system manage efficiency.  Semantics and High-level operators.  Framework for correctness criteria.  Pushing semantics down enables smarter implementations, code re-use.  Natural mapping of dataflow processing.  Query plans are networks of operators.  Query/Data duality enables intelligent routing. These are the traditionalarguments Here’s why the techniques carry over

Query Plans and Operators  System handles query plan generation & optimization; ensures correct execution. SELECT eid, ename, title FROM Emp E WHERE E.sal > $50K SELECT E.loc, AVG(E.sal) FROM Emp E GROUP BY E.loc HAVING Count(*) > 5 SELECT COUNT DISTINCT (E.eid) FROM Emp E, Proj P, Asgn A WHERE E.eid = A.eid AND P.pid = A.pid AND E.loc <> P.loc   Issues: Operator ordering, physical operator choice, caching, access path (index) use, … EmployeesProjectsAssignments Emp Select  Emp Group(agg) HavingEmp Count distinct  Asgn Join Join Proj

“Traditional” Distributed Queries  Transparency  Query writers can be oblivious to distribution.  System does plan generation and optimization; ensures correct execution. ©1998 Ozsu and Valduriez   Issues: operator placement, data placement, physical operators, caching, replication, synchronization,…

Beyond Emps and Depts  In emerging networked data environments, queries can also be used for:  Monitoring  Real-time Analysis  Actuation  Routing  Transformation  Service Composition  Definition,Naming, and Access Rights

New QP Scenarios  Sensor Networks  Message Brokers  Data Streams  Information/Application Integration

Monitoring (1) - Sensor Nets  Tiny devices monitor the physical environment.  Berkeley “motes”, Smart Dust, RFid, …  Apps: Transportation, Environmental, Energy, NBC,… e.g., TinyOS http://webs.cs.berkeley.edu/tos/ TinyDB http://telegraph.cs.berkeley.edu/tinydb   Form ad hoc networks that aggregate and communicate streams of values.   E.g., Mica Mote AA battery pack 4Mhz, 8 bit Atmel RISC uProc, 40 kbit Radio,4 K RAM, 128 K Program Flash, 512 K Data Flash, AA battery pack

Sensor Net Sample Apps Traditional monitoring apparatus. Earthquake shake-tests. Vehicle detection: sensors along a road, collect data about passing vehicles. Habitat Monitoring: Storm petrels on great duck island, microclimates on James Reserve.

Declarative Queries in Sensor Nets SELECT nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s EpochnestNoLightTempAccelSound 01455xxx 02389xxx 11422xxx 12405xxx Sensors “Report the light intensities of the bright nests.”EpochnestNoLightTempAccelSound 01455xxx 02389xxx  Many sensor network applications can be described using query language primitives.   Potential for tremendous reductions in development and debugging effort.

Aggregation Query Example EpochregionCNT(…)AVG(…) 0North3360 0South3520 1North3370 1South3520 “Count the number occupied nests in each loud region of the island.” SELECT region, CNT(occupied) AVG(sound) FROM sensors GROUP BY region HAVING AVG(sound) > 200 EPOCH DURATION 10s Regions w/ AVG(sound) > 200

A B C D F E Sensor Queries @ 10000 Ft Query {D,E,F} {B,D,E,F} {A,B,C,D,E,F} Written in SQL With Extensions For : Sample rate Offline delivery Temporal Aggregation (Almost) All Queries are Continuous and Periodic

TAG: Tiny AGgregation (Sam Madden)  In-network processing  Reduces costs depending on type of aggregates  Supports “spatial aggregation”  Exploitation of operator, functional semantics TinyDB  Part of “TinyDB” system available at http://telegraph.cs.berkeley.edu/tinydb Tiny AGgregation (TAG), Madden, Franklin, Hellerstein, Hong. OSDI 2002.

Aggregation Framework As in extensible databases, we support any aggregation function conforming to: Agg n ={f merge, f init, f evaluate } F merge {, }  f init {a 0 }  F evaluate { }  aggregate value (Merge: associative, commutative!) Example: Average AVG merge {, }  AVG init {v}  AVG evaluate { }  S 1 /C 1 Partial Aggregation

TAG: Pipelined Aggregates  After query propagates, during each epoch:  Each sensor samples local sensors once  Combines them with Partial State Records (PSRs) from children  Outputs PSR representing aggregate state in the previous epoch.  After (d-1) epochs, PSR for the whole tree output at root  d = Depth of the routing tree  If desired, partial state from top k levels could be output in k th epoch  To avoid combining PSRs from different epochs, sensors must cache values from children 1 23 4 5 Value from 5 produced at time t arrives at 1 at time (t+3) Value from 2 produced at time t arrives at 1 at time (t+1)

Illustration: Pipelined Aggregation 1 2 3 4 5 SELECT COUNT(*) FROM sensors Depth = d

Illustration: Pipelined Aggregation 12345 111111 1 2 3 4 5 1 1 1 1 1 Sensor # Epoch # Epoch 1 SELECT COUNT(*) FROM sensors

Illustration: Pipelined Aggregation 12345 111111 231221 1 2 3 4 5 1 2 2 1 3 Sensor # Epoch # Epoch 2 SELECT COUNT(*) FROM sensors

Illustration: Pipelined Aggregation 12345 111111 231221 341321 1 2 3 4 5 1 2 3 1 4 Sensor # Epoch # Epoch 3 SELECT COUNT(*) FROM sensors

Illustration: Pipelined Aggregation 12345 111111 231221 341321 451321 1 2 3 4 5 1 2 3 1 5 Sensor # Epoch # Epoch 4 SELECT COUNT(*) FROM sensors

Illustration: Pipelined Aggregation 12345 111111 231221 341321 451321 551321 1 2 3 4 5 1 2 3 1 5 Sensor # Epoch # Epoch 5 SELECT COUNT(*) FROM sensors

Bytes Transmitted Simulation Results 2500 Nodes 50x50 Grid Depth = ~10 Neighbors = ~20

Optimization: “Snooping”  Insight: Shared channel enables optimizations  Suppress messages that won’t affect aggregate  E.g., in a MAX query, sensor with value v hears a neighbor with value ≥ v, so it doesn’t report  Applies to all exemplary, monotonic aggregates  Learn about query advertisements it missed  If a sensor shows up in a new environment, it can learn about queries by looking at neighbors messages.  Root doesn’t have to explicitly rebroadcast query!

Optimization: Hypothesis Testing  Insight: Root can provide information that will suppress readings that cannot affect the final aggregate value.  E.g. Tell all the nodes that the MIN is definitely < 50; nodes with value ≥ 50 need not participate.  Depends on monotonicity  How is hypothesis computed?  Blind guess  Statistically informed guess  Observation over first few levels of tree / rounds of aggregate

Experiment: Hypothesis Testing Uniform Value Distribution, Dense Packing, Ideal Communication

Taxonomy of Aggregates  TAG insight: classify aggregates according to various functional properties  Yields a general set of optimizations that can automatically be applied PropertyExamplesAffects Partial StateMEDIAN : unbounded, MAX : 1 record Effectiveness of TAG Duplicate SensitivityMIN : dup. insensitive, AVG : dup. sensitive Routing Redundancy Exemplary vs. Summary MAX : exemplary COUNT: summary Applicability of Sampling, Effect of Loss MonotonicCOUNT : monotonic AVG : non-monotonic Hypothesis Testing, Snooping

ACQP Data collection aware query processing  “acquisitional query processing”  Issues addressed:  How does the user control acquisition?  Rates or lifetimes  Event-based triggers  How should the query be processed?  Sampling as a first class operation  Events – join duality  Which nodes have relevant data?  Which samples should be transmitted? Madden, Franklin, Hellerstein, and Hong. The Design of An Acqusitional Query Processor. SIGMOD 2003.

Sensor Query Processing Summary  Higher-level programming abstractions for sensor networks are necessary.  Aggregation is a fundamental operation  Semantically aware optimizations  Close integration with network  ACQP: Languages, indices, approximations that give user control over which data enters the system.  Wealth of open research problems:  Error tolerance, topologies, heterogeneity, spatial processing, routing strategies, operators, actuation,..  Combines database, network, and device issues

Web Services/Message Brokers dynamic, loosely-coupledA platform for dynamic, loosely-coupled integration of enterprise applications and data. Interaction accomplished through exchange of messages in the wide area. (e.g., Adam Bosworth’s VLDB 02 keynote: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P01slides.pdf)

The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. Underlying Technology: Filtering XML Conversion XML Documents Filter Engine User Profiles Users Filtered Data Data Sources

Message Brokers  Message Brokers perform three main tasks:  Filtering  Filtering - matching of interests.  Transformation  Transformation - format conversion for app integration and preferences.  Delivery  Delivery - moving bits through the overlay network  Must be lightweight and scalable.  Effectively they are high-function routers.  Large-scale deployments may entail handling 10’s or 100’s of thousands of queries (subscriptions)  XML is a natural substrate.

YFilter Message Broker (Yanlei Diao [VLDB 03])

XQuery-based Subscriptions A query consists of a constant tag and an FLWR expression  A for clause: a variable and a path expression  An optional where clause: conjunctive predicates  A return clause: interleaved constant tags and path expressions relative  where and return clause paths are relative { for $s in document(“doc.xml”)//section where $s//figure/title = “XML processing” return { $s/title } { $s//figure } }

YFilter:Shared Path Matching shared processing  For large-scale systems, shared processing is essential.  YFilter uses an NFA-based approach to share path matching work among queries. Location steps /a //a /* //* NFA fragments a * a  * * * 

Constructing a Query NFA Concatenate NFA fragments for location steps in a path expression. /a a //b * a  Query “/a//b” a * b 

Constructing the Combined NFA a {Q1} b Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/b Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c a {Q2} c c {Q3}  {Q4} c b * * c {Q5} c {Q6} * c {Q7} {Q3, Q8}

NFA Execution read 2 1 match Q1 read 3 2 1 match Q3 Q8 read 5 3 9 7 6 2 1 read 3 9 7 6 2 1 read 2 1 1 initial 1 Runtime Stack NFA An XML fragment c c b {Q1} {Q3, Q8} {Q2} {Q4} {Q6} {Q5} {Q7} a * c c * c c *  b 1 4 35 8 6 12 10 2 7 11 13 9 97 6 1012 811 6 Q5Q6Q4

Performance Evaluation Varying number of distinct queries (NITF, D=6, W=0.2, //=0.2) With YFilter, path matching is no longer the dominant cost! YFilter: prefix sharing XFilter (list balance): no sharing Hybrid approach: share substrings containing ‘/’ only YFilter is significantly faster (around 30 ms for 150K queries) Parsing not included: Xerces (168 ms) Java XML Pack (141 ms) Saxon (86 ms).

Message Transformation  Change YFilter to output streams of “path tuples”.  Each path tuple contains a sequence of node ids representing the elements that matched the path.  This output is post-processed using relational-style operators to produce customized messages.  Three approaches ( differ in the extent to which they push work to the engine)  PathSharing-F  PathSharing-F: For clause paths only  PathSharing-FW  PathSharing-FW: For & Where clause paths  PathSharing-FWR  PathSharing-FWR: For, Where & Return  Inherent tension between path sharing and result customization!

Message Broker – Wrap Up Sharing is the key to performance  NFA provides excellent scalability/performance  PathSharing-FWR performs best, when combined with optimizations based on the queries and DTD.  When the post-processing is shared, even more scalability can be achieved.  This sharing is facilitated by using relational-like query plans. On-going work - How to deploy in the wide area?:  Distributed Filtering and Content Delivery Network  Combining distributed query processing and state-of- the-art application-level multicast protocols.  What semantics can/should be provided? For more information see: www.cs.berkeley.edu/~daioyl/yfilter

Monitoring (2) : Data Streams  Streaming Data  Network monitors  news feeds  stock tickers  B2B and Enterprise apps  Supply-Chain, CRM  Trade Reconciliation, Order Processing etc.  (Quasi) real-time flow of events and data  Must manage these flows to drive business (and other) processes.  Mine flows to create and adjust business rules.  Can also “tap into” flows for on-line analysis.

TelegraphCQ Overview  An adaptive system for large-scale shared dataflow processing.  Based on an extensible set of operators: Ingress (data access) 1) Ingress (data access) operators  Screen Scraper, Napster/Gnutella readers,  File readers, Sensor Proxies Data processing 2) Non-Blocking Data processing operators  Selections (filters), XJoins, … Adaptive Routing 3) Adaptive Routing Operators  Eddies, STeMs, FLuX, etc.  Operators connected through “Fjords” [MF02]  queue-based framework unifying push&pull.

SteMs:“State Modules” [Raman & Hellerstein ICDE 03] A generalization of the symmetric hash join (n-way) SteMs maintain intermediate state for multiple joins. Use Eddy to route tuples through the necessary modules. SteMs + Eddy reduce need for optimizer, increasing adaptivity in volatile streaming environments. A B C D Hash A Hash B Hash C Hash D A B C D

Telegraph CQ Architecture TelegraphCQ Front End Planner Parser Listener Mini-Executor Catalog Split TelegraphCQ Back End Modules Scans CQEddy TelegraphCQ Wrapper ClearingHouse Shared Memory Buffer Pool Disk Query Plan Queue Eddy Control Queue Query Result Queues } Legend Data Tuples Query + Control Data + Query Wrappers Proxy 1 2 3 4 5 6 7 8 9

1 { t1,t2,t3 2 { t2,t3,t4 3 { t3,t4,t5 4 { t4,t5,t6 5 { t5,t6,t7 Time Tuple sets Semantics of data streams  Different notions of data streams  Ordered sequence of tuples  Bag of tuple/timestamp pairs [STREAM]  Mapping from time to sets of tuples  Data streams are unbounded  Windows: restrict data for a query  A stream can be transformed by:  Moving a window across it  A window can be moved by  Shifting its extremities  Changing its size

The StreaQuel Language  An extension of SQL  Operates exclusively on streams  Is closed under streams  Supports different ways to “create” streams  Infinite time-stamped tuple sequence  Traditional stable relations  Flexible windows: sliding, landmark, and more  Supports logical and physical time  When used with a cursor mechanism, allows clients to do their own window-based processing.  Target language for TelegraphCQ

Example – Landmark query

Current Status - TelegraphCQ  System has been developed by modifying PostgreSQL:  Re-used a lot of code:  Expression evaluator, semaphores, parser, planner  Sucessfully Demonstrated at SIGMOD 2003.  Performance studies underway.  Beta Version to be released Aug 03  Open Source (PostgreSQL license)  Shared joins with windows and aggregates  Archived/unarchived streams  A “hot” area: Several major streaming systems under development in the database community

Beyond Emps and Depts  Monitoring  TinyDB, TelegraphCQ, YFilter  Real-time Analysis  TinyDB and TelegraphCQ  Actuation  TinyDB, GridDB  Routing TransformationService Composition  Routing (queries and/or data), Transformation, Service Composition  all of the projects  Definition,Naming, and Access Rights  TelegraphCQ, but all should

Conclusions  Data is the crucial resource in emerging networked environments.  Database query processing techniques and insights can provide tremendous leverage. database networkingdistributed systems  Huge research opportunities for database, networking, and distributed systems researchers.  Breakthroughs will come from projects that span these areas.

Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003.

Similar presentations

Presentation on theme: "Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003.

Similar presentations

Presentation on theme: "Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003."— Presentation transcript:

Similar presentations

About project

Feedback