Download presentation
Presentation is loading. Please wait.
1
Data Streams, Message Brokers, Sensor Nets, and Other Strange Places to Run Database Queries Michael Franklin UC Berkeley July 2003
2
Data Everywhere Increasingly ubiquitous networking at all scales. ad hoc sensor nets, wireless, global Internet numbertypeslocations Explosion in number, types, and locations of data sources and sinks. mobile devices, P2P networks, data centers Emerging software infrastructure to put it all together. pub/sub, XML, web services, …
3
Data Management in a Networked World the Data is the crucial resource for emerging networked applications. Database techniques are all about data organization and access. They can be adapted for network-centric environments. query processing In particular, query processing can play a central role in a number of non-traditional settings. “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” (Jim Gray’s 1998 Turing Award Paper)
4
Networked Data Management Projects @UCB-DB Group GridDB - Relational interaction model for Scientific Grid Computing. [SIGMOD 03 Demo] MobiScope MobiScope - Distributed processing for Location-based Services [MDM 03] PIER PIER - P2P Data Management [VLDB 03] TelegraphCQ TelegraphCQ - Adaptive Dataflow Processing for Data Streams. [CIDR 03; SIGMOD 03 Demo] TinyDB TinyDB - Sensor Networks for environmental monitoring [OSDI 02;SIGMOD 03] YFilter YFilter - XML Message Brokering [ICDE 02 Demo; VLDB 03]
5
Why Database Queries? Declarative approach. Programmer productivity. Robustness to change. Let the system manage efficiency. Semantics and High-level operators. Framework for correctness criteria. Pushing semantics down enables smarter implementations, code re-use. Natural mapping of dataflow processing. Query plans are networks of operators. Query/Data duality enables intelligent routing. These are the traditionalarguments Here’s why the techniques carry over
6
Query Plans and Operators System handles query plan generation & optimization; ensures correct execution. SELECT eid, ename, title FROM Emp E WHERE E.sal > $50K SELECT E.loc, AVG(E.sal) FROM Emp E GROUP BY E.loc HAVING Count(*) > 5 SELECT COUNT DISTINCT (E.eid) FROM Emp E, Proj P, Asgn A WHERE E.eid = A.eid AND P.pid = A.pid AND E.loc <> P.loc Issues: Operator ordering, physical operator choice, caching, access path (index) use, … EmployeesProjectsAssignments Emp Select Emp Group(agg) HavingEmp Count distinct Asgn Join Join Proj
7
“Traditional” Distributed Queries Transparency Query writers can be oblivious to distribution. System does plan generation and optimization; ensures correct execution. ©1998 Ozsu and Valduriez Issues: operator placement, data placement, physical operators, caching, replication, synchronization,…
8
Beyond Emps and Depts In emerging networked data environments, queries can also be used for: Monitoring Real-time Analysis Actuation Routing Transformation Service Composition Definition,Naming, and Access Rights
9
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
10
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
11
Monitoring (1) - Sensor Nets Tiny devices monitor the physical environment. Berkeley “motes”, Smart Dust, RFid, … Apps: Transportation, Environmental, Energy, NBC,… e.g., TinyOS http://webs.cs.berkeley.edu/tos/ TinyDB http://telegraph.cs.berkeley.edu/tinydb Form ad hoc networks that aggregate and communicate streams of values. E.g., Mica Mote AA battery pack 4Mhz, 8 bit Atmel RISC uProc, 40 kbit Radio,4 K RAM, 128 K Program Flash, 512 K Data Flash, AA battery pack
12
Sensor Net Sample Apps Traditional monitoring apparatus. Earthquake shake-tests. Vehicle detection: sensors along a road, collect data about passing vehicles. Habitat Monitoring: Storm petrels on great duck island, microclimates on James Reserve.
13
Declarative Queries in Sensor Nets SELECT nestNo, light FROM sensors WHERE light > 400 EPOCH DURATION 1s EpochnestNoLightTempAccelSound 01455xxx 02389xxx 11422xxx 12405xxx Sensors “Report the light intensities of the bright nests.”EpochnestNoLightTempAccelSound 01455xxx 02389xxx Many sensor network applications can be described using query language primitives. Potential for tremendous reductions in development and debugging effort.
14
Aggregation Query Example EpochregionCNT(…)AVG(…) 0North3360 0South3520 1North3370 1South3520 “Count the number occupied nests in each loud region of the island.” SELECT region, CNT(occupied) AVG(sound) FROM sensors GROUP BY region HAVING AVG(sound) > 200 EPOCH DURATION 10s Regions w/ AVG(sound) > 200
15
A B C D F E Sensor Queries @ 10000 Ft Query {D,E,F} {B,D,E,F} {A,B,C,D,E,F} Written in SQL With Extensions For : Sample rate Offline delivery Temporal Aggregation (Almost) All Queries are Continuous and Periodic
16
TAG: Tiny AGgregation (Sam Madden) In-network processing Reduces costs depending on type of aggregates Supports “spatial aggregation” Exploitation of operator, functional semantics TinyDB Part of “TinyDB” system available at http://telegraph.cs.berkeley.edu/tinydb Tiny AGgregation (TAG), Madden, Franklin, Hellerstein, Hong. OSDI 2002.
17
Aggregation Framework As in extensible databases, we support any aggregation function conforming to: Agg n ={f merge, f init, f evaluate } F merge {, } f init {a 0 } F evaluate { } aggregate value (Merge: associative, commutative!) Example: Average AVG merge {, } AVG init {v} AVG evaluate { } S 1 /C 1 Partial Aggregation
18
TAG: Pipelined Aggregates After query propagates, during each epoch: Each sensor samples local sensors once Combines them with Partial State Records (PSRs) from children Outputs PSR representing aggregate state in the previous epoch. After (d-1) epochs, PSR for the whole tree output at root d = Depth of the routing tree If desired, partial state from top k levels could be output in k th epoch To avoid combining PSRs from different epochs, sensors must cache values from children 1 23 4 5 Value from 5 produced at time t arrives at 1 at time (t+3) Value from 2 produced at time t arrives at 1 at time (t+1)
19
Illustration: Pipelined Aggregation 1 2 3 4 5 SELECT COUNT(*) FROM sensors Depth = d
20
Illustration: Pipelined Aggregation 12345 111111 1 2 3 4 5 1 1 1 1 1 Sensor # Epoch # Epoch 1 SELECT COUNT(*) FROM sensors
21
Illustration: Pipelined Aggregation 12345 111111 231221 1 2 3 4 5 1 2 2 1 3 Sensor # Epoch # Epoch 2 SELECT COUNT(*) FROM sensors
22
Illustration: Pipelined Aggregation 12345 111111 231221 341321 1 2 3 4 5 1 2 3 1 4 Sensor # Epoch # Epoch 3 SELECT COUNT(*) FROM sensors
23
Illustration: Pipelined Aggregation 12345 111111 231221 341321 451321 1 2 3 4 5 1 2 3 1 5 Sensor # Epoch # Epoch 4 SELECT COUNT(*) FROM sensors
24
Illustration: Pipelined Aggregation 12345 111111 231221 341321 451321 551321 1 2 3 4 5 1 2 3 1 5 Sensor # Epoch # Epoch 5 SELECT COUNT(*) FROM sensors
25
Bytes Transmitted Simulation Results 2500 Nodes 50x50 Grid Depth = ~10 Neighbors = ~20
26
Optimization: “Snooping” Insight: Shared channel enables optimizations Suppress messages that won’t affect aggregate E.g., in a MAX query, sensor with value v hears a neighbor with value ≥ v, so it doesn’t report Applies to all exemplary, monotonic aggregates Learn about query advertisements it missed If a sensor shows up in a new environment, it can learn about queries by looking at neighbors messages. Root doesn’t have to explicitly rebroadcast query!
27
Optimization: Hypothesis Testing Insight: Root can provide information that will suppress readings that cannot affect the final aggregate value. E.g. Tell all the nodes that the MIN is definitely < 50; nodes with value ≥ 50 need not participate. Depends on monotonicity How is hypothesis computed? Blind guess Statistically informed guess Observation over first few levels of tree / rounds of aggregate
28
Experiment: Hypothesis Testing Uniform Value Distribution, Dense Packing, Ideal Communication
29
Taxonomy of Aggregates TAG insight: classify aggregates according to various functional properties Yields a general set of optimizations that can automatically be applied PropertyExamplesAffects Partial StateMEDIAN : unbounded, MAX : 1 record Effectiveness of TAG Duplicate SensitivityMIN : dup. insensitive, AVG : dup. sensitive Routing Redundancy Exemplary vs. Summary MAX : exemplary COUNT: summary Applicability of Sampling, Effect of Loss MonotonicCOUNT : monotonic AVG : non-monotonic Hypothesis Testing, Snooping
30
ACQP Data collection aware query processing “acquisitional query processing” Issues addressed: How does the user control acquisition? Rates or lifetimes Event-based triggers How should the query be processed? Sampling as a first class operation Events – join duality Which nodes have relevant data? Which samples should be transmitted? Madden, Franklin, Hellerstein, and Hong. The Design of An Acqusitional Query Processor. SIGMOD 2003.
31
Sensor Query Processing Summary Higher-level programming abstractions for sensor networks are necessary. Aggregation is a fundamental operation Semantically aware optimizations Close integration with network ACQP: Languages, indices, approximations that give user control over which data enters the system. Wealth of open research problems: Error tolerance, topologies, heterogeneity, spatial processing, routing strategies, operators, actuation,.. Combines database, network, and device issues
32
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
33
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
34
Web Services/Message Brokers dynamic, loosely-coupledA platform for dynamic, loosely-coupled integration of enterprise applications and data. Interaction accomplished through exchange of messages in the wide area. (e.g., Adam Bosworth’s VLDB 02 keynote: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P01slides.pdf)
35
The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. Underlying Technology: Filtering XML Conversion XML Documents Filter Engine User Profiles Users Filtered Data Data Sources
36
Message Brokers Message Brokers perform three main tasks: Filtering Filtering - matching of interests. Transformation Transformation - format conversion for app integration and preferences. Delivery Delivery - moving bits through the overlay network Must be lightweight and scalable. Effectively they are high-function routers. Large-scale deployments may entail handling 10’s or 100’s of thousands of queries (subscriptions) XML is a natural substrate.
37
YFilter Message Broker (Yanlei Diao [VLDB 03])
38
XQuery-based Subscriptions A query consists of a constant tag and an FLWR expression A for clause: a variable and a path expression An optional where clause: conjunctive predicates A return clause: interleaved constant tags and path expressions relative where and return clause paths are relative { for $s in document(“doc.xml”)//section where $s//figure/title = “XML processing” return { $s/title } { $s//figure } }
39
YFilter:Shared Path Matching shared processing For large-scale systems, shared processing is essential. YFilter uses an NFA-based approach to share path matching work among queries. Location steps /a //a /* //* NFA fragments a * a * * *
40
Constructing a Query NFA Concatenate NFA fragments for location steps in a path expression. /a a //b * a Query “/a//b” a * b
41
Constructing the Combined NFA a {Q1} b Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/b Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c a {Q2} c c {Q3} {Q4} c b * * c {Q5} c {Q6} * c {Q7} {Q3, Q8}
42
NFA Execution read 2 1 match Q1 read 3 2 1 match Q3 Q8 read 5 3 9 7 6 2 1 read 3 9 7 6 2 1 read 2 1 1 initial 1 Runtime Stack NFA An XML fragment c c b {Q1} {Q3, Q8} {Q2} {Q4} {Q6} {Q5} {Q7} a * c c * c c * b 1 4 35 8 6 12 10 2 7 11 13 9 97 6 1012 811 6 Q5Q6Q4
43
Performance Evaluation Varying number of distinct queries (NITF, D=6, W=0.2, //=0.2) With YFilter, path matching is no longer the dominant cost! YFilter: prefix sharing XFilter (list balance): no sharing Hybrid approach: share substrings containing ‘/’ only YFilter is significantly faster (around 30 ms for 150K queries) Parsing not included: Xerces (168 ms) Java XML Pack (141 ms) Saxon (86 ms).
44
Message Transformation Change YFilter to output streams of “path tuples”. Each path tuple contains a sequence of node ids representing the elements that matched the path. This output is post-processed using relational-style operators to produce customized messages. Three approaches ( differ in the extent to which they push work to the engine) PathSharing-F PathSharing-F: For clause paths only PathSharing-FW PathSharing-FW: For & Where clause paths PathSharing-FWR PathSharing-FWR: For, Where & Return Inherent tension between path sharing and result customization!
45
Message Broker – Wrap Up Sharing is the key to performance NFA provides excellent scalability/performance PathSharing-FWR performs best, when combined with optimizations based on the queries and DTD. When the post-processing is shared, even more scalability can be achieved. This sharing is facilitated by using relational-like query plans. On-going work - How to deploy in the wide area?: Distributed Filtering and Content Delivery Network Combining distributed query processing and state-of- the-art application-level multicast protocols. What semantics can/should be provided? For more information see: www.cs.berkeley.edu/~daioyl/yfilter
46
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
47
New QP Scenarios Sensor Networks Message Brokers Data Streams Information/Application Integration
48
Monitoring (2) : Data Streams Streaming Data Network monitors news feeds stock tickers B2B and Enterprise apps Supply-Chain, CRM Trade Reconciliation, Order Processing etc. (Quasi) real-time flow of events and data Must manage these flows to drive business (and other) processes. Mine flows to create and adjust business rules. Can also “tap into” flows for on-line analysis.
49
TelegraphCQ Overview An adaptive system for large-scale shared dataflow processing. Based on an extensible set of operators: Ingress (data access) 1) Ingress (data access) operators Screen Scraper, Napster/Gnutella readers, File readers, Sensor Proxies Data processing 2) Non-Blocking Data processing operators Selections (filters), XJoins, … Adaptive Routing 3) Adaptive Routing Operators Eddies, STeMs, FLuX, etc. Operators connected through “Fjords” [MF02] queue-based framework unifying push&pull.
50
SteMs:“State Modules” [Raman & Hellerstein ICDE 03] A generalization of the symmetric hash join (n-way) SteMs maintain intermediate state for multiple joins. Use Eddy to route tuples through the necessary modules. SteMs + Eddy reduce need for optimizer, increasing adaptivity in volatile streaming environments. A B C D Hash A Hash B Hash C Hash D A B C D
51
Telegraph CQ Architecture TelegraphCQ Front End Planner Parser Listener Mini-Executor Catalog Split TelegraphCQ Back End Modules Scans CQEddy TelegraphCQ Wrapper ClearingHouse Shared Memory Buffer Pool Disk Query Plan Queue Eddy Control Queue Query Result Queues } Legend Data Tuples Query + Control Data + Query Wrappers Proxy 1 2 3 4 5 6 7 8 9
52
1 { t1,t2,t3 2 { t2,t3,t4 3 { t3,t4,t5 4 { t4,t5,t6 5 { t5,t6,t7 Time Tuple sets Semantics of data streams Different notions of data streams Ordered sequence of tuples Bag of tuple/timestamp pairs [STREAM] Mapping from time to sets of tuples Data streams are unbounded Windows: restrict data for a query A stream can be transformed by: Moving a window across it A window can be moved by Shifting its extremities Changing its size
53
The StreaQuel Language An extension of SQL Operates exclusively on streams Is closed under streams Supports different ways to “create” streams Infinite time-stamped tuple sequence Traditional stable relations Flexible windows: sliding, landmark, and more Supports logical and physical time When used with a cursor mechanism, allows clients to do their own window-based processing. Target language for TelegraphCQ
54
Example – Landmark query
55
Current Status - TelegraphCQ System has been developed by modifying PostgreSQL: Re-used a lot of code: Expression evaluator, semaphores, parser, planner Sucessfully Demonstrated at SIGMOD 2003. Performance studies underway. Beta Version to be released Aug 03 Open Source (PostgreSQL license) Shared joins with windows and aggregates Archived/unarchived streams A “hot” area: Several major streaming systems under development in the database community
56
Beyond Emps and Depts Monitoring TinyDB, TelegraphCQ, YFilter Real-time Analysis TinyDB and TelegraphCQ Actuation TinyDB, GridDB Routing TransformationService Composition Routing (queries and/or data), Transformation, Service Composition all of the projects Definition,Naming, and Access Rights TelegraphCQ, but all should
57
Conclusions Data is the crucial resource in emerging networked environments. Database query processing techniques and insights can provide tremendous leverage. database networkingdistributed systems Huge research opportunities for database, networking, and distributed systems researchers. Breakthroughs will come from projects that span these areas.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.