XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003.

XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003

Michael Franklin, UC Berkeley Outline Earlier (non-XML) Projects – Client-Server EXODUS -> SHORE – DIMSUM - Distributed Query Architecture – DBIS - Dissemination-Based Information Systems – Telegraph and TelegraphCQ – Lessons Learned The XML-enabled Computing Landscape Some Research Suggestions

Michael Franklin, UC Berkeley Client-Server Exodus Issue: How to split the functionality of an OODB across Clients and Servers? Buffer Manager Disk Manager Transaction Mgr Buffer Manager Access Methods Applications Object Access/QP Buffer Manager Access Methods Applications Object Access/QP

Michael Franklin, UC Berkeley Distribution of OODB Functions Server is the owner of data. – Shared resources: data and log disks, server memory. Clients cache second-class (i.e., soft state) copies to reduce latency. – Can share client caches too… Query vs. Data Shipping. For Data Shipping: – Object or Page granularity. Ref: [Sigmod 91,92,94;VLDB 92,93] Buffer Manager Disk Manager Transaction Mgr Buffer Manager Access Methods Applications Object Access/QP Client- Server Protocol

Michael Franklin, UC Berkeley SHORE - A Peer Server (P2P?) Model Follow-on to Exodus [Sigmod 94] Among other things, took caching to its logical conclusion: All can be clients and servers. – You manage the data you own (server) – You cache data owned by others (client) Wide-area is a reasonable next step – But massive scale changes everything (more on this later).

Michael Franklin, UC Berkeley So, What Happened? Well, all the OODB/ORDB stuff – But isn’t XML DB just OODB redux? More to the point: – Models were tightly-coupled: Syncrhonous Need intimate knowledge of the schema – Limited (and late) standardization for query languages, data model, and schema interchange This is bad for: Scalability Interoperability Incremental Deployment Resiliance to Change Also, some people really did want queries (vs. navigation).

Michael Franklin, UC Berkeley DIMSUM - Adding Queries to the Mix Goal - mix declarative specification & caching. – raises mapping problems similar to materialized view maintenance, but more dynamic. “Hybrid-Shipping” - Sometimes neither pure strategy is best. Semantic Caching - remainder queries, semantic replacement functions, … Query Scrambing - query re-optimization for wide-area delays (vague “deep web” theme) XJoin - Adaptive, pipelined join operator. Cache Investment - Multiple query cache optimization. Ref: [Sigmod 96,98; VLDB 96,01; TODS 00]

Michael Franklin, UC Berkeley So, What Happened? Still Tightly Coupled: – Synchronous (modulo Query Scrambling delay tolerance). – Need to know (and exchange) schemata Basically, a federated database approach with caching added. – But, federated databases still haven’t caught on. – Q: Why is data warehousing so popular? Still, some interesting issues raised: – adaptivity for networked query processing. – semantic cache content descriptors raise duality of queries and data. – pipelined operators for incoming data.

Michael Franklin, UC Berkeley DBIS Framework Dissemination-Based Information Systems Outgrowth of “Broadcast Disks” project. [SIGMOD 95] Framework in OOPSLA 97, SIGMOD 98 (Franklin & Zdonik) Toolkit Developed and Demonstrated at SIGMOD 99 The DBIS Framework is based on three fundamental principles: 1) No one data delivery mechanism is best for all situations (e.g., apps, workloads, topologies). 2) Network Transparency: Must allow different mechanisms for data delivery to be applied at different points in the system. 3) Topology, routing, and delivery mechanism should vary adaptively in response to system changes.

Michael Franklin, UC Berkeley Dissemination Network Components profile query response profile query response Data Sources Information Brokers Client Proxies

Michael Franklin, UC Berkeley Data Delivery Mechanisms Push Pull AperiodicPeriodic Unicast1-to-n Unicast1-to-n AperiodicPeriodic Unicast 1-to-nUnicast1-to-n request/ response on- demand broadcast polling w\snoop Email lists publish/ subscribe Person- alized News Broad- cast disks Dimensions are largely orthogonal – all combinations are potentially useful.

Michael Franklin, UC Berkeley Network Transparency ClientsBrokersSources A fundamental principle for systems design: Type of a link matters only to nodes on each end.

Michael Franklin, UC Berkeley More on Brokers Brokers are middleware components that can act as both clients and servers. Must support data caching – Needed to convert pushed-data to pulled-data – Also allows implementation of hierarchical caching Profile Management – Allow informed data management: push, prefetch, staging, etc. Profile Matching – Our assumptions were: No profile language sufficient for all applications. Need an API for adding app-specific profiling

Michael Franklin, UC Berkeley So, What Happened? Focus on combo of Push and Pull. Big deal: Integration of Database and Networking – If I had a Euro for every review that said “why is this a db problem?” – Published in DB and Comms venues. But, we were missing 2 big pieces of the puzzle: – How to deploy this stuff (in the routers?)? – What should the language for profiles and queries be? These have since been answered

Michael Franklin, UC Berkeley Telegraph: Querying the Networked World Increasingly ubiquitous networking at all scales. – ad hoc sensor nets, wireless, global Internet numbertypeslocations Explosion in number, types, and locations of data sources and sinks. – mobile devices, P2P networks, data centers Emerging software infrastructure to put it all together. “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” “When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its organization.” (Jim Gray’s 1998 Turing Award Paper)

Michael Franklin, UC Berkeley Telegraph Overview An adaptive system for large-scale shared dataflow processing. – Sharing and adaptivity go hand-in-hand Based on an extensible set of operators: Ingress (data access) 1) Ingress (data access) operators  File readers, Sensor Proxies, Screen-Scrapers Data processing 2) Non-Blocking Data processing operators  Selections (filters), XJoins, … Adaptive Routing 3) Adaptive Routing Operators  Eddies, STeMs, FLuX, etc. Operators connected through “Fjords” [MF02] – queue-based framework unifying push&pull.

Michael Franklin, UC Berkeley The Telegraph Project We’ve explored sharing and adaptivity in … – Eddies – Eddies: Continuously adaptive queries – Fjords – Fjords: Inter-module communication – CACQ – CACQ: Sharing, Tuple-lineage – PSoup – PSoup: Query=Data duality – STeMs – STeMs: Half-a-symmetric-join, tuple store – FLuX – FLuX: Fault tolerance, load balancing.. and built a first generation prototype [SIGMODRec01] – Built from scratch in Java Rewrote as “TelegraphCQ” [CIDR 03] – In “C”, based on open-source PostgreSQL – Focus on continuous queries over streams – Released in July 2003

Michael Franklin, UC Berkeley The TelegraphCQ Architecture TelegraphCQ Wrapper ClearingHouse Wrappers Proxy TelegraphCQ Front End Planner Parser Listener Mini-Executor Catalog Query Plan Queue Eddy Control Queue Query Result Queues } Shared Memory Shared Memory Buffer Pool Disk Split TelegraphCQ Back End Modules Scans CQEddy Split TelegraphCQ Back End Modules Scans CQEddy

Michael Franklin, UC Berkeley Queries Need Windows: Landmark query

Michael Franklin, UC Berkeley So, What Happened? Decision was made to do relational first. – Enough hard problems w/o XML – Our early apps weren’t XML Q: Will they eventually be? – Note: Streams and Aurora made same choice Developed lots of stream-related technology Project still going strong – Storage manager, archives, and historical queries – Adaptive Adaptivity – Performance Tunning – Query Language and Window semantics – Distribution

Michael Franklin, UC Berkeley Summary So Far 4 projects over 14 or so years. All exploring aspects of networked data management. Exodus/SHORE - centrality of caching, work sharing and work splitting paradigms. DIMSUM - Benefits and challenges of declarative specificaitons via queries. DBIS - Push, Profiles, broader notion of integrating networking and data management. Telegraph - Adaptivity, Sharing, CQs, Stream processing. But, they all suffer to some extent from the problem of tight coupling  in terms of both timing and semantics.

Michael Franklin, UC Berkeley Meta Lessons Learned 1. You don’t have to predict the technology correctly to get a bunch of papers published. 2. Sometimes you actually get it right, but the timing is a bit off. A lot of pieces have to fall into place before a new technology or architecture clicks. XML is one such piece, and it’s a BIG one.

Michael Franklin, UC Berkeley How to Make Systems More Network-Friendly Messaging enables distributed communication that is loosely coupled. A component sends a message to a destination, and the recipient can retrieve the message from the destination. However, the sender and the receiver do not have to be available at the same time in order to communicate. In fact, the sender does not need to know anything about the receiver; nor does the receiver need to know anything about the sender. The sender and the receiver need to know only what message format and what destination to use. Java Message Service (JMS) API Tutorial Sun Microsystems

Michael Franklin, UC Berkeley Preaching to the Choir XML (not JMS!) solves both these issues. – Senders and Receivers can agree on message format (or at least figure most of it out). – Destinations should be encoded by value not by address. (Didn’t we learn anything during the OODB battles?). Database people live and breathe both of these. So who better to fix the networked application infrastructure problem? (Ahem, but, better keep that slow DBMS out of the message flow! e.g., FedEx tracking involves 100,000,000 transactions a day, and RFId will be even more fun.)

XML Message Brokers dynamic, loosely-coupledA platform for dynamic, loosely-coupled integration of enterprise applications and data. Interaction accomplished through exchange of messages in the wide area. (e.g., Adam Bosworth’s VLDB 02 keynote: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P01slides.pdf)

The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles. Underlying Technology: Filtering XML Conversion XML Documents Filter Engine User Profiles Users Filtered Data Data Sources

Our View on Message Brokers (YFilter) Message Brokers perform three main tasks: – Filtering – Filtering - matching of interests. – Transformation – Transformation - format conversion for app integration and preferences. – Routing – Routing - moving bits through the overlay network Must be lightweight and scalable. – Effectively they are high-function routers. – Large-scale deployments may entail handling 10’s or 100’s of thousands of queries (subscriptions) XML is a natural substrate.

YFilter:Shared Path Matching Yanlei Daio et al., ACM TODS, Dec. 2003 shared processing For large-scale systems, shared processing is essential. YFilter uses an NFA-based approach to share path matching work among queries. Location steps /a //a /* //* NFA fragments a * a  * * * 

Constructing a Query NFA Concatenate NFA fragments for location steps in a path expression. /a a //b * a  Query “/a//b” a * b 

Constructing the Combined NFA a {Q1} b Q1=/a/b Q2=/a/c Q3=/a/b/c Q4=/a//b/c Q5=/a/*/b Q6=/a//c Q7=/a/*/*/c Q8=/a/b/c a {Q2} c c {Q3}  {Q4} c b * * c {Q5} c {Q6} * c {Q7} {Q3, Q8}

NFA Execution read 2 1 match Q1 read 3 2 1 match Q3 Q8 read 5 3 9 7 6 2 1 read 3 9 7 6 2 1 read 2 1 1 initial 1 Runtime Stack NFA An XML fragment c c b {Q1} {Q3, Q8} {Q2} {Q4} {Q6} {Q5} {Q7} a * c c * c c *  b 1 4 35 8 6 12 10 2 7 11 13 9 97 6 1012 811 6 Q5Q6Q4

Michael Franklin, UC Berkeley Performance Overview Sharing provides order-of-magnitude improvements. In our experiments, even with 100,000 concurrent queries, filtering was faster than the parser. No exponential blow-up of active states in NFA execution Little sensitivity to occurence of ‘*’ and “//” YFilter shows little sensitivity to these two parameters because effective prefix sharing keeps the machine size small Efficient for query updates Tens of milli-seconds for inserting 1000 queries, and stabilizes at 5 msec after 50,000 queries exist in the system.

Message Transformation Shred FLWR expressions into paths that can be pushed down into the path matching engine. Post-process the output using relational-style operators to produce customized messages. – Can apply MQO techniques to these post-plans Three approaches ( differ in the extent to which they push work to the engine) – PathSharing-F – PathSharing-F: For clause paths only – PathSharing-FW – PathSharing-FW: For & Where clause paths – PathSharing-FWR – PathSharing-FWR: For, Where & Return Inherent tension between path sharing and result customization! See Yanlei Diao’s VLDB 03 paper (thursday afternoon)

Message Broker – Wrap Up Sharing is the key to performance – NFA provides excellent scalability/performance – PathSharing-FWR performs best, when combined with optimizations based on the queries and DTD. – When the post-processing is shared, even more scalability can be achieved. This sharing is facilitated by using relational-like query plans. On-going work - How to deploy in the wide area?: – Distributed Filtering and Content Delivery Network Combining distributed query processing and state-of- the-art application-level multicast protocols. What semantics can/should be provided? For more information see: www.cs.berkeley.edu/~daioyl/yfilter

Michael Franklin, UC Berkeley Beyond Message-Based Systems Distributed systems need traceability – Particularly highly dynamic (loosely-coupled) ones – Need to carry provenance information with data Workflow description – XML-based workflow languages with appropriate versioning models can provide the platform for the above. Data needs to be long-lived - Archiving – Marked up data provides an opportunity for future interpretation? – Schema versioning needed for this. Semantic Web? – Try it if you like…

Michael Franklin, UC Berkeley Deep/Hidden Web Querying XML is a great way to describe sources. Routing queries to sources is the inverse of the data dissemination problem. Yet another instance of the query and data duality. Stream query processing can help here too.

Michael Franklin, UC Berkeley Self-Publishing/Crawling Following the query routing idea further… Queries can be continuously crawling through the network acquiring new data. This can be random or focused (e.g., navigation your Friendster chains). Even more fun: Mutant Queries (Papadimos et al. OGI) – Queries are partially evaluated and bound as they traverse the network. – “Hybrid Shipping” on steroids

Michael Franklin, UC Berkeley Topics in Need of Work Query Languages and semantics in streaming, loosely-coupled, semi-structured environments. Update consistency models, transactions, exactly-once delivery - How 80’s! Dynamism and on-the-fly modifications User interaction Platform questions: In or out of the DBMS? Making XML appropriate for other environments (e.g., sensor networks). …

Michael Franklin, UC Berkeley Conclusions Two technologies are combining to make distribute/decentralized computing a reality: overlay networks and XML. Query processing is a way to route data through a network by value. – This is the right way to build an overlay network. – We are the right people to do it. – XML is the common substrate that enables it. My plan: revisit many earlier distributed data management ideas in light of this new reality. – And do some new stuff too!

XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003.

Similar presentations

Presentation on theme: "XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003.

Similar presentations

Presentation on theme: "XML + Query Processing: A Foundation for Intelligent Networks Michael Franklin UC Berkeley September 2003."— Presentation transcript:

Similar presentations

About project

Feedback