Dynamic XML documents with Distribution and Replication Authors : Dynamic XML documents with Distribution and Replication Authors : Serge Abiteboul, Angela.

Slides:



Advertisements
Similar presentations
Software change management
Advertisements

Configuration management
Research Issues in Web Services CS 4244 Lecture Zaki Malik Department of Computer Science Virginia Tech
XML: Extensible Markup Language
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Fast Algorithms For Hierarchical Range Histogram Constructions
The State of the Art in Distributed Query Processing by Donald Kossmann Presented by Chris Gianfrancesco.
Best-First Search: Agendas
Serge Abiteboul Omar Benjelloun Bogdan Cautis Ioana Manolescu Tova Milo Nicoleta Preda Lazy Query Evaluation for Active XML.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
2005rel-xml-ii1 The SilkRoute system  The system goals  Scenario, examples  View Forests  View forest and query composition  View forest efficient.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Distributed Databases
4/20/2017.
DAT602 Database Application Development Lecture 15 Java Server Pages Part 1.
This chapter is extracted from Sommerville’s slides. Text book chapter
LAYING OUT THE FOUNDATIONS. OUTLINE Analyze the project from a technical point of view Analyze and choose the architecture for your application Decide.
Dynamic XML documents with distribution and replication Angela Bonifati (currently in Icar-CNR, Italy) Joint work with: Serge Abiteboul, Gregory Cobéna,
Querying Tree-Structured Data Using Dimension Graphs Dimitri Theodoratos (New Jersey Institute of Technology, USA) Theodore Dalamagas (National Techn.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
CODD’s 12 RULES OF RELATIONAL DATABASE
An Introduction to Design Patterns. Introduction Promote reuse. Use the experiences of software developers. A shared library/lingo used by developers.
Configuration Management (CM)
AXML Transactions Debmalya Biswas. 16th AprSEIW Transactions A transaction can be considered as a group of operations encapsulated by the operations.
Data Structures : Project 5 Data Structures Project 5 – Expression Trees and Code Generation.
Querying Structured Text in an XML Database By Xuemei Luo.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Distributed Database Systems Overview
Chris Kuruppu NWS Office of Science and Technology Systems Engineering Center (Skjei Telecom) 10/6/09.
Future and Emerging Technologies (FET) Future and Emerging Technologies (FET) The roots of innovation Proactive initiative on: Global Computing (GC) Proactive.
7 Systems Analysis and Design in a Changing World, Fifth Edition.
View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.
LRI Université Paris-Sud ORSAY Nicolas Spyratos Philippe Rigaux.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
DDBMS Distributed Database Management Systems Fragmentation
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
CE 221 Data Structures and Algorithms Chapter 4: Trees (Binary) Text: Read Weiss, §4.1 – 4.2 1Izmir University of Economics.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
INRIA - Progress report DBGlobe meeting - Athens November 29 th, 2002.
XP New Perspectives on XML, 2 nd Edition Tutorial 7 1 TUTORIAL 7 CREATING A COMPUTATIONAL STYLESHEET.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
REST By: Vishwanath Vineet.
Reverse Engineering. Reverse engineering is the general process of analyzing a technology specifically to ascertain how it was designed or how it operates.
Exchange Intensional XML Data Tova MiloSerge Abiteboul Tova Milo INRIA & Tel-Aviv U. ; Serge Abiteboul INRIA ; Bernd AmannOmar Benjelloun Bernd Amann Cedric-CNAM.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
DEVELOPING WEB SERVICES WITH JAVA DESIGN WEB SERVICE ENDPOINT.
Expanding the Notion of Links DeRose, S.J. Expanding the Notion of Links. In Proceedings of Hypertext ‘89 (Nov. 5-8, Pittsburgh, PA). ACM, New York, 1989,
1 UNIVERSITY of PENNSYLVANIAGrigoris Karvounarakis October 04 Lazy Query Evaluation for Active XML Abiteboul, Benjelloun, Cautis, Manolescu, Milo, Preda.
Rendering XML Documents ©NIITeXtensible Markup Language/Lesson 5/Slide 1 of 46 Objectives In this session, you will learn to: * Define rendering * Identify.
Topic 2: binary Trees COMP2003J: Data Structures and Algorithms 2
Module 11: File Structure
-A File System for Lots of Tiny Files
Practical Database Design and Tuning
Presentation transcript:

Dynamic XML documents with Distribution and Replication Authors : Dynamic XML documents with Distribution and Replication Authors : Serge Abiteboul, Angela Bonifati, Grégory Cobéna, Ioana Manolescu, Tova Milo As summarized by : Preethi Vishwanath San Jose State University Computer Science

Dynamic XML documents XML documents where some data is given explicitly while other parts are given only intentionally by means of embedded calls to web services that can be called to generate the required information. –SOAP and WSDL normalize the way programs can be invoked over the Web, and become the standard means of publishing and accessing dynamic, up-to-date sources of information. –May be distributed and/or partially distributed. Whether dynamic or static, XML document may be –Distributed in several parts located at different peers, while maintaining the general unity of the separated pieces –Partially or entirely replicated on different peers.

Aspects of distribution due to embedding calls to a Web Service (1) Accessing remote services: Such a document provides the means to access remote services. This feature is already provided by platforms supporting embedded scripts in HTML/XML documents, e.g., JSP, ASP.Net. (2) Replicating data fragments with embedded service calls: a call included in a replicated fragment may be activated from the replica’s site, following a rather different communication path. (3) Replicating service definitions: A special form of replication may be achieved by replicating not only data, but also service definitions. This is in the spirit of “code-shipping”.

Context of paper and Contributions Dynamic XML documents (XML documents including calls to Web services) that are possibly distributed over several sites, with portions of them possibly replicated. Contributions (1) Model. Introduce a simple model for replicating and distributing XML documents over several sites. The model may be used for standard or dynamic documents. In general, users querying distributed/replicated data prefer to ignore data location and expect the system to locate data for them. But it is sometimes desirable to specify which replicas of a given fragment to use (e.g., the one in the local cache, or the most recent one). (2) Query evaluation and optimization. In the presence of replicas and distribution, many evaluation strategies are possible for a given query, depending on the choice of the replica to use, and of the sites performing each elementary computation. Typically, several peers will collaborate to evaluate a query; each involved peer will have to make choices in order to improve its observable performance, based on a cost metric specific to this peer. (3) Tailored replication. To improve its observable performance, a peer may be willing to replicate some data, possibly including service calls, and even service definitions, as explained above. Such replication is subject to natural constraints (e.g., storage space).

Data Model & Query Language Dynamic XML Documents –May be viewed as labeled tree. –Tree nodes represent the XML elements/attributes., edges represent relationship. –Function elements, represent calls to the Web Web Services –Opaque “SOAP-based Web services, black boxes” –Declarative web services, implementation is known and described in terms of XQuery. Peers –Offers some Web services and contains some dynamic XML document which may include calls to services provided by the same or other peers. Distribution –May include calls to services provided by the same or other peers. –A higher level of data distribution can be achieved by allowing a document to be distributed over several peers. –Tree data model : means that document nodes may now have external children edges pointing to children nodes on other peers, and analogously, an external parent edge if the parent of the node is on another peer. Replication of data and services –Same document fragment exists in several peers. –All children of the same node with the same ID are considered replicas of a single node.

A dynamic XML Document of the SKI Portal Colorado Colorado Aspen Aspen good good Aspen Aspen </fun> …. …. ….. …… ….. ……</resorts> ….. …. ….. ….</document> Web Services of Ski Portal function OperativeSkiResorts($state) implementation:XQuery for $x in document(”SkiPortal”)/state[state name=$state] /resorts/resort[snow cond/value()=”good”] return $x function HotelsInfo($state, $resort) implementation:XQuery for $x in document(”SkiPortal”)/state[state name=$state] /resorts/resort[name=$resort]/hotels/hotel return $x

If the two functions were opaque and the resort knows nothing about their internal implementation, there are essentially two possibilities: (1) (1) Call the ski portal each time a service is needed and have the portal compute the answer and return it, or (2) (2) cache the returned result and use it for some time, trading communication cost for data accuracy. Query Frequency – –By analyzing the OperativeSkiResorts query, we can see that its answer may change only every hour - when the SnowConditions functions is invoked. – –Hence, to give fully accurate answers to its visitors, the ski center needs to invoke the function every hour, and cache data in between. Replicating relevant data and services – –Assume that the Colorado ski center computer is capable of (1) storing dynamic XML documents, (2) invoking the web service calls embedded in them, and (3) processing XQuery queries. – –Rather than just caching the current query result, one could then decide to replicate (and maintain) in the ski center computer all the relevant data, and provide a local version of the service queries.

The Colorado dynamic document and services Aspen Aspen good good Aspen Aspen </fun></snow_cond> …. …. </hotels> ….. …… ….. ……</document> function OperativeSkiResorts(“Colorado”) implementation:XQuery for $x in document(”ColoradoSkiCenter”)/resort[snow cond/value()=”good”] return $x function HotelsInfo(“Colorado”, $resort) implementation:XQuery for $x in document(” ColoradoSkiCenter”)/resort[name=$resort]/hotels/hotel return $x

Partial Replication Replicate just the resort names and their ski conditions, without the hotels data, and just provide access to this data through the ski portal, when needed. The externalURL sub-element of the hotels element, together with the ID, indicate where the data of this element may be found. The external edge is simply viewed as an intensional description of this missing data and gives the means to obtain it if needed.

The Colorado document with external edges Aspen Aspen good good Aspen Aspen </fun></snow_cond> </externalURL></hotels> ….. ….. </document> Inverse External Edges </document>

Master-Slave Policy Maintaining consistency over replicated objects difficult. Typical solution –Have each object owned by a single master who is in charge of maintaining the various copies in sync. – –If the various copies are the children of a single element, then this element is the candidate for being in charge of synchronization.Example <document name=”SkiPortal”_ <state> Colorado Colorado

Queries Each element encountered in the evaluation of a path expression, on a given peer p, may contain some data (residing on that peer), and may also point (via external edges) to some replicas (on different peers). Which of the Element versions should be used ? – –Ignore all the external edges and consider only the data residing within the given peer p. – –use the element’s local data as well as follow all the given external edges to its replicas, in order to get the maximal available information. – –Intermediate choice : Choose some arbitrary copy consider the element’s local data when available, and follow an external edge – –Follow a particular edge – –Give a preference list Example : A Replicated query for $x in document(”SkiPortal”)/state[state name=”Colorado”] /resorts/resort replicate $x with resort name//* snow cond//* hotels as external link at peer ”

COST MODEL Configuration A set of peers, each containing some data and providing some web services (opaque or XQuery-based ones) Workload (for a configuration) System workload consists of the service calls invoked by the dynamic documents in the configuration, as well as of queries/web service requests posed by users at the various peers. Unifying user queries and services – –Consists of the invocation of web services entailed by the dynamic documents, and queries and web services requested by the user.

Decomposing Queries on Peers The processing of Q can thus be viewed as decomposed into several “intra-peer” sub- queries: each such sub-query is evaluated on a particular peer, consulting only the peer’s local data, and communication with other peers in order to forward some finer sub- queries or send/receive data or computation results. P1 Q

Cost Formulas Formulas for calculating the data used by a given workload on a set of peers M i,j = δ i,j * O j * min(F i,F j ) D = T L*M*L Computation, Communication and storage costs incurred by the workload C j GlobComp ={Comp*L} j *cpj C GlobReceiv [s] = D*BW IN C GlobSend [s] = TBW OUT * D C j GlobSpace = {Space * L} *spj Where M i,j is the volume of data transferred from one query W i to another query W j M i,j is the volume of data transferred from one query W i to another query W j D represents the volume of data transferred from peer P i to peer P j due to all queries in W D represents the volume of data transferred from peer P i to peer P j due to all queries in W C j GlobComp is the observable cost of computation C j GlobComp is the observable cost of computation C GlobReceiv [s] is the observable cost of received data C GlobReceiv [s] is the observable cost of received data C GlobSend [s] is the observable cost of sent data C GlobSend [s] is the observable cost of sent data C j GlobSpace is the observable cost of space, resp., of peer j C j GlobSpace is the observable cost of space, resp., of peer j

Outline of Query Evaluation Data Shipping vs Query Shipping –Wrappers decide how much of the decide how much of query sent by the mediator they solve. –The mediator has global information about data location, and all wrappers report directly to it. –Control over execution is distributed. Communication Pattern –At each step the sub-query Q next includes the address of the peer P on which Q was originally asked, so that the result is returned directly to P, since it requires less communication. –Drawback All peers get to know who initiated the query Peer Pi has to execute a simple path expression Q Q  some data in P 1 and some in P 2. P adopts the heuristic of executing as much of Q as possible, say Q local, obtaining an intermediate result, and delegates one or several further subqueries Q next to one or several other peers P next. Each P next will receive the intermediate results and continue processing, by applying the same method: attempt to evaluate all Q next and, if all data is not available, delegate further.

Replicating data and services For a given configuration and workload, every peer measures its observable performance In order to improve its observable performance, the peer may want to change the configuration; due to peer autonomy, the peer can only modify his own set of data and services. Possible replication scenarios that peer P may consider, –Accessing remote information (do not replicate) When not all the data needed for the query evaluation resides on ", it may need to consult remote data, for instance via external links If the query frequency is high and storage cost at the given peer is low, " may prefer to replicate the relevant data and use a local version rather than the remote one. –Replicating data fragments with or without service calls Scenario 1 –P may take the replicated fragment including the service calls embedded in it; thus P will call the service itself. –Alternatively,P may leave (some of) the calls to be executed at the remote peer, and just refer to the data they return via external links Scenario 2 –Cost Effective –Example if the service provider charges some fee from the caller, leaving the call on the remote peer spares " from this fee; or, if the call is invoked more frequently than the query that uses its data, its output is transmitted to " at the frequency of the query rather than that of the call invocation, thus entailing less communication. –Replicating service definitions When the data is replicated together with its embedded calls, we may want to also replicate, for declarative services, the code of the called services as well as the data that they use Things become more complex when service definitions are replicated. One has to decide – if and how to modify the service code to best fit the needs of P, –Which data the code uses, and how much of it to replicate, and recursively, for which service calls appearing in this replicated data, the code (and the data that it uses) should be also replicated.

Replication Algorithm Algorithm repDecision Input: configuration con f, service implementation Q Output: configuration con f1 con f1  con f, repData  0 foreach path expression pe over docin Q pe is of the form l1[c1]/l2[c2]/…lk // evaluate pe by top-down navigation in doc foreach step j in the evaluation of pe, j = 1,2,….,k Q1 ../lj+1/lj+2/../lk if exists {sc|sc child of a node in the current node list, sc is a call to a service sv, whose output type may contain a path lj+1/…/lk} then repData  the set of subtrees rooted at the current node list con f1  con f U repData U Q1 if cost(con f1) < cost(con f) then foreach sv1 call of service in repData con f1  repDecision(con f1, def(sv1)) endfor endfor break// stop here for evaluation of pe break// stop here for evaluation of pe else nop; else nop; else nop; endfor // the evaluation of pe is over if (empty ( repData) // repData has not yet been assigned repData  the result of pe on doc con f1  con f U repData foreach sv1 call of service in repData con f1  repDecision(con f1,def(sv1)) endfor endforendfor return con f1