Efficiently Publishing Relational Data as XML Documents

Efficiently Publishing Relational Data as XML Documents
Datawarehousing lab 석사 2학기 박유림

Outline Introduction Language Specification
Implementations Alternatives Conclusion Future Work

Introduction Why XML ? Why Relational Data?
XML is rapidly emerging as a standard for exchanging business data on the World Wide Web. Its nested, self-describing structure provides a simple and flexible means for applications to exchange data. Why Relational Data? Most business data is stored in relational database. Scalability, flexibility.

Issues in publishing XML
Relational tables are flat, while XML documents are tagged and graph-structured. How to convert relational table to XML document (a graph model) ----Need a language to specify the conversion. How to transform flat representation to tagged nested representation. ----Efficient implementation strategy

Transformation Languages
The language specification describes how to structure and tag the data from one or more tables as hierarchical XML document. Two Choice: XML Query languages. SQL – based languages with minor scalar and aggregate function for XML Construction.

SQL-based Language Specification
Nested SQL statements are used to specify nesting. SQL functions are used to specify elements construction. - XMLAGG: concatenates XML fragments, ordering the inputs. - XML Constructor functions: scalar functions returning XML fragments. One for each table in the relational database.

XMLAGG Example SELECT XMLELEMENT("Department", XMLAGG(XMLELEMENT("Employee", e.job_id||' '||e.last_name))) AS "Dept_list" FROM employees e WHERE e.department_id = 30; Dept_list <Department> <Employee>CSC Raphaely</Employee> <Employee>CSC Khoo</Employee> <Employee>CSC Baida</Employee> <Employee>CSC Tobias</Employee> <Employee>CSC Himuro</Employee> <Employee>CSC Colmenares</Employee> </Department> produces a Department element containing Employee elements with employee job ID and last name as the contents of the elements XMLAgg는 MIN, MAX, AVG 등과 같은 집계 함수입니다. 이 함수 사용의 핵심은 어떤 같은 값에 따라 데이타를 그룹으로 묶는 것입니다. 목록 3에서 질의는 데이타를 나라 이름으로 묶습니다. 그러면 XMLAgg 함수는 해당 나라의 모든 개별 <Attraction> 요소들을 가져와 한 데 연결하여 단일한 XMLType 값으로 반환하고, 그 값은 <County> 요소를 생성하는 새로운 포함 XMLElement 함수 호출로 입력됩니다. 그 결과는 목록 3< county. per document XML one returns> 에서 확인 하실 수 있습니다. 목록 3에서 XMLAttributes이 여러 특성 값을 생성하는 데 사용되는 것을 보게 됩니다. XMLAttributes에 대한 가장 바깥 쪽의 호출은 세 개의 특성 즉, 나라 이름에 대한 것 하나, 그리고 해당 문서가 따르는 XML 스키마를 가리키는 것 둘을 생성합니다. 여기서 질의는 GROUP BY이므로 가장 바깥 쪽의 XMLElement 함수 호출과 그와 연관된 XMLAttributes 호출은 요약된 열만을 참조할 수 있습니다. c.county_name 대신 a.county_name을 사용하면 오류가 나타나는데 이는 a.county_name이 GROUP BY 표현식이 아니기 때문입니다.

XMLAGG Example (cont) SELECT XMLELEMENT("Department", XMLAGG(XMLELEMENT("Employee", e.job_id||' '||e.last_name))) AS "Dept_list" FROM employees e GROUP BY e.department_id; Dept_list <Department> <Employee>ME Whalen</Employee> </Department> <Employee>ECE Hartstein</Employee> <Employee>ECE Fay</Employee> <Employee>CSC Raphaely</Employee> <Employee>CSC Khoo</Employee> <Employee>CSC Tobias</Employee> <Employee>CSC Baida</Employee> </Department> ... use the GROUP BY clause to group the returned set of rows into multiple groups:

Figure 1. An XML document describing a Customer
<customer id = “C1”> <name> John Doe </name> <accounts> <account id = “A1”> </account> <account id = “A2”> </account> </accounts> <porders> <porder id = “PO1” acct = “A2” > //first purchase order <date> 1 Jan 2000</date> <items> <item id = “I1”> Shoes </item> <item id = “I2”> Cloth </item> </items> <payments> <payment id = “P1”> due January 15 </payment> <payment id = “P2”> due January 28 </payment> <payment id = “P3”> due February 9 </payment> </payments> </porder> <porder id = “PO2” acct = “A1” > //second purchase order … </porders> </customer> Figure 1. An XML document describing a Customer This is an XML document representing a customer where a customer has a set of accounts and a set of purchase order and in turn has a set of items and a set of payments. The customer has an id attribute, which is a special kind of attribute that uniquely identifies an element in an XML document. Each customer has name , represented by the <name> sub-element nested under customer. A customer element also has nested the sub-elements representing the accounts and purchase orders. Each of these has other attributes and sub-elements. An interesting feature is the purchase order elements have an attribute called “ acct”, This is a field that is of type IDREF, it logically points to an element having the same value as its ID. The first purchase order points to the second account, while the second purchase order points to the first account.

foreign key relationships This customer relational schema models the customer information last slide in relational schema. As shown, there are customer, account, purchaseorder item and payment tables. Each table has an id and other attributes associated with it, and there are foreign key relationship, (shown by means of arrows) To convert data in this relational schema to the XML document in figure 1, we can write a SQL query that follows the nested structure of the document as figure 3. Figure 2: Customer Relational Schema

The query in figure 3 produces both SQL an XML data. The overall query consists of several correlated subqueries. The easiest way to understand the query is to look at it from top to down. A correlated query retrieves the customer’s account in line 2-4, and retrieve the purchase order from line 5-13. Each correlated sub-query returns an XML fragment, then we need CUST XML constructor to create the customer XML elements which takes a customer name, account information as input and produce a customer XML element as output. Figure 3: SQL Query to Construct XML Documents from Relational Data

Scalar function returning xml The correlated sub-queries can be interpreted similarly, with the ACCT, PORDER, ITEM and PAYMENT constructors defined like the CUST. Each nested query finally has to return one XML fragment. Figure 4: Definition of an XML Constructor

Implementation Alternatives
There is a space problem which is the main difference between relational tables and XML documents. XML documents have tags and nested structure, relational tables do not. So tags and structures have to be added somewhere along the converting from the relational table into XML documents.

Definitions Late (Early) tagging—Tagging is the final (or early) step of construction of an XML document. Late (Early) structuring—Structuring is the final (or early) step of construction of an XML document. Inside the engine— Tagging and structuring are done completely inside the relational engine. Outside the engine— Part of tagging and structuring, not necessarily all of that work is done outside the relational engine.

Figure 5. Space of Alternatives for Publishing XML
Early Tagging Late Tagging Inside Engine Inside Engine De-Correlated CLOB Sorted Outer Union (Tagging inside) Outside Engine Outside Engine Early Structuring Correlated CLOB Sorted Outer Union (Tagging outside) Stored Procedure Inside Engine Unsorted Outer Union (Tagging inside) This is a two dimensions figure show tagging and structuring. Early tagging with late structuring is not a viable alternative because physically tagging an XML document without having its structure makes no sense. Outside Engine Late Structuring Unsorted Outer Union (Tagging outside)

Early Tagging, Early Structuring
The Stored Procedure Approach (outside the engine) The Correlated CLOB Approach (inside the engine) The De-Correlated CLOB Approach (inside the engine)

Stored Procedure Approach (out the engine)
The Stored Procedure approach essentially performs a nested-loop join outside the engine by issuing queries for each nested structure within the desired XML document. Customer (name, ID)  Customer’s account  Customer’s purchase order  Customer’s payment Consider the example shown in figure 1, first a query can be issued to retrieve root level elements (customers). Information about a customer such as their customer ID and customer name are retrieved, tagged and output. Then using the customer’s ID, a query is issued to retrieve the customer’s account information, which is then tagged and output. Next, while still on the same customer, a query is issued to retrieve the customer’s purchase order, and doing the tagging and output. For each purchase order retrieved, a separate query is issued for the purchase order’s items and the purchase order’s payment information. Once this is done, the processing for one customer is complete. The same procedure is repeated for the next customer.

Stored Procedure Approach (cont)
It is common approach today Disadvantage: Need more than one query for tables that have nested structures. Very inefficient. Need particular join order and nested-loop method.

The Correlated CLOB Approach (inside the engine)
The method is to move processing inside the relational engine so that one large query with sub-queries, rather than many top-level queries is executed. XML document fragments are represented as CLOBs. This can be accomplished by adding engine support for the XML construtors and XMLAGG function. The query to produce the XML result can then be executed as a nested SQL query.

The Correlated CLOB Approach (cont)
Eliminating the overhead of issuing many queries SQL to the relational engine Disadvantage: Still having the sub-queries, Still need nested-loop join strategy

The Correlated CLOB Approach (cont)

The De-Correlated CLOB Approach (inside the engine)
Compute the accounts lists associated with all Customer Compute PurchaseOrder lists associated with all Customer Join results above on custom’s Id Replace correlations by joining the tables with the use of outer-joins

Figure 3. SQL query to Construct XML documents from Relation Data
Select cust.name, CUST(cust.id, cust.name, (Select XMLAGG (ACCT (acct.id, acct.acctnum )) From Account acct Where acct.custId = cust.id), (Select XMLAGG (PORDER (porder.id, porder.acct, porder.date, (Select XMLAGG (ITEM (item.id, item.desc)) From Item item Where item.poId = porder.id), (Select XMLAGG (PAYMENT (pay.id, pay.desc)) From Payment pay Where pay.poId = porder.id ))) From PurchOrder porder Where porder.custId = cust.id ) ) From Customer cust Figure 3. SQL query to Construct XML documents from Relation Data

Figure 6. De-Correlated SQL Query with Aggregations
First, each path from the root-level table to a leaf-level table is computed by joining the tables along the path( Customer joined with Account, Customer joined with PurchaseOrder. Outer joins are used because the information about a parent has to be preserved even if it has no children. The set of leaf-level XML elements corresponding to each leaf-level table is then built up by grouping on the id columns of the parent tables on the path from the root-level table to the leaf-level table. High-level structures are built up by joining on these id fileds and using an XML constructor. Though the item and Payment tables are ignored for clarity, it is easy to see how this approach generalizes to arbitrary level depths. Figure 6. De-Correlated SQL Query with Aggregations

The De-Correlated CLOB Approach (cont)
This approach is more flexible (using different join strategies) in allowing the engine to explore join strategies. Disadvantage: Inefficient as CLOB-Corr approach, because it still need lots of copying, parallelism and materialization of CLOBs.

-Need particular join order and the nested-loop join method
Classification Approach Advantage Disadvantage Early Tag Early Structure Outside Engine Stored Procedure Common method today -Inefficient -Need particular join order and the nested-loop join method Inside Engine Correlated CLOB(CLOB-Corr) Eliminate some overhead -Still need nested-loop join strategy De-Correlated CLOB(CLOB-decor) flexible -Same as CLOB-Corr Summary of Approaches of Early Tagging and Early Structuring

Late Tagging, Late Structuring
Two phase: Content Creation Tagging, Structuring Relational Query Processing Tagging and Structuring Unstructured content Result XML Document Late tagging and late structuring means tagging and structuring are done as the final step of construction of an XML document. Construction an XML document is logically split into two phase: (1) is content creation, where the relational data is produced. (2) is tagging and structuring, where the relational data is structured and tagged to produce the XML document.

Late Tagging, Late Structuring (cont.)
Content Creation: Redundant Relation Approach ---- Join all source tables using join predicates to relate parents to their children (Unsorted) Outer Union Approach ----Separate the representation of a child from the representation of its siblings and then union them with outer joins For content creation, we just consider “inside the engine” approaches, like joins can be exploited.

Redundant Relation Approach
To join all of the source tables. It is a simple way to produce the needed content Advantage Using regular, set oriented relational processing Pitfall Both content and processing redundancy

Redundant Relation Approach(cont)
Figure 7. Query for Redundant Relation Content

The Outer Union Plan Advantage: Disadvantage:
Eliminating much of the data redundancy of the Redundant Relation Approach. Disadvantage: There is still some data redundancy present. The basic problem with the Redundant Relation approach is that the number of tuples in the relational result grows as the product of number of children per parent. If we could limit the result’s size to the sum of the number of children per parent, redundancy could be reduced. One execution plan , each path from the root-level table to a leaf-level table is computed by means of joins. In this example, we have three paths: Customer—Account, Customer-PurchaseOrder-Item, Customer-PurchaseOrder-Payment. Eliminating data redundancy, because children of the same parent are represented in separate tuple.

Figure 8. The Outer Union Plan
Type column is added to distinguish an account tuple from the an item tuple itemId, itemInfo Each path computation produces one tuple per data item in the leaf level of the XML tree. Each tuple describing a leaf level data item includes the information about all of its ancestors. The final step in the process of creating the relational content is to glue together all the tuples representing leaf level elements in XML tree into a single relation. Figure 8. The Outer Union Plan

Structuring and Tagging ----Hash-based tagger
Inside /Outside the engine Two steps: Group all siblings under the same parent Extract the information from each tuple and tag it to produce the XML result. An efficient way to group siblings is to use a main memory hash table to look up the parent of a node, given the parent’s type and id information. This step can be done either inside the engine or outside the engine. If it is performed inside the engine, it can be implemented as an aggregate function (including XMLAGG function and XML constructors) . Whatever way we choose, we must do two things. (1) group-----eliminate duplicates in the case of Redundant Relation approach, (2)….. Whenever a tuple containing information about an XML document is seen, it is hashed on the element’s type and the ids of its ancestors in order to determine whether its parent is present in the hash table. If the parent is present, a new XML element is created and added as a child of the parent. If the parent is not present, then a hash is performed on the type and ids of all ancestors exception that of the parent. This is to determine if the grandparent exists. If the grandparent is present, the parent is created and then the child is created. If the grandparent is also not present, the procedure is repeated until an ancestor is present in the hash table or the root of the document is reached. After all the input tuples have been hashed, the entire tagged structure can be written out as an XML file.

Structuring and Tagging ----Hash-based tagger (cont)
Problem: The main limitation of using a hash-based tagger is that performance can degrade rapidly when there is insufficient memory to hold the hash table and the intermediate result.

Summary of Approaches of Late Tagging and Late Structuring
Classification Approach Advantage Disadvantage Late Tag Structure Inside Or Outside Engine Redundant Relation Regular set- oriented relational processing Content and redundancy Unsorted Path Outer Union Eliminate much of data But still have some data Summary of Approaches of Late Tagging and Late Structuring

Late Tagging, Early Structuring
Structured Content Creation Sorted Union Approach Parent info comes before or with child All info of a node and its Children occur together Perform a sort on ids on the result of the outer union. Sort the result of the Node Outer Union on its id fields, with the ids of parent nodes occurring higher in the sort order than the ids of children nodes. Null value in the sort fields occur before tuples having non-null values in the sort fields The main problem with the Late tagging, late structuring is we must consider the complex memory management in hash-based tagger when memory is scarce. To eliminate this problem, the relational engine can be used to produce “structured content”, which can be tagged using “constant space tagger”. The key to structuring relational content is to order it the same way that it needs to appear in the result XML document. We can achieved it by doing such procedures: (1). All of the information about a node X in the XML tree occurs either before or along with the information about the children of X in the XML tree. This essentially says that parent information occurs before, or with child information. (2). All tuples representing information about a node X and its descendants is not mixed in with information about non-descendant nodes. (3). The relative order of the tuples matches that of any user-specified order.

Late Tagging, Early Structuring (cont)
Tagging Sorted Documents ConstantSpace Tagger Append tag as soon as data is seen Need to remember the parent ids of the last tuple seen to know when to close the tag. Once the structured content is created, as described before, the next step is to tag and construct the result XML document. Since tuples arrive in document order, they can be tagged immediately and written out as they are seen. The tagger requires memory to remember the parent ids of the last tuple seen. The storage required by the constant space tagger is proportional only to the level of nesting and is independent of the size the XML document. Advantage: Scaling to large data volumes because the relational database sorting is disk-friendly User-specified ordering ( need additional cost)

Summary of Experimental Results
1) Constructing an XML document inside the relational engine is far more efficient than doing so outside the engine. 2) Given sufficient memory, and processing can be done in main memory, Unsorted Outer Union approach with hash-based tagger is the best. 3) When processing cannot be done in main memory, the Sorted Outer Union approach is the approach of choice (both inside and outside the engine).

Conclusion Publishing XML from relational sources is important in Internet Two Language Alternatives XML query language SQL-based language Implementation Alternatives Inside engine / Outside engine Unsorted Outer Join Sorted Outer Join

Future Work Investigating the impact of parallelism
Studying the new runtime operators inside the relational engine to enhance the performance of outer union plans Advancing techniques for efficient memory management to extend the useful range of the Unsorted Outer Union approach

Efficiently Publishing Relational Data as XML Documents

Similar presentations

Presentation on theme: "Efficiently Publishing Relational Data as XML Documents"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficiently Publishing Relational Data as XML Documents

Similar presentations

Presentation on theme: "Efficiently Publishing Relational Data as XML Documents"— Presentation transcript:

Similar presentations

About project

Feedback