Storing and Querying XML Documents Without Using Schema Information Kanda Runapongsa Department of Computer Engineering Khon Kaen University, Thailand Jignesh M. Patel Department of EECS University of Michigan, USA
Sample XML Data <bib> <book> <author>John</author> <publisher>ABC</publisher> </book> <article> <author>Brown</author> <author>Smith</author> </article> </bib>
Motivation The amount of XML data is increasing rapidly Enterprise application integration B2B interchange Web services Efficient tools for managing large XML data sets are urgently needed Solutions range from using native XML DBMSs to relational DBMSs
Pros & Cons of Using an RDBMS Advantages of using an RDBMS Well developed query optimization techniques Effective storage and indexing mechanisms Scalability, parallelism, and distributed processing Concurrency and recovery Disadvantages of using an RDBMS Need to transform between XML data and relational data Need to perform multiple joins between tables This work: try to improve the performance of an RDBMS
Queries on XML Typical inquiries on XML are containment queries Direct containment query: query that tests whether element c is a child of element p Example: p/c Indirect containment query: query that tests whether element d is a descendant of element a Example: a//d How should we process these queries?
XML Query Processing (Using Schema) * book article + author publisher bib Using a DTD graph which represents the structure of the DTD Nodes corresponds to Elements Example: book Attributes Operators Example: *
XML Query Processing (Using Schema) Legend Rules for mapping nodes to table/attributes in an RDBMS 1. Create tables for these following nodes: (C1): no incoming link (C2): below a ‘*’ node or a ‘+’ node 2. Inline all remaining nodes as table attributes bib table attribute * * book article + + author publisher
XML Query Processing (Not Using Schema) bib (begin, end, level) Q: book/author A: (3,5,3) Containment query: d is contained in a iff abegin < dbegin && dend < aend d is directly contained in a if dlevel = alevel – 1 book/author 2 < 3 && 5 < 9 && 3 = 2 - 1 (1,18,1) (10,17,2) (2,9,2) book article (3,5,3) (11,13,3) author author (6,8,3) (14,16,3) author publisher John Brown Smith ABC (4,4,4) (12,12,4) (15,15,4) (7,7,4)
This Proposed Solution (PAID) Use a numbering scheme (begin, end, level) to solve the direct and indirect containment queries because it is applicable even when there is no schema information Store the path information of the node to solve a long path query, such as a/b/c/d/e Store the position of the parent node for each node to quickly establish parent-child relationship (direct containment) between any two given nodes
Storing Node Information in Tables element <author> docID term pathID begin end level parentID 1 author 3 5 2 6 11 13 10 14 16 (3,5,3) (11,13,3) (14,16,3) John text term docID wordno level parentID John 1 4 3 (4,4,4) path pathExp pathID /bib/book/author 3 /bib/article/author 6
Other Mapping Approaches The BEL approach [ZND+01] Has begin, end, level information But no path and parent ID information Store a single word on each tuple The BELP approach [SYU99] Has begin, end, level information (but stored as float) Has path information, but no parentID information Store multiple words on each tuple
SQL Queries using PAID book/author select * from element b, element a where b.term = ‘book’ and a.term = ‘author’ and b.docID = a.docID and a.parentID = b.begin bib//author select * from element b, element a where b.term = ‘bib’ and a.term = ‘author’ and b.docID = a.docID and b.begin < a.begin and a.end < b.end
SQL Queries using Other Approaches BEL: book/author select * from element b, element a where b.term = ‘book’ and a.term = ‘author’ and b.docID = a.docID and b.begin < a.begin and a.end < b.end and b.level = a.level - 1 BELP: book/author select * from element e, path p where e.pathID = p.pathID and p.pathExp = ‘*book/author’
Experimental Setup: Platforms Software Apache Xerces C++ version 2.0 to parse the documents and generate the content of relations in different mapping approaches Commercial RDBMS: IBM DB2 UDB 7.0 32 MB Buffer pool size Hardware 1.2 GHz Pentium Celeron, 256 MB Memory Windows XP
Experimental Setup: Data Set Data set: the Shakespeare Plays XML documents The size of a copy of the data set is 8 MB To have the large size of the experimental data, use eight copies of the original Shakespeare data set. Thus, the total input data size is 64 MB Each data set has 37 files and each file is about 0.2 MB
Experimental Setup: Workload 6 micro-benchmark queries ‘element’ contains ‘text’ Direct containment Indirect containment Examples ACT/SCENE/SPEECH/LINE[contains(STAGEDIR, ‘Rising’)] /PLAY[contains(TITLE,’Juliet’)]//ACT/SCENE/SPEECH[contains(LINE, ‘love’)][contains(SPEAKER, ‘ROMEO’)]
Experimental Results Query Execution Times (seconds) BEL BELP PAID QS1 24.92 30.70 0.03 QS2 81.39 18.46 10.29 QS3 367.39 836.00 30.99 QS4 10.67 23.35 1.42 QS5 580.40 952.09 56.61 QS6 3.54 0.11 0.01 The PAID approach outperforms other approaches by several orders of magnitude
Experimental Results
Why Does PAID Perform Better? 1) PAID uses the parentID attribute to quickly find the parent nodes 2) PAID uses the path information to reduce the number of join operations in long path queries When only using the ‘begin, end, level’, the number of joins is proportional to the number of elements on the path 3) PAID uses the index on the value attribute to quickly retrieve the nodes that satisfy with the value-based predicates
Conclusions We can use an RDBMS for storing and querying XML data Pros: Has technology that has been developed for several decades (optimization, concurrency control, and recovery) Cons: not well-tuned for containment queries Performance on an RDBMS could be better if we encode more information Parent ID Path information
References [ZND+01] C. Zhang, J. Naughton, D. DeWitt, Q. Luo, and G. Lohman, “On Supporting Containment Queries in Relational Database Management Systems”, In SIGMOD 2001 [SYU+99] T. Shimura, M. Yoshikawa, and S. Uemura, “Storage and Retrieval of XML Documents Using Object-Relational Databases”, In International Conference on Database and Expert Systems Applications 1999