10/14/2001 Management of XML Documents without Schema in Relational Database Systems Workshop Objects, and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics
T. KudrassOOPSLA Workshop “Objects, and Databases“, Overview Introduction –Motivation –Main Issues Structure-Oriented Approach –Storage / Data Model –Queries –Evaluation Opaque Approach –Storage –Queries (vs. XPath) –Evaluation Prototype Implementation –Interface –Experience Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Motivation XML is used in: –data publishing (document-centric documents) –data exchange (data-centric documents) Why XML Documents without Schema? –generated by programs –mostly data-centric documents, e.g., account statements –high update frequency of the document structure evolving schemas Problems –How to deal with XML documents without DTD / XML Schema in databases? –Evaluate approaches –Use relational database systems Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Main Issues Evaluate Storage and Retrieval Methods –no defined document schema –platform: Oracle 8i XML-to-Relational Mapping Approaches –structure-oriented decomposition –opaque approach Identify Parameters of a Unified XML-DB Interface Implement a Testbed –qualitiative assessment –performance of both approaches Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Structure-Oriented Approach Characteristics –decomposition of an XML document into smaller units (elements) –depends on the document structure only –target system: relational DBMs, generic schema Variety of Mapping Methods –model XML document als directed ordered labeled graphs and map them to tables –proposed algorithms: edge tables [Florescu, Kossmann] universal table inlining techniques [Shanmugasundaram et.al.] model-based fragmentation Monet XML-model Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Storing XML Data in Relations XML-QL Data Model Peter Mary Fruitdale Ave. Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Data Model tblDocs DocIdurl tblEdge SourceIdTargetIdLeafIdAttrIdDocIdEdgeNameTypeDepth tblLeafs LeafIdValue tblAttrs AttrIdValue 1 n 1 0/1 1 Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook tblDocs DocIdurl tblEdge SourceIdTargetIdLeafIdAttrIdDocIdEdgeNameTypeDepth tblLeafs LeafIdValue tblAttrs AttrIdValue 1 n 1 0/1 1
T. KudrassOOPSLA Workshop “Objects, and Databases“, Import Algorithm Peter Main Road Leipzig Source Id Target Id Leaf Id Attr Id Doc Id EdgeNameTypeDepth 01NULL 1treeref3 12NULL 1personref2 23NULL11ageattr0 241NULL1nameleaf0 25NULL 1addressref1 562NULL1streetleaf0 573NULL1zipleaf0 584NULL1cityleaf0 DocIdurl 1Sample.xml LeafIdValue 1Peter 2Main Road Leipzig AttrIdValue 136 Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Query Processing Query Language –XML query language most appropriate –data model of our solution is based on XML-QL XML-QL preferred choice XML-Relational Mismatch –relational DBMS “understands“ SQL only requires translation from XML-QL to SQL –generate result document from the tuples retrieved Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Query Processing Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook XML-QL Query Parser Generate SQL Statement Object Structure SQL Statement Row Set Execute SQL Statement Construct Result Document XML Document DB
T. KudrassOOPSLA Workshop “Objects, and Databases“, Generate an SQL-Statement XML-QL Query CONSTRUCT { WHERE $n $a CONSTRUCT $n $a } SQL Statement SELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_Value FROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs E WHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+)) Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Construct Result Document Result Tuple Subtree Reconstruction SELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+) n_Typen_Tar getId n_Depthn_Valuea_Typea_Tar getId a_Deptha_Value leaf40Peterref51 EdgeNameTypeA_LeafValA_Attr Val streetleafMain Road 4 zipleaf04236 cityleafLeipzig Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Query Result XML Result Document Peter Main Road Leipzig Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Advantages Vendor Independence –no specific DBMS features needed Stability High Flexibility of Queries –retrieve and update single values –full SQL functionality can be used Well-Suited for Structure-Oriented Queries –structures are represented in tables Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Drawbacks Information Loss –Comments –Processing Instructions –Prolog –CDATA Sections –Entities Restrictions –only one text (content) per element Text1 Text2 lost –element text as VARCHAR(n); n <= 4000 Increased Load Time –sample document: 3.3. MB, tuples, 13 minutes Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Opaque Approach Characteristics –XML document stored as Large Object (LOB) –document completely preserved Storage Insert into tblXMLClob values (1,‘person.xml‘,‘ Mary ‘ ); DocIdurlcontent 1person.xml Mary Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Oracle interMedia Text Query Facilities of interMedia Text –full-text retrieval (word matching only) –path expression only together with content search –no range queries Example in interMedia Text: SELECT DocId FROM tblXMLClob WHERE CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0 XML Full-Text Index –Autosectioner Index –XML Sectioner Index –WITHIN operator text_subquery WITHIN elementname searches the entire text content of the named tag Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Comparison of Queries return document IDs word matching (default) no existence test for elements or attributes restricted set of path expressions using WITHIN e.g.: (xml WITHIN title) WITHIN book provides limited attribute value searches, no nesting of attribute searches numeric and data values not type-converted no range searches on attribute values return document fragments substring matching search for existing elements or attributes path expressions structure-oriented queries //Book/Title/[contains(..‘xml‘)] searches for attribute values and element text can be combined considers also decimal values range searches possible using filters interMedia Text XPath Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, XPath Queries with PL/SQL Prerequisite –XDK for PL/SQL installed on the server Parse CLOB into DOM representation XPath Query Search Document IDs of all CLOBs of the XML Table Execute XPath Query on the DOM tree for each CLOB Objects of a Document ID DocIDs with Result XML Documents server-side DB Doc IDs Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Advantages Information Preservation Handling of Large Documents –appropriate for document-centric documents with little structure and prose-rich elements Different XML Document APIs –interMedia Text: restricted set of XPath functionality –generate a DOM of the document before using XPath queries Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Drawbacks Restricted Expressiveness of Text Queries Performance vs. Accuracy of Query Results –interMedia Text queries on CLOBs faster than the DOM-API –sample document: 12.5 MB, parse time 3 min, load time 5 min Restrictions of Indexes –maximum tag names for indexing (incl. namespace) 64 bytes Problems with Markup –character entities Vendor Dependence –text engines are proprietary, e.g., Oracle interMedia Stability –maximum document size 50 MB –memory errors may occur with smaller documents Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, User Interface Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, XML Database Interface Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook Client XML Document XML Query Doc List DocID / Doc Name
T. KudrassOOPSLA Workshop “Objects, and Databases“, Comparison Import –loss of information –time-consuming –produces lots of tuples Queries –XML-QL fast new document as result Import –no loss of information –faster than structure- oriented decomposition Queries –interMedia Text fast only document IDs as result –XPath high response time flexible granularity of results Structure-Oriented Approach Opaque Approach Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Implementation Experience Import Problems –structure-oriented decomposition SAX parser produces FillBuf Error for XML documents > 3 MB limitations of VARCHAR columns (max bytes) –opaque Approach OutOfMemory error during import of XML documents > 4MB –reason: to little heap size in Java –use start option Xmx Queries –Opaque Approach OutOfMemory error when parsing CLOB into DB –increase java_pool_size ( MB) –increase shared_pool_size ( MB) Export and Delete without Problems Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook
T. KudrassOOPSLA Workshop “Objects, and Databases“, Outlook Experience –problems with larger documents in both approaches –no universal solution for all requirements Co-Existence of Multiple Storage Approaches –integrate different storage engines –combine structure-oriented decomposition and opaque approach –need for a generic XML data type New Data Model for Structure-Oriented Approach –reduce loss of information XML Database Interface / Middleware –combines different approaches –parameterize the XML database interface Introduction Structure-Oriented Approach Opaque Approach Prototype Implementation Outlook