April 9, 2007SWIIS, Bangkok1 Using Semantics in XML Data Management Tok Wang Ling Department of Computer Science National University of Singapore Gillian Dobbie Department of Computer Science University of Auckland
April 9, 2007SWIIS, Bangkok2 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [6] 3.The applications of ORA-SS Semantic query optimization in XML 4.Conclusion [6]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005
April 9, 2007SWIIS, Bangkok3 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS Semantic query optimization in XML 4.Conclusion
April 9, 2007SWIIS, Bangkok4 1. XML – Brief introduction XML (eXtensible Markup Language) is –Released by W3C –An application of SGML –A promising standard of data publishing, integrating and exchanging on the web XML schemas –DTD (Data Type Definition) [4] –XSD (XML Schema Definition), W3C recommended standard [8, 9, 10] [4]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February [8]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October [9]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October [10]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October
April 9, 2007SWIIS, Bangkok5 1. XML – A motivating example Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where –The document has a root element psj; –Under psj, there is a sequence of part elements; –Under part, there is a sequence of supplier elements; –Under supplier, there is a sequence of project elements.
April 9, 2007SWIIS, Bangkok6 Example 1. psj.xml P001 Nut Silver S001 Alfa Atlanta 5 J001 Rocket boots J003 Firework launcher S002 Beta Atlanta New York 5.5 J002 Diving helm J003 Firework launcher … P002 Nut Copper S001 Alfa Atlanta 4.6 J002 Diving helm S003 Beta New York 5 J001 Rocket boots J004 Blue fireworks Figure 1. Example XML document
April 9, 2007SWIIS, Bangkok7 1. XML – the DTD of the “psj.xml” ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty (a) “psj.dtd”, The DTD of the “psj.xml” (b) psj.dtd in Data Guide Figure 2. DTD and DataGuide of Example XML document
April 9, 2007SWIIS, Bangkok8 1. XML – what the DTD says DTD is a simple definition of an XML document, where users can define –Element/Attribute types –Occurrence constraints (e.g. ?, +, *) –Containment among different element types (the structure) DTD cannot express –Occurrence constraints in numbers (e.g. 2 to 8) –Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD.) –Relationship types among elements and their degrees –Difference between the attribute (or simple element ) of element type and the attribute (or simple element) of relationship type. Simple elements are those element types with PCDATA only without any attribute types.
April 9, 2007SWIIS, Bangkok9 1. XML – XSD “psj.xsd”, the XSD schema of the motivating example data. XSD definition of element occurrence constraint XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique. Figure 3. XML Schema of Example XML document
April 9, 2007SWIIS, Bangkok10 1. XML – what XSD can tell XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which –has extensible XML syntax, –supports more data types (user-defined type and 37 built-in types) –is able to represent uniqueness/key for both attribute types and element types. –And has many other improvements in comparison with DTD.
April 9, 2007SWIIS, Bangkok11 1. XML – XSD still flaws 1.A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases. –E.g. In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document. –Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively. XSD is not sufficient in expressing the relational semantics in XML data, such as:
April 9, 2007SWIIS, Bangkok12 1. XML – XSD still flaws (cont.) -The key element must contain the following (in order): a)One and only one selector element -contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique b)One or more field elements -contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element. - The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values.
April 9, 2007SWIIS, Bangkok13 1. XML – XSD still flaws (Cont.) 2.XSD does not support relationship types and other relational semantic constraints. –E.g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD. 3.XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types. –E.g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier.
April 9, 2007SWIIS, Bangkok14 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS Semantic query optimization in XML 4.Conclusion
April 9, 2007SWIIS, Bangkok15 2. ORA-SS in a nutshell ORA-SS is a semantics rich data model for semi- structured data. It can easily represent the relational semantics and constraints in XML data. ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases. In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data.
April 9, 2007SWIIS, Bangkok16 2. ORA-SS in a nutshell A complete ORA-SS model has 4 diagrams –Schema diagram Represents the structure and constrains (business rules) on XML documents –Instance diagram Visually represents the graphical structure of XML data –Functional dependency diagram Represents FDs in relationship types –Inheritance diagram Represents the specialization/generalization relationships among different object classes in ORA-SS
April 9, 2007SWIIS, Bangkok17 2. ORA-SS data models Object class –attributes of object class –ordering on object class Relationship Type –degree of relationship type –participating object classes in relationship type –attributes of relationship type –disjunctive relationship type –recursive relationship type –ID dependent relationship type
April 9, 2007SWIIS, Bangkok18 2. ORA-SS data models (Cont.) Attribute –attributes of object class or relationship type –key attribute (OID) –foreign key / referential constraint (IDREF/IDREFS) –composite attribute –disjunctive attribute –attribute with unknown structure –ordering on attributes –fixed or default value of attribute –derived attribute
April 9, 2007SWIIS, Bangkok19 The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as object classes. Pno, sno and jno are declared as the object ID of part, supplier and project respectively. Price is an attribute of the relationship type PS; and qty is an attribute of PSJ. PS is a binary relationship type between part and supplier, PSJ is a ternary relationship type defined among part, supplier and project Figure 4. ORA-SS schema diagram of Example XML document
April 9, 2007SWIIS, Bangkok20 ORA-SS – Semantic Advantages ORA-SS can represent the following semantics that DTD and XMLSchema cannot: –Attribute vs. object class –Multi-valued attribute vs. object class –Identifier (ID) –IDREF or Foreign Key –n-ary relationship type –Attribute of object class vs. attribute of relationship type –View of XML document
April 9, 2007SWIIS, Bangkok21 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS Semantic query optimization in XML 4.Conclusion
April 9, 2007SWIIS, Bangkok22 3. ORA-SS applications Due to the rich semantics in ORA-SS, the model can be widely used in –Normal form XML schema –Relational/object-relational storage of XML data –XML schema/data integration –XML query optimization [12] –XML aggregates evaluation –XML view creation and validation [2] –XML graphical query language and output [7] –XML keyword search [13] –etc. [2]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [7]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA [12]. H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML query processing. Submitted to ER’07 [13]. B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to VLDB’07. We will illustrate these with in details
April 9, 2007SWIIS, Bangkok23 The semantic information represented in ORA-SS is helpful in optimizing XML query. –There are many algorithms proposed for XML query optimization, e.g. TwigStack [1] and its variants. –When ORA-SS semantics of the data are known, they can be taken into account for query optimization. [1]. Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching. SIGMOD Conference, Semantic query optimization 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok24 Semantic query optimization 3. ORA-SS applications Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values. However, in ORA-SS, since jno is the object ID and we have the functional dependecny: jno budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value. Example: Consider the following simple query example which means, (Query 1) To display the budget of project “J001”. //project [jno = “J001”]/budget
April 9, 2007SWIIS, Bangkok25 Most existing algorithms focus on structural search of twig pattern queries Few of them pay high attentions on content search for values of elements. They treat content nodes (or values) the same as element nodes Disadvantages: –Too many label streams of contents –Difficult to find the actual values of labels as output solutions We propose VERT (Value Extraction with Relational Table) Semantic query optimization – Content Search 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok26 Idea of VERT: 1.Introduce relational tables to store document values instead of treating them as nodes and labeling them. 2.Rewrite and optimize XML twig queries based on underlining relational tables. 3.Further optimize relational tables for query processing if more semantic information is available (i.e. more semantics better optimization ). 3. ORA-SS applications Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok27 1.Introduce relational tables to store document values instead of treating them as nodes and labeling them. E.g. the values for price ( title, etc) of XML tree in Figure 5 can be stored with the labels of price ( title, etc) elements in Figure ORA-SS applications Figure 5. Example XML document 2 Figure 6. Example VERT tables Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok28 2.Rewrite and optimize XML twig queries based on underlining relational tables. e.g. –Rewrite the twig query in Figure 7(a) to the twig in Figure 7(b) –Execute SQL in table R price of Figure 6 to get all labels of price elements with value greater than 15 and form the stream T price>15 –Perform structural joins based on these labels for price elements (i,e.T price>15 ) with book and ISBN elements 3. ORA-SS applications Benefits: Save stream merging of all price elements with values > 15 Save structural join between price elements and their values Figure 7. Example twig query (a) Twig query(b) rewritten query Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok29 3.Further optimize relational tables for query processing if some more semantic information is available (i.e. more semantics better optimization). Optimization 1 (VERT-1): put the value of price ( title, etc) with labels of book objects since price ( title ) is a property of book object class according to semantics captured in ORA-SS (shown in Figure 8). 3. ORA-SS applications Benefit: Further save structural joins between price and book & between ISBN and book for query in Figure 7 Figure 8. VERT tables with optimization 1 Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok30 3.Further optimize relational tables for query processing if some more semantic information is available (i.e. more semantics better optimization). Optimization 2 (VERT-2): pre-merge the tables of title, price, etc. in Figure 8 if we further know they are single-valued attributes of book object class according to semantics in ORA-SS (shown in Figure 9). (Note: should not merge multi-valued attribute, author.) 3. ORA-SS applications Benefit: Save expensive structure joins by using an efficient selection on the table for query in Figure 7. Figure 9. VERT tables with optimization 2 Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok31 Experimental results on three datasets i.e. NASA, DBLP and XMark (Figure 10) VERT outperforms TwigStack in query processing time VERT-2 is superior to VERT-1, which is in turn better than original VERT. 3. ORA-SS applications Figure 10. Experimental results of VERT Semantic query optimization – Content Search
April 9, 2007SWIIS, Bangkok32 XML semantics captured in ORA-SS are crucial in correctly writing queries with aggregates Example. Consider the query: (Query 3.) Find the average budget of all the projects. Two potential XQuery expressions are:: XML query with aggregates 3. ORA-SS applications XQ.3a for $pid in distinct_values(//project/jno) let $bgts := //project[jno = $pid]/budget return {avg($bgts)} XQ.3b let $bgts := //project/budget return {avg($bgts)}
April 9, 2007SWIIS, Bangkok33 Example - cont. If we know jno is the OID or key of project object class from ORA-SS, i.e. jno budget then we can easily judge that XQ.3a is a correct Xquery expression while XQ3.b is incorrect as some projects may appear more times than other projects in the XML document. If we don’t know this semantics, it is difficult to say which XQuery expression is correct. XML query with aggregates 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok34 Define and validate XML views 3. ORA-SS applications Valid XML views in ORA-SS View definition operators: select, project/drop, swap, join For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels: Valid viewInvalid view Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view. Figure 11. Example view definition 1
April 9, 2007SWIIS, Bangkok35 Define and validate XML views 3. ORA-SS applications Another example, consider the following projection operation that drops supplier from the structure: Valid viewInvalid view Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view. Figure 12. Example view definition 2 project part price qty
April 9, 2007SWIIS, Bangkok36 Graphical XML query based on ORA-SS 3. ORA-SS applications A graphical XML query language is designed on the base of ORA-SS Figure 13. The screenshot of the user-interface of our graphical query language The schema panel loads the ORA-SS schema diagram Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window. Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window Query 1: To select and display the projects that do not have any suppliers located in Atlanta.
April 9, 2007SWIIS, Bangkok37 Keyword search is a user-friendly way to query XML documents. Most existing algorithms are based on either tree data model or graph (digraph) data model of XML without the semantics. XML keyword search with semantics 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok38 Tree data model (LCA [11]) –Lowest Common Ancestor (LCA) Contains the all keywords Has no descendant node containing all the keywords Graph (digraph) data model (Banks [5]) –Reduced sub-tree A tree T in graph (digraph) containing all keywords No proper sub-tree of T contains all keywords Limitations of keyword search without semantics –May have difficulty in representing results –May return many irrelevant results XML keyword search with semantics 3. ORA-SS applications [5]. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proc. of VLDB Conference, pages , [11] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of SIGMOD Conference, pages , 2005.
April 9, 2007SWIIS, Bangkok39 Example: Q 1 = {Widom} LCA & reduced sub-tree give node Not enough information XML keyword search with semantics 3. ORA-SS applications Q 2 = {semistructured query processing} LCA(Q 2 ) = dblp(i.e. the whole XML database) …overwhelming information Reduced sub-tree results includes all papers with either “semistructured” or “query processing”. However, not all “query processing” papers are about “semistructured”. Figure 14. Example XML document 3
April 9, 2007SWIIS, Bangkok40 Therefore, we propose ICA (Interested Common Ancestor) and IRA (Interested Related Ancestors) to exploit the semantics for ranked keyword search. Ideas: 1.DBA Defines the set of interested object classes and the conceptual connections between objects. e.g. in DBLP publications and author can be the interested object classes; the reference/citations can be one type of conceptual connection between publications. Note: we can group all publications for each author object. XML keyword search with semantics 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok41 Ideas: 2. The results of a keyword query include interested objects based on ICA and IRA semantics. –The results of ICA (Interested Common Ancestor) include all objects that each contains all query keywords –The results of IRA (Interested Related Ancestors) include all object pairs (o, o’) such that –the pair together contain all keywords AND –o and o’ are conceptually connected. Note: we output a list of IRA objects instead of IRA pairs. Intuitive meaning for IRA: For query “semistructured query processing”, if a paper P with title “query processing” cites or is cited by a paper with title “semistructured”, then P is considered related to the query; at least it is a better result than “query processing” papers that do not cite or are cited by “semistructured” papers. XML keyword search with semantics 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok42 Ideas: 3.The system automatically ranks result objects based on the following metrics for output. –RelevanceRank: Intuitive meaning: –for query “semistructured query processing”, –given two papers P1 and P2 containing “query processing”, –if P1 cites or is cited by many “semistructured” papers whereas P2 cites or is cited by few “semistructured” papers, then P1 is considered more relevant to the query. –Keyword Proximity Ranks (ProxRank): –Intuition: The less the number of elements in one object that directly contain all keywords, the better result the object is. XML keyword search with semantics 3. ORA-SS applications
April 9, 2007SWIIS, Bangkok43 Experimental evaluation based on DBLP XML keyword search with semantics 3. ORA-SS applications Our approach outperforms most existing academic demos in both execution time and result quality Figure 15. Execution time Figure 16. Comparisons of relevant result in top-10, 20, 30 answers among academic demos
April 9, 2007SWIIS, Bangkok44 Experimental evaluation based on DBLP XML keyword search with semantics 3. ORA-SS applications Our approach is comparable or superior to commercial systems, Google Scholar and Microsoft Libra, in term of result quality even though they can search in much more web data. Figure 17. Comparisons of relevant result in top-10, 20, 30 answers with commercial systems
April 9, 2007SWIIS, Bangkok45 A demo prototype of our keyword search system on DBLP data is available at XML keyword search with semantics 3. ORA-SS applications Figure 18. User interface of the demo system
April 9, 2007SWIIS, Bangkok46 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS Semantic query optimization in XML 4.Conclusion
April 9, 2007SWIIS, Bangkok47 4. Conclusion 1.We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints.
April 9, 2007SWIIS, Bangkok48 4. Conclusion 2.We have shown that semantics in XML data are crucial in many applications, such as XML query optimization XML query optimization for content search XML aggregate computation XML view creation and validation XML graphical query language and output XML keyword search etc.
April 9, 2007SWIIS, Bangkok49 4. Conclusion 3.Many semantic information of XML data can be expressed in ORA-SS, which is a semantics rich data model, but not in DTD or XML Schema.
April 9, 2007SWIIS, Bangkok50 References: [1]Nicolas Bruno, Nick Koudas, and Divesh Srivastava. Holistic Twig Joins: optimal XML Pattern Matching. SIGMOD Conference, [2].Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [3].C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing Company (1981). [4].Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February [5]. V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In Proc. of VLDB Conference, pages , [6]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc [7]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA [8]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October [9]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October [10]. XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October [11] Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In Proc. of SIGMOD Conference, pages , [12].H. Wu, T. W. Ling, B. Chen. VERT: a semantic approach for content search and content extraction in XML query processing. Submitted to ER’07 [13].B. Chen, J. Lu, T. W. Ling. ICRA: effective semantics for ranked XML keyword search. Submitted to VLDB’07.
April 9, 2007SWIIS, Bangkok51 Q & A
April 9, 2007SWIIS, Bangkok52 The End