April 9, 2006KDXD 2006, Singapore1 Capturing Semantics in XML Documents Tok Wang Ling Department of Computer Science National University of Singapore.

Slides:



Advertisements
Similar presentations
Entity Relationship Diagrams
Advertisements

Entity-Relationship (ER) Modeling
BUSINESS DRIVEN TECHNOLOGY Plug-In T4 Designing Database Applications.
Ch5: ER Diagrams - Part 1 Much of the material presented in these slides was developed by Dr. Ramon Lawrence at the University of Iowa.
The Relational Model System Development Life Cycle Normalisation
Modeling the Data: Conceptual and Logical Data Modeling
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model.
Fundamentals, Design, and Implementation, 9/e Chapter 5 Database Design.
1 COS 425: Database and Information Management Systems XML and information exchange.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to Structured Query Language (SQL)
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
1 The ORA-SS Approach for Designing Semistructured Databases Xiaoying Wu, Tok Wang Ling, Mong Li Lee National University of Singapore Gillian Dobbie University.
Tok Wang Ling1 Mong Li Lee1 Gillian Dobbie2
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
4/20/2017.
Chapter 3 Data Modeling Using the Entity- Relationship (ER) Model Dr. Bernard Chen Ph.D. University of Central Arkansas.
Data Modeling Using the Entity-Relationship Model
Entity-Relationship modeling Transparencies
Entity-relationship Modeling Transparencies 1. ©Pearson Education 2009 Objectives How to use ER modeling in database design. The basic concepts of an.
Dr. Mohamed Osman Hegaz1 Conceptual data base design: The conceptual models: The Entity Relationship Model.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model.
1 Designing Valid XML Views Ya Bing Chen, Tok Wang Ling, Mong Li Lee Department of Computer Science National University of Singapore.
Chapter 5 Entity–Relationship Modeling
CSCI 3140 Module 2 – Conceptual Database Design Theodore Chiasson Dalhousie University.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model.
1 Maintaining Semantics in the Design of Valid and Reversible SemiStructured Views Yabing Chen, Tok Wang Ling, Mong Li Lee Department of Computer Science.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Concepts and Terminology Introduction to Database.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Identity Constraints.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
1 5 Normalization. 2 5 Database Design Give some body of data to be represented in a database, how do we decide on a suitable logical structure for that.
3 & 4 1 Chapters 3 and 4 Drawing ERDs October 16, 2006 Week 3.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
An Introduction to XML Sandeep Bhattaram
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Tutorial 13 Validating Documents with Schemas
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
Chapter 9 Logical Database Design : Mapping ER Model To Tables.
Entity-Relation Model. E-R Model The Entity-Relationship (ER) model was originally proposed by Peter in 1976 ER model is a conceptual data model that.
Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.
April 9, 2007SWIIS, Bangkok1 Using Semantics in XML Data Management Tok Wang Ling Department of Computer Science National University of Singapore Gillian.
McGraw-Hill/Irwin Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Modeling the Data: Conceptual and Logical Data Modeling.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Methodology - Logical Database Design. 2 Step 2 Build and Validate Local Logical Data Model To build a local logical data model from a local conceptual.
Department of Mathematics Computer and Information Science1 CS 351: Database Management Systems Christopher I. G. Lanclos Chapter 4.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Conceptual Modeling for XML Data
COP Introduction to Database Structures
Logical Database Design and the Rational Model
XML: Extensible Markup Language
Entity-Relationship Model
Chapter 2: Entity-Relationship Model
XML QUESTIONS AND ANSWERS
Entity Relationship Diagrams
Semi-Structured data (XML Data MODEL)
Review of Week 1 Database DBMS File systems vs. database systems
Entity-Relationship Diagram (ERD)
Presentation transcript:

April 9, 2006KDXD 2006, Singapore1 Capturing Semantics in XML Documents Tok Wang Ling Department of Computer Science National University of Singapore

April 9, 2006KDXD 2006, Singapore2 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion [4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005

April 9, 2006KDXD 2006, Singapore3 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion

April 9, 2006KDXD 2006, Singapore4 1. XML – Brief introduction XML (eXtensible Markup Language) is –Released by W3C –An application of SGML –A promising standard of data publishing, integrating and exchanging on the web XML schema –DTD (Data Type Definition) [3] –XSD (XML Schema Definition), W3C recommended standard [6, 7, 8] [3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February [6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October [8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October

April 9, 2006KDXD 2006, Singapore5 1. XML – A motivating example Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where –The document has a root element psj; –Under psj, there is a sequence of part elements; –Under part, there is a sequence of supplier elements; –Under supplier, there is a sequence of project elements.

April 9, 2006KDXD 2006, Singapore6 Example 1. psj.xml P001 Nut Silver S001 Alfa Atlanta 5 J001 Rocket boots J003 Firework launcher S002 Beta Atlanta New York 5.5 J002 Diving helm J003 Firework launcher … P002 Nut Copper S001 Alfa Atlanta 4.6 J002 Diving helm S003 Beta New York 5 J001 Rocket boots J004 Blue fireworks

April 9, 2006KDXD 2006, Singapore7 1. XML – the DTD of the “psj.xml” ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty (a) “psj.dtd”, The DTD of the “psj.xml”(b) psj.dtd in Data Guide

April 9, 2006KDXD 2006, Singapore8 1. XML – what the DTD says DTD is a simple definition of an XML document, where users can define –Element/Attribute types –Occurrence constraints (e.g. ?, +, *) –Containment among different element types (the structure) DTD cannot express –Occurrence constraints in numbers (e.g. 2 to 8) –Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD.) –Relationship types among elements and their degrees –Difference between the attribute (or simple element  ) of element type and the attribute (or simple element) of relationship type.  Simple elements are those element types with PCDATA only without any attribute types.

April 9, 2006KDXD 2006, Singapore9 1. XML – XSD “psj.xsd”, the XSD schema of the motivating example data. XSD definition of element occurrence constraint XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique.

April 9, 2006KDXD 2006, Singapore10 1. XML – what XSD can tell XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which –has extensible XML syntax, –supports more data types (user-defined type and 37 built-in types) –is able to represent uniqueness/key for both attribute types and element types. –And has many other improvements in comparison with DTD.

April 9, 2006KDXD 2006, Singapore11 1. XML – XSD still flaws 1.A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases. –E.g. In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document. –Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively. XSD is not sufficient in expressing the relational semantics in XML data, such as:

April 9, 2006KDXD 2006, Singapore12 1. XML – XSD still flaws (cont.) -The key element must contain the following (in order): a)One and only one selector element -contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique b)One or more field elements -contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element. - The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values.

April 9, 2006KDXD 2006, Singapore13 1. XML – XSD still flaws (Cont.) 2.XSD does not support relationship types and other relational semantic constraints. –E.g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD. 3.XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types. –E.g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier.

April 9, 2006KDXD 2006, Singapore14 Reconsider the semantics in Example 1. The XML data in Example 1. (psj.xml) is a typical data-centric XML document that is derived from structured data contents usually stored in relational or object-relational databases. The semantics of the data in Example 1. can be described in the ER diagram as follows.

April 9, 2006KDXD 2006, Singapore15 The ER diagram of the data in Example 1.

April 9, 2006KDXD 2006, Singapore16 One of the object-relational database representations of psj.xml pnopnamecolor P001NutSilver P002NutCopper snosnamecity+ S001AlfaAtlanta S002Beta{Atlanta, New York} S003GamaNew York jnojnamebudget J001Rocket boots20000 J002Diving helm18000 J003Firework launcher J004Blue fireworks20000 pnosnoprice P001S0015 P001S P002S P002S0035 pnosnojnoqty P001S001J00160 P001S001J P001S002J00270 P001S002J00350 P002S001J00260 P002S003J00120 P002S003J00450 partsupplierproject PS PSJ There 5 tables in the relational schema: part (pno, pname, color) supplier (sno, sname, (city)+) project (jno, jname, budget) PS (pno, sno, price) PSJ (pno, sno, jno, qty)

April 9, 2006KDXD 2006, Singapore17 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion

April 9, 2006KDXD 2006, Singapore18 2. ORA-SS in a nutshell ORA-SS is a semantics rich data model for semi- structured data. It can easily represent the relational semantics and constraints in XML data. ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases. In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data.

April 9, 2006KDXD 2006, Singapore19 2. ORA-SS in a nutshell A complete ORA-SS model has 4 diagrams –Schema diagram Represents the structure and constrains (business rules) on XML documents –Instance diagram Visually represents the graphical structure of XML data –Functional dependency diagram Represents FDs in relationship types –Inheritance diagram Represents the specialization/generalization relationships among different object classes in ORA-SS

April 9, 2006KDXD 2006, Singapore20 2. ORA-SS data models Object class –attributes of object class –ordering on object class Relationship Type –degree of relationship type –participating object classes in relationship type –attributes of relationship type –disjunctive relationship type –recursive relationship type –ID dependent relationship type

April 9, 2006KDXD 2006, Singapore21 2. ORA-SS data models (Cont.) Attribute –attributes of object class or relationship type –key attribute (OID) –foreign key / referential constraint (IDREF/IDREFS) –composite attribute –disjunctive attribute –attribute with unknown structure –ordering on attributes –fixed or default value of attribute –derived attribute

April 9, 2006KDXD 2006, Singapore22 The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as object classes. Pno, sno and jno are declared as the object ID of part, supplier and project respectively. Price is an attribute of the relationship type PS; and qty is an attribute of PSJ. PS is a binary relationship type between part and supplier, PSJ is a ternary relationship type defined among part, supplier and project

April 9, 2006KDXD 2006, Singapore23 ORA-SS – Features ORA-SS can represent the following semantics –Object ID attributes play the key constraints in object-relational databases, i.e. the object ID attributes functional determine (or multi-valued determine) object attributes of the same object class. –Various relationship types including ID dependent relationship types, their degrees and participating object classes. –Distinguish relationship attributes from object attributes.

April 9, 2006KDXD 2006, Singapore24 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion

April 9, 2006KDXD 2006, Singapore25 3. ORA-SS applications Due to the rich semantics in ORA-SS, the model can be widely used in –Normal form XML schema –Relational/object-relational storage of XML data –XML view creation and validation [1] –XML schema/data integration –XML data query, especially with graphical user interfaces [5] –XML query optimization –etc. [1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.

April 9, 2006KDXD 2006, Singapore26 Store ORA-SS in object-relational databases Current existing storage approaches store XML in flat files (NF relations), which are long and difficult to query and update; Pure relational DBMS – join needs much time. ORA-SS reflects the nested structure of semi- structured data Less join in nested relations 3. ORA-SS applications

April 9, 2006KDXD 2006, Singapore27 Store ORA-SS in object-relational databases (Cont.) 3. ORA-SS applications Each object class is stored as an object relation with its object ID and its object attributes. (e.g. part, supplier, project) Each relationship type is stored as a relationship relation with the object IDs of participating object classes and its relationship attributes. (e.g. PS and PSJ) Multi-value attributes and composite attributes are stored as nested relations. (e.g. city) Given an ORA-SS schema diagram

April 9, 2006KDXD 2006, Singapore28 Store ORA-SS in object-relational databases (Cont.) Object Relations part (pno, pname, color) supplier (sno, sname, (city)+) project(jno, jname, budget) Relationship relations PS (pno, sno, price) PSJ (pno, sno, jno, qty) Constraint: PSJ[pno, sno]  PS[pno, sno] Storage Schema for ORA-SS/XML Databases of the data in Example 1. ORA-SS schema diagramStorage schema 3. ORA-SS applications

April 9, 2006KDXD 2006, Singapore29 Store ORA-SS in object-relational databases (Cont.) Employee (eno, ename, (hobby)*, quantification(year, degree, Univ)*, job_history(year, job_title, company)*) An example to show the advantage of using object-relational database instead of relational database. ORA-SS schema diagram Storage schema in ORDB 3. ORA-SS applications Storage schema in traditional RDB Employee (eno, ename) E_hobby (eno, hobby) E_quantification (eno, year, degree, Univ.) E_job_history (eno, year, job_title, company)

April 9, 2006KDXD 2006, Singapore30 Define and validate XML views 3. ORA-SS applications Valid XML views in ORA-SS View definition operators: select, project/drop, swap, join For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels: Valid viewInvalid view Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view.

April 9, 2006KDXD 2006, Singapore31 Define and validate XML views (cont.) 3. ORA-SS applications Another example, consider the following projection operation that drops supplier from the structure: Valid viewInvalid view Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view.

April 9, 2006KDXD 2006, Singapore32 Graphical XML query based on ORA-SS 3. ORA-SS applications A graphical XML query language is designed on the base of ORA-SS The screenshot of the user-interface of our graphical query language The schema panel loads the ORA-SS schema diagram Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window. Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window Query 1: To select and display the projects that do not have any suppliers located in Atlanta.

April 9, 2006KDXD 2006, Singapore33 XML query optimization The semantic information represented in ORA-SS is also helpful in optimizing XML query. 3. ORA-SS applications Consider the following simple query example which means, (Query 2.) To display the budget of project “J001”.

April 9, 2006KDXD 2006, Singapore34 XML query optimization 3. ORA-SS applications Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values. However, in ORA-SS, since jno is the object ID and we have the functional dependecny: jno  budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value.

April 9, 2006KDXD 2006, Singapore35 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion

April 9, 2006KDXD 2006, Singapore36 4. Discover semantics in XML documents Problem definition –Input: a well formed XML document, probably with a DTD or XSD schema –Output: semantics that are necessary to ORA-SS schema It is a process of enriching XML schema to ORA- SS schema by using mining techniques.

April 9, 2006KDXD 2006, Singapore37 Related issues in mining semantics –Object classes Identify object classes Identify object IDs Identify object attributes and their cardinalities Identify IDREF(s) attributes –Relationship types Find relationship types with their degrees and participating object classes Find attributes and their cardinalities of relationship types 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore38 4. Discover semantics in XML documents The whole vision of the process. The main flow of the process The output flow The input flow

April 9, 2006KDXD 2006, Singapore39 Assumption –To simplify the discussion, we do not consider the order of attributes and elements. User-verification –The findings of each steps during the process should be verified by the user. –The verified findings of previous steps would be used in later steps. 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore40 Find object classes Identify object classes from element types: –Scan the XML document or, if possible, the DTD/XSD of the XML document to select all internal nodes in the document tree. –An internal node means the node must have some child nodes such as XML attribute types and/or subelement types. –An internal node may not be an object class, but an object class must correspond to an internal node. Therefore, internal nodes are candidates of object classes. 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore41 Find object classes (cont.) Detecting composite attributes from object classes –Although composite attributes are also internal nodes, there are some special patterns that indicate they are not object classes. 4. Discover semantics in XML documents XML element XML elements Or XML attributes values 1)Single-valued 2)Always occur with the same order 3)No functional dependency can be found within the component attributes of a composite attribute. The first pattern is that, all subelement types or attributes are

April 9, 2006KDXD 2006, Singapore42 Find object classes (cont.) 4. Discover semantics in XML documents XML element XML elements Or XML attributes values 1)Of the same type (repeated) 2)The set of the subelement/attribute values is often determined by other element/attribute values. (e.g. studNo determines the values of hobby elements under “hobbies” element) The second pattern is that, all subelement types or attributes are: student studNo

April 9, 2006KDXD 2006, Singapore43 ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty The DTD of Example 1. Dataguide From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can be intuitively found in Dataguide). Then, the list { psj, part, supplier, project } contains candidate object classes. Because a well-formed XML document usually have a document root that is not concerned with the data, we can drop the root node psj from the list and get the final result { part, supplier, project }. Find object classes (cont.) 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore44 Identify multi-valued attributes After Object classes and composite attributes are identified, we pick out all multi-valued attributes for later use. –Multi-valued attributes can be detected by checking the occurrence constraints in DTD/XSD, or counting directly in the document. –Multi-valued attributes can be either of an object class (e.g. city of supplier) or a relationship type. To determine the affiliation of multi-valued attributes, we need to find object ID first. –Without considering multi-valued attributes, the search of object ID would be easier. 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore45 For each identified object class (after user-verified) –If it is located at the first level below the document root, and the DTD/XSD has specified ID attribute or key constraint, then the corresponding attribute/element should be an object ID. –Otherwise A temporary table is built, which contains all XML attributes and single-valued simple subelement types of the object class. To find full functional dependencies in the temporary table. –If all attributes/elements are fully functional dependent on an attribute/element k, then k is most likely the object ID; Else, »find an attribute/element k’, which functional determines the most number of attributes/elements, k’ is suggested as the object ID, »and the attributes/elements that are not determined by k’ will be classified as single-valued attributes of some relationship types to be determined later. The result should be verified by the user. Find object IDs 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore46 Candidate object classes list {part, supplier, project} Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) Notice that, in this stage, all simple subelement types and attributes are treated the same. Multi-valued attributes such as city is not included inside the temporary table. Find object IDs (cont.) 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore47 Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) Find object IDs (cont.) 4. Discover semantics in XML documents 1. In part_temp, we find that pno  pname, color thus, pno is the object ID of part. 2. In supplier_temp, we only have sno  sname thus, sno is the object ID of supplier, and price is picked our as a relationship attribute. 3. In project_temp, we only have jno  jname, budget thus, jno is the object ID of project, and qty is picked out as a relationship attribute.

April 9, 2006KDXD 2006, Singapore48 In the stage after the process of identifying object IDs, we find out: –Object IDs of each object class, –Single-valued object attributes and their corresponding object classes, –Single-valued relationship attributes without knowing what relationship type they belong to. Find object IDs 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore49 Recall that, before searching object ID, all multi- valued attributes are identified. Given a multi- valued attribute under an object class, we check, –for each object ID value of the object class, whether there is a unique set of values of the attribute If it is true, then it is a multi-valued attribute of the object class; Else, it is classified as a multi-valued attribute of some relationship type not known yet. Multi-valued attributes of object classes 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore50 Multi-valued attributes of object classes For example, the city is a multi-valued attribute under supplier –We check sno and city, since each sno value is associated with the same set of city values, city is a multi-valued attribute of supplier 4. Discover semantics in XML documents snocity+ S001Atlanta S002{Atlanta, New York} S001Atlanta S003New York The temporary table of sno and city

April 9, 2006KDXD 2006, Singapore51 For multi-valued object attributes, we should know their cardinality –If the DTD/XSD has specified, reuse it –Without schema, count the minimum and maximum occurrences of the multi-valued attributes. –Notice that, both single-valued and multi-valued attributes can be null (e.g. ? and *). Thus, the result should be verified by the user. Find cardinality of object class attributes 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore52 Identify IDREFs –If the DTD/XSD has specified IDREF/IDREFS or Keyref constraints, reuse them. –Without the schema, we compare the object attribute values with the values of other object IDs, If all values of a single-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREF; If all values of a multi-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREFS. (Note that, if it is an XML attribute, multiple values of IDREFS are separated by a blank character.) Find IDREF/IDREFS 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore53 Identify relationship types (basic idea) –The search of relationship types is based on the object ID and relationship attributes (single-valued or multi- valued). –Along with a path from the root to a leaf node in the document tree, we may pass through several object classes. The object IDs of these object classes can form a temporary table. We build such kind of temporary tables for each single-valued relationship attributes, and find relationship types. Find relationship types 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore54 For each single-valued relationship attribute, there is a path from the root to the attribute, and along the path, put object IDs of object classes inside the temporary table together with the relationship attribute. –Find the FDs that determines the single-valued relationship attribute in the temporary table. For multi-valued relationship attributes, we should find a combination of object IDs of different object classes that each unique combination object ID value corresponds to a unique set of the attribute values. Find relationship types (cont.) 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore55 Find relationship types (cont.) From the data in Example 1, we can have a temporary table for price along with the path: “part/supplier/price” as follows pnosnoprice P001S0015 P001S P002S P002S Discover semantics in XML documents We can find that {pno, sno}  price, thus, there is an binary relationship type between part and supplier; and price is an attribute of the binary relationship type.

April 9, 2006KDXD 2006, Singapore56 Similarly, we can have a temporary table for qty along with the path: “part/supplier/project/qty” as follows pnosnojnoqty P001S001J00160 P001S001J P001S002J00270 P001S002J00350 P002S001J00260 P002S003J00120 P002S003J Discover semantics in XML documents We can find that {pno, sno, jno}  qty, thus, there is an ternary relationship type among part, supplier and project; and qty is an attribute of the ternary relationship type. Find relationship types (cont.)

April 9, 2006KDXD 2006, Singapore57 Relationship types can be exist without have relationship attributes. To find such kind of relationship types, we need to build a temporary table for different object classes with their object IDs based on the existing paths in the document tree. Search the temporary table and find MVDs (see the following example.) Find relationship types (cont.) 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore58 We have already identified the - Hierarchical structure; - Object classes and their object IDs; - Attributes of object classes; - But no attribute is likely to be of some relationship types. Suppose we have another document of project, staff, and paper. After we found their object ID attributes, accordingly, i.e. J_no, St_no, and Pa_no, we can create a temporary table as follows. 4. Discover semantics in XML documents Find relationship types (cont.) …

April 9, 2006KDXD 2006, Singapore59 J_noSt_noPa_no J001S001P001 J001S002P003 J002S001P001 J002S003P001 ……… 4. Discover semantics in XML documents CASE 2. If there is no FD or MVD in the table, then there is a ternary relationship among project, staff and paper. CASE 1. If we find that each St_no value is associated with a unique set of Pa_no values, i.e. St_no multi-determines Pa_no, then there are two binary relationship types, one consists of project and staff, and the other consists of staff and paper. Find relationship types (cont.) We build a temporary table which consists of J_no, St_no, and Pa_no CASE 1.CASE 2.

April 9, 2006KDXD 2006, Singapore60 The participating constraints of each relationship types can be obtained through the count of unique object ID values in the temporary table accordingly. Find participating constraints 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore61 All outputs, including those intermediate results, should be verified by users. With input from users and their verification, a semi- automatic mining process can be applied to discover the semantics in XML documents that are important in designing XML databases, storing XML data, validating XML view and processing/optimizing XML query. All the discovered semantics can be represented by ORA-SS; but some of them cannot be represented in DTD/XSD. User verification 4. Discover semantics in XML documents

April 9, 2006KDXD 2006, Singapore62 Roadmap 1.XML documents and current XML schema languages 2.ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3.The applications of ORA-SS 4.Discovering Semantics in XML documents 5.Conclusion

April 9, 2006KDXD 2006, Singapore63 5. Conclusion 1)We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints. 2)We Introduce ORA-SS, a semantics rich data model that can intuitively express the semantics in XML data. 3)We discuss the naïve method of mining semantics from XML data/schema to generate ORA-SS schema. More efficient methods should be further investigated.

April 9, 2006KDXD 2006, Singapore64 5. Conclusion (cont.) 4)The semantics in ORA-SS are crucial in designing XML database, writing and interpreting XML query and validating XML views, etc. 5)The method we proposed in the presentation to discover semantics only provides candidate answers. In other words, not all the results are necessarily true because the contents of the data may be changed. Therefore, user feedback is indispensable in the process of enriching XML schema to ORA-SS schema.

April 9, 2006KDXD 2006, Singapore65 References: [1].Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [2].C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing Company (1981). [3].Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February [4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc [5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA [6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October [8]. XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October

April 9, 2006KDXD 2006, Singapore66 Q & A

April 9, 2006KDXD 2006, Singapore67 The End