Designing Functional Dependencies For XML Mong Li LEE, Tok Wang LING, Wai Lup LOW EDBT 2002.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Three-Step Database Design
Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
Chapter 10: Designing Databases
XML to Relational Database Mapping
XML: Extensible Markup Language
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
C6 Databases.
XML and Enterprise Computing. What is XML? Stands for “Extensible Markup Language” –similar to SGML and HTML –document “tags” are used to define content.
Database Systems: Design, Implementation, and Management Tenth Edition
Database Systems: Design, Implementation, and Management Tenth Edition
Database Systems: Design, Implementation, and Management Ninth Edition
Data Model driven applications using CASE Data Models as the nucleus of software development in a Computer Aided Software Engineering environment.
1 Resolving Structural Conflicts in the Integration of XML Schemas: A Semantic Approach Xia Yang Mong Li Lee Tok Wang Ling National University of Singapore.
Managing Data Resources
Introduction to Database Development. 2-2 Outline  Context for database development  Goals of database development  Phases of database development.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
1 The ORA-SS Approach for Designing Semistructured Databases Xiaoying Wu, Tok Wang Ling, Mong Li Lee National University of Singapore Gillian Dobbie University.
PHASE 3: SYSTEMS DESIGN Chapter 7 Data Design.
Database Design, Application Development, and Administration, 5 th Edition Copyright © 2011 by Michael V. Mannino All rights reserved. Chapter 2 Introduction.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
Database Systems: Design, Implementation, and Management Ninth Edition
1 Maintaining Semantics in the Design of Valid and Reversible SemiStructured Views Yabing Chen, Tok Wang Ling, Mong Li Lee Department of Computer Science.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Electronic Commerce COMP3210 Session 4: Designing, Building and Evaluating e-Commerce Initiatives – Part II Dr. Paul Walcott Department of Computer Science,
Querying Structured Text in an XML Database By Xuemei Luo.
XML – An Introduction Structured Data Mark-up James McCartney CSCE 590, Cluster and Grid Computing.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
1 Chapter 1 Introduction. 2 Introduction n Definition A database management system (DBMS) is a general-purpose software system that facilitates the process.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
1 5 Normalization. 2 5 Database Design Give some body of data to be represented in a database, how do we decide on a suitable logical structure for that.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
School of Computing and Information Systems CS 371 Web Application Programming XML and JSON Encoding Data.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML and Its Applications Ben Y. Zhao, CS294-7 Spring 1999.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
1 M ATERIALIZED V IEW M AINTENANCE FOR THE X ML D OCUMENTS Yuan Fa, Yabing Chen, Tok Wang Ling, Ting Chen Yuan Fa, Yabing Chen, Tok Wang Ling, Ting Chen.
1 Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He Tok Wang Ling Dept. of Computer Science School of Computing National.
What is XML? eXtensible Markup Language eXtensible Markup Language A subset of SGML (Standard Generalized Markup Language) A subset of SGML (Standard Generalized.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C - The World Wide Web Consortium W3C - The World Wide Web Consortium.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Ch 7: Normalization-Part 1
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
XML Extensible Markup Language
Data Models. 2 The Importance of Data Models Data models –Relatively simple representations, usually graphical, of complex real-world data structures.
1 © 2013 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Managing Data Resources File Organization and databases for business information systems.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
XML BASICS and more…. What is XML? In common:  XML is a standard, simple, self-describing way of encoding both text and data so that content can be processed.
Database Systems: Design, Implementation, and Management Tenth Edition
XML to Relational Database Mapping
XML: Extensible Markup Language
XML QUESTIONS AND ANSWERS
XML in Web Technologies
The XML Language.
Advanced Database Models
MANAGING DATA RESOURCES
Chapter 4 Entity Relationship (ER) Modeling
Data Model.
Presentation transcript:

Designing Functional Dependencies For XML Mong Li LEE, Tok Wang LING, Wai Lup LOW EDBT 2002

2 Contents 1.Introduction 2.FDs for XML : FD XML 3.Replication cost model using FD XML 4.Verification of FD XML 5.Performance Studies 6.Conclusion 7.Q & A

Introduction

4 XML - Extensible Markup Language Simplified descendant of Standard Generalized Markup Language (SGML) Used for information interchange over the Web – Presentation-Oriented Publishing (POP) – Message-Oriented Middleware (MOM) New view of XML : Data model Why is XML suitable as a data model ? – Data semantics – Data independence

5 Motivation Introduction Projects have suppliers who supply them with a quantity of parts at a certain price. Each project is identified by a JName. Each supplier is identified by a SName. Each part is identified by a PartNo. Constraint : Supplier must supply a part at the same price regardless of projects. JName, SName,PartNo  Qty SName,PartNo  Price

6 Use XML to model the Project-Supplier-Part database Additional requirements: – Preserve natural inherent hierarchical structure. – Order of nesting : Project, Supplier, Part Possible solutions... Motivation Introduction

7 Solution 1 Normalized. No (little) redundancy. Extensive use of references, pointing relationships. Model not natural. Difficult to understand. Less efficient from query processing point of view. JSP Project Supplier Qty ‘500’ ‘200’ S ‘Road Works’ Part ‘ABC Price ‘ 80’ Price ‘10’ Part Supplier Price ‘12’ ‘DEF Pte S P denotes is a reference to a Supplier is a reference to a Part Element.

8 Solution 2 A good solution with clear semantics. But requires re-ordering of elements (i.e. from Project,Supplier,Part to Supplier,Part,Project. But this is not what the user wants. Supplier ‘ABC Trading’ ‘P123’ Price ‘10’ Project ‘200’ ‘80’ ‘P789’ Project Price ‘500’ Project ‘Road ‘50000’ ‘DEF Pte Ltd’ ‘P123’ Price ‘12’ Project ‘Road ‘1000’ Introduction Qty

9 Solution 3 Introduction Ordering (Project, Supplier, Part) is maintained. De-normalized. Controlled redundancy. Containment (Parent-Child) relationships. Natural model. Easy to understand. More efficient from processing point of view (compared to Sol 1). JSP ‘ABC Project Supplier Price Qty ‘P789’ ‘80’ ‘500’ Price Qty ‘10’ ‘200’ ‘P123’ ‘Road ‘ABC Price Qty ‘P789’ ‘10’ ‘50000’ Supplier ‘DEF Pte Price Qty ‘P123’‘12’ ‘1000’  Data redundancy. Possible data inconsistency.  How do we know that Sname,PartNo  Price ? BUT

FD XML

11 Functional Dependency in Relational Databases Let r be a relation on scheme R. X and Y subsets of attributes in R. Relation r satisfies the FD X  Y if for every X- Value x,  Y (  X=x (r)) has at most one tuple. E.g. SName, PartNo  Price This definition is defined for flat tables. How can we extend it for the hierarchical structure of XML databases? FD XML

12 Functional Dependency for XML An XML functional dependency, FD XML : (Q, [ P xi,..., P xn  P y ]) where – Q is the FD XML header path, a fully qualified path expression ( i.e. the expression starts from the root ) – Each P xi is a LHS entity type ( which consists of an element name in the XML document, and the optional key attibute(s) ). – P y is a RHS entity type ( which consists of an element name in the XML document, and an optional attribute name ). – For any 2 instance subtrees identified by Q, if all LHS entities agree on their values, they must also agree on the value of the RHS entity, if it exists. FD XML

13 JSP Project Supplier Part ‘Garden’ @PartNo Price Qty ‘P789’‘80’ ‘500’ Price Qty ‘10’ ‘200’ ‘P123’ ‘Road ‘ABC Price Qty ‘P789’‘10’ ‘50000’ Supplier ‘DEF Pte Price Qty ‘P123’‘12’ ‘1000’ FD XML Example FD XML ( /JSP/Project, [ Supplier, Part  Price ] )

14 FD XML Different Notations for FD XML ( /JSP/Project, [ Supplier, Part  Price ] ) ( /JSP/Project, [ Supplier {SName}, Part {PartNo}  Price ] ) ( [ Supplier, Part  Price ] ) Show identifier of elements Header path is implied Basic Notation

15 FD XML Distributing FD XML Can make use of existing XML tools if FD XML is expressed in XML too. Need a DTD to facilitate distribution of FD XML s Can be easily translated to its XML Schema equivalent.

16 FD XML Distributing FD XML DTD for the running Project-Supplier-Part database.

17 FD XML Distributing FD XML FD XML for the Project-Supplier-Part XML database. ( /JSP/Project, [ Supplier, Part  Price ] ) Conceptual Notation DTD for FD XML /JSP/Project Supplier SName Part PartNo Price FD XML Instance

Replication Cost Model for FD XML

19 Replication Cost Model for FD XML Data replication is sometimes unavoidable (or even desirable!) – Provided it does not get out of hand. Measure the degree of replication – Gauge if it is worth the increased effort for checking consistency, and the increased risk of data inconsistency. We need a replication cost model. Replication Cost Model for FD XML

20 Full FD XML A full FD XML is one which the LHS entity types are minimal, that is, no redundant LHS entity types. Lineage A set of nodes, L, in a tree is a lineage if: 1.There is a node N in L such that all the nodes in the set are ancestors of N, and 2.For every node M in L, if L contains an ancestor of M, it also contains the parent of M. Definitions Replication Cost Model for FD XML * Informal definition : “a straight and unbroken line of elements"

21 Definitions Replication Cost Model for FD XML Well-structured FD XML Consider the DTD : … … The FD XML, F =(Q,[P 1, …,P k  P k+1 ]), where Q = /H 1 /…/H m, holds on this DTD. F is well-structured if : 1.there is a single RHS entity type (i.e. P k+1 ). 2.the ordered XML elements in Q (i.e. H 1,…,H m ), LHS entity types (i.e. P 1,…,P k ) and RHS entity type (i.e. P k+1 ), in that order, form a lineage. 3.The LHS entity types are minimal (i.e. no redundant LHS entity types).

22 Definitions (last one!) Replication Cost Model for FD XML Context Cardinality The context cardinality of XML element X to XML element Y is the number of times Y can participate in a relationship with X in the context of X’s entire ancestry in the XML document. Denoted as: where D is the schema on which this context cardinality is defined, and Q is the header path of X. Project Supplier Part “The number of parts a supplier can supply to a project ” SupplierPart 1:M In ERD Traditional Cardinality SupplierProject 1:N Part Context Cardinality (Participation Constraint) X Y JSP (Document root)

23 Replication Cost Model Replication Cost Model for FD XML Suppose we have the following well- structured FD XML and it holds on DTD D. H1H1 H2H2 H m-1 HmHm P1P1 PkPk P k+1 The model for the replication factor is

24 Using the Cost Model Replication Cost Model for FD XML Project Supplier Part JSP What if each supplier is now constrained to supply to at most 20 projects? 20 Price F = ( /JSP/Project, [Supplier, Part  Price]) (Max. no. of Projects under /JSP) (Max. no. of projects a supplier can supply to, in the context of /JSP)

25 Design insights from Cost Model Replication Cost Model for FD XML Length of FD XML header path, Q, should be as short as possible. Minimize value of 2 nd parameter of RF(F). – If there are several acceptable designs, choose the one with the smallest value for the 2 nd parameter of RF(F). Use model to gauge extra storage requirements due to replication.

Verification of FD XML

27 Scenario Verification of FD XML XML Database FD XML Specifications XML Database Verification Process Verification Results Distribution

28 Verification Process Verification of FD XML XML Database FD XML Specifications XML Parser State Variables Context information Hash structure (with LHS values as hash keys) Set-up using information from FD XML Only a single pass through the database is required.

29 Running the verification process Verification of FD XML

Performance Studies

31 Dataset Performance Studies DBLP – a widely-used, large XML bibliographical database. 80,000 journal records Check dependency Journal,Volume  Year A. H. M. ter Hofstede T. F. Verhoef On the Feasibility of Situational Method Engineering IS 6/7 db/journals/is/is22.html#HofstedeV97 A sample DBLP journal record

32 DOM vs. SAX Performance Studies Document Object Model (DOM) – Builds in-memory tree of nodes. Simple API for XML (SAX) – Event-driven parsing DOM requires too much memory for large datasets. By maintaining simple context information, we do not need the whole database to be in memory. SAX parsing is more suitable for our verification technique.

33 DOM vs. SAX Performance Studies Out of memory error Experiments done on P3 700 MHz machine (128 MB RAM) running WinNT 4.0

34 Memory requirements Performance Studies Hash structure for efficient access. How much memory does the hash structure (with LHS values as hash keys) take? Affects the feasibility of incremental checking.

35 Memory requirements Performance Studies Experiments done on P3 700 MHz machine (128 MB RAM) running WinNT 4.0. A SAX-based parser is used to parse the XML data. FD XML verification does not take up much memory and scales up well. No. of entries in the hash table No. of “errors”

Conclusion

37 Contributions Conclusion Representation for FDs in XML databases. Replication cost model based on FD XML. FD XML verification. A framework for FD XML use and deployment.

38 Future work Conclusion Inference rules for FD XML. Incremental FD XML checking for XML updates. Integration of FD XML with next generation XML DBMS. Mining FD XML from XML databases. MVD XML

39 Everything in ONE slide Conclusion To make XML a data model FD XML To distribute/disseminate the known FD constraints Schema for FD XML Is redundancy in the XML database controlled? Replication cost model To verify FD XML efficiently A single-pass hash-based technique

40 References P. Buneman, S. Davidson, W. Fan, C Hara, WC Tan. Keys for XML. In Proceedings of WWW’10, Hong Kong, China TW Ling, CH Goh, ML Lee. Extending classical functional dependencies for physical database design. Information and Software Technology, 9(38): , Jennifer Widom. Data Management for XML: Research Directions. IEEE Data Engineering Bulletin, 22(3):44-52, 1999 XY Wu, TW Ling, ML Lee, G Dobbie. Designing Semistructured Databases Using the ORA-SS Model. In Proceedings of the 2 nd International Conf on Web Information Systems Engineering (WISE). IEEE Computer Society, Michael Ley. DBLP Bibliography.

Q & A