Towards the Preservation of Keys in XML Data Transformation for Integration Md. Sumon Shahriar and Jixue Liu Data and Web Engineering Lab Computer and.

Slides:

Advertisements

Similar presentations

CSCI N241: Fundamentals of Web Design Copyright ©2004 Department of Computer & Information Science Introducing XHTML: Module B: HTML to XHTML.

Advertisements

A Normal Form for XML Documents Marcelo Arenas Leonid Libkin Department of Computer Science University of Toronto.

Schema Refinement: Canonical/minimal Covers

Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.

Elements of Lambda Calculus Functional Programming Academic Year Alessandro Cimatti

1 DTD (Document Type Definition) Imposing Structure on XML Documents (W3Schools on DTDs)W3Schools on DTDs.

Discrete Mathematics Lecture 5 Alexander Bukharovich New York University.

SQL Lecture 10 Inst: Haya Sammaneh. Example Instance of Students Relation  Cardinality = 3, degree = 5, all rows distinct.

The Relational Model Class 2 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) (From ER to Relational)

1 Translation of ER-diagram into Relational Schema Prof. Sin-Min Lee Department of Computer Science.

SPRING 2004CENG 3521 The Relational Model Chapter 3.

1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney,

Keys For XML Peter Buneman Susan Davidson Wenfei Fan Carmem Hara Wang Chiew Tan.

1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,

The Relational Model Lecture 3 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) From ER to Relational.

1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.

1 Data Modeling Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Daniel Kroening and Ofer Strichman 1 Decision Procedures in First Order Logic Decision Procedures for Equality Logic.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Using SQL Queries to Generate XML- Formatted Data Joline Morrison Mike Morrison Department of Computer Science University of Wisconsin-Eau Claire.

Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

The Relational Model These slides are based on the slides of your text book.

Relational Data Model, R. Ramakrishnan and J. Gehrke with Dr. Eick’s additions 1 The Relational Model Chapter 3.

The Relational Model. Review Why use a DBMS? OS provides RAM and disk.

Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.

A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.

1 Translation of ER-diagram into Relational Schema Prof. Sin-Min Lee Department of Computer Science.

What is XML?  XML stands for EXtensible Markup Language  XML is a markup language much like HTML  XML was designed to carry data, not to display data.

These Questions are copied from

DASWIS NF-SS: A Normal Form for Semistructured Schemata Xiaoying Wu, Tok Wang Ling, Sin Yeung Lee, Mong Li Lee National University of Singapore.

VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Identity Constraints.

1 The Relational Model. 2 Why Study the Relational Model? v Most widely used model. – Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. v “Legacy.

On Reducing the Global State Graph for Verification of Distributed Computations Vijay K. Garg, Arindam Chakraborty Parallel and Distributed Systems Laboratory.

FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.

RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ

1 Lecture 6: Schema refinement: Functional dependencies

Reading and Writing Mathematical Proofs Spring 2015 Lecture 4: Beyond Basic Induction.

XML – A Quick Introduction Kerry Raymond (stolen from others)

On the Relation between SAT and BDDs for Equivalence Checking Sherief Reda Rolf Drechsler Alex Orailoglu Computer Science & Engineering Dept. University.

CMPT 258 Database Systems The Relationship Model PartII (Chapter 3)

Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

The Relational Model Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.

Friday, September 4 th, 2009 The Systems Group at ETH Zurich XML and Databases Exercise Session 5 courtesy of Ghislain Fourny/ETH © Department of Computer.

CS34311 The Relational Model. cs34312 Why Relational Model? Currently the most widely used Vendors: Oracle, Microsoft, IBM Older models still used IBM’s.

Deriving Relation Keys from XML Keys by Qing Wang, Hongwei Wu, Jianchang Xiao, Aoying Zhou, Junmei Zhou Reviewed by Chris Ying Zhu, Cong Wang, Max Wang,

Exchange Intensional XML Data Tova MiloSerge Abiteboul Tova Milo INRIA & Tel-Aviv U. ; Serge Abiteboul INRIA ; Bernd AmannOmar Benjelloun Bernd Amann Cedric-CNAM.

A New Top-down Algorithm for Tree Inclusion Dr. Yangjun Chen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.

Relational-Style XML Query Taro L. Saito, Shinichi Morishita University of Tokyo June 10 th, SIGMOD 2008 Vancouver, Canada Presented by Sangkeun-Lee Reference.

Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

Daniel Kroening and Ofer Strichman 1 Decision Procedures in First Order Logic Decision Procedures for Equality Logic.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.

1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.

CENG 351 File Structures and Data Management1 Relational Model Chapter 3.

CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.

Modified Slides from Dr.Peter Buneman 1 XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type.

1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.

The NP class. NP-completeness

Computing Full Disjunctions

A Normal Form for XML Documents

A Normal Form for XML Documents

XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type – thus constraints are often the only.

Query Optimization.

Presentation transcript:

Towards the Preservation of Keys in XML Data Transformation for Integration Md. Sumon Shahriar and Jixue Liu Data and Web Engineering Lab Computer and Information Science University of South Australia

Outline of the Presentation  Motivation for XML Data Transformation with XML keys  How to define XML keys  How to transform XML keys  Whether transformed XML keys are valid and preserved [Key Preservation]  If XML key is not preserved, how to capture XML key as XML functional dependency (XFD) [Key Transition]

Data Transformations for Integration Relational  Relational Relational  XML XML  Relational XML  XML

Data Transformations for Integration with Constraints Relational  Relational Relational  XML XML  Relational XML  XML Constraint (keys, functional dependencies etc.) preservations (a.k.a propagations) are well studied Little investigated! Mostly structural transformations of schema and data ignoring constraints! Reason: document-centric approach rather than data-centric approach of XML

Motivating Example 1 Source DTD D a : Target DTD D b : Nested Flat-like Unnest(sid) Operation

VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V9V9 V 10 V 11 enroll dept dname cid sid cid sid Physics Chemistry Phys01001Chem02 V6V6 V7V7 V8V8 sidcid sid 002 Phys V sid VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V 10 V 12 V 13 enroll dept dname cid sid cid Physics Chemistry Phys01001 Chem02 V6V6 V7V7 V8V8 cidsid cid 002Phys02003 V sid V9V9 V 11 cid 004 Chem02 Phys01 XML tree T a XML tree T b Unnest(sid)

XML key consideration D a : D b : Unnest(sid) K is valid on D a K is satisfied by T a Is K is transformed?: NO Is K is valid on D b :YES Is K is satisfied by T b ?: NO Unnest(sid) K(enroll/dept,{cid})

VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V9V9 V 10 V 11 enroll dept dname cid sid cid sid Physics Chemistry Phys01001Chem02 V6V6 V7V7 V8V8 sidcid sid 002 Phys V sid VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V 10 V 12 V 13 enroll dept dname cid sid cid Physics Chemistry Phys01001 Chem02 V6V6 V7V7 V8V8 cidsid cid 002Phys02003 V sid V9V9 V 11 cid 004 Chem02 Phys01 XML tree T a XML tree T b duplicates distinct

Observation Observation 1: An XML key may not be preserved after transformation.

Motivating Example 2 Target DTD D b : Source DTD D a : expand operation replacing (cid,sid+) with course K(enroll/dept,{cid}) Vaild and satisfied K(enroll/dept/course,{cid}) Is K Valid? Answer: NO Reason: Path is transformed Suggestion: Needs transformation of key Satisfactions?: May be or not, need to check

Expanding (cid,sid+) with new element course

Observation Observation 2: How XML keys should be transformed needs to be defined when DTD is transformed

Contributing on Defining XML keys on DTD and their satisfactions Rules for transforming XML keys using important operations Key preservation [key to key] Defining XML functional dependencies (XFDs) and their satisfactions Key transition [key to XFD]

Contributing on Defining XML keys on DTD and their satisfactions Defined on schema definition DTD Use a novel technique to produce semantically correct values for key satisfactions Can capture some properties of relational key on the sense of value completeness and disallowing redundant values Can capture ID properties of DTD definition Improvement of key notion in XML Schema

XML Key Given a DTD D = (EN, ,  ), an XML key on D is defined as K(Q,{P 1,…,P l }), where l>= 0, Q is a complete path on D called the selector, and {P 1,..., P i,…, P l } (often denoted by P) is a set of fields where each P i is defined as:, where " U " means disjunction and p ij (j [1,…,n i ]) is a simple path on D,  (last(p ij ))=Str, and has the following syntax:  p ij =seq  seq=e | e/seq where ; Q/p ij is a complete path.

Example of XML keys Source DTD D a : K(enroll/dept,{cid}) selector=enroll/dept field={cid}  (cid)=#PCDATA means Str  (last(cid)) =Str K(enroll/dept,{cid,sid}) selector=enroll/dept fields={cid,sid}  (last(cid))=  (last(sid))= Str

Some definitions for XML key satisfactions [ P-tuple ] Given a key K(Q,{P 1,...,P l }) and a tree T, let T Q be a tree in T. A P-tuple in T Q is a tuple of pair-wise close sub-trees. By pair-wise close, we mean tuples in the same minimal hedge A P-tuple is complete if We call T P =T last(P) the prefixed format tree. For example P=enroll/dname. Then T P =T dname

Proposed techniques [Hedge] Hedge is a consecutive sequence of primary sub-trees of the same node. [Minimal structure] Given a DTD definition  (e) and two elements e 1 and e 2 in  (e), the minimal structure g of e 1 and e 2 in  (e) is the pair of brackets that encloses e 1 and e 2 and any other structure in g does not enclose both. [Minimal Hedge] Given a hedge H of  (e), a minimal hedge of e 1 and e 2 is one of H g s in H.

Example of minimal structure, minimal hedge and P-tuple VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V9V9 V 10 V 11 enroll dept dname cid sid cid sid Physics Chemistry Phys01001Chem02 V6V6 V7V7 V8V8 sidcid sid 002 Phys V 12 TaTa D a : K(enroll/dept,{cid,sid}) P 1 =cid, P 2 =sid Minimal structure is g=(cid,sid+) Minimal hedges are: H 1 =v 4 v 5 v 6, H 2 =v 7 v 8 under node v 1 and H 3 =v 10 v 11 v 12 under node v 2 P-tuples are: F 1 =v 4 v 5, F 2 =v 4 v 6 for hedge H 1, F 3 =v 7 v 8 for hedge H 2 for node v 1 and F 4 =v 10 v 11, F 5 =V 10 v 12 for hedge H 3 for node v sid H1gH1g H2gH2g H3gH3g

Produced P-tuples

XML Key Satisfaction An XML tree satisfies a Key K(Q,{P 1,…P l }) if the followings are held: If {P 1,…P l }=  then T satisfies K iff there exists one and only one T Q in T; Else (exists at least one P-tuple in T Q ) (every P-tuple in T Q is complete) (every P-tuple in T Q is value distinct) (exists two P-tuples ) This requires that P-tuples in different T Q must be value distinct.

Checking satisfaction of key T Q =T v1 T Q =T v2

Contributing on Rules for transformation on key definition A key is transformed if any path in the key is transformed. After the transformation, key needs to be checked whether it is valid on target schema. If a key is not transformed, it is valid on target DTD

Transformation on key Unnest operation: g=(g 1 xg 2 +)+  g=(g 1 xg 2 )+ Example: (cid,sid+)+  (cid,sid)+ It makes the nested structure to flat- like structure No path transformation No change in the key definition

Transformation on key Nest operation: g=(g 1 xg 2 )+  g=(g 1 xg 2 +)+ Example: (cid,sid)+  (cid,sid+)+ It makes the flat-like structure to nested structure No path transformation No change in the key definition

Transformation on key Expand operation: g=(g 1 xg 2 +)+  g=(g new )+, g new =g 1 xg 2 + Example: g=(cid,sid+)+  g=(course+), g new =(cid,sid+)+ It pushes the structure to one level down Path is transformed in DTD and so in key Needs some rules to transform key correctly

Transformation on key Transformation rules on key using expand: Depends where the new element is added in the key paths (either selector or field) K(enroll/dept,{cid,sid}) K(enroll/dept/course,{cid,sid})K(enroll/dept,{course/cid,course/sid}) expand((cid,sid+), course) K(enroll/dept,{cid,sid}) expand(sid+, stIDs) K(enroll/dept,{cid,stIDS/sid}) D a :

Transformation on key Collapse operation: g=(g coll )+, g coll =g 1 xg 2 +  g=(g 1 xg 2 +)+ Example: g=(dept+), g dept =(cid,sid+)  g=(cid,sid+)+ It moves the structure to one level up Path is transformed in DTD and so in key Needs some rules to transform key correctly

Transformation on key Transformation rules on key using collapse: Depends which element is deleted in the key paths (either selector or field) K(enroll/dept,{cid,sid}) K(enroll,{cid,sid}) collapse(dept) K(enroll,{dept/cid,dept/sid}) K(enroll,{cid,sid}) D a : collapse(dept)

Contributing on [Key preservation] Given a source DTD, its conforming document, a valid key that is satisfied by the document, if the transformed key is valid on target DTD and is satisfied by the target document then key is said to be preserved by the transformation.

Key preserving properties of operations Preserving: Nest and collapse Preserving with necessary and sufficient conditions: Unnest and Expand

Theorem: Unnest operator is key preserving if some key fields don’t cross g 1.

Example to explain Unnest(sid) However if the key is K(enroll,{cid,sid}), then Key is preserved (cid,sid+)+ g1 g2 K(enroll/dept,{cid})

Theorem: Expand operator is key preserving if when the selector is transformed, then every tree for selector has a P-tuple.

Example to explain No duplicate cid’s are produced distinct K(enroll/dept,{cid}) K(enroll/dept/course,{cid}) K(enroll/dept,{course/cid})

Contributing on [Key transition] Given a source DTD, its conforming document, a valid key that is satisfied by the document, if the transformed key is valid on target DTD and is not satisfied by the target document but if key is transformed to XFD and is satisfied by the target document then we say XML key is transited as XFD.

XML functional dependency (XFD) Given a DTD D = (EN, ,  ), an XML key on D is defined as  (S, P  Q), where S is a complete path on D called the scope, P is a set of simple paths P={p 1,...,p i,…,p l } called determinant or LHS, Q is a simple path or empty path called dependent or RHS, and S/P and S/Q are complete paths. If Q= , then XFD  (S, P   ) implies that P  last(S) meaning that P determines S

Tuple for XFD [ Tuple ] Given an XFD  (S,P  Q) and a tree T,let T S be a tree in T. A tuple in T S is a tuple of pair-wise close sub-trees. By pair-wise close, we mean tuples in the same minimal hedge By P-tuple, we mean the tuple for paths P By Q-tuple, we mean the tuple for path Q A P-tuple is complete if

XFD satisfactions An XML tree satisfies an XFD  (S, P  Q) if the followings are held: If Q=  then is complete; Else are complete. For every pair of tuples F 1 [P] and F 2 [Q] in T S, if F 1 [P]= v F 1 [Q], then F 1 [Q]= v F 2 [Q].

Key transition algorithm 1: check=CheckKeyTransformation(k, UnNest); 2: if check=TRUE then 3: TransformKeyToXFD(k); 4: end if 5: if target T satisfies the XFD Φ then 6: return Φ and ”KeyTransited”; 7: end if

Function CheckKeyTransformation(k, UnNest) 1: if g 1 crossing any P i in [P 1, · · ·, P n ] at an element e where e in g 1 and e in P i then 2: return TRUE; 3: else 4: return FALSE; 5: end if

Function TransformKeyToXFD(k) 1: Φ[S] := k[Q]; 2: for all i such that 1 ≤ i ≤ n do 3: Φ[P i ] := k[P i ]; 4: end for 5: Φ[Q] :=  ; 6: return Φ(S, {P} → Q);

VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V9V9 V 10 V 11 enroll dept dname cid sid cid sid Physics Chemistry Phys01001Chem02 V6V6 V7V7 V8V8 sidcid sid 002 Phys V sid VrVr V1V1 V2V2 V3V3 V4V4 V5V5 V 10 V 12 V 13 enroll dept dname cid sid cid Physics Chemistry Phys01001 Chem02 V6V6 V7V7 V8V8 cidsid cid 002Phys02003 V sid V9V9 V 11 cid 004 Chem02 Phys01 XML tree T a XML tree T b duplicates distinct K(enroll/dept,{cid}) Φ(enroll/dept,{cid}   )

Theorem: An XML key on source DTD can only be transited to an XFD on the target DTD if the key is satisfied by the conforming source document.

Talked on XML data transformation with keys A new definition for XML keys Transformation rules for keys Key preservations Key transition Also a new definition for XML functional dependency (XFD)

our papers “On Defining Keys for XML”, IEEE cit2008, Database and Data Mining Workshop, Sydney “Key Preserving P2P Data Transformation for XML”,LNCS, DBISP2P,2008(VLDB Workshop), Auckland, New Zealand “Transition of keys in XML Data Transformation”, IEEE CSA2008, Hobart. “On Defining Functional Dependency for XML”, IEEE IWSCA 2008, Korea

Other research issues Already done “Preserving functional dependency in XML data transformation”, LNCS, ADBIS 2008, Finland. Preserving Inclusion dependency in XML data transformation Future work Adaptation of constraints in XML data integration Detecting conflicts between source constraints and target constraints in XML settings Checking Validations and satisfactions of the constraints XML keys, XFDs and XML inclusion dependencies (XID) Performances in XML data transformation and Integrations with constraints

Thank You Questions