Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.

Slides:

Advertisements

Similar presentations

Ting Chen, Jiaheng Lu, Tok Wang Ling

Advertisements

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,

ECE 250 Algorithms and Data Structures Douglas Wilhelm Harder, M.Math. LEL Department of Electrical and Computer Engineering University of Waterloo Waterloo,

Fast Algorithms For Hierarchical Range Histogram Constructions

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

NORMALIZATION FIRST NORMAL FORM (1NF): A relation R is in 1NF if all attributes have atomic value = one value for an attribute = no repeating groups =

Lossless Decomposition (2) Prof. Sin-Min Lee Department of Computer Science San Jose State University.

Efficient Query Evaluation on Probabilistic Databases

Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan

Reasoning and Identifying Relevant Matches for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 15 Basics of Functional Dependencies and Normalization for Relational.

Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.

Compressed Accessibility Map: Efficient Access Control for XML Ting Yu : University of Illinois Divesh Srivastava : AT&T Labs Laks V.S. Lakshmanan : University.

Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.

1 COS 425: Database and Information Management Systems XML and information exchange.

Keys For XML Peter Buneman Susan Davidson Wenfei Fan Carmem Hara Wang Chiew Tan.

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

CRIUS: User-Friendly Database Design Li (Eric) Qian, Kristen LeFevre, H. V. Jagadish University of Michigan, Ann Arbor.

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Abrar Fawaz AlAbed-AlHaq Kent State University October 28, 2011

Chapter 10 Functional Dependencies and Normalization for Relational Databases.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.

Querying Structured Text in an XML Database By Xuemei Luo.

Database Design (Normalizations) DCO11310 Database Systems and Design By Rose Chang.

Winter 2006Keller, Ullman, Cushing18–1 Plan 1.Information integration: important new application that motivates what follows. 2.Semistructured data: a.

Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:

Functional Dependencies and Normalization for Relational Databases.

1 5 Normalization. 2 5 Database Design Give some body of data to be represented in a database, how do we decide on a suitable logical structure for that.

RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ

University of Crete Department of Computer Science ΗΥ-561 Web Data Management XML Data Archiving Konstantinos Kouratoras.

CSE314 Database Systems Basics of Functional Dependencies and Normalization for Relational Databases Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E.

Database Systems Part VII: XML Querying Software School of Hunan University

DataBase Management System What is DBMS Purpose of DBMS Data Abstraction Data Definition Language Data Manipulation Language Data Models Data Keys Relationships.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.

1 Functional Dependencies and Normalization Chapter 15.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

1 Functional Dependencies. 2 Motivation v E/R  Relational translation problems : –Often discover more “detailed” constraints after translation (upcoming.

Chapter 7 Functional Dependencies Copyright © 2004 Pearson Education, Inc.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Deriving Relation Keys from XML Keys by Qing Wang, Hongwei Wu, Jianchang Xiao, Aoying Zhou, Junmei Zhou Reviewed by Chris Ying Zhu, Cong Wang, Max Wang,

Towards the Preservation of Keys in XML Data Transformation for Integration Md. Sumon Shahriar and Jixue Liu Data and Web Engineering Lab Computer and.

A Semantic Caching Method Based on Linear Constraints Yoshiharu Ishikawa and Hiroyuki Kitagawa University of Tsukuba

CS 338Database Design and Normal Forms9-1 Database Design and Normal Forms Lecture Topics Measuring the quality of a schema Schema design with normalization.

Ch 7: Normalization-Part 1

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

CPSC 603 Database Systems Lecturer: Laurie Webster II, M.S.S.E., M.S.E.E., M.S.BME, Ph.D., P.E. Lecture 5 Introduction to a First Course in Database Systems.

1 CS 430 Database Theory Winter 2005 Lecture 7: Designing a Database Logical Level.

1 The tree data structure Outline In this topic, we will cover: –Definition of a tree data structure and its components –Concepts of: Root, internal, and.

Relational-Style XML Query Taro L. Saito, Shinichi Morishita University of Tokyo June 10 th, SIGMOD 2008 Vancouver, Canada Presented by Sangkeun-Lee Reference.

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.

Al-Imam University Girls Education Center Collage of Computer Science 1 st Semester, 1432/1433H Chapter 10_part 1 Functional Dependencies and Normalization.

Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.

1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.

XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

Modified Slides from Dr.Peter Buneman 1 XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type.

1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.

Module 5: Overview of Normalization

XML Constraints Constraints are a fundamental part of the semantics of the data; XML may not come with a DTD/type – thus constraints are often the only.

Presentation transcript:

Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006

2 / 42 Talk Outline Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation Conclusion

3 / 42 An Example XML Document warehouse state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” …

4 / 42 An example constraint: For any two book s, if they have the same ISBN, then they have the same title. Similar to Equality Generating Dependencies (EGDs) [BV84] and Nested EGDs [YP04] Constraints on XML Data Target Condition Element(s) Implication Element(s)

5 / 42 Data Redundancies E.g., title is redundantly stored Result of “non-optimal” design of the database schema in the presence of constraints Lead to:  Update anomalies  Increased cost for data transfer and manipulation Constraints are the properties of data  May not be known at the design phase

6 / 42 Goal Efficiently Discover Redundancies From the XML Database By Discovering Satisfied Constraints

7 / 42 Main Contributions A comprehensive notion of XML FD  Capturing a semantically richer set of XML constraints  Definition of XML data redundancy in terms of XML FDs and XML Keys Efficient algorithms for discovering FDs and data redundancies from an XML database Experimental Evaluation

8 / 42 Talk Outline Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation Conclusion

9 / 42 Backup slide: Example XML Constraints Regular: condition and implication elements are children of target state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” …

10 / 42 Example XML Constraints Hierarchical: condition and/or implication elements can come from multiple hierarchies state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” …

11 / 42 Set elements: condition and/or implication elements can involve set elements Example XML Constraints, Cont’d store book name book store name book ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” … state

12 / 42 Functional Dependencies (FDs) FDs are used to describe constraints in relational databases A similar notion of FD is needed for XML Challenges:  Target is difficult to specify due to the hierarchical structure  Set elements introduce new semantics XML FD needs richer semantics !

13 / 42 Previous Notions Path Based Notion [LLL02,VLL04]  Example: {/warehouse/state/store/book/ISBN}  /warehouse/state/store/book/title  Format: LHS  RHS  Semantics: for any two RHS nodes, same (associated) LHS indicates same RHS Tree Tuple Based Notion [AL04]  A tree tuple is a data tree, with exactly one data node for each schema element  Format: LHS  RHS  Semantics: for any two tree tuples, same LHS indicates same RHS

14 / 42 Both capture hierarchical constraints Neither can capture set constraints {/store/book/ISBN}  /store/book/au  Violated in previous  Satisfied if the two au nodes are a single set {/store/book/title, /store/book/au}  /store/book/ISBN  Undefined in previous  Intuitive if au nodes are a single set Previous Notions, cont’d store book name ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” price “$59.9”

15 / 42 A New Comprehensive Notion Generalized Tree Tuple  A data tree constructed around a pivot data node (n p )  Entire subtree rooted at n p is kept  All ancestors of n p and their “attributes” are kept Tuple Class C P  The set of all generalized tree tuples, whose pivot nodes share the same path P (called pivot path )

16 / 42 warehouse state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” … Example Generalized Tree Tuple Pivot

17 / 42 warehouse state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” … Example Generalized Tree Tuple Pivot

18 / 42 XML FD : LHS  RHS w.r.t. C P Semantics: for any two generalized tree tuple t 1, t 2 in C P, if they share the same LHS, they have the same RHS. E.g., {./title,./au} ./ISBN, w.r.t. C /warehouse/state/store/book

19 / 42 Repeatable Elements Are Special warehouse state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” …

20 / 42 Essential Tuple Classes Definition: Tuple classes with pivot paths that correspond to repeatable schema elements  C /warehouse/state/store/book is essential  C /warehouse/state/store/name is not Express XML FDs that are expressible with non-essential tuple classes See paper for detailed proof

21 / 42 Backup slide: Structurally Redundant XML FDs Definition: FDs where none of the paths in LHS and RHS is a descendant of pivot path Their satisfaction on a data tree is mirrored by other FDs  I.e., they are satisfied if and only if some other FD is satisfied See paper for detailed explanation

22 / 42 Backup slide: Interesting XML FD RHS is not contained in LHS C P is an essential Tuple Class RHS is descendent of pivot node See paper for details

23 / 42 XML Key and Data Redundancy Let uniquely identify each node in the entire data tree is an XML Key, when the database satisfies XML FD: LHS w.r.t. C P Similar to the relative key notion proposed in [BDF+01] Data redundancy exists if the database:  Satisfies the XML FD,  But is not an XML key  RHS is redundantly stored.

24 / 42 Talk Outline Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation Conclusion

25 / 42 Strategy Discover satisfied XML FDs and Keys Data redundancies can then be discovered based on the definition First, we need an efficient representation of the XML data

26 / 42 Each essential tuple class  a relation  Similar to nested relations [OY87,MNE96]  All relations together form a hierarchy  Tree tuples can be reconstructed by with parent Hierarchical Representation of XML Data parent 2 root 3 root 18 root..... parent name 4 3 Borders 12 3 Amazon Borders parent ISBN title price 6 4 …269 DB $ …269 DB $ …269 DB $ R.R J.G R.R J.G.

27 / 42 Intra-Relation FDs state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” … {./ISBN} ./title, w.r.t. C /warehouse/state/store/book

28 / 42 Present in R_book Inter-Relation FDs state store book name book store name book state ISBN title au “Borders” “… 269” “DB” “R.R.” “J.G.” store name “Amazon” ISBN title “… 269” “DB” ISBN title au “… 269” “DB” “R.R.” “J.G.” price “$59.9” price “$51.1” price “$59.9” … {../name,./ISBN} ./price, w.r.t. C /warehouse/state/store/book Present in R_store

29 / 42 Overview of the Discovery Process Only interested in minimal FDs Bottom-Up At each relation  Discover intra-relation FDs and Keys  Discover inter-relation FDs and Keys involving descendant relations  Generate candidate inter-relation FDs and Keys for examination at the parent level Attribute Partition as the basic data structure

30 / 42 Attribute Partition Groups tuples according to the attribute value ∏ {price} for C book = { {t 6,t 20 }, {t 13 } } ∏ for C book = { {t 6 }, {t 20 }, {t 13 } } ∏ for C book = { {t 6 }, {t 20 }, {t 13 } } FD: LHS  RHS w.r.t. C P is satisfied iff: ∏ LHS ∪ RHS = ∏ LHS parent ISBN title price 6 4 …269 DB $ …269 DB $ …269 DB $59.9

31 / 42 Set Attribute Partition Generated through refinement  Initialize ∏ {au} for R_book to be { {t 6, t 13, t 20 } }  ∏ for R_au = { {t 10, t 24 }, {t 11, t 25 } }  { {t 6, t 20 }, {t 6, t 20 } }  ∏ au for R_book = { {t 6, t 20 }, {t 13 } } ∏ au can then be used as a normal partition R.R J.G R.R J.G. parent ISBN title price 6 4 …269 DB $ …269 DB $ …269 DB $59.9 Convert to parent Refine ∏ {au} using partitions in ∏

32 / 42 Discovery Algorithms DiscoverFD:  Discover intra-relation FDs and Keys  Similar to existing relational algorithms DiscoverXFD:  Discover inter-relation FDs and Keys  Key component:  Candidate inter-relation XML FD generation

33 / 42 Generating Candidate Inter-Relation FDs Let P ' be a parent relation of P Parent satisfaction property  For LHS ∪ X  RHS w.r.t. C P to hold for any attribute set X in relation P ', LHS ∪ {./parent}  RHS w.r.t. C P must hold Child implication property  For LHS ∪ X  RHS w.r.t. C P to be a non-trivial FD for any attribute set X in relation P ', LHS  RHS w.r.t. C P must not hold An FD is a candidate inter-relation FD if it satisfies both properties

34 / 42 Backup slide: Generating Partition Target Example candidate FD {./ISBN} ./price w.r.t. C book We associate each FD with a Partition Target (PT):  Specifying inequalities parent attribute partitions must satisfy parent ISBN title price 6 4 …269 DB $ …269 DB $ …269 DB $59.9 ∏ {ISBN} = { {t 6, t 13, t 20 } } ∏ {price} = { {t 6, t 20 }, {t 13 } } PT = { t 4 ≠ t 12, t 19 ≠ t 12 }

35 / 42 Backup slide: Checking Partition Target Candidate FD {./ISBN} ./price w.r.t. C book We check each parent attribute partition against the PT to discover inter- relation FDs We use various techniques to compactly represent PT See analysis in Paper PT = { t 4 ≠ t 12, t 19 ≠ t 12 } ∏ {name} = { {t 4, t 19 }, {t 12 } } {../name} ./price w.r.t. C book parent name 4 3 Borders 12 3 Amazon Borders

36 / 42 Talk Outline Motivating Example A Comprehensive Notion of XML FD XML Redundancy Discovery Algorithms Experimental Evaluation Conclusion

37 / 42 Real Datasets DBLP contains a fair amount of redundancy, as noted earlier in [AL04] as well ~ 10% redundancies in PIR (measured as # of redundant elements over total # of elements), schema modification reported to PIR

38 / 42 Scalability on XMark Linear in terms of scale factor (# of elements) – even though exponential in theory Orders of magnitude faster than direct application of a state-of- the-art relational discovery algorithm  The latter takes over 3 hours to run on XMark scale factor 1

39 / 42 Related Work XML Integrity Constraints (FDs and Keys)  [BDF+01], [LLL02], [FS03] XML Normal Form  [AL04], [VLL04] Nested Relation Normal Form  [OY87], [MNE96] Relational FD discovery  FUN, Dep-Miner, TANE, fdep, FastFDs

40 / 42 Backup slide: GORDIAN Both use extensive pruning strategies based on the properties of FDs  E.g., singleton pruning are adopted in both GORDIAN is more aggressive since it only looks for keys Our algorithm is more comprehensive, it discovers satisfied FDs, in addition to keys

41 / 42 Conclusion A comprehensive notion of XML FDs and Keys, capturing set semantics A system for for detecting XML data redundancies through the discovery of FDs and Keys The system is practical for real datasets and out-performs direct application of the best available relational algorithm by orders of magnitude.

42 / 42 Questions ?