1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories.

Slides:

Advertisements

Similar presentations

XML Data Management 8. XQuery Werner Nutt. Requirements for an XML Query Language David Maier, W3C XML Query Requirements: Closedness: output must be.

Advertisements

Heuristic Search techniques

XML: Extensible Markup Language

NP-Hard Nattee Niparnan.

Lecture 24 MAS 714 Hartmut Klauck

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Algorithms + L. Grewe.

Advanced Data Structures

Train DEPOT PROBLEM USING PERMUTATION GRAPHS

Lectures on Network Flows

1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.

1 Spanning Trees Lecture 20 CS2110 – Spring

Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

From Semistructured Data to XML: Migrating The Lore Data Model and Query Language Roy Goldman, Jason McHugh, Jennifer Widom Stanford University

Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.

1 Secure XML Querying with Security Views Wenfei Fan University of Edinburgh & Bell Laboratories Chee-Yong Chan National University of Singapore Minos.

Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.

Storing and Querying Ordered XML Using a Relational Database System By Khang Nguyen Based on the paper of Igor Tatarinov and Statis Viglas.

Chapter 11: Limitations of Algorithmic Power

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.

Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

Extracting Relations from XML Documents C. T. Howard HoJoerg GerhardtEugene Agichtein*Vanja Josifovski IBM Almaden and Columbia University*

1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.

“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.

Querying Structured Text in an XML Database By Xuemei Luo.

RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.

Zorica Stanimirović Faculty of Mathematics, University of Belgrade

Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ

Declaratively Producing Data Mash-ups Sudarshan Murthy 1, David Maier 2 1 Applied Research, Wipro Technologies 2 Department of Computer Science, Portland.

CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.

XML Access Control Koukis Dimitris Padeleris Pashalis.

1 Composable XML Integration Grammars Xibei Jia Laboratory for Foundations of Computer Science Supervisor: Wenfei Fan 20 May 2004.

COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.

Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.

Dom and XSLT Dom – document object model DOM – collection of nodes in a tree.

Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

State space representations and search strategies - 2 Spring 2007, Juris Vīksna.

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

LIMITATIONS OF ALGORITHM POWER

Ferdowsi University of Mashhad 1 Automatic Semantic Web Service Composition based on owl-s Research Proposal presented by : Toktam ghafarian.

2004/12/31 報告人 : 邱紹禎 1 Mining Frequent Query Patterns from XML Queries L.H. Yang, M.L. Lee, W. Hsu, and S. Acharya. Proc. of 8th Int. Conf. on Database.

Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,

Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

 XML derives its strength from a variety of supporting technologies.  Structure and data types: When using XML to exchange data among clients, partners,

Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:

XML Query languages--XPath. Objectives Understand XPath, and be able to use XPath expressions to find fragments of an XML document Understand tree patterns,

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

RE-Tree: An Efficient Index Structure for Regular Expressions

Lectures on Network Flows

HEXA: Compact Data Structures for Faster Packet Processing

Managing XML and Semistructured Data

On Inferring K Optimum Transformations of XML Document from Update Script to DTD Nobutaka Suzuki Graduate School of Library, Information and Media Studies.

CLASSES P AND NP.

MCN: A New Semantics Towards Effective XML Keyword Search

Presentation transcript:

1 Information Preserving XML Schema Embedding Philip BohannonBell Laboratories Wenfei FanUniv of Edinburgh & Bell Labs Michael Flaster Bell Laboratories PPS Narayan Bell Laboratories

2 XML mapping XML mapping σ d : I(S1) → I(S2): Instance-level: from XML instances of a given source DTD schema S1 to XML trees of a predefined target DTD schema S2 Information preserving (lossless) XML data exchange, migration, integration, P2P, … XML tree T of S1 XML tree of S2 XML mapping

3 Example: XML mapping – source DTD Source schema S1 : db  class* class  cno, title, type type  ( regular + project ) regular  prereq prereq  class* DTD: (E, P, r). E: element types; r: root; P: element type definitions A    ::= PCDATA |  | B1, …, Bk | B1 + … + Bk | B* Graph representation: –concatenation production B1, …, Bk : AND edge (solid) –disjunction B1 + … + Bk : OR edge (dashed) –Kleene star B*: STAR edge (with edge label *) class cnotypetitle projectregular prereq db * *

4 Example: XML mapping – target DTD target schema S2 : courses cnomandatorycredit basic current category school ** students history course semester termyeartitle advanced student projectseminarlabregular gpaprereq required * * gpanamessntaking * *

5 information preserving XML mapping Objective: Find an XML mapping σ d : I(S1) → I(S2) such that Type safety: for any XML tree T of S1, σ d (T) is an XML document that is conforms to the predefined target schema S2 Information preserving: –Invertibility: there exists an inverse σ -1 d : I(S2) → I(S1) such that for any XML tree T of S1, T = σ -1 d ( σ d (T)). The source T can be recovered from the target σ d (T) –Query preservation w.r.t a query language L: there is a query-rewriting function F: L → L such that for any Q in L and any T of S1, Q(T) = F(Q)( σ d (T)). All queries in L on the source can be answered on the target

6 Challenge: different structures S1 and S2 have vastly different structures: graph similarity (simulation) does not work here! * seminarlabregular gpaprereq required courses cnomandatorycredit basic current category school history course semester termyeartitle advanced students * project *... class cnotypetitle projectregular prereq db * * S1S2 *

7 Challenge: data integration S1’ * student ssntakingname cno db * * class cnotypetitle projectregular prereq db * * S1 S2 courses current school students history student *... Multiple sources are to be mapped to a single target: the target schema must have a larger information capacity – it cannot be similar to sources

8 About query preservation: XML query languages Regular XPath: Q ::=  | A | Q/text() | Q/Q | Q ∪ Q | Q* | Q[q] q ::= Q | Q/text() = ‘c’ | position() = k | q ∧ q | q ∨ q | not q An XPath fragment: Q//Q instead of Q* Example: a regular XPath query over S1: Find all prerequisites of CIS 331 class cnotypetitle projectregular prereq db * * Q1: class [ cno/text() = ‘ CS331 ’] / (type/regular/prereq/class)* Q2: courses/current/course [ basic/cno/text() = ‘CS331’] / (category/mandatory/regular/required/prereq/course)* query rewriting

9 Challenge: information preservation for XML For relational data w.r.t. relational calculus ( L), invertiblility (calculus dominance) and query preservation (dominance) coincide [Hull 84] Separation: (a) There is an invertible XML mapping that is NOT query preserving w.r.t. XPath. (b) There is an XML mapping that is query preserving w.r.t. XPath without position( ) but it is NOT invertible. Complexity: It is undecidable to determine, for an XML mapping defined in any language subsuming FO, whether it is (a) invertible, or (b) query preserving w.r.t. any query language with projection. beyond reach for XML mappings defined in XQuery/XSLT Other results: query preservation w.r.t. regular XPath: stronger than invertibility sufficient conditions under which the two coincide

10 Previous work XML mappings defined in XQuery/XSLT: no guidance on –type safety: for any XML tree T of S1, is σ d (T) guaranteed to conforms to predefined (recursive) target schema S2 ? –how to ensure information preservation Schema mapping: to derive instance-level mapping –similarity flooding, Cupid, Clio, TransSCM… –cannot guarantee information preservation Information preservation in traditional data models: not directly applicable to XML mappings No prior work has considered information-preserving XML mapping

11 Our approach A systematic way to find XML mappings commonly used in practice find a schema mapping (embedding): σ : S1 → S2 with certain properties, if there is any derive an instance-level mapping σ d : I(S1) → I(S2) from σ –automatically guarantee information preservation –accommodate integration (multiple sources) Input: source DTD S1 = (E1, P1, r1), target DTD S2 = (E2, P2, r2) ; similarity matrix att( ) on element type names: att(A, B) in [0, 1] indicates how close A ∈ E1 is to B ∈ E2 Output: Schema embedding: σ = ( λ( ), path( ))

12 Schema embedding σ = ( λ( ), path( )) λ : E1 → E2, type mapping: λ(r1) = r2 and att(A, λ(A)) > 0 path(A, B) maps an edge (A, B) in S1 to a unique path from λ(A) to λ(B) in S2 : A1[position( ) = k1] / … /An(position( ) = kn] –path type: AND (OR, STAR) edge to AND (OR, STAR) path (solid/star edges, solid + at least 1 dashed, solid edges + *) Information capacity –prefix-free: if P1(A) = A1, …, An, path(A, Ai) is NOT a prefix of any path(A, Aj) for j ≠ i ; similarly for P1(A) = A1+ … + An. Type safety – valid mapping Is there a schema embedding for the following? A BC A BC A BC A BC S1 S2

13 Example: Schema embedding A A BC S1S2 B C λ (A) = A, λ (B) = B, λ (C) = C path(A, B) = A/B path(A, C) = B/C Unfolding: the prefix-free condition query translation: B/C A B S1 1 2 A B S2 Schema embedding: NO Graph simulation: YES Schema embedding is not a mild generalization of graph simulation

14 Schema embedding: example λ (db) = school, λ (class) = course path(db, class) = courses/current/course –mapping edge to path –STAR edge to STAR path –Graph similarity? NO class db * courses current school ** students history course student * gpanamessntaking S1 S2

15 Schema embedding: example λ (type) = category, λ (A) = A path(class, cno) = basic/cno path(class, title) = basic/semester/title path(class, type) = category AND (STAR) edges to AND (STAR) paths Relative path: relative to course class cnotypetitle cnocredit basic category course semester termyeartitle S1 S2 *

16 Schema embedding: example λ (X) = X path(type, regular) = mandatory/regular path(type, project) = advanced/project λ (X) = X path(regular, prereq) = required/prereq path(prereq, class) = course projectregular type S1 mandatoryadvanced projectseminarlabregular category S2 * prereq regular class.... OR edges to OR paths * regular gpa prereq required course.... S1S2

17 Deriving instance-level mapping Each schema embedding σ : S1 → S2 determines an XML mapping σ d : I(S1) → I(S2) Path types and prefix-free Given an XML tree T1 of S1, σ d (T1) constructs an instance T2 of S2, top-down by mapping A-elements of T1 to λ (A)- nodes in T2 the root of T2 is mapped from the root of T1 ; for each λ (A) -element in T2 mapped from an A-element of T1, generate path(A, B) in T2 for each B-child of the A-element; when all the element in T2 mapped from nodes in T1 are fully expanded, add necessary “default” elements to T2 such that T2 satisfies S2.

18 Properties of schema embedding Theorem: The XML mapping σ d : I(S1) → I(S2) derived from a schema embedding σ : S1 → S2 is well defined (type safety) invertible (with a quadratic-time inverse), and query preserving w.r.t. regular XPath (query rewriting: linear-time data complexity, quadratic-time combined complexity)

19 Integration: multiple sources S1’ * student ssntakingname cno db * class cnotypetitle projectregular prereq db * * S1 S2 courses current school students history student *... λ (db) = school, λ (X) = X path(db, student) = students/student path(taking, cno) = cno gpanamessntaking * cno pairwise disjoint path mappings from S1, S1’ to S2

20 Schema embedding vs. graph simulation Definition: –embedding: mapping edges to paths –simulation: mapping edges to edges restructuring: –embedding: various DTD constructs, different structures –simulation: source and target schemas with similar structures information preservation for XML mappings: –embedding: automatically guarantee both invertibility and query preservation w.r.t. regular XPath –simulation: no data integration: –embedding: multiple source DTDs to a single target schema –simulation: no A systematic method to define information-preserving XML mappings

21 Complexity: finding schema embedding Input: two DTD schemas S1 and S2, and a similarity matrix att( ) Output: find a schema embedding from σ : S1 → S2 such that qual( σ, att) is maximal, if there is any qual( σ, att) is the sum of att(A, λ(A)) for all A in S1 Theorem: It is NP-complete to determine whether or not there is a schema embedding from S1 to S2, even when S1 and S2 are nonrecursive and they consist of concatenation types only. Efficient algorithms are necessarily heuristic. Find local embedding for each DTD production of S1 Assemble local embeddings to make a schema embedding

22 Computing local embedding – fixed type mapping Input: a production A → P(A) in source DTD S1, target schema S2 Output: σ 0 = (λ0, path0), a partial embedding from P(A) to S2 Example: find λ0( ) from types in P(A) to types of S2, and path0( ) projectregular type S1 mandatoryadvanced projectseminarlabregular category S2 If λ0 is given: an O(|P(A)| |S2|) algorithm findPath to find local embedding (depth-first search, checking each S2 subtree only once) When λ0 is not fixed, the local embedding problem is NP-hard Heuristic: randomized findPath to find both λ0 and path0 (randomly pick up possible type-node match in the search)...

23 Assembling local embeddings Input: C(A), a set of local embeddings for each A in the source DTD (initialized via randomized findPath); a target schema S2 Output: σ = (λ, path), a schema embedding from S1 to S2 if any Theorem: The assemble-embedding problem is NP-complete even when S1 and S2 are nonrecursive. Conflict: type mapping, prefix free Three heuristic algorithms: 1. Fix an order O on S1 types via qual( ), pick a local embedding σ A from C(A) in O, and increment σ with σ A if no conflict 2. Assume a random order O on S1 types, then do the same as (1) 3. Reduction to the MAX-Weight-Independence-Set problem, leveraging an existing tool for that problem.

24 Experimental evaluation benchmark –XMark (99 type nodes in its original form) –Real-life DTD s: SIGMOD (13), PSD (121), mondial (70), etc –Generating target schemas by adding noise: changing edges to paths, mutating names, inserting new subtrees. selectivity/accuracy of att ( ): [0, 1] (1.0: exact match) Target schemas with 75% noise: XMark ( ), SIGMOD (54-96), PSD ( ), mondial ( ) system –933MHZ/1.0GHZ Pentium III, 256M memory –QUALEX: a tool for MAX-Weight-Independence-Set –Algorithms implemented in Java

25 Experimental result – target size XMark (acc 0.75). RandomOrder and MAXSet-Reduction perform well

26 Experimental result – running time required XMark (acc 0.75). In seconds for schemas of hundreds of nodes

27 Experimental result – different source schemas Various source schemas (acc 0.75). RandomOrder finds solutions more than 90% of the time, in seconds

28 Summary Information preservation: the first study for XML mappings –more intriguing than its relational counterparts: separation, equivalence, complexity of invertibility and query preservation –important for data exchange, migration, integration, P2P, … Schema embedding: –mapping edges to paths –capture various DTD constructs, support restructuring –automatically guarantee information preservation –accommodate multiple source to a single target –NP-complete, but with efficient and effective heuristic A practical solution for finding information-preserving XML mappings