Nested Mappings: Schema Mapping Reloaded

Nested Mappings: Schema Mapping Reloaded
Clio P. Papotti Universita’ Roma Tre M.A. Hernandez - H. Ho - L. Popa IBM Almaden Research Center A. Fuxman - R.J. Miller University of Toronto

The Problem of Mapping Generation
Schemas can be arbitrarily different E.g., different normalization & naming, missing/extra elements Input: correspondences between atomic schema elements (Automatic discovery) Logical and declarative expressions of relationships between schemas. Abstraction for data interoperability tasks Simpler than actual implementations of data exchange (SQL/XQuery/XSLT) Must generate transformation that: Preserves data relationships: pname-dname, pname-ename, etc. Creates new target values (pid) Produces “correct” groupings 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti
Outline Schema mapping generation [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) From basic to nested: Issues with basic mappings Nested mappings and their advantages Generation algorithm Performance impact Conclusion Related work Future directions 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Schema Mapping Generation
Schema Correspondences Source schema S Target schema T Mappings Source Concepts (relational views) Target Concepts Step 1. Extraction of “concepts” (in each schema). Concept = one category of data that can exist in the schema Step 2. Mapping generation Enumerate all non-redundant maps between pairs of concepts 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Example The concept of “project of a department” dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] projects: Set [ pname m2 m1 m1 maps proj to dept-projects proj: Set [ dname pname emps: Set [ ename salary ] m1: (p0 in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0 in proj) (e0 in p0.emps) (e in d.emps) (w in e.worksOn) w.pid = p.pid  p0.dname = d.dname  e0.ename = e.ename  e0.salary = e.salary m2 maps proj-emps to dept-emps-worksOn-projects expression for dept-emps-worksOn-projects The concept of “project of an employee of a department” Two ‘basic’ mappings (or source-to-target tgds or GLAV formulas) 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Outline Schema mapping generation [VLDB’02] Fagin, Hernandez, Popa (IBM Almaden), Miller, Velegrakis (Univ. of Toronto) From basic to nested: Issues with basic mappings Nested mappings and their advantages Generation algorithm Performance impact Conclusion Related work Future directions 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 1: Many Small Uncorrelated Formulas
dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] projects: Set [ pname m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] m1: “for every proj tuple there must be dept and project tuples such that …“ m2: “for every emp of a proj tuple there must be: dept, emp, worksOn, project … “ If we also had dependents under employees, then: “for every dependent of an emp of a proj … “ and so on … There is a lot of common mapping behavior that is repeated E.g., m2 repeats the mapping behavior of m1 (although for a “subconcept”) 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 2: Redundancy in the Generated Data
Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] projects: Set [ pname m2 CS B1 { } { X1 uSearch } CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] CS uSearch { Alice John 120K, 90K } Required to exist based on m1 Required to exist based on m2 m2 repeats the mapping behavior of m1: “duplicate” dept and project tuples “duplicate” nulls (pid values: X2 and X3, and budget values) Moreover, this duplication happens for each joining emp tuple in the source 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Issue 3: No Grouping in the Target
Possible output: dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] projects: Set [ pname m2 CS B1 { } { X1 uSearch } CS B2 { Alice 120K { X2 } } { X2 uSearch } CS B2 { Alice, John 120K, 90K { X2} { X3 } } { X3 uSearch } CS B3 { John 90K { X3 } } { X3 uSearch } m1 Input: proj: Set [ dname pname emps: Set [ ename salary ] CS uSearch { Alice John 120K, 90K } Required to exist based on m1 Required to exist based on m2 Alice and John are in different singleton sets (E and E’) There can be as many singleton sets as emp tuples in the source nested set It is desirable to enforce the grouping on the target data 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Summary of issues Fragmentation of the specification (Too) many small tgds Fragmentation of the data Generate redundant data (which later needs to be removed or fused) No grouping enforced on the target data (need additional phase to enforce any grouping) 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Idea We would like to reuse (in m2) the “dept” and “project” tuples that the simpler mapping m1 asserts. Make m2 assert only the “extra” information Also accumulate the corresponding employees into one set Idea: Correlate the mapping formulas based on their common part dept: Set [ dname budget emps: Set [ ename salary worksOn: Set [ pid ] projects: Set [ pname m2 m1 proj: Set [ dname pname emps: Set [ ename salary ] This is the main idea. 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Correlating Mapping Formulas
m1: (p0 in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname m2: (p0 in proj) (e0 in p0.emps) (d in dept) (p in d.projects) (e in d.emps) (w in e.worksOn) w.pid=p.pid  p0.dname = d.dname  p0.pname = p.pname  e0.ename = e.ename  e0.salary = e.salary proj tuples mapped only once Submapping, correlated to the parent mapping Replace with n: (p0 in proj) (d in dept) (p in d.projects) p0.dname = d.dname  p0.pname = p.pname  [ (e0 in p0.emps) (e in d.emps) (w in e.worksOn) w.pid=p.pid  e0.ename = e.ename  e0.salary = e.salary ] For every proj tuple, we map all employees, as a group. (Source grouping is preserved) This is a nested mapping 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Advantages of Nested Mappings
Nested tgds can exploit the natural hierarchy that exists on the concepts of a schema e.g., proj-emps is a “subconcept” of proj, in the source schema Map higher concept only once; use submappings for subconcepts Nested mappings are strictly more expressive: There is no set of source-to-target tgds that is equivalent to n. proj: Set [ dname pname emps: Set [ ename salary ] 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nesting Algorithm: Sketch
Step 1. Discovery: construct a DAG of basic mapping based on the concepts hierarchy Step 2. Correlation: construct nested mappings by traversing the DAG, starting from each root, and repeatedly applying the nesting step hinted before. We get a forest of nested mappings 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nesting Algorithm: Example
dept: Set of [ dname budget emps: Set of [ ename salary worksOn: Set of [ pid ] projects: Set of [ pname P X D proj: Set of [ dname pname emps: Set of [ ename salary ] PE DE DP DEPW A DAG of basic mappings for p in proj exists d’ in dept, p’ in d’.projects where d’.dname=p.dname and p’.pname=p.pname and PDP ( for e in p.emps exists e’ in d’.emps, w in e’.worksOn where w.pid=p’.pid and e’.ename=e.ename and e’.salary=e.salary ) PEDEPW 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Experimental evaluation
Goal: show empirically that nested mappings can dramatically: reduce the cost of producing a target instance improve the quality of the generated data DBLP-like schema, on both source and target, with four levels of nesting/grouping: authors – level 1 conferences – level 2 years – level 3 publications – level 4 Mappings are implemented by generating queries (in XQuery) Qbasic based on basic mappings Qnested based on nested mappings 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Example Queries – 2 Levels Only
Qbasic let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author, $x1 in $x0/conf return <author> <name> { $x0/name/text() } </name> { for $x0L1 in $doc0/authorDB/author, $x1L1 in $x0L1/conf where $x0/name/text()=$x0L1/name/text() <conf> <name> { $x1L1/name/text() } </name> </conf> } </author> } { for $x0 in $doc0/authorDB/author </authorDB> Qnested let $doc0 := fn:doc("instance.xml") return <authorDB> { for $x0 in $doc0/authorDB/author return <author> <name> { $x0/name/text() } </name> { for $x1 in $x0/conf <conf> <name>{ $x1/name/text() }</name> </conf> } </author> } </authorDB> Multiple query terms (one per basic mapping) Need re-grouping (over entire data) Generate duplicates Single pass over the data No duplicates 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Execution time comparison
Qbasic execution time / Qnested execution time Logarithm scale Execution time for basic: 22 minutes Execution time for nested: 1.1 seconds 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Output file size comparison Qbasic output file size / Qnested output file size Logarithm scale Size of generated data for basic (including duplicates): 45MB Size of generated data for nested: 552KB The nested mapping results in much more efficient execution with less redundant data 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Related work Both embedded mappings [Melnik et al. SIGMOD’05] and HePTox [Bonifati et al. VLDB’05] support nested data, but do not support nesting of mappings. Nested mappings are less general than languages used for composition [Fagin et al. PODS’04, Nash et al. PODS’05], but are more compact and easier to understand/program The generation algorithm identifies common expressions within mappings: same spirit of work in query optimization [e.g., Roy et al. SIGMOD’00]. But query optimization preserves query equivalence, while our techniques lead to mappings with better semantics (do not preserve query equivalence). There are already commercial tools that use similar paradigms (e.g., IBM Ascential DataStage TX) but most of the mapping generation work is manual. 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Conclusion Nested tgds: better specification language for transformation Use correlation (hierarchy) between concepts Less redundancy in the output, more efficient Naturally preserve source grouping For more complex mappings we expose Skolem functions to let users alter the default grouping behavior Nested tgds are more compact and easier to understand/program Humans think top-down: map top concepts, then submappings, etc. Can be generated too ! 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Future Directions Extend existing solutions to use nested mappings Data integration, mapping analysis and reasoning, schema evolution, etc. Nested tgds are more complex as a logic formalism ! Study the formal foundation of nested mappings More generally, develop methods for deciding when and why is a schema mapping specification “better” than another Need to look at issues such as: preservation of the source data (associations, correlations, etc.) minimization of incompleteness 11/30/2018 Nested Mappings: Schema Mapping Reloaded - VLDB'06 - Paolo Papotti

Nested Mappings: Schema Mapping Reloaded

Similar presentations

Presentation on theme: "Nested Mappings: Schema Mapping Reloaded"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nested Mappings: Schema Mapping Reloaded

Similar presentations

Presentation on theme: "Nested Mappings: Schema Mapping Reloaded"— Presentation transcript:

Similar presentations

About project

Feedback