Generic Schema Matching using Cupid Jayant Madhavan University of Washington Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig.

Slides:



Advertisements
Similar presentations
Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009 Sonia Bergamaschi Serena Sorrentino
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Heterogeneous Data Warehouse Analysis and Dimensional Integration Marius Octavian Olaru XXVI Cycle Computer Engineering and Science Advisor: Prof. Maurizio.
© 2001 Microsoft Corp.1 Generic Model Management A Database Infrastructure for Schema Manipulation Philip A. Bernstein Microsoft Corporation September.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Generic Schema Matching using Cupid
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Aki Hecht Seminar in Databases (236826) January 2009
Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.
Direct and Indirect Matching of Schema Elements for Data Integration on the Web Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
1 Draft of a Matchmaking Service Chuang liu. 2 Matchmaking Service Matchmaking Service is a service to help service providers to advertising their service.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
Discovering Direct and Indirect Matches for Schema Elements Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Generic Schema Matching with Cupid Jayant Madhavan Philip A. Bernstein Erhard Raham Proceedings of the 27 th VLDB Conference.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Sangam: A Transformation Modeling Framework Kajal T. Claypool (U Mass Lowell) and Elke A. Rundensteiner (WPI)
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Philip A. Bernstein Microsoft Corp. Jayant Madhavan Google Erhard Rahm Univ. of Leipzig Copyright © 2011 Microsoft Corp.
Senior Software Developer at DevScope Microsoft Integration MVP since 2011  Writer of numerous articles for Portuguese eMagazine “Programar”  Author.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
Rationale Aspiring Database Developers should be able to efficiently query and maintain databases. This module will help students learn the Structured.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs Sherif SakrSameh ElniketyYuxiong He NICTA & UNSW Sydney, Australia Microsoft Research Redmond,
Page 1 Composing Mappings between Schemas using a Reference Ontology - ODBASE’04 - Eduard Dragut, Ramon Lawrence Composing Mappings between Schemas using.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Dimitrios Skoutas Alkis Simitsis
Module 3: Creating Maps. Overview Lesson 1: Creating a BizTalk Map Lesson 2: Configuring Basic Functoids Lesson 3: Configuring Advanced Functoids.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
MD – Object Model Domain eSales Checker Presentation Régis Elling 26 th October 2005.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo.
XML Schema Integration Ray Dos Santos July 19, 2009.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining.
A Survey of Approaches to Automatic Schema Matching (VLDB Journal, 2001) November 7, 2008 IDB SNU Presented by Kangpyo Lee.
A Hybrid Match Algorithm for XML Schemas Ray Dos Santos Aug 21, 2009 K. Claypool, V. Hegde, N. Tansalarak UMass – Lowell - ICDE ‘06.
Mar 27, 2008 Christiano Santiago1 Schema Matching Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Maßmann Putting Context into Schema Matching.
Semantic Mappings for Data Mediation
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Session 1 Module 1: Introduction to Data Integrity
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: COMA-A system for flexible combination of schema matching approaches - VLDB Hong-Hai Do and Erhard Rahm.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Implementing Mapping Composition
Automating Schema Matching for Data Integration
A Framework for Testing Query Transformation Rules
Presentation transcript:

Generic Schema Matching using Cupid Jayant Madhavan University of Washington Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig Microsoft Research University of Leipzig

September 11th 2001VLDB 2001 Roma Italy2 Schema Matching PO Item POLines Qty Line UoM POShipTo City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure DeliverTo CityStreet Address Name POShipToDeliverTo LineItemNumber Qty UoM Quantity UnitofMeasure POShipToDeliverTo Qty UoM Quantity UnitofMeasure

September 11th 2001VLDB 2001 Roma Italy3 Given two schemas obtain a mapping between them that identifies corresponding elementsGiven two schemas obtain a mapping between them that identifies corresponding elements The Problem A hard problemA hard problem –Naming and structural differences in schemas –Similar, but non-identical concepts modeled –Multiple data models – SQL DDL, XML, ODMG… –Minimize user involvement (semi-automatic) –Data model independent matching (generic)

September 11th 2001VLDB 2001 Roma Italy4 Motivation Important component in many applicationsImportant component in many applications –Data Integration –Data Migration –E-Commerce Model Management [Bernstein, Halevy, Pottinger ’00]Model Management [Bernstein, Halevy, Pottinger ’00] –Algebra for manipulating models and mappings –Match, Merge, Compose …

September 11th 2001VLDB 2001 Roma Italy5 Schema Matching Approaches Individual matchers Schema-basedContent-based Graph matching Linguistic Constraint- based Types Keys Value pattern and ranges Constraint -based Linguistic IR (word frequencies, key terms) Constraint- based Names Descriptions StructuralPer-Element Combined matchers automatic composition Composite manual composition Hybrid Taxonomy based survey [Rahm,Bernstein’00]

September 11th 2001VLDB 2001 Roma Italy6 Related Work Hybrid approaches for schema integrationHybrid approaches for schema integration –DIKE [Palopoli, Sacca, Ursino, Terracina] –MOMIS [Bergamaschi, Castano, Vincini] Linguistic and Instance basedLinguistic and Instance based –SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li] Instance based Multi-strategy learningInstance based Multi-strategy learning –LSD [Doan, Domingos, Halevy] OthersOthers –Hybrid rule based - Transcm [Milo, Zohar] –Query Discovery - CLIO [Haas, Hernandez, Miller]

September 11th 2001VLDB 2001 Roma Italy7 Contributions Taxonomy of schema matching approachesTaxonomy of schema matching approaches Cupid system that exploits linguistic, data- type, structure and referential integrity informationCupid system that exploits linguistic, data- type, structure and referential integrity information –New algorithm that exploits schema structure Experimental validation and comparison with other systemsExperimental validation and comparison with other systems

September 11th 2001VLDB 2001 Roma Italy8 Cupid architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Linguistic Matching Thesaurus LSIM SSIM WSIM

September 11th 2001VLDB 2001 Roma Italy9 Linguistic Matching –Tokenization of names POOrderNum  PO, Order, Num –Expansion of short-forms, acronyms PO  Purchase, Order; Num  Number –Clustering of schema elements based on keywords and data-types Street, City, POAddress  Address –Thesaurus of synonyms, hypernyms, acronyms –Linguistic Similarity coefficient (lsim)  [0,1] Heuristic name matchingHeuristic name matching

September 11th 2001VLDB 2001 Roma Italy10 Structure Matching PO Item POLines Qty Line UoM City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure POShipTo DeliverTo CityStreet Address Name Qty UoM Quantity UnitofMeasure Item LineItemNumber POShipTo DeliverTo Name City Street CityStreet Name

September 11th 2001VLDB 2001 Roma Italy11 Structure Match Mutually Reinforcing Similarity PO Item POLines Qty Line UoM Item PurchaseOrder Items Quantity ItemNum UnitofMeasure Wsim > th high Ssim ++ Qty UoM Quantity UnitofMeasure Qty UoM Quantity UnitofMeasure Item LineItemNum LineItemNum POLinesItemsPOLinesItems POPurchaseOrder

September 11th 2001VLDB 2001 Roma Italy12 Structure Match Context dependent disambiguation PO POShipTo PurchaseOrder InvoiceTo DeliverTo StreetCity Address Street City POBillTo StreetCity Address StreetCity Ssim++ Ssim-- City POShipTo Address POBillTo POShipTo Address POBillTo Address InvoiceToPOBillTo InvoiceToPOBillTo POShipTo InvoiceTo POShipTo InvoiceTo DeliverTo POShipTo

September 11th 2001VLDB 2001 Roma Italy13 Intuition Atomic elements are similarAtomic elements are similar –Linguistically and data-type similar –Their ancestors are similar Compound elements (non-leaf) are similar ifCompound elements (non-leaf) are similar if –Linguistically similar –Subtrees rooted at the elements are similar Mutually recursiveMutually recursive –Leaves determine internal node similarity –Similarity of internal nodes leads to increase in leaf similarity

September 11th 2001VLDB 2001 Roma Italy14 Structure Match details Subtrees are similar ifSubtrees are similar if –Immediate children are similar –Leaf sets are similar Subtree Similarity (nodes s and t)Subtree Similarity (nodes s and t) –Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa –Less sensitive to variation in intermediate structure Pruning the number of comparisonsPruning the number of comparisons –Elements must have comparable number of leaves

September 11th 2001VLDB 2001 Roma Italy15 Order-Customer-fk Referential Integrity Join nodes added to the schema tree for each referential integrity constraintJoin nodes added to the schema tree for each referential integrity constraint Views can be similarly usedViews can be similarly used Purchase Order Product Name Order ID Customer ID Customer Customer ID Name Address Order-Customer-fk Schema A Customer-Purchase-Order Schema B

September 11th 2001VLDB 2001 Roma Italy16 Cupid architecture Schema 1 Schema 2 Structure Matching Lsim Generate Mapping Output Mapping Linguistic Matching Thesaurus Structural(Ssim), Weighted(Wsim) similarity InvoiceToBillTo0.7 UoM UnitMeasur e 0.9 City 1.0 Linguistic Similarity (Lsim) Ssim,Wsim InvoiceToBillTo UoMUnitMeasur e InvoiceTo/CityBillTo/City0.80.9

September 11th 2001VLDB 2001 Roma Italy17 Mapping Generation Individual mapping elements computed from Wsim valuesIndividual mapping elements computed from Wsim values –Consider only mapping pairs that have Wsim greater than threshold –For each element of target find most similar source element –Not accepted mappings with high similarity are returned in order to help user modify map

September 11th 2001VLDB 2001 Roma Italy18 Cupid Architecture Schema 1 Schema 2 Structure Matching Lsim Generate Mapping Output Mapping Linguistic Matching Thesaurus Ssim,Wsim Input hint

September 11th 2001VLDB 2001 Roma Italy19 Experimental Validation DIKEDIKE –Graph Matching of ER models –No Lsim component (LSPD entries) MOMISMOMIS –Class Level Matching of OO descriptions –Word senses manually chosen from WordNet MOMISDIKECupid Canonical Examples Real World Examples

September 11th 2001VLDB 2001 Roma Italy20 Evaluation Insights Linguistic Similarity Cupid is less sensitive to name variations due to token level manipulationsCupid is less sensitive to name variations due to token level manipulations MOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniquesMOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniques MOMIS has a interface to WordNetMOMIS has a interface to WordNet –Word senses need to be chosen manually –Choosing a single sense is not always possible Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid)Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid)

September 11th 2001VLDB 2001 Roma Italy21 Evaluation Insights Structural Similarity DIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elementsDIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elements Leaf structure for sub-tree similarity relaxes requirements on intermediate structure matchLeaf structure for sub-tree similarity relaxes requirements on intermediate structure match Class-level structural similarity in MOMIS can be restrictive while matching schemas with different nestingClass-level structural similarity in MOMIS can be restrictive while matching schemas with different nesting Context-dependent matching in Cupid resolves mapping ambiguityContext-dependent matching in Cupid resolves mapping ambiguity Linguistic similarity with complete path names (and no structural similarity) is insufficientLinguistic similarity with complete path names (and no structural similarity) is insufficient

September 11th 2001VLDB 2001 Roma Italy22 Summary Taxonomy of schema matching approachesTaxonomy of schema matching approaches Cupid system that performs linguistic and structural matchingCupid system that performs linguistic and structural matching New algorithm for exploiting schema structureNew algorithm for exploiting schema structure Comparative evaluationComparative evaluation

September 11th 2001VLDB 2001 Roma Italy23 Future Work Towards a more robust solutionTowards a more robust solution –Auto-tuning parameters –Thesaurus Generation and Evolution –More scalability testing Schema matching component architectureSchema matching component architecture –Easily extensible by adding multiple techniques –Data Instances for matching –Mapping, Expression and Query Discovery Model ManagementModel Management

September 11th 2001VLDB 2001 Roma Italy24 Model Management Other recent publicationsOther recent publications –A Model Theory for Generic Schema Management, DBPL 2001 –Generic Model Management – A Database Infrastructure for Schema Manipulation, CoopIS 2001 –A Vision for Management of Complex Models, Sigmod Record, Dec 2000 –Data Warehouse Scenarios for Model Management, ER 2000 More informationMore information – – –MSR Technical Report Talk to us for a demoTalk to us for a demo

September 11th 2001VLDB 2001 Roma Italy25 End of the talk

September 11th 2001VLDB 2001 Roma Italy26 Schema Matching Fo r each Lines create Items For each Item create Item ItemNumber = concat(“Itm”, Line) Price = “Unknown” Quantity = Pounds2Kgs(Qty) Count = Number of Item in Lines PO Item Lines QtyLineUnit PurchaseOrder Item Items QuantityItemNumberPrice Count ItemNumber=concat(“Itm”,Line) Quantity=Pounds2Kgs(Qty)

September 11th 2001VLDB 2001 Roma Italy27 Tree Match For each pair of leaves initialize ssim to be their data- type compatibility For each s in S (post order) For each t in T(post order) Compute ssim(s,t) = structural-similarity(s,t) wsim(s,t) = g(lsim(s,t), ssim(s,t)) If (wsim(s,t) > th high ) Inc-struct-similarity(leaves(s), leaves(t)) If (wsim(s,t) < th low ) Dec-struct-similarity(leaves(s), leaves(t) Tree Match (Schema tree S, Schema tree T)

September 11th 2001VLDB 2001 Roma Italy28 Tree Match (example) POShipTo PO Item POLines Qty Line UoM POBillTo Count City Street City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure InvoiceToDeliverTo ItemCount CityStreet Address CityStreet Address

September 11th 2001VLDB 2001 Roma Italy29 Canonical Examples MOMISDIKECupid Identical schemas YYY Attributes with identical names, but different data-types YYY Attributes with same data- types, but slightly different names YNY Different class names, but same attribute names NYY Different nesting of schema elements NYY Type substitution NYY

September 11th 2001VLDB 2001 Roma Italy30 Real world example