Download presentation
Presentation is loading. Please wait.
Published byDenis Nelson Modified over 9 years ago
2
Generic Schema Matching using Cupid Jayant Madhavan University of Washington Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig Microsoft Research University of Leipzig
3
September 11th 2001VLDB 2001 Roma Italy2 Schema Matching PO Item POLines Qty Line UoM POShipTo City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure DeliverTo CityStreet Address Name POShipToDeliverTo LineItemNumber Qty UoM Quantity UnitofMeasure POShipToDeliverTo Qty UoM Quantity UnitofMeasure
4
September 11th 2001VLDB 2001 Roma Italy3 Given two schemas obtain a mapping between them that identifies corresponding elementsGiven two schemas obtain a mapping between them that identifies corresponding elements The Problem A hard problemA hard problem –Naming and structural differences in schemas –Similar, but non-identical concepts modeled –Multiple data models – SQL DDL, XML, ODMG… –Minimize user involvement (semi-automatic) –Data model independent matching (generic)
5
September 11th 2001VLDB 2001 Roma Italy4 Motivation Important component in many applicationsImportant component in many applications –Data Integration –Data Migration –E-Commerce Model Management [Bernstein, Halevy, Pottinger ’00]Model Management [Bernstein, Halevy, Pottinger ’00] –Algebra for manipulating models and mappings –Match, Merge, Compose …
6
September 11th 2001VLDB 2001 Roma Italy5 Schema Matching Approaches Individual matchers Schema-basedContent-based Graph matching Linguistic Constraint- based Types Keys Value pattern and ranges Constraint -based Linguistic IR (word frequencies, key terms) Constraint- based Names Descriptions StructuralPer-Element Combined matchers automatic composition Composite manual composition Hybrid Taxonomy based survey [Rahm,Bernstein’00]
7
September 11th 2001VLDB 2001 Roma Italy6 Related Work Hybrid approaches for schema integrationHybrid approaches for schema integration –DIKE [Palopoli, Sacca, Ursino, Terracina] –MOMIS [Bergamaschi, Castano, Vincini] Linguistic and Instance basedLinguistic and Instance based –SEMINT, DELTA [Clifton, Hausman, Rosenthal, Li] Instance based Multi-strategy learningInstance based Multi-strategy learning –LSD [Doan, Domingos, Halevy] OthersOthers –Hybrid rule based - Transcm [Milo, Zohar] –Query Discovery - CLIO [Haas, Hernandez, Miller]
8
September 11th 2001VLDB 2001 Roma Italy7 Contributions Taxonomy of schema matching approachesTaxonomy of schema matching approaches Cupid system that exploits linguistic, data- type, structure and referential integrity informationCupid system that exploits linguistic, data- type, structure and referential integrity information –New algorithm that exploits schema structure Experimental validation and comparison with other systemsExperimental validation and comparison with other systems
9
September 11th 2001VLDB 2001 Roma Italy8 Cupid architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Linguistic Matching Thesaurus LSIM SSIM WSIM
10
September 11th 2001VLDB 2001 Roma Italy9 Linguistic Matching –Tokenization of names POOrderNum PO, Order, Num –Expansion of short-forms, acronyms PO Purchase, Order; Num Number –Clustering of schema elements based on keywords and data-types Street, City, POAddress Address –Thesaurus of synonyms, hypernyms, acronyms –Linguistic Similarity coefficient (lsim) [0,1] Heuristic name matchingHeuristic name matching
11
September 11th 2001VLDB 2001 Roma Italy10 Structure Matching PO Item POLines Qty Line UoM City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure POShipTo DeliverTo CityStreet Address Name Qty UoM Quantity UnitofMeasure Item LineItemNumber POShipTo DeliverTo Name City Street CityStreet Name
12
September 11th 2001VLDB 2001 Roma Italy11 Structure Match Mutually Reinforcing Similarity PO Item POLines Qty Line UoM Item PurchaseOrder Items Quantity ItemNum UnitofMeasure Wsim > th high Ssim ++ Qty UoM Quantity UnitofMeasure Qty UoM Quantity UnitofMeasure Item LineItemNum LineItemNum POLinesItemsPOLinesItems POPurchaseOrder
13
September 11th 2001VLDB 2001 Roma Italy12 Structure Match Context dependent disambiguation PO POShipTo PurchaseOrder InvoiceTo DeliverTo StreetCity Address Street City POBillTo StreetCity Address StreetCity Ssim++ Ssim-- City POShipTo Address POBillTo POShipTo Address POBillTo Address InvoiceToPOBillTo InvoiceToPOBillTo POShipTo InvoiceTo POShipTo InvoiceTo DeliverTo POShipTo
14
September 11th 2001VLDB 2001 Roma Italy13 Intuition Atomic elements are similarAtomic elements are similar –Linguistically and data-type similar –Their ancestors are similar Compound elements (non-leaf) are similar ifCompound elements (non-leaf) are similar if –Linguistically similar –Subtrees rooted at the elements are similar Mutually recursiveMutually recursive –Leaves determine internal node similarity –Similarity of internal nodes leads to increase in leaf similarity
15
September 11th 2001VLDB 2001 Roma Italy14 Structure Match details Subtrees are similar ifSubtrees are similar if –Immediate children are similar –Leaf sets are similar Subtree Similarity (nodes s and t)Subtree Similarity (nodes s and t) –Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa –Less sensitive to variation in intermediate structure Pruning the number of comparisonsPruning the number of comparisons –Elements must have comparable number of leaves
16
September 11th 2001VLDB 2001 Roma Italy15 Order-Customer-fk Referential Integrity Join nodes added to the schema tree for each referential integrity constraintJoin nodes added to the schema tree for each referential integrity constraint Views can be similarly usedViews can be similarly used Purchase Order Product Name Order ID Customer ID Customer Customer ID Name Address Order-Customer-fk Schema A Customer-Purchase-Order Schema B
17
September 11th 2001VLDB 2001 Roma Italy16 Cupid architecture Schema 1 Schema 2 Structure Matching Lsim Generate Mapping Output Mapping Linguistic Matching Thesaurus Structural(Ssim), Weighted(Wsim) similarity InvoiceToBillTo0.7 UoM UnitMeasur e 0.9 City 1.0 Linguistic Similarity (Lsim) Ssim,Wsim InvoiceToBillTo0.80.7 UoMUnitMeasur e 0.70.8 InvoiceTo/CityBillTo/City0.80.9
18
September 11th 2001VLDB 2001 Roma Italy17 Mapping Generation Individual mapping elements computed from Wsim valuesIndividual mapping elements computed from Wsim values –Consider only mapping pairs that have Wsim greater than threshold –For each element of target find most similar source element –Not accepted mappings with high similarity are returned in order to help user modify map
19
September 11th 2001VLDB 2001 Roma Italy18 Cupid Architecture Schema 1 Schema 2 Structure Matching Lsim Generate Mapping Output Mapping Linguistic Matching Thesaurus Ssim,Wsim Input hint
20
September 11th 2001VLDB 2001 Roma Italy19 Experimental Validation DIKEDIKE –Graph Matching of ER models –No Lsim component (LSPD entries) MOMISMOMIS –Class Level Matching of OO descriptions –Word senses manually chosen from WordNet MOMISDIKECupid Canonical Examples Real World Examples
21
September 11th 2001VLDB 2001 Roma Italy20 Evaluation Insights Linguistic Similarity Cupid is less sensitive to name variations due to token level manipulationsCupid is less sensitive to name variations due to token level manipulations MOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniquesMOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniques MOMIS has a interface to WordNetMOMIS has a interface to WordNet –Word senses need to be chosen manually –Choosing a single sense is not always possible Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid)Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid)
22
September 11th 2001VLDB 2001 Roma Italy21 Evaluation Insights Structural Similarity DIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elementsDIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elements Leaf structure for sub-tree similarity relaxes requirements on intermediate structure matchLeaf structure for sub-tree similarity relaxes requirements on intermediate structure match Class-level structural similarity in MOMIS can be restrictive while matching schemas with different nestingClass-level structural similarity in MOMIS can be restrictive while matching schemas with different nesting Context-dependent matching in Cupid resolves mapping ambiguityContext-dependent matching in Cupid resolves mapping ambiguity Linguistic similarity with complete path names (and no structural similarity) is insufficientLinguistic similarity with complete path names (and no structural similarity) is insufficient
23
September 11th 2001VLDB 2001 Roma Italy22 Summary Taxonomy of schema matching approachesTaxonomy of schema matching approaches Cupid system that performs linguistic and structural matchingCupid system that performs linguistic and structural matching New algorithm for exploiting schema structureNew algorithm for exploiting schema structure Comparative evaluationComparative evaluation
24
September 11th 2001VLDB 2001 Roma Italy23 Future Work Towards a more robust solutionTowards a more robust solution –Auto-tuning parameters –Thesaurus Generation and Evolution –More scalability testing Schema matching component architectureSchema matching component architecture –Easily extensible by adding multiple techniques –Data Instances for matching –Mapping, Expression and Query Discovery Model ManagementModel Management
25
September 11th 2001VLDB 2001 Roma Italy24 Model Management Other recent publicationsOther recent publications –A Model Theory for Generic Schema Management, DBPL 2001 –Generic Model Management – A Database Infrastructure for Schema Manipulation, CoopIS 2001 –A Vision for Management of Complex Models, Sigmod Record, Dec 2000 –Data Warehouse Scenarios for Model Management, ER 2000 More informationMore information –http://data.cs.washington.edu/model/ –http://www.cs.washington.edu/homes/jayant –MSR Technical Report Talk to us for a demoTalk to us for a demo
26
September 11th 2001VLDB 2001 Roma Italy25 End of the talk
27
September 11th 2001VLDB 2001 Roma Italy26 Schema Matching Fo r each Lines create Items For each Item create Item ItemNumber = concat(“Itm”, Line) Price = “Unknown” Quantity = Pounds2Kgs(Qty) Count = Number of Item in Lines PO Item Lines QtyLineUnit PurchaseOrder Item Items QuantityItemNumberPrice Count ItemNumber=concat(“Itm”,Line) Quantity=Pounds2Kgs(Qty)
28
September 11th 2001VLDB 2001 Roma Italy27 Tree Match For each pair of leaves initialize ssim to be their data- type compatibility For each s in S (post order) For each t in T(post order) Compute ssim(s,t) = structural-similarity(s,t) wsim(s,t) = g(lsim(s,t), ssim(s,t)) If (wsim(s,t) > th high ) Inc-struct-similarity(leaves(s), leaves(t)) If (wsim(s,t) < th low ) Dec-struct-similarity(leaves(s), leaves(t) Tree Match (Schema tree S, Schema tree T)
29
September 11th 2001VLDB 2001 Roma Italy28 Tree Match (example) POShipTo PO Item POLines Qty Line UoM POBillTo Count City Street City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure InvoiceToDeliverTo ItemCount CityStreet Address CityStreet Address
30
September 11th 2001VLDB 2001 Roma Italy29 Canonical Examples MOMISDIKECupid Identical schemas YYY Attributes with identical names, but different data-types YYY Attributes with same data- types, but slightly different names YNY Different class names, but same attribute names NYY Different nesting of schema elements NYY Type substitution NYY
31
September 11th 2001VLDB 2001 Roma Italy30 Real world example
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.