Download presentation
Presentation is loading. Please wait.
Published byJustina Joseph Modified over 9 years ago
1
CSE 636 Data Integration Schema Matching Cupid Fall 2006
2
2 Mediator Virtual Integration Architecture Data Source Data Source Global Schema Local Schema Local Schema QueryResult Wrapper End User Design-Time Mediation Language Schema Matching Run-Time Query Reformulation Optimization & Execution XML Web Services
3
3 Independently created schemas… … might be modeling similar information… … in slightly different ways Schema Heterogeneity name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ? evaluation studentID student * course * DB2 courseID title name type
4
4 Schema Heterogeneity name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ? Similar entities represented Dissimilar structures (inverted nesting) Different element names for similar data values Similar element names for different data values evaluation studentID student * course * DB2 courseID title name type
5
5 Schema Matching vs. Schema Mapping GAV and LAV are schema mapping languages Mappings: –set of queries –associations + semantics Match: –set of associations only Schema Matching: –Identifying associations –First step towards constructing mappings
6
6 Associations Semantics Schema Matching vs. Schema Mapping for $s1 in DB3/student where $s1/type = ‘UGRAD’ return {$s1/studentID} {$s1/name} LAV Mapping: DB1 Q(DB3) name ugradID ugrad * DB1 enrollment * courseID ugradID grade type courseID course * student * DB3 studentID name type letter title ?
7
7 The Problem of Schema Matching Input Schemas S 1 and S 2 Possibly data instances for S 1 and S 2 Background knowledge –thesauri –validated matches –standard schemas –reference instances –ontologies –constraints (keys, data types etc) Output Associations between S 1 and S 2 Goal Schema matching tools with significant automated support
8
8 Schema Matching How is the match result expressed? type courseID course * student * DB3 studentID name type letter title ? evaluation studentID student * course * DB2 courseID title name type Pairs of paths Lists of paths Schema names
9
9 Schema Matching What do we match? Depends on the queries we want to ask 1.Elements in isolation (leaves in particular) 2.Substructures 3.Whole schemas
10
10 Motivation Important component in many applications –Data Integration –Data Migration –E-Commerce Model Management [Bernstein, Halevy, Pottinger ’00] –Algebra for manipulating models and mappings –Match, Merge, Compose …
11
11 Minimize user involvement (semi-automatic) Data model independent matching (generic) Schema matching is a hard problem –Naming and structural differences in schemas –Similar, but non-identical concepts modeled –Multiple data models – SQL DDL, XML, ODMG… Problems
12
12 Schema Matching Approaches Graph matching Constraint- based Individual matchers Schema-basedContent-based StructuralPer-Element Constraint- based Types Keys Linguistic Names Descriptions Value pattern and ranges Constraint- based Linguistic IR (word frequencies, key terms) Per-Element Combined matchers CompositeHybrid automatic composition manual composition Taxonomy based survey: Rahm and Bernstein, VLDB J, 2001 How to match?
13
13 Cupid Individual matchers Schema-basedContent-based Graph matching Linguistic Constraint- based StructuralPer-Element Types Keys Value pattern and ranges Constraint- based Linguistic IR (word frequencies, key terms) Per-Element Constraint- based Names Descriptions Combined matchers automatic composition Composite manual composition Hybrid Madhavan, Bernstein and Rahm, VLDB, 2001
14
14 Cupid Example PO Item POLines Qty Line UoM POShipTo City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure DeliverTo CityStreet Address Name
15
15 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM
16
16 Linguistic Matching Heuristic name matching –Tokenization of names POOrderNum PO, Order, Num –Expansion of short-forms, acronyms PO Purchase, Order; Num Number –Clustering of schema elements based on keywords and data-types Street, City, POAddress Address –Thesaurus of synonyms, hypernyms, acronyms –Linguistic Similarity coefficient (LSIM) [0,1]
17
17 Structure Matching PO Item POLines Qty Line UoM City Street Item PurchaseOrder Items Quantity ItemNumber UnitofMeasure POShipTo DeliverTo CityStreet Address Name
18
18 PO Item POLines Qty Line UoM Item PurchaseOrder Items Quantity ItemNum UnitofMeasure WSIM > th high SSIM++ Structure Matching Mutually Reinforcing Similarity
19
19 PO POShipTo PurchaseOrder InvoiceTo DeliverTo StreetCity Address Street City POBillTo StreetCity Address StreetCity SSIM++ SSIM-- Structure Matching Context Dependent Disambiguation
20
20 Intuition Atomic elements are similar –Linguistically and data-type similar –Their ancestors are similar Compound elements (non-leaf) are similar if –Linguistically similar –Subtrees rooted at the elements are similar Mutually recursive –Leaves determine internal node similarity –Similarity of internal nodes leads to increase in leaf similarity
21
21 Structure Match Details Subtrees are similar if –Immediate children are similar –Leaf sets are similar Subtree Similarity (nodes s and t) –Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa –Less sensitive to variation in intermediate structure Pruning the number of comparisons –Elements must have comparable number of leaves
22
22 Order-Customer-fk Referential Integrity Purchase Order Product Name Order ID Customer ID Customer Customer ID Name Address Order-Customer-fk Schema A Customer-Purchase-Order Schema B Join nodes added to the schema tree for each referential integrity constraint Views can be similarly used
23
23 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM Structural (SSIM), Weighted (WSIM) Similarity InvoiceToBillTo0.7 UoMUnitMeasure0.9 City 1.0 Linguistic Similarity (LSIM) InvoiceToBillTo0.80.7 UoMUnitMeasure0.70.8 InvoiceTo/CityBillTo/City0.80.9
24
24 Mapping Generation Individual mapping elements computed from WSIM values: –Consider only mapping pairs that have WSIM greater than threshold –For each element of target find most similar source element –Not accepted mappings with high similarity are returned in order to help user modify map
25
25 Cupid Architecture Schema 1 Schema 2 Structure Matching Generate Mapping Output Mapping Thesaurus Linguistic Matching LSIM SSIM WSIM Input hint
26
26 Work Needed A more robust solution –Auto-tuning parameters –Thesaurus Generation and Evolution Schema matching component architecture –Easily extensible by adding multiple techniques –Data Instances for matching –Look at COMA & ProtoPlasm systems
27
27 References 1.J. Madhavan, P. A. Bernstein, E. Rahm Generic Schema Matching with Cupid VLDB, 2001 2.H. H. Do, E. Rahm: COMA - A System for Flexible Combination of Schema Matching Approaches VLDB, 2002 3.P. A. Bernstein, S. Melnik, M. Petropoulos, C. Quix Industrial-Strength Schema Matching SIGMOD Record 33(4), 2004
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.