Download presentation
1
Generic Schema Matching using Cupid
Jayant Madhavan University of Washington Cupid the match-maker Philip A. Bernstein Erhard Rahm Microsoft Research University of Leipzig
2
Schema Matching PO Item POLines Qty Line UoM POShipTo City Street
PurchaseOrder Items Quantity ItemNumber UnitofMeasure DeliverTo Address Name POShipTo DeliverTo POShipTo DeliverTo Schema Matching. E-Commerce two businesses willing to co-operate Naming differences example Structural differences click Line ItemNumber Qty UoM Quantity UnitofMeasure Qty UoM Quantity UnitofMeasure CS652 Information Extraction and Information Integration
3
Schema Matching Approaches
Individual matchers Schema-based Content-based Combined matchers automatic composition Composite manual composition Hybrid Structural Per-Element Graph matching Linguistic Constraint-based Types Keys Value pattern and ranges IR (word frequencies, key terms) Names Descriptions Schema Matching approaches have been studied as a component of a number of applications. Taxonomy of schema matching algorithms. Schema-based and Instance-based, inter-element relationships, linguistic and structural matches Single algorithm not good enough. Hybrid and Composite approaches There are a number of people in the the audience who have contributed algorithms that find a place in this survey Cupid is a hybrid schema-based approach that uses linguistic, constraint and structural information. Taxonomy based survey [Rahm,Bernstein’00] CS652 Information Extraction and Information Integration
4
Cupid architecture LSIM SSIM WSIM Schema 1 Schema 2 Linguistic
Matching Thesaurus Structure Matching Matching proceeds in 3 steps – common to many systems LM – Linguistic Similarity Co-efficient between pairs of element names one from each schema SM – Similarity Coefficient between element pairs that accounts for the similarity of related elements WS – Linear combination Generate Mapping Output Mapping SSIM WSIM CS652 Information Extraction and Information Integration
5
Linguistic Matching Heuristic name matching Tokenization of names
POOrderNum PO, Order, Num Expansion of short-forms, acronyms PO Purchase, Order; Num Number Clustering of schema elements based on keywords and data-types Street, City, POAddress Address Thesaurus of synonyms, hypernyms, acronyms Linguistic Similarity coefficient (lsim) [0,1] Linguistic Matching is essentially done by linguistic matching. We use a number of heuristics, a few of them being Note that the information for acronyms, hypernymns and synonyms are obtained from a thesaurus The result is a coefficient in the range 0 to 1 between pairs of schema element names, one in each schema. The calculation of this coefficient accounts for common tokens, synonym and hypernym relationss, suffix and prefix information. Most of the heuristics have been proposed else where in the literature, and we combined the ones we thought most useful. CS652 Information Extraction and Information Integration
6
Structure Matching PO PurchaseOrder POLines Items POShipTo DeliverTo
Name City Street Name Address Name City Street We first revisit our original purchase order example and try to motivate some of the intuition that lies behind the Cupid algorithm. Qty in the lhs matches with Quantity in the rhs because they are linguistically similar and their data-types are similar Item in the lhs matches with Item because they are linguistically similar and a significant fraction of its children match each other. Line Line ItemNumber ItemNumber City Street UoM Qty UoM Quantity UnitofMeasure UnitofMeasure Qty Quantity CS652 Information Extraction and Information Integration
7
Structure Match Mutually Reinforcing Similarity
PO PO PurchaseOrder PurchaseOrder Wsim > thhigh POLines Items POLines Items POLines Items Wsim > thhigh Item Item Item Item Ssim ++ Ssim ++ What we know is the linguistic similarity between pairs of schema elements. The goal is to compute structural similarity between schema element pairs that captures the similarity of other elements in their neighborhood. We enumerate the nodes in the two schemas in a post-order, and start comparing nodes as we traverse the schemas in a bottom-up fashion. We initialize the structural similarity of atomic element pairs to their data type similarity. UoM and UnitOfMeasure, Qty and Quantity When we compare compound elements, e.g. Item in the left schema with Item on the right, we do two things. First we compute the structural similarity of the two elements. One measure that we use for structural similarity is the number of atomic elements or leaves in either sub-tree that can be mapped to a leaf in the other sub-tree. We consider two atomic elements to be mappable to each other if they have a weighted similarity greater than a threshold. Secondly, we compute the weighted similarity as a linear combination of the structural and linguistic similarity of the two Item elements. If this is greater than a threshold thhigh, we increment the struct similarity of the leaf pairs in the sub-tree. We do this in recognition of the fact that the leaf pairs have ancestors that are similar, or that they exist in a similar context. As you might notice, this might in turn make Line mappable to ItemNum, because it had a lower linguistic similarity, but its struct similarity has now been increased. We continue this process as go up the two trees. Thus we now have a scheme for mutually reinforcing the similarity of nodes in the two trees. Line ItemNum Line Line ItemNum ItemNum Qty UoM Quantity UnitofMeasure Qty UoM Quantity UnitofMeasure UoM UnitofMeasure Qty Quantity CS652 Information Extraction and Information Integration
8
Structure Match Context dependent disambiguation
PO PurchaseOrder POShipTo InvoiceTo POShipTo InvoiceTo POShipTo Address POBillTo POShipTo Address POBillTo InvoiceTo POBillTo InvoiceTo POBillTo POBillTo InvoiceTo DeliverTo POShipTo Ssim-- DeliverTo POShipTo Address Street City Address Address Address City City City Street We now briefly demonstrate a further aspect of structure matching. Both the purchase order schemas describe shipping and billing addresses. But the schema on the right has a common address element, which is not the case in the left. Atomic elements City and Street in the right side schema have now to be matched according to the contexts in which they occur. We apply the same methodology and traverse the schemas bottom-up. The two cities in the lhs have the same ssim value to being with. The vlaues remain the same when after both POShipTo and POBillTo are compared in turn with Address. As we traverse the rhs schema beyond Address we realize that it can occur in multiple contexts. At this point we split the sub-tree into two, copying similarity values as well. Now when POBillTo is matched with InvoiceTo we find a strong match. Good matches among the leaves and Linguistic Similarity. We increase the SSim amoing the leaves. When POShip is matched with InvoiceTo we have the option of decrementing the Ssim value because of a poor linguistic match. When POShipTo in compared with DeliverTo we again have strong linguistic match and so increase Ssim values being their leaves. We have thus successfully disambiguated the mappings City Street Ssim++ Ssim++ City Street CS652 Information Extraction and Information Integration
9
Intuition Atomic elements are similar
Linguistically and data-type similar Their ancestors are similar Compound elements (non-leaf) are similar if Linguistically similar Subtrees rooted at the elements are similar Mutually recursive Leaves determine internal node similarity Similarity of internal nodes leads to increase in leaf similarity We reiterate some of the basic intuitions behind our structure matching CS652 Information Extraction and Information Integration
10
Structure Match details
Subtrees are similar if Immediate children are similar Leaf sets are similar Subtree Similarity (nodes s and t) Fraction of leaves in subtree s that can be mapped to a leaf in the other subtree t and vice-versa Less sensitive to variation in intermediate structure Pruning the number of comparisons Elements must have comparable number of leaves We use the concept of similarity of sub-trees – what do we exactly mean CS652 Information Extraction and Information Integration
11
Referential Integrity
Order-Customer-fk Order-Customer-fk Purchase Order Customer Customer-Purchase-Order Schema B Customer ID Order ID Address Customer ID Name Product Name Schema A Join nodes added to the schema tree for each referential integrity constraint Views can be similarly used CS652 Information Extraction and Information Integration
12
Cupid architecture Schema 1 Schema 2 Lsim Linguistic Matching
Thesaurus Generate Mapping Ssim,Wsim Structure Matching Output Mapping Linguistic Similarity (Lsim) Structural(Ssim), Weighted(Wsim) similarity InvoiceTo BillTo 0.7 UoM UnitMeasure 0.9 City 1.0 InvoiceTo BillTo 0.8 0.7 UoM UnitMeasure InvoiceTo/City BillTo/City 0.9 CS652 Information Extraction and Information Integration
13
Mapping Generation Individual mapping elements computed from Wsim values Consider only mapping pairs that have Wsim greater than threshold For each element of target find most similar source element Not accepted mappings with high similarity are returned in order to help user modify map CS652 Information Extraction and Information Integration
14
Cupid Architecture Schema 1 Schema 2 Lsim Linguistic Matching
Thesaurus Generate Mapping Ssim,Wsim Structure Matching The generated output mapping can be corrected or selectively chosen and fed back into the structure matcher as an input hint. Node pairs that are identified to be similar can have their similarities set to 1 in order to bias the structural similarity Output Mapping Input hint CS652 Information Extraction and Information Integration
15
Experimental Validation
MOMIS DIKE Cupid Canonical Examples Real World Examples DIKE Graph Matching of ER models No Lsim component (LSPD entries) MOMIS Class Level Matching of OO descriptions Word senses manually chosen from WordNet CS652 Information Extraction and Information Integration
16
Evaluation Insights Linguistic Similarity
Cupid is less sensitive to name variations due to token level manipulations MOMIS is able to infer linguistic relationships based on intra-schema properties using Description Logic techniques MOMIS has a interface to WordNet Word senses need to be chosen manually Choosing a single sense is not always possible Matching performance without thesaurus depends on similarity of terms used and on available structure (tokenization helps Cupid) CS652 Information Extraction and Information Integration
17
Evaluation Insights Structural Similarity
DIKE and Cupid exploit structural similarity beyond the immediate neighborhood of schema elements Leaf structure for sub-tree similarity relaxes requirements on intermediate structure match Class-level structural similarity in MOMIS can be restrictive while matching schemas with different nesting Context-dependent matching in Cupid resolves mapping ambiguity Linguistic similarity with complete path names (and no structural similarity) is insufficient CS652 Information Extraction and Information Integration
18
Contributions Taxonomy of schema matching approaches
Cupid system that exploits linguistic, data-type, structure and referential integrity information New algorithm that exploits schema structure Experimental validation and comparison with other systems In addition to the mentioned taxonomy CS652 Information Extraction and Information Integration
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.