HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo
DB Seminar2 What is Schema Matching? Finding semantic correspondences between elements of two schemas Input: 2 schemas, Output: A set of mappings Done by human experts time consuming
DB Seminar3 Application domains Ecommerce and data translation: –E.g. Each trading partners has its own messaging format (e.g. EDI and ebXML) to describe the business transactions details –To deal with trading partners different message schemas, businesses often need to convert messages between the schemas –For example, the ‘total quantity’ field in one partner may match the ‘amount’ field in another partner
DB Seminar4 Outline Introduction Related Work Problem Possible Solutions Discussion and Conclusion References
DB Seminar5 State of the Art Goal: High match accuracy for large variety of schemas Different match criteria (e.g. name, data type, dictionary, thesaurus…) are used in a single algorithm –Linguistic match the name (semantic) level –Structural match also the structure
DB Seminar6 One related work Cupid [VLDB01] Support varieties of schema formats Cupid model the interconnected elements of a schema as a schema tree The schema tree can capture different data modeling in a unified way Match by both linguistic and structural level
DB Seminar7 Schema Tree Example PO POShipToPOBillTo POLines CityStreet CityStreet Item Count Line Qty UoM DeliverToInvoiceTo Items CityStreet CityStreet Item ItemCount Line Qty UoM PurchaseOrder Address
DB Seminar8 Match the schemas Calculate the similarities between elements Report those mappings with high similarity values (> threshold) Similarity is between [0,1] 2 Phases: –Linguistic Matching (lsim) E.g. match their value (edit distance for string) Use a thesaurus to resolve synonyms (“Bill”=“Invoice”), short form (“Qty”=“Quantity”) –Structural Matching
DB Seminar9 Structural Matching Match two elements based on context and vicinities Structural information can help to solve many ambiguity problems that “linguistic” cannot solve Define the similarity as ssim
DB Seminar10 Schema Tree Example Revisit PO POShipToPOBillTo POLines CityStreet CityStreet Item Count Line Qty UoM DeliverToInvoiceTo Items CityStreet CityStreet Item ItemCount ItemNo Qty UoM PurchaseOrder Address
DB Seminar11 Schema Tree Example Revisit (2) PO POShipToPOBillTo POLines CityStreet CityStreet Item Count Line Qty UoM DeliverToInvoiceTo Items CityStreet CityStreet Item ItemCount ItemNo Qty UoM PurchaseOrder Address
DB Seminar12 Similarity in Cupid Similarity = a x ssim + (1-a) x lsim “a” is the importance of structural similarity
DB Seminar13 How to evaluate a schema matching system? Precision and recall Option 1: –Compare with human experts Option 2: –Comparative study with other systems Automatic match returns P matches I is true positive (by domain experts) Precision= |c|/|P| reliability of match predictions Recall= |c|/|I| % of real matches found PI c
DB Seminar14 Limitations The problem is not solved completely Schema matching is just a step in data integration, data translation, ecommerce, etc. Why? –Given a set of matched schema elements … –Need to generate the query! –E.g. insert into B.BillTo… select A.InvoiceTo...
DB Seminar15 The Real Picture Should be… Input: –A.Firstname concat A.Lastname B.Name –A.basesalary + A.workingHour x A.hourlyWages B.salary Output: –XQuery, SQL
DB Seminar16 Current Problems All (except one) works on 1-to-1 matching Unrealistic Given a set of 1-to-1 mappings, users still need to form the real “input” from: –Mapping x: A.Firstname B.Name (1:1) (not useful!) –Mapping y: A.Lastname B.Name (1:1) (not useful!) –To: –A.Firstname concat A.Lastname B.Name (2:1) (useful!)
DB Seminar17 Really novel? A DASFAA2003 paper “solved” it [DASFAA03] They augment each schemas to be matched by huge amount of ontological information Application oriented Assumption: –Each schemas has such ontology exists –Such ontology can be easily created
DB Seminar18 Set-Oriented Matching Use the ontology to enhance the similarity functions and generated a set of n-to-m mappings E.g. If one of the input schema is obtained from the real estate sector, argument an ontology about real estate, thus the system must know which elements form a set (e.g. firstname concat lastname is given in ontology a priori) Extremely high accuracy?
DB Seminar19 Our directions Previous work are not realistic Dig out the set-mappings without the help of ontology Observation 1: –Those m elements and n elements in the m-to-n mappings (useful input) are inter-correlated –Inter-correlated in terms on both structure and linguistic
DB Seminar20 Intra-similarity Structural similarity is much more important than linguistic similarity within schema –A.Firstname concat A.Lastname B.Name same type? Intra-Similar identical meaning?… Intra-Similar? Not necessary same hierarchical level? Intra-Similar –A.basesalary + A.workingHour x A.hourlyWages B.salary same type? Intra-Similar identical meaning?… Intra-Similar? Not necessary same hierarchical level? Intra-Similar
DB Seminar21 Intra-schema similarity A new similarity function is defined If intra-similar is > threshold, then output m-to-n mappings The algorithm is similar to other structural matching approaches Foreseeable evaluation result: –Must accuracy than all (except one) previous work As we can find those user expected mappings –May poorer than the ontology approach But no “magic” ontology is need and more realistic
DB Seminar22 Observation 2 Users efforts must involved in any approaches Users efforts are “throw away” afterwards The system is user-oriented Users give the final decision Why not learn and store the users patterns? –Improve accuracy –Suggest mappings to users in case they get lost
DB Seminar23 Discussion and Conclusion Still ongoing developing … A fact: very difficult to argue How to define the similarity between a set of source elements and a set of target elements, given the intra-similarity? Propose a novel matcher for discovering useful set-mappings
DB Seminar24 References [VLDB02] COMA-A system for flexible combination of schema matching approaches –By Hong-hai Do, Erhard Rahm –University of Leipzig [ICDE02] Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching –By Sergey Melik, Hector Garcia-Molina, Erhard Rahm –Stanford and University of Leipzig [VLDB02] Translating Web Data –By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al. –IBM Almaden Research Center and University of Toronto [VLDB01] Generic Schema Matching with Cupid –By Jayant Madhavan, Philip A. Bernstein, Erhard Rahm –U of Washington and Microsoft Research [DASFAA03] Discoering Direct and Indirect Matches for Schema Elments –By Li Xu and David W. Embley –Brigham Young Univeristy
DB Seminar25 Thank You