1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: 3740535 This.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Entity Relationship Diagrams
Chapter 10: Designing Databases
Entity-Relationship (ER) Modeling
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Entity-Relationship Model Chapter 2.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
1 A Survey of Approaches to Automatic Schema Matching Erhard Rahm Philip A. Bernstein The VLDB Journal 10: (2001)
Generic Schema Matching using Cupid
Data Modeling and Relational Database Design ISYS 650.
Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.
1 Basic DB Terms Data: Meaningful facts, text, graphics, images, sound, video segments –A collection of individual responses from a marketing research.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
Generic Schema Matching with Cupid Jayant Madhavan Philip A. Bernstein Erhard Raham Proceedings of the 27 th VLDB Conference.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Automatic Data Ramon Lawrence University of Manitoba
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Michael F. Price College of Business Chapter 6: Logical database design and the relational model.
BIS310: Week 7 BIS310: Structured Analysis and Design Data Modeling and Database Design.
Advanced Database CS-426 Week 2 – Logic Query Languages, Object Model.
Entity Relationship Modeling Objectives: To illustrate how relationships between entities are defined and refined. To know how relationships are incorporated.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model.
A Unified Framework for the Semantic Integration of XML Databases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 7 Data Modeling Using the Entity- Relationship (ER) Model.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Concepts and Terminology Introduction to Database.
Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Database Management COP4540, SCS, FIU Relational Model Chapter 7.
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Univ of Georgia) Topic Presentation.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
Dimitrios Skoutas Alkis Simitsis
Conceptual Database Design
1 Chapter 1 Introduction. 2 Introduction n Definition A database management system (DBMS) is a general-purpose software system that facilitates the process.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
Relational Database. Database Management System (DBMS)
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo.
XML Schema Integration Ray Dos Santos July 19, 2009.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
1 Conceptual Design using the Entity- Relationship Model.
A Survey of Approaches to Automatic Schema Matching (VLDB Journal, 2001) November 7, 2008 IDB SNU Presented by Kangpyo Lee.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management
Mar 27, 2008 Christiano Santiago1 Schema Matching Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Maßmann Putting Context into Schema Matching.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic Mappings for Data Mediation
Chapter 3: Modeling Data in the Organization. Business Rules Statements that define or constrain some aspect of the business Assert business structure.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
Of 24 lecture 11: ontology – mediation, merging & aligning.
COP Introduction to Database Structures
Entity Relationship Modeling
Entity-Relationship Model
Entity Relationship Diagrams
Chapter 4+17 The Relational Model Pearson Education © 2014.
Implementing Mapping Composition
Presentation transcript:

1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This Presentation based on the following paper: E. Rahm, P. Bernstein,"A Survey of Approaches to Automatic Schema Matching“, VLDB Journal, 10(4): (2001).

2 Presentation Outlines  Definition of Schema Matching.  Schema Matching problem.  Applications of Schema Matching.  Schema matching approaches.  Personal contribution.  Conclusion.

3 Schema Matching: Definition  Schema Matching: is the process of finding semantic correspondence between elements of two schemas.  Schema Matching is achieved through Match operation.  Match Operator is a function that takes two schemas as input and returns a mapping between those two schemas as output, called the match result. Match(S1,S2)  Match Result.

4 The Match operator Match(S1,S2)  Match Result  The schema (either S1 or S2) defined to be a set of elements connected by some structure (ER model, OO model,..).  The Match Result is a set of mapping elements, each of which indicate that certain elements of S1 are mapped to certain elements of S2, expressed by.  A mapping expressions, which specifies how the S1 and S2 elements are related, may be associated with the mapping elements.

5 The Match Operator

6 Schema Matching Problem  Schema Matching currently performed manually which makes it: Error-prone. Tedious. Time-consuming.  So, the solution is to automate the match function, but There is no mathematical model that capture the matching process. Application dependent.

7 Schema Matching Applications  Schema Integration.  Data Warehouses.  E-commerce.  Semantic Query Processing.  Data Integration system

8 1-Schema Integration  Is the process of constricting a global view from a set of independently developed schema.  Schema integration achieved through identifying the interschema relationships (applying schema matching) then unify these matched elements.

9 Schema Matching Example S1 {SNam,….} S2 {FName,LName, …}. Sn Match + Unify Sg {Name,…} Example: Determining the correspondence between SName and Fname,Lname achived through the match operation.

10 2-Data Warehouses  DWH is a decision support database that is extracted from a set of data sources.  DWH and sources represent data in different format (i.e. relations or XML versus multidimensional view)  Constructing DWH require transform data from its format into DWH format.  match operation can be used to identify those elements in the sources that are represented in the DWH, according to this mapping appropriate transformation can be designed.

11 3- E-commerce  Trading partners exchange messages that describes business transactions.  Each partners uses its own message format (EDI, XML,..) or different message schema.  In order to exchange messages, there is a need to translate the messages to the format required by different partners (Matching problem).

12 4- Semantic Query Processing  A run time scenarios where the user specifies the output of the query (in terms of concepts familiar to him, which may be not the same concepts presented in the DB {the Select Clause}), and the system figures out how to produce that output (i.e. the from and where clauses in SQL).  The match operation is used to determent the mapping between the user concepts and the DB concepts).

13 4-Semantic Query Processing (Example) All employees earn more than 2000$ Emplyee(FName,LName,Salary) Applying Match Mapping elements Producing New Query in SQL format: {Select FName, LName From Employee Where Salary >2000} Output Example SQL Q

14 5- Data Integration System  The major component of data integration system is the source description.  Source description maps the sources schema to the mediated schema.  Match operation applied in order to specify this mapping.

15 Generic Match Architecture  Schemas to be matched are represented in a uniform internal representation.  Importer translates input schemas from their native representation into the internal representation.  Exporter translates the match result produced by the match from the internal representation into the representation required by each tool.

16 Generic Match Architecture Tool 1 (Portal Schemas) Tool 2 ( E-business Schemas) Tool 3 (DWH Schemas) Tool 4 (DB Schemas) Generic Match Implementation Schema import / export Internal Schema Representation Global Libraries (dictionaries, schemas,..)

17  Instance Vs schema: consider instance data or schema information.  Element Vs structure: matching performed for individual schema element (attribute), or for combinations of elements (structure).  Language Vs constraint: use linguistic information (names and textual description) or constraint information (key, relationship)  Matching cardinality: the overall match result may relate one or more elements of one schema to one or more elements of the other (1:1, 1:n, m:n).  Auxiliary information: the use of auxiliary information (dictionaries, pervious matching results, user input,..) Classification of schema matching approaches

18 Classification of schema matching approaches Schema Matching Approaches Combine Matcher Individual Matcher Schema BasedInstance Based Element Level Structure Level Linguistic Constraint Element Level Linguistic Constraint Hybrid Matcher Composite Matcher AutomaticManual

19 Schema level matcher  Consider only schema information, not instance data, such as: Name Description Data Type Relationship (is-a, part-of) Constraints Schema structure  Multiple match candidates could be founded, each of which assigned with a similarity degree.

20 Granularity of match  Element level matching: for each element in the first schema, determines the matching elements in the second schema.  Structural level matching: matching combinations of elements that appear together in a structure. Could be fully match or partial match.

21 Granularity of match Example TaxExempt CPhone Birthdate Address CAddress (E.L) CAddress Address Name CName (E.L) Cname Name AccountOwner Customer (S.L partial Match) CustomerAccountOwner Zip PostalCode (E.L) PostalCode ZIP State USState (E. L) USState State Street City (E.L) City Street Address CustomerAddress (S.L full Match) CustomerAddressAddress Match GranularityS2 ElementsS1 Elements

22 Linguistic approaches  Linguistic matchers use names and text (word or sentence) to find semantically similar schema elements: Name Matching Description matching

23 Name Matching  Name based matching matches schema elements with equal names or similar names. Equality of names. Equality of canonical name representation after stemming and preprocessing. Equality of synonyms. Equality of hypernyms (X is a hypernym of Y if Y is a kind of X. Similarity based on common substring, edit distance, soundex. User- provided name matches.  Thesauri or dictionary should be exploited.

24 Name Matching Example: two schema S1, S2 represent two automobile suppliers CAddress CustomerAddress (Preprocessing) CustomerAddressCAddress SoldTo Sold2 (Soundex)Sold2SoldTO Price Price (Equality of Names)Price Brand Make (Synonyms)MakeBrand Car Truck (Hypernyms) Car is an automobile and truck is an automobile. TruckIDCarID Matching Based onS2 ElementsS1 Elements

25 Constraint- based approaches  Exploit constraints information associated with the input schemas to determine the similarity of schema elements. Data types and domain constraints Key characteristics (primary, unique) Relationship cardinality Structural constraints such as foreign key (used by structural matches approaches)

26 Constraint- based approaches DeptName {String} DeptNo {int,PK} Department BirthDate {date} Born {date} salary {single} S2.Personal {Pno, Pname, Dept, born} Select S1.Employee.EmpNo, S1.Employee.EmpName, S1.Department.DeptName, S1.Employee.BirthDate From S1.Employee, S1.Department Where (S1.Employee.DeptNo = S1.Department.DeptNo) Note: Structural Matching Dept {String} DeptNo {int, ref dep} Pname {string} EmpName {String} Born Birthdate {Type} Pno EmpNo| DeptNo {Key} Pname DeptName {Type} Pname EmpName {Type} Dept DeptName {Type} Dept EmpName {Type} Pno {int, PK} EmpNo {int, PK} PersonalEmployee MatchingS2 ElementsS1 Elements Several match candidate could results so this approach could be sued to limit the number of candidate.

27 Description Matching  Based on linguistic evaluation for the comment associated with schema elements. Example: NL Understanding technology S1: empn // employee name S2: name // name of employee Empn name

28 Reusing Schema and mapping information  This approach support and exploit the reuse of common schema components and previously determined mapping.  Useful when matching applied to different but similar schemas to the same destination schemas.

29 Reusing Schema and mapping information Schema S1 Purchase order2 Product BillTo Name Address ShipTo Name Address ContactPhone Schema S2 Purchase order Product BillTo Name Address ShipTo Name Address Contact Name Address Schema S POrder Article Payee BillAddress Recipient ShipAddress Goal: Mapping S1 to S matching result between S2 and S are previously determent and can be reused to map S1 to S EX:

30 Match cardinality  Global cardinality: how many mapping elements S1 or (S2) elements can participate in the matching results.  Local cardinality: how many elements in S1 match how many elements in S2 within individual mapping element.  Most of approaches restricted to 1:1 local and 1:1 or 1:n global cardinality.

31 Match cardinality Local Match Cardinality S1 elements S2 elements Matching Expression 11:1 element levelPriceAmountAmount = Price 2n:1 element levelPrice,TaxCostCost = Price* (1 + Tax/100) 31:n element levelNameFirstName LastName FirstName, LastName = Extract(Name,..) 4n:1 structure level (n:m element level) B.Title B.PuNo P.PuNo P.Name A.Book A.Publisher A.Book, A.Publisher = Select B.Title, P.Name From B.P Where B.PuNo =P.PuNo Example: Price has 1:n Global Cardinality

32 Instance level approaches  Consider data contents.  Useful when schema information limited (or no schema at all).  Enhance schema matching by considering elements whose instances are more similar. Linguistic approach based on IR techniques for text elements. Constraint based such as value range and average for numeric element.

33 Instance level approach EmpNoDept 234Marketing 235Accounting 236Marketing SSNWorks for 230Accounting 229Marketing 228Marketing {Dept ≈ works for} (based on “Marketing” Frequency) {EmpNo ≈ SSN} (based on value range) Example:

34 Combining matchers  Combine several approaches to achieve good match candidates. Hybrid Matcher: combine several matching approaches to determine match candidate based on multiple criteria (name, type constraints).  More effective (poor candidates filtered out early)  Better performance (reducing number of pass over the schema). Composite matcher: combines the results of several independently executed matchers, including hybrid matchers.  More flexible than hybrid matchers, (allow us to select from set of matchers).  The combination of results could be automatic or manually).

35 personal contribution  A prototype Datalog program designed to implement a composite matcher.  The program was tested on DES system ( Datalog Educational System) available at: http//  The program takes advantage of linguistic based approach over constraint approach.

36 Database Description  The DataBase contains the following predicates: Source(ElementName, DataType, constraints,…). // to describe sources Dictionary(Name1,Name2)// a dictionary to provides synonyms.

37 The Program Rules  cand1(N,N) :- s1(N,D,C), s2(N,DataType,Constraint).  cand2(N,N1) :- s1(N,D,C), s2(N1,DataType,Constraint), d(N,N1).  ok1(N) :- cand1(N,N1).  ok1(N1) :- cand1(N,N1).  ok2(N) :- cand2(N,N1).  ok2(N1) :- cand2(N,N1).  cand3(N,N1) :- s1(N,D,C),s2(N1,D,Constraint),  not(ok1(N)),not(ok2(N1)),  not(ok1(N1)),not(ok2(N)).  cand4(N,N1) :- s1(N,DataType,C),s2(N1,D,C),  not(ok1(N)),not(ok2(N1)),  not(ok1(N1)),not(ok2(N)).  match(N,N1) :- cand1(N,N1).  match(N,N1) :- cand2(N,N1).  match(N,N1) :- cand3(N,N1),cand4(N,N1).  match(N,N1) :- cand3(N,N1),not(cand4(N,N1)).  match(N,N1) :- cand4(N,N1),not(cand3(N,N1)).

38 EXAMPLE  The program was tested on the following schemas. s1(eNo,integer,pk). s1(city,string,20). s1(street,string,30). s1(state,string,12). s2(eID,integer,pk). s2(cName,string,15). s2(street,string,10). s2(province,string,25). d(state, province).

39 Results

40 Conclusion  A taxonomy for schema approaches was presented, in order to compare between different approaches to schema matching.  The generic implementation described in the paper, could be base for any new implantation.  Different techniques could be used in order to automate schema matching, such as Natural language, machine learning,IR...  Having full automated matching (without user interaction ) not achieved yet.

41 Thank You

42 References  E. Rahm, Ph.A. Bernstein,"A Survey of Approaches to Automatic Schema Matching“, VLDB Journal, 10(4): (2001).  DES system ( Datalog Educational System) available at: http//  Jayant Madhavan, Philip A.Bernstein, Erhard Rahm, “Genric Schema Matching with Cupid”, Proceedings of 27 th VLDB conference, Roma, Italy,2001.  Hong-Hai Do, Sergey Melnik, Erhard Rahm, “Comparison of Schema Matching Evaluations”, University of Leipzig Augustusplatz 10-11, 04109, Leipzig, Germany.  AnHai Doan, Pedro Domigos, Alon Levy, “Learning Source Description for Data Integration”, University of Washington, Sattle, WA 98195

43 Appendix 1- Cupid (Microsoft Research) 2- LSD (Learning Source Description)

44 1- Cupid (Microsoft Research)  A hybrid matcher based on element and structure matching.  What is new in this approach, that the schemas represented as a graph which encode the referential constraints into structure the can be matched just like other structures.  The algorithm has two phase: Linguistic matching: matches individual schema elements based on their names, data types, domains,.. Structural matching: matches schema elements based on the similarity of their context.

45 2- LSD (Learning Source Description)  Composite matcher, with autonomic combination of match results.  An attempt to automate the mappings between source schemas and mediated schema in data integration system.  Uses machine learning techniques to match a new data source against previously determined global schema.  After a set of data sources have been manually mapped to a mediated schema, the system should be able to glean significant information from these mapping for subsequent data sources.