Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Univ of Georgia) Topic Presentation
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
Schema Matching Match: Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically. It is usually performed manually. -Tedious -Time Consuming -Error Prone -Expensive We must automate this process!
Example GTE telecommunications needed to integrate 40 databases with a total of 27,000 elements. Project planners estimated that manual matching would take 12 person years to integrate. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Various Levels of Heterogenity ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf
How to deal with Semantic Heterogenity 1. Standardize: agree on a common representation 2. Translate: create mappings between different schemas -requires human input and machine reasoning -mappings can be difficult and expensive 3. Annotate: create relationships between agreed upon conceptualizations -requires human input and machine reasoning -annotation can be difficult and expensive ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf
Challenges Actual semantics of the involved elements are typically only from the creators or documentation – so we must use clues in the schema and data instead. These clues are often misleading. Ie. ‘Area’ can refer to different entities Ie. The same entities can have very different names. Clues are often ambiguous. Ie. ‘Contact-agent’ Agent name or phone number? Matching process can be very costly Each element of the schema must be examined to ensure discovery of the best match. Matching is often subjective depending on the application. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
Where is Schema Matching used? Database Application Domains -Data Integration -Data Warehousing -E-Business -Query Processing Semantic Web -XML/HTML to an Ontology -Semantic Web Services Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Integration Problem: Construct a global view from a set of independently constructed schemas. (ie: ontologies) - Different structure and terminologies Solution: Schema Matching is performed to find relationships between concepts in each schema. Then the matching elements can be unified. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Data Warehouses Problem: Integrating data sources into a data warehouse. - Different formats between the source and warehouse. Solution: Use matching to find the elements of the source that are also present in the warehouse. Then the details of the semantics can be examined to integrate the two. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
E-Commerce Problem: Message translation. -Each trading partner uses its own message format. Solution: A match operation would reduce the amount of manual work to specify how the formats are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Query Processing Problem: The terms used in the user’s query may be different from those in the database. Solution: Matching is used to map the user-specified concepts in the query to schema elements. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Need for Data Integration on the Semantic Web Problem: Web documents are not in RDF or any form suitable for the SW. We must annotate them with concepts from ontologies. Solution: Use schema matching to map between elements represented in OWL and the different schemas of web documents.
Semantic Web Services Problem: Web Services are currently searched for using keywords. We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently. WSDLs are in XML, Ontologies in OWL! Solution: Use schema matching approaches to map between the two different schemas.
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
Term Definitions Schema: a set of elements connected by some structure. Mapping: a set of mapping elements, each of which indicates that certain elements of schema s1 are mapped to certain elements in s2. Mapping Expression: Tells how s1 and s2 elements are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Example A mapping between s1 and s2 might contain these elements: Cust.C#=Customer.CustID Concatenate(Cust.FirstName, Cust.LastName) = Customer.contact Cust.CName = Customer.Company S1 ElementsS2 Elements CustCustomer C#CustID CNameCompany FirstNameContact LastNamePhone Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Example Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Classification of Schema Matching Approaches Instance vs Schema: matching approaches can consider instance data or schema-level information. Element vs Structure matching: match can be performed for individual schema elements or combinations of elements. Language vs Constraint: linguistic (names) or constraint-based (keys and relationships). Matching Cardinality: match result may relate one or more elements of one schema to one or more elements of another. Auxiliary Information: matcher relies on other information besides the input schemas, such as dictionaries, user input, global schemas. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Classification of Schema Matching Approaches Schema Matching Approaches Individual MatchersCombining Matchers Schema-only Structure LevelElement Level Instance/Contents ConstraintLinguisticConstraint ……… Element Level ConstraintLinguistic …… Hybrid MatchersComposite Matchers Manual CompositionAutomatic Composition Further Criteria -Match Cardinality -Auxiliary information used… Name Similarity Description Similarity Global Namespaces Word Frequency Group Matching Type Similarity Key Properties Value Pattern and Ranges Sample Approaches Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Level Matchers Consider schema information instead of instance data: Name, Description, Data Type, Relationship Types, Constraints, Structure Often produces multiple candidates and estimates a degree of similarity for each 1.Granularity of match (element level vs structure level) 2.Match Cardinality 3.Linguistic Approaches: Name or Description Matching 4.Constraint-Based Approaches 5.Reusing Schema and Matching Information Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Element-Level Element-Level: Identifies all elements of S1 that are the same or similar to elements of S2. The match comparison can be based on name, description, or data type of the element. Example of name-based element-level matching: Address = CustomerAddress Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Structure-Level Structure-Level: Matches combinations of elements that appear together in S1 with combinations of elements that appear together in S2. Full Structure Match: Partial Structure Match: Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library. S1 ElementsS2 Elements AddressCustAddress Street City StateUSState ZipPostalCode S1 ElementsS2 Elements AccountOwnerCustomer NameCname AddressCAddress BirthdateCPhone TaxExempt Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Match Cardinality One or more S1 elements can match one or more S2 elements. Complex matches Examples of the four local cardinality cases for individual mapping elements. Local Match Cardinalities S1 Element(s)S2 Element(s)Matching Expression 1:1, element levelPriceAmountAmount = Price n:1, element levelPrice, TaxCostCost = Price*(1+Tax/100) 1:n, element levelNameFirstName, LastName FirstName, LastName = Name n:m, element level also n:1, structure level B.Title B.PuNo, P.PuNo, P.Name A.Book, A.Publisher A.Book, A.Publisher = Select B.Title, P.Name From B, P Where B.PuNo = P.PuNo Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Complex Matches 1:1 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema Only a few works on complex matching have been done. Some hard code complex matches into rules. Some rely on a domain specific ontology. We need domain knowledge to accurately perform complex matching. The best match isn’t always the top match returned by the matcher – so human involvement is still needed. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Linguistic Approaches Language based matchers use names and text (i.e. words or sentences) to find semantically similar schema elements. Name Matching: match elements with similar names Description Matching: match comments in the schemas Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Linguistic Approaches: Name Matching Matches schema elements with equal or similar names. How similarity is defined: 1. Equality of names 2. Equality of names after stemming, deals with prefixes/suffixes. 3. Equality of synonyms 4. Equality of hypernyms (suv is a type of car) 5. Similarity of names based on common substrings, soundex, pronunciation (ShipTo = Ship2) 6. User provided name matches. Can be element or structure-level. Cardinality is not limited to 1:1. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Linguistic Approaches: Description Matching Schemas can contain comments in natural language that express the intended semantics of the schema elements. Example S1: empn // employee name S2: name // name of employee Can be as simple as keyword extraction and synonym matching, or as complex as using natural language understanding technology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Constraint Based Schemas often contain constraints to define data types and value ranges, optionality, relationship types, cardinalities, etc. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Reusing Schema and Mapping Information The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings. Many schemas are often very similar to each other and previously matched schemas. i.e. In E-Commerce, substructures often repeat within different message formats (address fields, name fields) A schema library should be created and the schema editors should access the library to use predefined terms and definitions. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Mapping Reuse Example Problems: 1. Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself. 2. Similarity values may depend on the domain. i.e. Salary and income may be identical in payroll application but not in a tax reporting application Schema S1Schema S2Schema S Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address POrder Article Payee BillAddress Recipient ShipAddress Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Instance Level Approaches Why? 1. Little or no schema information available. 2. Enhancement of schema-level matchers. Instance data gives insight to the contents and meaning of schema elements. 3. To match instance-level data. How? 1. Preferred Method: Linguistic Characterization 2. Constraint-based Characterization i.e. Ranges 3. Auxiliary Information 4. Also uses both rule-based and learner-based techniques. Main Problem: When comparing data at the instance-level it is likely that there will be a ton of possible match combinations, a lot of which are irrelevant. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Rule Based Solutions Rule-Based: hand crafted rules to exploit schema information element names, data types, structures and subelements. Ie: two elements match if they have the same name and the same number of subelements Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Learner Based Solutions Learner-Based: exploit both schema and data. Requires a lot of training data but can exploit data. Rule and learner based techniques combined provide an effective matching solution. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Combining Different Matchers The ideal matching system must exploit many different types of information and technique for maximum accuracy. More match candidates will be produced if the previous approaches are combined. Two Combination Methods: 1. Hybrid: integrates multiple matching criteria. Better performance. 2. Composite: combine the results of independently executed matchers. More flexible. Can be done automatically or manually. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
LSD (Univ. of Washington) Learning Source Descriptions Uses machine learning techniques to match a new data source against a previously determined global schema. Uses a name matcher and several instance-level matchers. System is trained with sample user inputs and it learns patterns and matching rules. Mostly instance-oriented but can use schema information too. Also supports user input domain constraints on the global schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
SKAT (Stanford University) Semantic Knowledge Articulation Tool Follows a rule-based approach to semi-automatically determine matches between two ontologies. User input required: * The user must provide application specific match/mismatch relations. * The user must approve or reject matches. SKAT matching is used within the ONION architecture for ontology integration. In ONION, an “articulation ontology” is constructed from the rules. Matching is based on is-a relationships between the articulation ontology and the source ontology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
TransScm (Tel Aviv University) Uses schema matching to derive an automatic data translation between schema instances. Schemas are transformed into labeled graphs. Matching is performed node by node (element-level, 1:1) starting at the top. Requires user intervention if no match is found (i.e. to provide a new rule). Bernstein P, Rahm E. A survey of approaches to automatic schema matching
DIKE (Univ. of Reggio Calabria, Univ. of Calabria) Compares pairs of objects by their attributes and the is-a relationships that they are involved in. These pairs are given a match score between 0 and 1. User must specify synonyms, homonyms, and inclusion properties. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Cupid (Microsoft Research) Hybrid matcher Element and Structural-Level matches. Phase 1: Linguistic Element-Level. - categorizes elements based on name, data types, and domains. - calculates a linguistic similarity coefficient. Phase 2: - transform the original schema into a tree then perform a bottom-up structure matching. - calculates a similarity value. - calculates a weighted mean of linguistic and structural similarity of pairs of elements Phase 3: - uses the mean from phase 2 to decide on a mapping. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Clio (IBM Almaden and Univ. of Toronto) Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema. Three Components: Schema Readers: read schema and translate it into an internal representation. Correspondence Engine: is used to identify matching parts of the schemas or databases. Mapping Generator: generates view definitions to map data in the source schema to data in the target schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Similarity flooding (Stanford Univ. and Univ. of Leipzig) Graph Matching Algorithm. Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs. Uses a name matcher to get an initial element- level match that is then given to the structural matcher. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Delta (Mitre) Uses attribute descriptions to determine attribute matches. The method is to group the metadata about an attribute into a text string which is presented as a document. The user is then presented with other ‘documents’ with matching attributes and can chose from those. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Tess (Univ. of Massachusetts, Amherst) System for helping to cope with schema evolution. Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
MWSAF: Meteor-S Web Service Annotation Framework LSDIS Lab, UGA What is it? A tool for semi-automatically marking up web service descriptions with ontologies. It helps in describing services semantically and aids in efficient web service discovery and composition.
MWSAF Annotation Tool Input: WSDL File 1. Individual elements of the WSDL are matched to concepts in the domain 2. The WSDL is classified into a domain. 3. The Matches are given to the user to accept or reject. 4. Upon the user’s acceptance, the annotations are written to the WSDL. Output: WSDL File with semantic annotations
MWSAF Architecture Main Components of the System: 1. Ontology Store: stores the DAML and RDF ontologies that will be used to annotate the WSDL files. Ontologies are categorized by domain. 2. Parser Library: consists of the parsers used to generate the SchemaGraphs. 3. Matcher Library: provides schema matching algorithm. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF Schema Graphs PROBLEM: The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly. MWSAF converts both models to a common representation format called SchemaGraph. A SchemaGraph is a set of nodes connected by edges that are created using conversion functions. Then it applies a matching algorithm to find the mappings between them. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF: Meteor-S Web Service Annotation Framework XML to SchemaGraph conversion rules <xsd:element maxOccurs="1" minOccurs="1" name="compass" nillable="true" type="xsd1:DirectionCompass" /> <xsd:element maxOccurs="1" minOccurs="1" name="degrees" type="xsd:int" /> Direction degrees Direction Compass hasElement compass SchemaNode representation of XML schema Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework.
MWSAF: Meteor-S Web Service Annotation Framework Ontology to SchemaGraph conversion rules Superclass for all events dealing with wind Wind event Wind direction Wind speed WindEvent windDirectionSpeed hasPropertywindSpeed SchemaGraph representation of part of ontology Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework.
Mapping Measures of the Match Score: -Element Level Match: linguistic similarity of two concepts based on names. Uses WordNet to check for synonyms. Abbreviations are even checked. -Schema Match: structural similarity, sub-concept similarities. The getBestMapping function then looks at the Match Scores and determines a map set. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF Matching Techniques: ElemMatch Name and String Matching algorithms: -NGram: considers the number of qgrams that the names have in common. -CheckSynonym: uses Wordnet to find synonyms. -CheckAbbreviations: uses an abbreviation dictionary. -TokenMatcher: uses Porter Stemmer tonkenization and substring matching techniques. Each algorithm returns a value between 0 and 1. These values are used in an equation for the final match score. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
Matching Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology, Then two measures are derived from the mapping: -Average Concept Match: tells the user about the degree of similarity between matched concepts of the WSDL and ontology. -Average Service Match: helps to categorize the service. *We have a machine learning alternative for categorization! Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
Outline Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion
Current and Future Issues User Interaction: minimize user input but maximize impact of the feedback Real World Analysis: can the current matching techniques be used in real world situations? P2P data management Mapping Maintenance: what happens when you map between two schemas and then one changes? Developing global schemas (or ontologies) for domains. Dealing with inconsistent data values for a schema element. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
More Issues If we require user acceptance for our matches, then what happens if our matcher returns thousands or hundreds of matches? Is it unrealistic to think that we will eventually perfect our matchers? Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Conclusion It is necessary to automate the matching process. Schema matching is very difficult and expensive. We have looked at a taxonomy and the descriptions of the existing approaches for matching. -Schema vs Instance-level -Element vs Structure-level -Language and Constraint based matchers. We also discussed several implementations of the matching techniques.
References Bernstein P, Rahm E. A survey of approaches to automatic schema matching. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework. POSV-WWW2004.pdf POSV-WWW2004.pdf Vassilis C, Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM). Dagsthul Seminar ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.p df
Questions ?