Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Mappings for Data Mediation

Similar presentations


Presentation on theme: "Semantic Mappings for Data Mediation"— Presentation transcript:

1 Semantic Mappings for Data Mediation
Jayant Madhavan University of Washington Joint work with AnHai Doan, Pedro Domingos, and Alon Halevy Good afternoon. Jayant Madhavan, Semantic Mappings for Data Mediation Joint work with AnHai Doan and professors Pedro Domingos and Alon Halevy I shall start with some motivation for the problem, and then go on to sketch a brief outline of two systems that we have built here at UW.

2 Find houses with 2 bedrooms priced under 300K Charlie comes to town
homes.com realestate.com homeseekers.com I shall start with a motivating example. Charlie comes to Seattle… decides that he needs to buy a house – a 2 bedroom one priced in the 300K range Being a man of the information age, he decides to take the help web-based agencies Every week he looks up listings at sites such as realestate.com Frustration soon creeps in… he’s having to go to easy site and do the search all over 26th Febraury, 2002 Affiliates Meeting

3 Find houses with 2 bedrooms
Data Integration Find houses with 2 bedrooms priced under 300K mediated schema source schema 1 source schema 2 source schema 3 Instead Charlie decided to use a data integration system… one that integrates, one that integrates sources for real estate listings. Here’s how it works… each site publishes its listings in its source schema. The system has a mediated schema – like a unified view. Charlie’s queries are posed over the mediated schema, and the system translates them to queries over the different sources, and the results are later combined. The green arrows enable this translation. They are what we shall call mappings. They identify corresponding data entities in the mediated schema and the source schemas. Our general approach for data integration will be to manually provide manual mappings from a few 2/3 data sources to the mediated schema, and the rest will be learnt automatically. So if 100 => do 3, other 97 done automatically. wrapper wrapper wrapper realestate.com homeseekers.com homes.com 26th Febraury, 2002 Affiliates Meeting

4 Semantic Mappings between Schemas
Mediated schema address agent-name agent-city agent-state 1-1 mapping complex mapping homes.com area contact-name contact-address Denver, CO Laura Smith Boulder, CO Oakland, CA Jean Brown Davis, CA Here’s a closer look at an example mapping. 1-1 mappings and more complex mappings Our focus on 1-1 mappings 26th Febraury, 2002 Affiliates Meeting

5 Why Schema Matching is Important
Enterprise 1 Data integration Data integration Data translation Data warehousing E-commerce World-Wide Web Ontology Matching Mapping are relevant in many-many application areas. Other examples – data translation, migrating data from legacy database to a new format. E-commerce – Mediation between two services that use different XML formats. Ontology mapping – Ontologies are rich schemas that have been proposed recently as a means of tagging data in order to enable automatic interpretations by soft bots. Mapping between multiple onto is a necessacity it the vision of automatic reasoning over the web is to be a reality. In short, any applications Knowledge Base 2 Information agent Enterprise 2 Home users Knowledge Base 1 Application has more than one schema need for schema matching! 26th Febraury, 2002 Affiliates Meeting

6 Why Schema Matching is Difficult
No access to exact semantics of concepts Semantics not documented in sufficient details Schemas not adequately expressive to capture semantics Must rely on clues in schema & data Using names, structures, types, data values, etc. Such clues can be unreliable Synonyms: Different names => same entity: area & address => location Homonyms: Same names => different entities: area => location or square-feet Done manually by domain experts Expensive and time consuming Obtaining these mappings would have been easy if were aware of the exact semantics of each and every dat entity or concept. Not the case, schemas not well documented, or schemas might not reveal the complete semantics. Must rely of heuristics that use many different clues available in the schemas and the data, such as using the different names of entities, their data types, sample values, etc. Among other things these clues are often unreliable – synonymy and homonymy For most part done manually by domain experts, and the process is usually expensive and time consuming. 26th Febraury, 2002 Affiliates Meeting

7 Previous work Mostly ad-hoc heuristics Graph matching
Name matchers Data types Sample domain values Graph matching Schemas are labeled graphs No single heuristic works across scenarios Systems are fragile and need a lot of tuning Most prior work has been in the form of ad-hoc heuristics, Name matchers that look for synonyms, common suffixes/prefixes, etc., Some that use the data types and others that use sample domain values There is bunch of work on graph matching based approaches that interpret schemas as labeled graphs Some works combine different approaches also in some ad-hoc fashion Heuristics successful to various extents in different scenarios. No single one works Systems are fragile and need a lot of tuning. 26th Febraury, 2002 Affiliates Meeting

8 How do we go about it? Make extensive use of data instances
Incorporate multiple heuristics Base learners that implement individual heuristics Machine Learning Multi-strategy learning to combine base learners Extensible framework Easy to add new heuristics/learners Generic and domain specific constraints Robust solution with high accuracy We’ve built matching solutions for two scenarios. Basic underlying characteristics - Extensive use of data instances, as opposed to just schema information. - Incorporate multiple heuristics. Each of these in our implementation is a learning algorithm. - Predictions of base learners combined by multi-strategy. - Easy to add new heuristics/learners - Easy to add a large set of expressive constraints. => Robust solution with accuracy 26th Febraury, 2002 Affiliates Meeting

9 If “office” occurs in the name => office-phone
Multiple hypotheses Mediated schema address price agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) (305) Fantastic house Boston, MA $320K Mike Doan (617) (617) Great location If “fantastic” & “great” occur frequently in data instances => description Examples of how multiple heuristics work Two sources of information – data contents and concept names. Content matcher example. Name matcher, note that CM will not work here. Combine hypotheses. If “office” occurs in the name => office-phone Content matcher Name matcher 26th Febraury, 2002 Affiliates Meeting

10 Base Learners Mediated schema realestate.com
address price agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Name Learner Miami, FL $250K James Smith (305) (305) Fantastic house Boston, MA $320K Mike Doan (617) (617) Great location Content Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) ”, agent-phone) (“(305) ”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) Here’s how we train our base learners – As a reminder, we manually map our first few data sources to the mediated schema. We extract training data for our Content Learner by looking at data instances and assigning them their corresponding label in the mediated schema. Same too for the name matcher, but here just the element names. Results in learnt hypotheses like earlier 26th Febraury, 2002 Affiliates Meeting

11 Learning Source Descriptions (LSD) [SIGMOD’01]
Training Phase Mediated schema Source schemas Matching Phase Meta-Learner Base-Learner Base-Learnerk Predictions for data instances Base-Learner1 Base-Learnerk Training data for base learners Hypothesis1 Hypothesisk Prediction Combiner Predictions for elements A look at the complete LSD solution for learning semantic mappings for data integration - Trained by manual mappings - Training data extracted for base learners - Meta learner is trained using stacking to obtain weights that indicate the importance of individual learners for make each prediction in the mediated schema - Matching phase - Given a new data source, we apply each individual base learner, and then combine the predictions using the meta-learner - First combined at the data instance level and then at the element or concept level - Apply a set of constraints that can be specified by the user to prune away candidate mappings. - Finally produce mappings, can be corrected by user feedback and fed back to the system as a constraint. Constraint Handler Mappings Domain constraints Weights for Base Learners Meta-Learner 26th Febraury, 2002 Affiliates Meeting

12 LSD’s performance LSD’s accuracy: 71 - 92%
Avg. Matching Accuracy (%) LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % Complete LSD system: % 26th Febraury, 2002 Affiliates Meeting

13 Matching Ontologies of Concepts
CS Dept Australia CS Dept U.S. Undergrad Courses Grad Courses Courses People Staff Faculty Staff Academic Staff Technical Staff Senior Lecturer Assistant Professor Associate Professor Professor Lecturer Professor Different problem – ontology matching - Ontology is a conceptualization of a domain that identifies concepts in the domain, and their inter-relationships. - Inheritance tree (taxonomy) that identifies gens/specs There are data instances in at the leaves. - For each concept, find the most similar concept Each ontology has an inheritance tree (taxonomy) and data instances at the leaves. For each concept find most similar concept in the other ontology. 26th Febraury, 2002 Affiliates Meeting

14 The Glue System [WWW’2002] No manually performed mappings
Automatically collect training data for base learners. Similarity measures computed from the joint probability distribution of concepts A random data instance can belong to both, either, neither concepts – P(A,B), P(A,B’), P(A’,B), P(A’,B’). General framework for incorporating constraints Extension of relaxation labeling. 26th Febraury, 2002 Affiliates Meeting

15 The Glue System Relaxation Labeling Similarity Estimator Meta Learner
Mappings for O1 , Mappings for O2 Relaxation Labeling Similarity Matrix Common Knowledge & Domain Constraints Similarity Estimator Joint Probability Distribution P(A,B), P(A’, B)… Similarity Function Meta Learner Distribution Estimator Base Learner Base Learner Taxonomy O1 (tree structure + data instances) Taxonomy O2 (tree structure + data instances) 26th Febraury, 2002 Affiliates Meeting

16 Glue’s performance 26th Febraury, 2002 Affiliates Meeting

17 Conclusion and Future Work
LSD and Glue perform well Combine predictions of different base learners Incorporate constraints Robust solution that results in good accuracy Future Work Representation mapping system Incorporates various heuristics Can perform complex mappings Can learn with experience. Reasoning about mappings Does a mapping enable answering of queries posed on other schema? Is one mapping implied by another? Is a mapping minimal? Can mappings be composed? 26th Febraury, 2002 Affiliates Meeting


Download ppt "Semantic Mappings for Data Mediation"

Similar presentations


Ads by Google