eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Slides:

Advertisements

Similar presentations

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Advertisements

IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

AnHai Doan University of Illinois Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.

Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.

Dimitrios Skoutas Alkis Simitsis

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.

1 Introduction to Software Engineering Lecture 1.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.

Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.

Semantic Mappings for Data Mediation

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Chapter 3 The Relational Model. Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. “Legacy.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.

Dagstuhl Workshop on Fresh Approaches for Business Process Modeling How “Cognitive Computing” can transform BP’s (and how we model them) Rick Hull IBM.

Computable Contracts as Functional Elements

Systems Analysis and Design in a Changing World, Fifth Edition

Database Systems: Design, Implementation, and Management Tenth Edition

Modelling and Solving Configuration Problems on Business

ITEC 3220A Using and Designing Database Systems

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

Information Systems Today: Managing in the Digital World

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

SOA Implementation and Testing Summary

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Probabilistic Data Management

Entity-Relationship Model and Diagrams (continued)

Dr. Sudha Ram Huimin Zhao Department of MIS University of Arizona

Data Integration with Dependent Sources

Chapter 3: Cost Estimation Techniques

Semantic Interoperability and Data Warehouse Design

Objective of This Course

CSc4730/6730 Scientific Visualization

MatchCatcher: A Debugger for Blocking in Entity Matching

Property consolidation for entity browsing

CIS 488/588 Bruce R. Maxim UM-Dearborn

CMPT 354: Database System I

Block Matching for Ontologies

Artificial Intelligence Lecture No. 28

Automated Software Integration

M1G Introduction to Database Development

INFO/CSE 100, Spring 2006 Fluency in Information Technology

A Framework for Testing Query Transformation Rules

Query Optimization.

Learning to Map Between Schemas Ontologies

Implementation of Learning Systems

Chapter 3: Cost Estimation Techniques

Presentation transcript:

eTuner: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA Thank you, ..., for your introduction, and thank you all for coming to my talk. Today I’m going to talk about the problem of mapping between data representations. And this is joint work with my advisors, Alon Halevy and P. Domingos, and my colleguage, J Madhavan, at the University of Washington. Now, I will show you shortly that the problem of mapping representations arises everywhere. But first, let me motivate it by grounding it in a very specific application context, which is DATA INTEGRATION.

Main Points Tuning matching systems: long standing problem becomes increasingly worse We propose a principled solution exploits synthetic input/output pairs promising, though much work remains Idea applicable to other contexts

Schema Matching Schema 1 1-1 match complex match Schema 2 price agent-name address Schema 1 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY 1-1 match complex match listed-price contact-name city state However, before doing that, lets get a better feel for the problem by defining it in more details. For simplicity, lets assume that both the mediated schema and source schemas use the RELATIONAL REPRESENTATION. For example, here’s a mediated schema in the real-estate domain, with elements price, agent-name, and address. And here’s the source homes.com, which exports house listings in this relational table, each row in this table corresponds to a house listing. Throughout the talk, we color elements of the mediated schema in red, and elements of source schemas in blue. Now, given a mediated schema and a source schema, the schema-matching problem is to find semantic mappings between the elements of the two schemas. The simplest type of mapping is 1-1 mappings, such as price to listed-price, and agent-name to contact-name. BUT 1-1 mappings make up only a portion of semantic mappings in practice. There are also a lot of complex mappings such as address is the concatenation of city and state, or number of bathrooms is the number of full baths plus number of half baths. In this talk, we shall focus first on finding 1-1 mappings, then on finding complex mappings. Schema 2 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

Schema Matching is Ubiquitous Databases data integration, model management data translation, collaborative data sharing keyword querying, schema/view integration data warehousing, peer data management, … AI knowledge bases, ontology merging, information gathering agents, ... Web e-commerce, Deep Web, Semantic Web eGovernment, bio-informatics, scientific data management But first lets take a step back and ask, if you are not buying a house, why should you care about this problem. The answer is that you should, because it is a fundamental problem in many areas and in numerous applications. Given any domain, if you ask two persons to describe it, they will almost certainly use different terminologies. Thus any application that involves more than one such description must establish semantic mappings between them, in order to have INTEROPERABILITY. As a consequence, variations of this problem arises everywhere. It has been a long standing problem in databases and is becoming increasingly critical. It arises in AI, in the context of ontology merging and information gathering on the Internet. It arises in e-commerce, as the problem of matching catalogs. It is also a fundamental problem in the context of the Semantic Web, which tries to add more structure to the Web by marking up data using ontologies. There we have the problem of matching ontologies. Now, if this problem is so important, why has no one solved it? [REPLACE SLIDE]

Current State of Affairs Finding semantic mappings is now a key bottleneck! largely done by hand, labor intensive & error prone Numerous matching techniques have been developed Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ... AI: Stanford, Karlsruhe University, NEC Japan, ... Techniques are often synergistic, leading to multi-component matching architectures each component employs a particular technique final predictions combine those of the components So, what do people do today with semantic mappings? Unfortunately, they still must create them by hand, in a very labor intensive process. For example, Li&Clifton recently reported that at the phone company GTE people tried to integrate 40 databases, which have a total of 27000 elements, and they estimated that simply finding and documenting the semantic mappings would take them 12 years, unless they have the owners of the databases around. Thus, finding semantic mappings has now become a key bottleneck in building large-scale data management applications. And this problem is going to be even more critical, as data sharing becomes even more pervasive on the Web and at enterprises, and as the need for translating legacy data increases. Clearly, we need semi-automatic solutions to schema matching, in order to scale up. And there have been a lot of research works on such solutions, in both databases and AI.

An Example: LSD [SIGMOD-01] Schema 1 agent name address agent-name Name Matcher 0.5 contact agent Urbana, IL James Smith Seattle, WA Mike Doan Combiner Schema 2 Naive Bayes Matcher 0.1 0.3 area contact-agent Peoria, IL (206) 634 9435 Kent, WA (617) 335 4243 area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Here’s an example to illustrate our approach. Consider a mediated schema with elements price, agent-name, agent-phone, and so on. To apply LSD, first the user selects a few sources to be the training sources. In this case, the user selects a single source, realestate.com [POINT]. Next, the user manually specifies the 1-1 mappings between the schema of this source and the mediated schema. These mappings are the five green arrows right here [POINT], which say that listed-price matches price, contact-name matches agent-name, and so on. Once the user has shown LSD these 1-1 mappings, there are many different types of information that LSD could learn from, in order to construct hypotheses on how to match schema elements. For example, LSD could learn from the names of schema elements. Knowing that office matches office-phone, it may construct the hypothesis [POINT] that if the word ”office" occurs in the name of a schema element, then that element is likely to be office-phone. LSD could also learn from the data values. Because comments matches description, LSD knows that these data values here [POINT] are house descriptions. It could then examine them to learn that house descriptions frequently contain words such as fantastic, great, and beautiful. Hence, it may construct the hypothesis [POINT] that if these words appear frequently in the data values of an element, then that element is likely to be house descriptions. LSD could also learn from the characteristics of value distributions. For example, it can look at the average value of this column [POINT], and learn that if the average value is in the thousands, then the element is more likely to be price than the number of bathrooms. And so on. Now, consider the source homes.com, with these schema elements [POINT] and these data values [POINT]. LSD can apply the learned hypotheses to the schema and the data values, in order to predict semantic mappings. For example, because the words "beautiful" and "great" appear frequently in these data values, LSD can predict that "extra-info" matches "description”. ... and the solution to this is MULTI-STRATEGY LEARNING [PAUSE] Constraint Enforcer Match Selector area = address contact-agent = agent-phone ... comments = desc Only one attribute of Schema 2 matches address

Multi-Component Matching Solutions Developed in many recent works e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 Now commonly adopted, with industrial-strength systems e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig] Constraint enforcer Match selector Combiner Constraint enforcer Match selector Combiner Constraint enforcer Match selector Matcher Match selector Combiner Matcher 1 … Matcher n Matcher Combiner … Matcher 1 Matcher n Matcher 1 … Matcher n LSD COMA SF Such systems are very powerful ... maximize accuracy; highly customizable to individual domain ... but place a serious tuning burden on domain users LSD-SF

Tuning Schema Matching Systems Given a particular matching situation how to select the right components? how to adjust the multitude of knobs? Knobs of decision tree matcher Constraint enforcer Match selector Combiner Matcher 1 Matcher n … Threshold selector Bipartite graph selector • Characteristics of attr. A* search enforcer Relax. labeler ILP • Split measure Average combiner Min combiner Max combiner Weighted sum combiner • Post-prune? • Size of validation set q-gram name matcher Decision tree matcher Naïve Bays matcher • • • TF/IDF name matcher SVM matcher Execution graph Library of matching components Untuned versions produce inferior accuracy, however ...

... Tuning is Extremely Difficult Large number of knobs e.g., 8-29 in our experiments Wide variety of techniques database, machine learning, IR, information theory, etc. Complex interaction among components Not clear how to compare the quality of knob configs Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse Developing efficient tuning techniques is crucial to making matching systems attractive in practice

The eTuner Solution Given schema S & matching system M tunes M to maximize average accuracy of matching S with future schemas incurs virtually no cost to user Key challenge 1: Evaluation must search for “best” knob config how to compute the quality of any knob config C? if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C but often have no such W Key challenge 2: Search how to efficiently evaluate the huge space of knob configs?

Key Idea: Generate Synthetic Input/Output Pairs Need workload W = {(S,T1), (S,T2), …, (S,Tn)} To generate W start with S perturb S to generate T1 perturb S to generate T2 etc. Know the perturbation => know matches between S & Ti

Key Idea: Generate Synthetic Input/Output Pairs V V1 3 12 1 Perturb # of tables 3 2 Perturb # of columns in each table Split S into V and U with disjoint data tuples . . . EMPLOYEES 3 12 Vn id first last salary ($) 1 Bill Laup 40,000 $ 2 Mike Brown 60,000 $ Perturb column and table names Schema S 1 2 3 EMPLOYEES last id salary($) Laup 1 40,000$ Brown 2 60,000$ 3 12 Perturb data tuples in each table U 1 EMPS 3 emp-last id wage Laup 1 40,000$ Brown 2 60,000$ 2 3 12 EMPLOYEES id first last salary ($) 1 Bill Laup 40,000 $ 2 Mike Brown 60,000 $ 3 Jean Ann 30,000 $ 4 Roy Bond 70,000 $ EMPLOYEES EMPS EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) id first last salary ($) 3 Jean Ann 30,000$ 4 Roy Bond 70,000$ emp-last id wage Laup 1 45200 Brown 2 59328 U Ω1: a set of semantic matches V1

Examples of Perturbation Rules Number of tables merge two tables based on a join path splits a table into two Structure of table merges two columns e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) drop a column swap location of two columns Names of tables/columns rules capture common name transformations abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc Data values rules capture common format transformations: 12/4 => Dec 4 values are changed based on some distributions (e.g., Gaussian) See paper for details

The eTuner Architecture Perturbation Rules Tuning Procedures Workload Generator Synthetic Workload Staged Tuner Tuned Matching Tool M U Ω1 V1 U Ω2 V2 U Ωn Vn Matching Tool M (Optional) Schema S

The Staged Tuner Constraint enforcer Match selector Combiner Matcher 1 Matcher n … Level 4 Level 3 Tuning direction Level 2 Level 1 Tune sequentially starting with lowest-level components Assume execution graph has k levels, m nodes per level each node can be assigned one of n components each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs

# attributes per schema Empirical Evaluation Domains Domain # schemas # tables per schema # attributes per schema # tuples per table reference paper Real Estate 5 2 30 1000 LSD (SIGMOD’01) Courses 3 13 50 LSD Inventory 10 4 20 Corpus (ICDE’05) Product 120 iMAP (SIGMOD’04) Matching systems LSD: 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs SF: 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs

Matching Accuracy Off-the-shelf Domain-dependent eTUNER: Automatic Domain-independent Source-dependent eTUNER: Human-assisted 0.9 0.9 0.8 LSD 0.8 COMA 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.8 SF 0.9 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Real Estate Product Inventory Course Real Estate Product Inventory Course eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

Cost of Using eTuner You have a schema S and a matching system M Vendor supplies eTuner will hook it up with matching system M Vendor supplies a matching system M bundles eTuner inside

Sensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload) 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 Accuracy (F1) 0.5 0.3 0.4 Average Inventory Domain Real Estate Domain 0.3 0.2 Tuned LSD 0.2 0.1 0.1 0.0 0.0 1 10 20 25 40 50 22 44 66 88 Schemas in Synthetic Workload (#) Previous matches in collection (%)

Summary: The eTuner Project @ Illinois Tuning matching systems is crucial long standing problem, is getting worse a next logical step in schema matching research Provides an automatic & principled solution generates a synthetic workload, employs it to tune efficiently incurs virtually no cost to human users exploits user assistance whenever available Extensive experiments over 4 domains with 4 systems Future directions find optimal synthetic workload apply to other matching scenarios adapt ideas to scenarios beyond schema matching (see 3rd speaker)

Backup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem: x matches phone1, x does not match phone2 User: group phone1 and phone2 so if x matches phone1, it will also match phone2 Intuition: tell system do not bother to try distinguish phone1 and phone2