Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

Slides:

Advertisements

Similar presentations

Towards Data Mining Without Information on Knowledge Structure

Advertisements

2017/3/25 Test Case Upgrade from “Test Case-Training Material v1.4.ppt” of Testing basics Authors: NganVK Version: 1.4 Last Update: Dec-2005.

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

Chapter 7 System Models.

Multi-RQP Generating Test Databases for the Functional Testing of OLTP Applications Carsten Binnig Joint work with: Donald Kossmann, Eric Lo DBTest Workshop,

eClassifier: Tool for Taxonomies

…to Ontology Repositories Mathieu dAquin Knowledge Media Institute, The Open University From…

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Views-basics 1. 2 Introduction a view is a perspective of the database different users may need to see the database differently; this is achieved through.

Query optimisation.

1 Term 2, 2004, Lecture 6, Views and SecurityMarian Ursu, Department of Computing, Goldsmiths College Views and Security 3.

1 The Efficacy of Matching Information Systems Development Methodologies with Application Characteristics – An Empirical Study Present by Saidur Rahman.

Database Design: ER Modelling (Continued)

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Database Systems: Design, Implementation, and Management

1 Lecture 5: SQL Schema & Views. 2 Data Definition in SQL So far we have see the Data Manipulation Language, DML Next: Data Definition Language (DDL)

Information Systems Today: Managing in the Digital World

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide

On-line Index Selection for Physical Database Tuning

© 2011 TIBCO Software Inc. All Rights Reserved. Confidential and Proprietary. Towards a Model-Based Characterization of Data and Services Integration Paul.

State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.

Chapter 10: Designing Databases

Creating Tables. 2 home back first prev next last What Will I Learn? List and provide an example of each of the number, character, and date data types.

IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.

Proposal by CA Technologies, IBM, SAP, Vnomic

Database System Concepts and Architecture

Lecture plan Outline of DB design process Entity-relationship model

Executional Architecture

Addition 1’s to 20.

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.

14-1 © Prentice Hall, 2004 Chapter 14: OOSAD Implementation and Operation (Adapted) Object-Oriented Systems Analysis and Design Joey F. George, Dinesh.

From Model-based to Model-driven Design of User Interfaces.

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Improving System Safety through Agent-Supported User/System Interfaces: Effects of Operator Behavior Model Charles SANTONI & Jean-Marc MERCANTINI (LSIS)

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

AnHai Doan University of Illinois Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa.

BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

1 Introduction to Software Engineering Lecture 1.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.

Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.

Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.

Semantic Mappings for Data Mediation

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Automatic Physical Design Tuning: Workload as a Sequence

MatchCatcher: A Debugger for Blocking in Entity Matching

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Presentation transcript:

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using Synthetic Scenarios

2 Main Points Tuning matching systems: long standing problem –becomes increasingly worse We propose a principled solution –exploits synthetic input/output pairs –promising, though much work remains Idea applicable to other contexts

3 price agent-name address Schema Matching 1-1 matchcomplex match listed-price contact-name city state Schema 2 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY 320K Jane Brown Seattle WA 240K Mike Smith Miami FL Schema 1

4 Schema Matching is Ubiquitous Databases –data integration, –model management –data translation, –collaborative data sharing –keyword querying, schema/view integration –data warehousing, peer data management, … AI –knowledge bases, ontology merging, information gathering agents,... Web –e-commerce, Deep Web, Semantic Web eGovernment, bio-informatics, scientific data management

5 Current State of Affairs Finding semantic mappings is now a key bottleneck! –largely done by hand, labor intensive & error prone Numerous matching techniques have been developed –Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin,... –AI: Stanford, Karlsruhe University, NEC Japan,... Techniques are often synergistic, leading to multi-component matching architectures –each component employs a particular technique –final predictions combine those of the components

6 An Example: LSD [SIGMOD-01] Schema 1 Urbana, IL James Smith Seattle, WA Mike Doan address agent-name area contact-agent Peoria, IL (206) Kent, WA (617) Schema 2 Name Matcher Naive Bayes Matcher Combiner 0.3 agent name contact agent area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Match Selector Constraint Enforcer Only one attribute of Schema 2 matches address area = address contact-agent = agent-phone... comments = desc

7 Multi-Component Matching Solutions Such systems are very powerful... –maximize accuracy; highly customizable to individual domain... but place a serious tuning burden on domain users Constraint enforcer Match selector Matcher Combiner … Matcher 1 Matcher n Constraint enforcer Match selector Combiner Matcher 1Matcher n … Constraint enforcer Match selector Combiner Matcher 1Matcher n … Match selector Combiner LSDCOMASF LSD-SF Developed in many recent works –e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 Now commonly adopted, with industrial-strength systems –e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig]

8 Tuning Schema Matching Systems Library of matching components Constraint enforcer Match selector Combiner Matcher 1Matcher n … Execution graph Knobs of decision tree matcher Threshold selector Bipartite graph selector A* search enforcer Relax. labeler ILP Average combiner Min combiner Max combiner Weighted sum combiner q-gram name matcher Decision tree matcher Naïve Bays matcher TF/IDF name matcher SVM matcher Characteristics of attr. Post-prune? Size of validation set Split measure Given a particular matching situation –how to select the right components? –how to adjust the multitude of knobs? Untuned versions produce inferior accuracy, however...

9 Large number of knobs –e.g., 8-29 in our experiments Wide variety of techniques –database, machine learning, IR, information theory, etc. Complex interaction among components Not clear how to compare the quality of knob configs Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse... Tuning is Extremely Difficult Developing efficient tuning techniques is crucial to making matching systems attractive in practice

10 The eTuner Solution Given schema S & matching system M –tunes M to maximize average accuracy of matching S with future schemas –incurs virtually no cost to user Key challenge 1: Evaluation –must search for best knob config –how to compute the quality of any knob config C? –if knowing ground-truth matches for a representative workload W = {(S,T1),..., (S,Tn)}, then can use W to evaluate C –but often have no such W Key challenge 2: Search –how to efficiently evaluate the huge space of knob configs?

11 Key Idea: Generate Synthetic Input/Output Pairs Need workload W = {(S,T1), (S,T2), …, (S,Tn)} To generate W –start with S –perturb S to generate T1 –perturb S to generate T2 –etc. Know the perturbation => know matches between S & Ti

12 Key Idea: Generate Synthetic Input/Output Pairs Perturb # of tables id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ EMPLOYEES EMPS emp-last idwage Laup Brown V1V1 Schema S id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ 3 Jean Ann30,000 $ 4 Roy Bond70,000 $ EMPLOYEES id first last salary ($) 3JeanAnn30,000$ 4RoyBond70,000$ EMPLOYEES Perturb # of columns in each table last id salary($) Laup140,000$ Brown260,000$ EMPLOYEES Perturb column and table names Perturb data tuples in each table EMPS emp-last idwage Laup140,000$ Brown260,000$ EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) U V V1V1 U Ω 1 : a set of semantic matches VnVn... Split S into V and U with disjoint data tuples

13 Examples of Perturbation Rules Number of tables –merge two tables based on a join path –splits a table into two Structure of table –merges two columns –e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) –drop a column –swap location of two columns Names of tables/columns –rules capture common name transformations –abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc Data values –rules capture common format transformations: 12/4 => Dec 4 –values are changed based on some distributions (e.g., Gaussian) See paper for details

14 The eTuner Architecture Staged Tuner Tuning Procedures Workload Generator Perturbation Rules Matching Tool M Synthetic Workload (Optional) Tuned Matching Tool M U Ω 1 V 1 U Ω 2 V 2 U Ω n V n Schema S

15 The Staged Tuner Level 1 Level 2 Level 3 Constraint enforcer Match selector Combiner Matcher 1Matcher n … Level 4 Tuning direction Tune sequentially starting with lowest-level components Assume –execution graph has k levels, m nodes per level –each node can be assigned one of n components –each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs

16 Empirical Evaluation Domain# schemas # tables per schema # attributes per schema # tuples per table reference paper Real Estate LSD (SIGMOD01) Courses531350LSD Inventory104 20Corpus (ICDE05) Product iMAP (SIGMOD04) Domains LSD : 6 Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs iCOMA : 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs SF : 3 Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs LSD-SF : 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs Matching systems

17 Matching Accuracy CourseInventoryProductReal Estate LSD COMA SF Off-the-shelf Domain-independent LSD-SF eTuner achieves higher accuracy than current best methods, at virtually no cost to the user Domain-dependent Source-dependent eTUNER: Automatic eTUNER: Human-assisted CourseInventoryProductReal Estate CourseInventoryProductReal Estate CourseInventoryProductReal Estate

18 Cost of Using eTuner You have a schema S and a matching system M Vendor supplies eTuner –will hook it up with matching system M Vendor supplies a matching system M –bundles eTuner inside

19 Sensitivity Analysis Adding perturbation rules Exploiting prior match results (enriching the workload) Schemas in Synthetic Workload (#) Accuracy (F1) Average Inventory Domain Real Estate Domain Tuned LSD Previous matches in collection (%)

20 Summary: The eTuner Illinois Tuning matching systems is crucial –long standing problem, is getting worse –a next logical step in schema matching research Provides an automatic & principled solution –generates a synthetic workload, employs it to tune efficiently –incurs virtually no cost to human users –exploits user assistance whenever available Extensive experiments over 4 domains with 4 systems Future directions –find optimal synthetic workload –apply to other matching scenarios –adapt ideas to scenarios beyond schema matching (see 3 rd speaker)

21 Backup: User Assistance S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem: –x matches phone1, x does not match phone2 User: –group phone1 and phone2 –so if x matches phone1, it will also match phone2 Intuition: tell system do not bother to try distinguish phone1 and phone2