Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Slides:

Advertisements

Similar presentations

Copyright © SoftTree Technologies, Inc. DB Tuning Expert.

Advertisements

Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA eTuner: Tuning Schema Matching Software using.

IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.

Fast Algorithms For Hierarchical Range Histogram Constructions

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.

The Experience Factory May 2004 Leonardo Vaccaro.

Chapter 3 Database Management

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Chapter 11 Data Management Layer Design

Database Systems More SQL Database Design -- More SQL1.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Module 3: Business Information Systems

Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.

Application of SAS®! Enterprise Miner™ in Credit Risk Analytics

AnHai Doan University of Illinois Joint work with Pedro DeRose, Robert McCann, Yoonkyong Lee, Mayssam Sayyadian, Warren Shen, Wensheng Wu, Quoc Le, Hoa.

Overview of the Database Development Process

1 Introduction to databases concepts CCIS – IS department Level 4.

IST722 Data Warehousing Business Intelligence Development with SQL Server Analysis Services and Excel 2013 Michael A. Fudge, Jr.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.

Introduction to MDA (Model Driven Architecture) CYT.

“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.

EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.

DATA-DRIVEN UNDERSTANDING AND REFINEMENT OF SCHEMA MAPPINGS Data Integration and Service Computing ITCS 6010.

High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.

DEPICT: DiscovEring Patterns and InteraCTions in databases A tool for testing data-intensive systems.

High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

Distributed Aircraft Maintenance Environment - DAME DAME Workflow Advisor Max Ong University of Sheffield.

1/26/2004TCSS545A Isabelle Bichindaritz1 Database Management Systems Design Methodology.

Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

Data Mining By Dave Maung.

Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.

WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.

Object Oriented Multi-Database Systems An Overview of Chapters 4 and 5.

COMM89 Knowledge-Based Systems Engineering Lecture 8 Life-cycles and Methodologies

Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University Efficient.

GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.

Semantic Mappings for Data Mediation

Top-K Generation of Integrated Schemas Based on Directed and Weighted Correspondences by Ahmed Radwan, Lucian Popa, Ioana R. Stanoi, Akmal Younis Presented.

© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.

1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Learning Behavioral Parameterization Using Spatio-Temporal Case-Based Reasoning Maxim Likhachev, Michael Kaess, and Ronald C. Arkin Mobile Robot Laboratory.

CS3431: C-Term CS3431 – Database Systems I Introduction Instructor: Mohamed Eltabakh

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Wolfgang Runte Slide University of Osnabrueck, Software Engineering Research Group Wolfgang Runte Software Engineering Research Group Institute.

Databases and DBMSs Todd S. Bacastow January

Decision Support Systems

Lec 6: Practical Database Design Methodology and Use of UML Diagrams

Chapter 1: Introduction

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

CS5123 Software Validation and Quality Assurance

Semantic Interoperability and Data Warehouse Design

Property consolidation for entity browsing

MOMA - A Mapping-based Object Matching System

A Framework for Testing Query Transformation Rules

Query Optimization.

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Query Processing.

Presentation transcript:

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning Problem The eTUNER Archietecture Generating Synthetic Workload eTUNER: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S. Rosenthal University of Urbana & MITRE Modeling Schema Matching Systems Schema Matching – Finding semantic matches between the schemas of disparate data sources – Applications: data warehousing, scientific collaboration, e-commerce, bioinformatics, data integration on WWW, … Current Trends – Manually finding matches is labor intensive – Numerous automatic matching techniques have been developed – Each technique has its own strength and weakness – Hence, most current matching systems adopt a multi-component strategy – Each component employs a particular matching technique – Highly extensible and customizable – Example: LSD, COMA, GLUE, [Embley02], SimFlood, iMAP, ProtoPlasm, … Meta-Learner Base-Learner Base-Learner k Constraint Handler Domain constraints Source schema & Target schema Tuning is necessary to get high matching accuracy – Crucial in many applications: automatic data exchange, data integration, peer-to-peer systems, … Tuning is extremely difficult – Huge space of knobs – Wide variety of matching techniques – Complex interactions among the components – No reasonable guideline for tuning Given a particular matching situation, how to select the right matching components to execute, and how to adjust the multiple knobs of the components? Developing efficient techniques for tuning is now crucial! Matching tool M (L, G, k) – k: Collection of control variables (i.e. “knobs”) – G: Execution graph – L: Library of matching components (e.g. matchers, combiners, filters, etc.) Example: LSD (L, G, k) Library of matching components (L) Constraint enforcer Match selector Combiner Matcher 1Matcher n … Execution graph (G) Collection of knobs (k) Threshold selector Bipartite graph selector A* search enforcer Average combiner Min combiner Max combiner Weighted sum combiner q-gram name matcher Decision tree matcher Naïve Bays matcher TF/IDF name matcher SVM matcher Characteristics of attr. Post-prune? Size of validation set Split measure Decision tree matcher General tuning problem Given M: a schema matching tool Workload: a set of matching scenarios (S 1,T 1 ), (S 2,T 2 ), …, (S k,T k ) U: a utility function defined over the process of matching two schemas Find the knob configuration k* maximizing the utility over the workload Our tuning problem Given M: a schema matching tool S: a source schema Workload: a set of matching scenarios (S,T 1 ), (S,T 2 ), …, (S,T k ), (The T i s are future schemas) U: matching accuracy Find the knob configuration k* maximizing the average accuracy Generate synthetic workload Tune a matching system M using the synthetic workload and tuning procedures stored in the repository Exploit user assistance to generate an even higher quality synthetic workload, if possible Staged Tuner Tuning Procedures Workload Generator Transformation Rules Matching Tool M = (L, G, k) Synthetic Workload Source Schema S Tuned Matching Tool M* = (L, G, k*) User Augmented Schema Perturb # of tables id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ EMPLOYEES EMPS emp-last idwage Laup Brown V1V1 Schema S id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ 3 Jean Ann30,000 $ 4 Roy Bond70,000 $ EMPLOYEES id first last salary ($) 3JeanAnn30,000$ 4RoyBond70,000$ EMPLOYEES Perturb # of columns in each table last id salary($) Laup140,000$ Brown260,000$ EMPLOYEES Perturb column and table names Perturb data tuples in each table EMPS emp-last idwage Laup140,000$ Brown260,000$ EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) U V V1V1 U Ω 1 : a set of semantic matches VnVn... Split S into V and U with disjoint data tuples Exploiting user assistance - Grouping semantically equivalent attributes over S - Adding domain specific perturbation rules Staged Tuning Level 1 Level 2 Level 3 Constraint enforcer Match selector Combiner Matcher 1Matcher n … Level 4 Tuning direction Tune sequentially starting from the lowest-level components Find best knob configuration for a component based on matching accuracy over the synthetic workload Efficient tuning is extremely important Our contributions – Establish that tuning matching systems automatically is feasible – Synthesize workload to estimate the quality of a matching system with given knob configurations – Establish that staged tuning is a reasonable optimization technique – Experiment extensively over 4 real-world domains with 4 matching systems Future Work – Explore better search methods and more extensive evaluation – Deploy the idea of using synthetic input/output pairs to other applications (e.g. wrapper maintenance)