Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning Problem The eTUNER Archietecture Generating Synthetic Workload eTUNER: Tuning Schema Matching Software using Synthetic Scenarios Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan, and Arnon S. Rosenthal University of Urbana & MITRE Modeling Schema Matching Systems Schema Matching – Finding semantic matches between the schemas of disparate data sources – Applications: data warehousing, scientific collaboration, e-commerce, bioinformatics, data integration on WWW, … Current Trends – Manually finding matches is labor intensive – Numerous automatic matching techniques have been developed – Each technique has its own strength and weakness – Hence, most current matching systems adopt a multi-component strategy – Each component employs a particular matching technique – Highly extensible and customizable – Example: LSD, COMA, GLUE, [Embley02], SimFlood, iMAP, ProtoPlasm, … Meta-Learner Base-Learner Base-Learner k Constraint Handler Domain constraints Source schema & Target schema Tuning is necessary to get high matching accuracy – Crucial in many applications: automatic data exchange, data integration, peer-to-peer systems, … Tuning is extremely difficult – Huge space of knobs – Wide variety of matching techniques – Complex interactions among the components – No reasonable guideline for tuning Given a particular matching situation, how to select the right matching components to execute, and how to adjust the multiple knobs of the components? Developing efficient techniques for tuning is now crucial! Matching tool M (L, G, k) – k: Collection of control variables (i.e. “knobs”) – G: Execution graph – L: Library of matching components (e.g. matchers, combiners, filters, etc.) Example: LSD (L, G, k) Library of matching components (L) Constraint enforcer Match selector Combiner Matcher 1Matcher n … Execution graph (G) Collection of knobs (k) Threshold selector Bipartite graph selector A* search enforcer Average combiner Min combiner Max combiner Weighted sum combiner q-gram name matcher Decision tree matcher Naïve Bays matcher TF/IDF name matcher SVM matcher Characteristics of attr. Post-prune? Size of validation set Split measure Decision tree matcher General tuning problem Given M: a schema matching tool Workload: a set of matching scenarios (S 1,T 1 ), (S 2,T 2 ), …, (S k,T k ) U: a utility function defined over the process of matching two schemas Find the knob configuration k* maximizing the utility over the workload Our tuning problem Given M: a schema matching tool S: a source schema Workload: a set of matching scenarios (S,T 1 ), (S,T 2 ), …, (S,T k ), (The T i s are future schemas) U: matching accuracy Find the knob configuration k* maximizing the average accuracy Generate synthetic workload Tune a matching system M using the synthetic workload and tuning procedures stored in the repository Exploit user assistance to generate an even higher quality synthetic workload, if possible Staged Tuner Tuning Procedures Workload Generator Transformation Rules Matching Tool M = (L, G, k) Synthetic Workload Source Schema S Tuned Matching Tool M* = (L, G, k*) User Augmented Schema Perturb # of tables id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ EMPLOYEES EMPS emp-last idwage Laup Brown V1V1 Schema S id first last salary ($) 1 Bill Laup40,000 $ 2 Mike Brown60,000 $ 3 Jean Ann30,000 $ 4 Roy Bond70,000 $ EMPLOYEES id first last salary ($) 3JeanAnn30,000$ 4RoyBond70,000$ EMPLOYEES Perturb # of columns in each table last id salary($) Laup140,000$ Brown260,000$ EMPLOYEES Perturb column and table names Perturb data tuples in each table EMPS emp-last idwage Laup140,000$ Brown260,000$ EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) U V V1V1 U Ω 1 : a set of semantic matches VnVn... Split S into V and U with disjoint data tuples Exploiting user assistance - Grouping semantically equivalent attributes over S - Adding domain specific perturbation rules Staged Tuning Level 1 Level 2 Level 3 Constraint enforcer Match selector Combiner Matcher 1Matcher n … Level 4 Tuning direction Tune sequentially starting from the lowest-level components Find best knob configuration for a component based on matching accuracy over the synthetic workload Efficient tuning is extremely important Our contributions – Establish that tuning matching systems automatically is feasible – Synthesize workload to estimate the quality of a matching system with given knob configurations – Establish that staged tuning is a reasonable optimization technique – Experiment extensively over 4 real-world domains with 4 matching systems Future Work – Explore better search methods and more extensive evaluation – Deploy the idea of using synthetic input/output pairs to other applications (e.g. wrapper maintenance)