eTuner: Tuning Schema Matching Software using Synthetic Scenarios

eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Mayssam Sayyadian, Yoonkyong Lee, AnHai Doan University of Illinois, USA Arnon Rosenthal MITRE Corp., USA Thank you, ..., for your introduction, and thank you all for coming to my talk. Today I’m going to talk about the problem of mapping between data representations. And this is joint work with my advisors, Alon Halevy and P. Domingos, and my colleguage, J Madhavan, at the University of Washington. Now, I will show you shortly that the problem of mapping representations arises everywhere. But first, let me motivate it by grounding it in a very specific application context, which is DATA INTEGRATION.

Main Points Tuning matching systems: long standing problem
becomes increasingly worse We propose a principled solution exploits synthetic input/output pairs promising, though much work remains Idea applicable to other contexts

Schema Matching Schema 1 1-1 match complex match Schema 2
price agent-name address Schema 1 120,000 George Bush Crawford, TX 239,900 Hillary Clinton New York City, NY 1-1 match complex match listed-price contact-name city state However, before doing that, lets get a better feel for the problem by defining it in more details. For simplicity, lets assume that both the mediated schema and source schemas use the RELATIONAL REPRESENTATION. For example, here’s a mediated schema in the real-estate domain, with elements price, agent-name, and address. And here’s the source homes.com, which exports house listings in this relational table, each row in this table corresponds to a house listing. Throughout the talk, we color elements of the mediated schema in red, and elements of source schemas in blue. Now, given a mediated schema and a source schema, the schema-matching problem is to find semantic mappings between the elements of the two schemas. The simplest type of mapping is 1-1 mappings, such as price to listed-price, and agent-name to contact-name. BUT 1-1 mappings make up only a portion of semantic mappings in practice. There are also a lot of complex mappings such as address is the concatenation of city and state, or number of bathrooms is the number of full baths plus number of half baths. In this talk, we shall focus first on finding 1-1 mappings, then on finding complex mappings. Schema 2 320K Jane Brown Seattle WA 240K Mike Smith Miami FL

Schema Matching is Ubiquitous
Databases data integration, model management data translation, collaborative data sharing keyword querying, schema/view integration data warehousing, peer data management, … AI knowledge bases, ontology merging, information gathering agents, ... Web e-commerce, Deep Web, Semantic Web eGovernment, bio-informatics, scientific data management But first lets take a step back and ask, if you are not buying a house, why should you care about this problem. The answer is that you should, because it is a fundamental problem in many areas and in numerous applications. Given any domain, if you ask two persons to describe it, they will almost certainly use different terminologies. Thus any application that involves more than one such description must establish semantic mappings between them, in order to have INTEROPERABILITY. As a consequence, variations of this problem arises everywhere. It has been a long standing problem in databases and is becoming increasingly critical. It arises in AI, in the context of ontology merging and information gathering on the Internet. It arises in e-commerce, as the problem of matching catalogs. It is also a fundamental problem in the context of the Semantic Web, which tries to add more structure to the Web by marking up data using ontologies. There we have the problem of matching ontologies. Now, if this problem is so important, why has no one solved it? [REPLACE SLIDE]

Current State of Affairs
Finding semantic mappings is now a key bottleneck! largely done by hand, labor intensive & error prone Numerous matching techniques have been developed Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U Leipzig, U Wisconsin, NCSU, U Illinois, Washington, Humboldt-Universität zu Berlin, ... AI: Stanford, Karlsruhe University, NEC Japan, ... Techniques are often synergistic, leading to multi-component matching architectures each component employs a particular technique final predictions combine those of the components So, what do people do today with semantic mappings? Unfortunately, they still must create them by hand, in a very labor intensive process. For example, Li&Clifton recently reported that at the phone company GTE people tried to integrate 40 databases, which have a total of elements, and they estimated that simply finding and documenting the semantic mappings would take them 12 years, unless they have the owners of the databases around. Thus, finding semantic mappings has now become a key bottleneck in building large-scale data management applications. And this problem is going to be even more critical, as data sharing becomes even more pervasive on the Web and at enterprises, and as the need for translating legacy data increases. Clearly, we need semi-automatic solutions to schema matching, in order to scale up. And there have been a lot of research works on such solutions, in both databases and AI.

An Example: LSD [SIGMOD-01]
Schema 1 agent name address agent-name Name Matcher 0.5 contact agent Urbana, IL James Smith Seattle, WA Mike Doan Combiner Schema 2 Naive Bayes Matcher 0.1 0.3 area contact-agent Peoria, IL (206) Kent, WA (617) area => (address, 0.7), (description, 0.3) contact-agent => (agent-phone, 0.7), (agent-name, 0.3) comments => (address, 0.6), (desc, 0.4) Here’s an example to illustrate our approach. Consider a mediated schema with elements price, agent-name, agent-phone, and so on. To apply LSD, first the user selects a few sources to be the training sources. In this case, the user selects a single source, realestate.com [POINT]. Next, the user manually specifies the 1-1 mappings between the schema of this source and the mediated schema. These mappings are the five green arrows right here [POINT], which say that listed-price matches price, contact-name matches agent-name, and so on. Once the user has shown LSD these 1-1 mappings, there are many different types of information that LSD could learn from, in order to construct hypotheses on how to match schema elements. For example, LSD could learn from the names of schema elements. Knowing that office matches office-phone, it may construct the hypothesis [POINT] that if the word ”office" occurs in the name of a schema element, then that element is likely to be office-phone. LSD could also learn from the data values. Because comments matches description, LSD knows that these data values here [POINT] are house descriptions. It could then examine them to learn that house descriptions frequently contain words such as fantastic, great, and beautiful. Hence, it may construct the hypothesis [POINT] that if these words appear frequently in the data values of an element, then that element is likely to be house descriptions. LSD could also learn from the characteristics of value distributions. For example, it can look at the average value of this column [POINT], and learn that if the average value is in the thousands, then the element is more likely to be price than the number of bathrooms. And so on. Now, consider the source homes.com, with these schema elements [POINT] and these data values [POINT]. LSD can apply the learned hypotheses to the schema and the data values, in order to predict semantic mappings. For example, because the words "beautiful" and "great" appear frequently in these data values, LSD can predict that "extra-info" matches "description”. ... and the solution to this is MULTI-STRATEGY LEARNING [PAUSE] Constraint Enforcer Match Selector area = address contact-agent = agent-phone ... comments = desc Only one attribute of Schema 2 matches address

Multi-Component Matching Solutions
Developed in many recent works e.g., Doan et. al., WebDB-00, SIGMOD-01; Do&Rahm, VLDB-02; Embley et.al.-02; Bernstein et. al. SIGMOD Record-04; Madhavan et. al. 05 Now commonly adopted, with industrial-strength systems e.g., Protoplasm [MSR], COMA++ [Univ of Lepzig] Constraint enforcer Match selector Combiner Constraint enforcer Match selector Combiner Constraint enforcer Match selector Matcher Match selector Combiner Matcher 1 … Matcher n Matcher Combiner … Matcher 1 Matcher n Matcher 1 … Matcher n LSD COMA SF Such systems are very powerful ... maximize accuracy; highly customizable to individual domain ... but place a serious tuning burden on domain users LSD-SF

Tuning Schema Matching Systems
Given a particular matching situation how to select the right components? how to adjust the multitude of knobs? Knobs of decision tree matcher Constraint enforcer Match selector Combiner Matcher 1 Matcher n … Threshold selector Bipartite graph selector • Characteristics of attr. A* search enforcer Relax. labeler ILP • Split measure Average combiner Min combiner Max combiner Weighted sum combiner • Post-prune? • Size of validation set q-gram name matcher Decision tree matcher Naïve Bays matcher • • • TF/IDF name matcher SVM matcher Execution graph Library of matching components Untuned versions produce inferior accuracy, however ...

... Tuning is Extremely Difficult
Large number of knobs e.g., 8-29 in our experiments Wide variety of techniques database, machine learning, IR, information theory, etc. Complex interaction among components Not clear how to compare the quality of knob configs Matching systems are still tuned manually, by trial and error Multiple component systems make tuning even worse Developing efficient tuning techniques is crucial to making matching systems attractive in practice

The eTuner Solution Given schema S & matching system M
tunes M to maximize average accuracy of matching S with future schemas incurs virtually no cost to user Key challenge 1: Evaluation must search for “best” knob config how to compute the quality of any knob config C? if knowing “ground-truth” matches for a representative workload W = {(S,T1), ..., (S,Tn)}, then can use W to evaluate C but often have no such W Key challenge 2: Search how to efficiently evaluate the huge space of knob configs?

Key Idea: Generate Synthetic Input/Output Pairs
Need workload W = {(S,T1), (S,T2), …, (S,Tn)} To generate W start with S perturb S to generate T1 perturb S to generate T2 etc. Know the perturbation => know matches between S & Ti

Key Idea: Generate Synthetic Input/Output Pairs
V V1 3 12 1 Perturb # of tables 3 2 Perturb # of columns in each table Split S into V and U with disjoint data tuples . . . EMPLOYEES 3 12 Vn id first last salary ($) 1 Bill Laup 40,000 $ 2 Mike Brown 60,000 $ Perturb column and table names Schema S 1 2 3 EMPLOYEES last id salary($) Laup 1 40,000$ Brown 2 60,000$ 3 12 Perturb data tuples in each table U 1 EMPS 3 emp-last id wage Laup 1 40,000$ Brown 2 60,000$ 2 3 12 EMPLOYEES id first last salary ($) 1 Bill Laup 40,000 $ 2 Mike Brown 60,000 $ 3 Jean Ann 30,000 $ 4 Roy Bond 70,000 $ EMPLOYEES EMPS EMPS.emp-last = EMPLOYEES.last EMPS.id = EMPLOYEES.id EMPS.wage = EMPLOYEES.salary($) id first last salary ($) 3 Jean Ann 30,000$ 4 Roy Bond 70,000$ emp-last id wage Laup 1 45200 Brown 2 59328 U Ω1: a set of semantic matches V1

Examples of Perturbation Rules
Number of tables merge two tables based on a join path splits a table into two Structure of table merges two columns e.g., neighboring columns, or sharing prefix/suffix (last-name, first-name) drop a column swap location of two columns Names of tables/columns rules capture common name transformations abbreviation to the first 3-4 characters, dropping all vowels, synonyms, dropping prefixes, adding table name to column name, etc Data values rules capture common format transformations: 12/4 => Dec 4 values are changed based on some distributions (e.g., Gaussian) See paper for details

The eTuner Architecture
Perturbation Rules Tuning Procedures Workload Generator Synthetic Workload Staged Tuner Tuned Matching Tool M U Ω1 V1 U Ω2 V2 U Ωn Vn Matching Tool M (Optional) Schema S

The Staged Tuner Constraint enforcer Match selector Combiner Matcher 1 Matcher n … Level 4 Level 3 Tuning direction Level 2 Level 1 Tune sequentially starting with lowest-level components Assume execution graph has k levels, m nodes per level each node can be assigned one of n components each component has p knobs, each of which has q values tuning examines (npqkm) out of (npq)^(km) knob configs

# attributes per schema
Empirical Evaluation Domains Domain # schemas # tables per schema # attributes per schema # tuples per table reference paper Real Estate 5 2 30 1000 LSD (SIGMOD’01) Courses 3 13 50 LSD Inventory 10 4 20 Corpus (ICDE’05) Product 120 iMAP (SIGMOD’04) Matching systems LSD: Matchers, 6 Combiners, 1 Constraint enforcer, 2 Match selectors, 21 Knobs iCOMA: 10 Matchers, 4 Combiners, 2 Match selectors, 20 Knobs SF: Matchers, 1 Constraint enforcer, 2 Match selectors, 8 Knobs LSD-SF: 7 Matcher, 7 Combiners, 1 Constraint enforcer, 2 Match selectors, 29 Knobs

Matching Accuracy Off-the-shelf Domain-dependent eTUNER: Automatic Domain-independent Source-dependent eTUNER: Human-assisted 0.9 0.9 0.8 LSD 0.8 COMA 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Real Estate Product Inventory Course Real Estate Product Inventory Course 0.9 0.8 SF 0.9 LSD-SF 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 Real Estate Product Inventory Course Real Estate Product Inventory Course eTuner achieves higher accuracy than current best methods, at virtually no cost to the user

Cost of Using eTuner You have a schema S and a matching system M
Vendor supplies eTuner will hook it up with matching system M Vendor supplies a matching system M bundles eTuner inside

Sensitivity Analysis Adding perturbation rules
Exploiting prior match results (enriching the workload) 0.7 0.9 0.8 0.6 0.7 0.5 0.6 0.4 Accuracy (F1) 0.5 0.3 0.4 Average Inventory Domain Real Estate Domain 0.3 0.2 Tuned LSD 0.2 0.1 0.1 0.0 0.0 1 10 20 25 40 50 22 44 66 88 Schemas in Synthetic Workload (#) Previous matches in collection (%)

Summary: The eTuner Project @ Illinois
Tuning matching systems is crucial long standing problem, is getting worse a next logical step in schema matching research Provides an automatic & principled solution generates a synthetic workload, employs it to tune efficiently incurs virtually no cost to human users exploits user assistance whenever available Extensive experiments over 4 domains with 4 systems Future directions find optimal synthetic workload apply to other matching scenarios adapt ideas to scenarios beyond schema matching (see 3rd speaker)

Backup: User Assistance
S(phone1,phone2,…) Generate V by dropping phone2: V(phone1,…) Rename phone1 in V: V(x,…) Problem: x matches phone1, x does not match phone2 User: group phone1 and phone2 so if x matches phone1, it will also match phone2 Intuition: tell system do not bother to try distinguish phone1 and phone2

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Similar presentations

Presentation on theme: "eTuner: Tuning Schema Matching Software using Synthetic Scenarios"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Similar presentations

Presentation on theme: "eTuner: Tuning Schema Matching Software using Synthetic Scenarios"— Presentation transcript:

Similar presentations

About project

Feedback