Learning to Map Between Schemas Ontologies

Slides:

Advertisements

Similar presentations

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.

Advertisements

Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.

Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.

Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.

Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.

Automatic Data Ramon Lawrence University of Manitoba

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.

Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.

CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.

Learning to Map between Structured Representations of Data

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.

A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.

Querying Structured Text in an XML Database By Xuemei Luo.

1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.

Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.

IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.

CSE 636 Data Integration Schema Matching Cupid Fall 2006.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Semantic Mappings for Data Mediation

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.

Information Retrieval in Practice

Linking Ontologies to Spatial Databases

Machine Learning Supervised Learning Classification and Regression

Compiler Design (40-414) Main Text Book:

Semi-Supervised Clustering

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

The Development Process of Web Applications

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington

SCC P2P – Collaboration Made Easy Contract Management training

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Fundamentals/ICY: Databases 2010/11 WEEK 1

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

Relational Algebra Chapter 4, Part A

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

Lecture 12: Data Wrangling

Semantic Interoperability and Data Warehouse Design

Information Retrieval

CSc4730/6730 Scientific Visualization

Extracting Semantic Concept Relations

Property consolidation for entity browsing

Analysis models and design models

Block Matching for Ontologies

Introduction to Information Retrieval

Grid Based Data Integration with Automatic Wrapper Generation

Retail Sales is used to illustrate a first dimensional model

Chen Li Information and Computer Science

eTuner: Tuning Schema Matching Software using Synthetic Scenarios

Chaitali Gupta, Madhusudhan Govindaraju

A Semantic Peer-to-Peer Overlay for Web Services Discovery

ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.

Software Architecture & Design

Presentation transcript:

Learning to Map Between Schemas Ontologies Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos

Agenda Ontology mapping is a key problem in many applications: LSD: Data integration Semantic web Knowledge management E-commerce LSD: Solution that uses multi-strategy learning. We’ve started with schema matching (I.e., very simple ontologies) Currently extending to more expressive ontologies. Experiments show the approach is very promising!

The Structure Mapping Problem Types of structures: Database schemas, XML DTDs, ontologies, …, Input: Two (or more) structures, S1 and S2 Data instances for S1 and S2 Background knowledge Output: A mapping between S1 and S2 Should enable translating between data instances. Semantics of mapping?

Semantic Mappings between Schemas Source schemas = XML DTDs house address contact-info num-baths agent-name agent-phone 1-1 mapping non 1-1 mapping house Points: 0) Sources export data in XML 1) Mediated & source schemas are represented with XML DTDs 2) Describe different types of mappings 1-1 mappings more complex mappings 3) We focus on 1-1 mappings Note: when defining the schema-matching problem, say clearly that we are *given* the two schemas. All we need to do is to find the mappings. Do not talk about handicap-equipped => amenities. Not enough time. Not important. location contact full-baths half-baths name phone

Motivation Database schema integration A problem as old as databases themselves. database merging, data warehouses, data migration Data integration / information gathering agents On the WWW, in enterprises, large science projects Model management: Model matching: key operator in an algebra where models and mappings are first-class objects. See [Bernstein et al., 2000] for more. The Semantic Web Ontology mapping. System interoperability E-services, application integration, B2B applications, …,

Desiderata from Proposed Solutions Accuracy, efficiency, ease of use. Realistic expectations: Unlikely to be fully automated. Need user in the loop. Some notion of semantics for mappings. Extensibility: Solution should exploit additional background knowledge. “Memory”, knowledge reuse: System should exploit previous manual or automatically generated matchings. Key idea behind LSD.

LSD Overview L(earning) S(ource) D(escriptions) Problem: generating semantic mappings between mediated schema and a large set of data source schemas. Key idea: generate the first mappings manually, and learn from them to generate the rest. Technique: multi-strategy learning (extensible!) Step 1: [SIGMOD, 2001]: 1-1 mappings between XML DTDs. Current focus: Complex mappings Ontology mapping.

Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

Data Integration Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 wrappers realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.

Semantic Mappings between Schemas Source schemas = XML DTDs house address contact-info num-baths agent-name agent-phone 1-1 mapping non 1-1 mapping house Points: 0) Sources export data in XML 1) Mediated & source schemas are represented with XML DTDs 2) Describe different types of mappings 1-1 mappings more complex mappings 3) We focus on 1-1 mappings Note: when defining the schema-matching problem, say clearly that we are *given* the two schemas. All we need to do is to find the mappings. Do not talk about handicap-equipped => amenities. Not enough time. Not important. location contact full-baths half-baths name phone

Semantics (preliminary) Semantics of mappings has received no attention. Semantics of 1-1 mappings – Given: R(A1,…,An) and S(B1,…,Bm) 1-1 mappings (Ai,Bj) Then, we postulate the existence of a relation W, s.t.: P (C1,…,Ck) (W) = P (A1,…,Ak) (R) , P (C1,…,Ck) (W) = P (B1,…,Bk) (S) , W also includes the unmatched attributes of R and S. In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.

Why Matching is Difficult Aims to identify same real-world entity using names, structures, types, data values, etc Schemas represent same entity differently different names => same entity: area & address => location same names => different entities: area => location or square-feet Schema & data never fully capture semantics! not adequately documented, not sufficiently expressive Intended semantics is typically subjective! IBM Almaden Lab = IBM? Cannot be fully automated. Often hard for humans. Committees are required!

Current State of Affairs Finding semantic mappings is now the bottleneck! largely done by hand labor intensive & error prone GTE: 4 hours/element for 27,000 elements [Li&Clifton00] Will only be exacerbated data sharing & XML become pervasive proliferation of DTDs translation of legacy data reconciling ontologies on semantic web Need semi-automatic approaches to scale up! Points: -- what’s the state of research in data integration today? -- this and this are well-understood -- schema matching is now the bottleneck => need to automate

Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

The LSD Approach User manually maps a few data sources to the mediated schema. LSD learns from the mappings, and proposes mappings for the rest of the sources. Several types of knowledge are used in learning: Schema elements, e.g., attribute names Data elements: ranges, formats, word frequencies, value frequencies, length of texts. Proximity of attributes Functional dependencies, number of attribute occurrences. One learner does not fit all. Use multiple learners and combine with meta-learner.

Example Mediated schema Learned hypotheses Schema of realestate.com address price agent-phone description location listed-price phone comments Learned hypotheses Schema of realestate.com If “phone” occurs in the name => agent-phone location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

Multi-Strategy Learning Use a set of base learners: Name learner, Naïve Bayes, Whirl, XML learner And a set of recognizers: County name, zip code, phone numbers. Each base learner produces a prediction weighted by confidence score. Combine base learners with a meta-learner, using stacking.

Base Learners Name Learner Naive Bayes Learner [Domingos&Pazzani 97] (contact-info,office-address) (contact-info,office-address) (contact,agent-phone) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (phone,agent-phone) (listed-price,price) (listed-price,price) contact-phone => (agent-phone,0.7), (office-address,0.3) Naive Bayes Learner [Domingos&Pazzani 97] “Kent, WA” => (address,0.8), (name,0.2) Whirl Learner [Cohen&Hirsh 98] XML Learner exploits hierarchical structure of XML data Points: 1) Describe learners in general 2) Describe two example learners in detail Note: must say clearly and briefly how each example learner works.

Training the Base Learners Mediated schema address price agent-phone description location listed-price phone comments Schema of realestate.com Name Learner <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> (location, address) (listed-price, price) (phone, agent-phone) ... realestate.com Points: 1) Introduce our approach. 2) We do not manually map the schemas of all sources to mediated schema. The goal is to manually mark up only a few sources, and be able to learn from the marked up sources to successfully propose mappings for subsequent sources. 3) Once the markup is done, there are many different types of information to learn from. IT’S IMPORTANT TO SAY HERE THAT TRAINING = GLEANING KNOWLEDGE FROM THE DATA <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> Naive Bayes Learner (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) ...

Entity Recognizers Use pre-programmed knowledge to identify specific types of entities date, time, city, zip code, name, etc house-area (30 X 70, 500 sq. ft.) county-name recognizer Recognizers often have nice characteristics easy to construct many off-the-self research & commercial products applicable across many domains help with special cases that are hard to learn

Meta-Learner: Stacking Training of meta-learner produces a weight for every pair of: (base-learner, mediated-schema element) weight(Name-Learner,address) = 0.1 weight(Naive-Bayes,address) = 0.9 Combining predictions of meta-learner: computes weighted sum of base-learner confidence scores Name Learner Naive Bayes (address,0.6) (address,0.8) <area>Seattle, WA</> Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

Training the Meta-Learner For address Extracted XML Instances Name Learner Naive Bayes True Predictions <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... 0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ... Least-Squares Linear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9

Applying the Learners Schema of homes.com Mediated schema area day-phone extra-info address price agent-phone description Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) Points: 1) Explain how the example learners are applied to match “area” with “address”. 2) Note that in general an arbitrary number of learners can be plugged into the matching process. If we have a new learner that has been trained, we can simply plug it in. This show how easy it is to add a new learner in our approach. <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</> (description,0.8), (address,0.2)

The Constraint Handler Extends learning to incorporate constraints hard constraints a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a soft constraints a = agent-phone & b = agent-name a & b are usually close to each other user feedback = hard or soft constraints Details in [Doan et. al., SIGMOD 2001]

The Current LSD System Training Phase Matching Phase Mediated schema Source schemas Domain Constraints Data listings User Feedback Constraint Handler Base-Learner1 Base-Learnerk Meta-Learner Mappings

Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

Empirical Evaluation Four domains For each domain Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML (faithful to schema!) mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified

Matching Accuracy LSD’s accuracy: 71 - 92% Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%

Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

Contribution of Schema vs. Data Average matching accuracy (%) LSD with only schema info. LSD with only data info. Complete LSD More experiments in the paper [Doan et. al. 01]

Reasons for Incorrect Matching Unfamiliarity suburb solution: add a suburb-name recognizer Insufficient information correctly identified general type, failed to pinpoint exact type <agent-name>Richard Smith</> <phone> (206) 234 5412 </> solution: add a proximity learner Subjectivity house-style = description?

Outline Overview of structure mapping Data integration and source mappings LSD architecture and details Experimental results Current work.

Moving Up the Expressiveness Ladder Schemas are very simple ontologies. More expressive power = More domain constraints. Mappings become more complex, but constraints provide more to learn from. Non 1-1 mappings: F1(A1,…,Am) = F2(B1,…,Bm) Ontologies (of various flavors): Class hierarchy (I.e., containment on unary relations) Relationships between objects Constraints on relationships

Finding Non 1-1 Mappings Current work Given two schemas, find 1-many mappings: address = concat(city,state) many-1: half-baths + full-baths = num-baths many-many: concat(addr-line1,addr-line2) = concat(street,city,state) 1-many mappings expressed as query value correspondence expression: room-rate = rate * (1 + tax-rate) relationship: state of tax-rate = state of hotel that has rate special case: 1-many mappings between two relational tables Flat schemas so that the set of operators under consideration can be simplified Mediated schema Source schema address description num-baths city state comments half-baths full-baths

Brute-Force Solution Define a set of operators concat, +, -, *, /, etc For each set of mediated-schema columns enumerate all possible mappings evaluate & return best mapping Mediated-schema columns Source-schema columns compute similarity using all base learners m1 m1, m2, ..., mk

Search-Based Solution States = columns goal state: mediated-schema column initial states: all source-schema columns use 1-1 matching to reduce the set of initial states Operators: concat, +, -, *, /, etc Column-similarity: use all base learners + recognizers

Multi-Strategy Search Use a set of expert modules: L1, L2, ..., Ln Each module applies to only certain types of mediated-schema column searches a small subspace uses a cheap similarity measure to compare columns Example L1: text; concat; TF/IDF L2: numeric; +, -, *, /; [Ho et. al. 2000] L3: address; concat; Naive Bayes Search techniques beam search as default specialized, do not have to materialize columns

Multi-Strategy Search (cont’d) Apply all applicable expert modules L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z Combine modules’ predictions & select the best one compute similarity using all base learners m11, m12, m21, m22, m31,m32 m11

Related Work ? Recognizers + Schema + 1-1 Matching Single Learner + 1-1 Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] Hybrid + 1-1 Matching DELTA [Clifton et. al. 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction LSD [Doan et. al. 2000, 2001] CLIO [Miller et. al. 00],[Yan et. al. 01] ?

Summary LSD: Future work and issues to ponder: uses multi-strategy learning to semi-automatically generate semantic mappings. LSD is extensible and incorporates domain and user knowledge, and previous techniques. Experimental results show the approach is very promising. Future work and issues to ponder: Accommodating more expressive languages: ontologies Reuse of learned concepts from related domains. Semantics? Data management is a fertile area for Machine Learning research!

Backup Slides

Mapping Maintenance Ten months later ... Mediated-schema M Source-schema S m1 m2 m3 Ten months later ... are the mappings still correct? Mediated-schema M’ Source-schema S’ m1 m2 m3

Information Extraction from Text Extract data fragments from text documents date, location, & victim’s name from a news article Intensive research on free-text documents Many documents do have substantial structure XML pages, name card, tables, list Each such document = a data source structure forms a schema only one data value per schema element “real” data source has many data values per schema element Ongoing research in the IE community

Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

Exploiting Hierarchical Structure Existing learners flatten out all structures Developed XML learner similar to the Naive Bayes learner input instance = bag of tokens differs in one crucial aspect consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>

Domain Constraints Impose semantic regularities on sources Examples verified using schema or data Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front when creating mediated schema independent of any actual source schema

The Constraint Handler Predictions from Meta-Learner Domain Constraints a = address & b = adderss a = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address 0.7 0.9 0.6 0.378 area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.4 0.252 Can specify arbitrary constraints User feedback = domain constraint ad-id = house-id Extended to handle domain heuristics a = agent-phone & b = agent-name a & b are usually close to each other