AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Semantic integration of data in database systems and ontologies
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
1 CIS607, Fall 2004 Semantic Information Integration Presentation by Julian Catchen Week 3 (Oct. 13)
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
ICS (072)Database Systems Background Review 1 Database Systems Background Review Dr. Muhammad Shafique.
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
ebis/etat/ebuy/xdia Joint Effort ebis/etat/ebuy/xdia Joint Effort2 Introduction Extensible Markup language XML SCHEMA DTD.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Learning to Map between Structured Representations of Data
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 5: Schema Matching and Mapping PRINCIPLES OF DATA INTEGRATION.
AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
Presenter: Shanshan Lu 03/04/2010
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Building Simulation Model In this lecture, we are interested in whether a simulation model is accurate representation of the real system. We are interested.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
Semantic Mappings for Data Mediation
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Of 24 lecture 11: ontology – mediation, merging & aligning.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Lecture 12: Data Wrangling
CSc4730/6730 Scientific Visualization
Integrating Taxonomies
Learning to Map Between Schemas Ontologies
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Presentation transcript:

AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project

2 Data Integration Find houses with four bathrooms priced under $500,000 mediated schema homes.comrealestate.com source schema 2 homeseekers.com wrapper source schema 3source schema 1

3 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping

4 Current State of Affairs Finding semantic mappings is now the bottleneck! –largely done by hand –labor intensive & error prone Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data –reconciling ontologies on the semantic web Need (semi-)automatic approaches to scale up!

5 Suppose user wants to integrate 100 data sources 1. User –manually creates mappings for a few sources, say 3 –shows LSD these mappings 2. LSD learns from the mappings 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach

6 listed-price $250,000 $110, address price agent-phone descriptionExample location Miami, FL Boston, MA... phone (305) (617) comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320, contact-phone (278) (617) extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

7 Our Contributions 1. Use of multi-strategy learning –well-suited to exploit multiple types of knowledge –highly modular & extensible 2. Extend learning to incorporate constraints –handle a wide range of domain & user-specified constraints 3. Develop XML learner –exploit hierarchical nature of XML

8 Multi-Strategy Learning Use a set of base learners –each exploits well certain types of information Match schema elements of a new source –apply the base learners –combine their predictions using a meta-learner Meta-learner –uses training sources to measure base learner accuracy –weighs each learner based on its accuracy

9 Base Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Examples –Name learner –agent-name => (name,0.7), (phone,0.3) –Naive Bayes learner –“Kent, WA” => (address,0.8), (name,0.2) –“Great location” => (description,0.9), (address,0.1)

10 Boston, MA $110,000 (617) Great location Miami, FL $250,000 (305) Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) ”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

11 Beautiful yard Great beach Close to Seattle (278) (617) (512) Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

12 Domain Constraints Impose semantic regularities on sources –verified using schema or data Examples –a = address & b = address a = b –a = house-id a is a key –a = agent-info & b = agent-name b is nested in a Can be specified up front –when creating mediated schema –independent of any actual source schema

13 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint –ad-id = house-id Extended to handle domain heuristics –a = agent-phone & b = agent-name a & b are usually close to each other Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner

14 Putting It All Together: the LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner –uses stacking [Ting&Witten99, Wolpert92] –returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase

15 Empirical Evaluation Four domains –Real Estate I & II, Course Offerings, Faculty Listings For each domain –create mediated DTD & domain constraints –choose five sources –extract & convert data listings into XML –mediated DTDs: elements, source DTDs: Ten runs for each experiment - in each run: –manually provide 1-1 mappings for 3 sources –ask LSD to propose mappings for remaining 2 sources –accuracy = % of 1-1 mappings correctly identified

16 High Matching Accuracy LSD’s accuracy: % Best single base learner: % + Meta-learner: % + Constraint handler: % + XML learner: % Average Matching Acccuracy (%)

17 Performance Sensitivity Average matching accuracy (%) Number of data listings per source

18 Contribution of Schema vs. Data More experiments in the paper! Average matching accuracy (%)

19 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98], CUPID [Madhavan et. al. 01] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability Others –DELTA [Clifton et. al. 97], CLIO [Miller et. al. 00][Yan et. al. 01] Multi-strategy learning in other domains –series of workshops [91,93,96,98,00] –[Freitag98], Proverb [Keim et. al. 99]

20 Summary LSD project –applies machine learning to schema matching Main ideas & contributions –use of multi-strategy learning –extend learning to handle domain & user-specified constraints –develop XML learner System design: A contribution to generic schema-matching –highly modular & extensible –handle multiple types of knowledge –continuously improve over time

21 Ongoing & Future Work Ongoing & Future Work Improve accuracy –address current system limitations Extend LSD to more complex mappings Apply LSD to other application contexts –data translation –data warehousing –e-commerce –information extraction –semantic web

22 Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

23 Existing learners flatten out all structures Developed XML learner –similar to the Naive Bayes learner –input instance = bag of tokens –differs in one crucial aspect –consider not only text tokens, but also structure tokens Exploiting Hierarchical Structure Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. Gail Murphy MAX Realtors