AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Outbrief of SWSI Architecture Committee F2F Sat, April 12, 2003 Miami, FL Mark H. Burstein BBN Technologies.
Alon Halevy University of Washington Joint work with Anhai Doan, Jayant Madhavan, Phil Bernstein, and Pedro Domingos Peer Data-Management Systems: Plumbing.
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Semantic integration of data in database systems and ontologies
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
AnHai Doan Database and Information System Group University of Illinois, Urbana Champaign Spring 2004 Schema & Ontology Matching: Current Research Directions.
Datalog and Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 12, 2007 LSD Slides courtesy.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Ron Shaker, Peter Mork, Matt Barclay and Peter Tarczy-Hornoch, M.D. University of Washington, Seattle WA A Rule Driven Bi-Directional Translation System.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos Learning to Map Between Schemas Ontologies.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)
Data Integration Helena Galhardas DEI IST (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
The Semantic Web - Week 21 Building the SW: Information Extraction and Integration Module Website: Practical this.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Automatic Data Ramon Lawrence University of Manitoba
Adaptively Processing Remote Data and Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Learning to Map between Structured Representations of Data
Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach The LSD Project.
1 Tools for Commercial Component Assembly Francis Bordeleau, Zeligsoft/Carleton University Mark Vigder, National Research Council Canada.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 5: Schema Matching and Mapping PRINCIPLES OF DATA INTEGRATION.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Lecture #9 Data Integration May 30 th, Agenda/Administration Project demo scheduling. Reading pointers for exam.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 16, 2015 LSD Slides courtesy AnHai Doan.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Component Based SW Development and Domain Engineering 1 Component Based Software Development and Domain Engineering.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Data Integration Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 14, 2007.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Semantic Mappings for Data Mediation
Approved for Public Release, Distribution Unlimited The Challenge of Data Interoperability from an Operational Perspective Workshop on Information Integration.
Requirements Engineering Requirements Validation and Management Lecture-24.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
AnHai Doan, Pedro Domingos, Alon Halevy University of Washington
Cross-Ontological Relationships
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
CAE-SCRUB for Incorporating Static Analysis into Peer Reviews
Property consolidation for entity browsing
Block Matching for Ontologies
Integrating Taxonomies
Learning to Map Between Schemas Ontologies
eTuner: Tuning Schema Matching Software using Synthetic Scenarios
Presentation transcript:

AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration

2 Overview Problem definition –schema matching Solution –multi-strategy learning Prototype system –LSD (Learning Source Descriptions) Experiments Related work Summary & future work

3 Data Integration Find houses with four bathrooms and price under $500,000 mediated schema superhomes.com source schema realestate.com source schema homeseekers.com source schema wrapper

4 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact-info house address agent-name agent-phone num-bathsamenities full-bathshalf-bathshandicap- equipped contact name phone

5 Map of the Problem Map of the Problem source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings

6 Current State of Affairs Largely done by hand –labor intensive & error prone –key bottleneck in building applications Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data Need automatic approaches to scale up!

7 Use machine learning to match schemas Basic idea 1. create training data –manually map a set of sources to mediated schema 2. train system on training data –learns from –name of schema elements –format of values –frequency of words & symbols –characteristics of value distribution –proximity, position, structure, system proposes mappings for subsequent sources Our Approach

8 Example realestate.com Seattle, WA (206) $250,000 Fantastic house address phone price description mediated schema location Seattle, WA Dallas, TX... listed-price $250,000 $162,000 $180, agent-phone (206) (206) (214) comments Fantastic house... Great... Hurry!......

9 Multi-Strategy Learning Use a set of base learners –each exploits certain types of information Match schema elements of a new source –apply the learners –combine their predictions using a meta-learner Meta-learner –measures base learner accuracy on training data –weighs each learner based on its accuracy

10 Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Examples –Name matcher –agent-name => (name,0.7), (phone,0.3) –Frequency learner –“Seattle, WA” => (address,0.8), (name,0.2) –“Great location...” => (description,0.9), (address,0.1)

11 Training the Learners realestate.com Seattle, WA (206) $ 250,000 Fantastic house address phone price description mediated schema locationlisted-price agent-phone comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description)... Frequency Learner (“Seattle, WA”, address) (“(206) ”, phone) (“$ 250,000”, price) (“Fantastic house...”, description)...

12 Applying the Learners homes.com address phone price description mediated schema area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Frequency Learner Name Matcher Frequency Learner Meta-learner address description address Combiner address

13 The LSD System Base learners/modules –name matcher –Naive Bayesian learner –Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] –county-name recognizer Meta-learner –uses stacking [Ting&Witten99, Wolpert92] –uses training data to learn weights for base learners –combines predictions using confidence scores/weights

14 Experiments

15 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability Multi-strategy learning in other domains –series of workshops [91,93,96,98,00] –[Freitag98], Proverb [Keim et. al. 99]

16 Summary Schema matching –automated by learning Multi-strategy learning is essential –handles different types of data –incorporates different types of domain knowledge –easy to incorporate new learners –alleviates effects of noise & dirty data Implemented LSD –promising results with initial experiments

17 Future Work Future Work source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings