Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.

Slides:



Advertisements
Similar presentations
ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 1: INTRODUCTION TO DATA INTEGRATION PRINCIPLES OF DATA INTEGRATION.
Advertisements

Chapter 10: Information Integration. Bing Liu, UIC ACL-07 2 Introduction At the end of last topic, we identified the problem of integrating extracted.
CSE 636 Data Integration Data Integration Approaches.
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
An Extensible System for Merging Two Models Rachel Pottinger University of Washington Supervisors: Phil Bernstein and Alon Halevy.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Information Integration: A Status Report Alon Halevy University of Washington, Seattle IJCAI 2003.
Principles of Dataspace Systems Alon Halevy PODS June 26, 2006.
Xyleme A Dynamic Warehouse for XML Data of the Web.
New England Database Society (NEDS) Friday, April 23, 2004 Volen 101, Brandeis University Sponsored by Sun Microsystems.
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UBC, January 15, 2004.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
DataSpaces: A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications.
Dataspaces: Co-Existence with Heterogeneity Alon Halevy KR 2006.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
Automatic Data Ramon Lawrence University of Manitoba
Crossing the Structure Chasm Alon Halevy University of Washington, Seattle UCLA, April 15, 2004.
Robert McCann University of Illinois Joint work with Bedoor AlShebli, Quoc Le, Hoa Nguyen, Long Vu, & AnHai Doan VLDB 2005 Mapping Maintenance for Data.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
CSE 636 Data Integration Overview Fall What is Data Integration? The problem of providing uniform (sources transparent to user) access to (query,
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Presenter: Shanshan Lu 03/04/2010
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Data Integration: Achievements and Perspectives in the Last Ten Years AiJing.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Kevin C. Chang. About the collaboration -- Cazoodle 2 Coming next week: Vacation Rental Search.
Semantic Mappings for Data Mediation
Chapter 13.3: Databases Invitation to Computer Science, Java Version, Second Edition.
Data Integration Approaches
Jennifer Widom Relational Databases The Relational Model.
Presented by: Dardan Xhymshiti Fall  Authors: Eli Cortez, Philip A.Bernstein, Yeye He, Lev Novik (Microsoft Corporation)  Conference: VLDB  Type:
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Organizing Structured Web Sources by Query Schemas: A Clustering Approach Bin He Joint work with: Tao Tao, Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Of 24 lecture 11: ontology – mediation, merging & aligning.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Statistical Schema Matching across Web Query Interfaces
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Relational Databases The Relational Model.
Relational Databases The Relational Model.
Data Integration for Relational Web
Integrating Taxonomies
Information Retrieval and Web Design
Context-Aware Internet
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington

November 4th, 2004Corpus-based Schema Matching Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories Keywords Books Title ISBN Price OurPrice Edition BookGenres ISBN Genre Authors ISBN FirstName LastName Inventory Database A Inventory Database B Discounts ItemID DiscountPrice

November 4th, 2004Corpus-based Schema Matching Heterogeneity and Data Sharing Mappings provide the glue between independent data sources Books+Music Central CD WorldAmazonAll Books Data Sources Mediator Query Books, Pubs, Authors,… Products, Discounts, … Book, Music, Store, … Mappings Data Integration Schema matching important to any application with multiple data sources

November 4th, 2004Corpus-based Schema Matching Typical Approaches Multiple sources of evidence in the schemas Schema element names Descriptions and documentation Data types Schema structure Data instances Combine multiple techniques to exploit all available evidence [Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]… Abbreviations, synonyms,… Incomplete, absent,… Inconsistent, absent,… Overlapping schemas,… Different values, scales,… BooksAndCDs/Categories ~ BookCategories/Category ItemID: unique identifier for a book or a CD DateTime  Integer All books have similar attributes All addresses have similar formats

November 4th, 2004Corpus-based Schema Matching Name: Instances: Type: … Name: Instances: Type: … MsMs MtMt smsm tntn st Similarity Matrix s1s1 t1t1 1. Build models 4. Generate matches t T s S Schemas Element Models Mapping 3. Combine results Matching Techniques 2. Compare models

November 4th, 2004Corpus-based Schema Matching Insufficient evidence Product Music ASINtitleartistsrecordLabeldiscountPrice productIDnamepricesalePrice 0X7630AB12The Concept in Central Park $13.99$11.99 (no tuples) MusicCD CD prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GessColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99

November 4th, 2004Corpus-based Schema Matching Obtaining more evidence MusicCD CD prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GessColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99 Product, CD Music, MusicCD ASINtitle, albumartists, artistName recordLabeldiscountPrice 4Y3026DF23The Best of the DoorsThe Doors $12.99 productID, prodID name, albumNamepricesalePrice 0X7630AB12, 9R4374FG56 The Concept in Central Park, Saturday Night Fever $13.99, $14.99 $11.99, $9.99 Corpus-based Augment Corpus

November 4th, 2004Corpus-based Schema Matching Can we use known schemas and mappings to match as yet unseen schemas? Augment information about elements in schemas being matched Learn schema design patterns and constraints from known schemas to improve matches

November 4th, 2004Corpus-based Schema Matching Multiple representations for concepts CDs CD Music ID CDID ProdCode ISBN Album AlbumName Name TrackName DiscountPrice DiscountedPrice SalePrice OurPrice Discounted DiscPrice RecordLabel Label Company RecordingCompany Artist AuthorArtist Name LastName Author Artists CD2Artist AuthorArtists ArtistID Learn alternate names, data instances, names of related elements, data types, …

November 4th, 2004Corpus-based Schema Matching Schema Design Patterns Relations between elements Schema element dependency Frequently co-occurring concepts CDs  pricefax  telephone(Warehouse, warehouseID, manager, telephone, fax) (Availability, Books, CDs, Warehouses) discountPrice  pricecity  state numEmployees  manager zipcode  Warehouses Tables and likely columnsTable/column Likely column/table Other column/table WarehouseswarehouseID, telephone, fax, manager, streetAddress, city state, zip, numEmployees, capacity titleBooks isbnBooks, AvailabilityKeywords, Authors

November 4th, 2004Corpus-based Schema Matching Name: Instances: Type: … s MsMs Build initial models S Schemas Element Models Name: Instances: Type: … M’ s Search similar elements Build augmented models sef f e Corpus of known schemas and mappings Learn schema design patterns Domain Constraints Generate Matches Typical Schema Matcher Augmented Models Mapping Concepts/Clusters

November 4th, 2004Corpus-based Schema Matching Contents of the Corpus In order to augment Learn model ensemble for each element names, data instances, types, structure, … Train using the schemas and mappings Element and elements it maps to are positive examples In order to learn domain constraints Cluster elements in the corpus into concepts Estimate schema statistics Likely tables-columns and element co-occurrence Learn importance of individual constraints

November 4th, 2004Corpus-based Schema Matching Experimental Results Four domains Automatically extracted web forms Manually created relational schemas Techniques Direct: Glue [WWW’2004] Corpus-based Augment Corpus-based Pivot [IIW’2004]

November 4th, 2004Corpus-based Schema Matching Improved Matching Performance schemas and 6 mappings in the corpus schema pairs being tested

November 4th, 2004Corpus-based Schema Matching Difficult Match Tasks More significant improvements for difficult tasks Improvements are less for easy tasks

November 4th, 2004Corpus-based Schema Matching Related Work Using past matching experience [Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002] We are trying to match unseen schemas. Using web forms to construct mediated schema [He & Chang, SIGMOD’2003] Clustering of elements is an intermediate step in our corpus. Using a Domain Ontology [Xu & Embley, DASFAA’2003] Our corpus structures are automatically generated.

November 4th, 2004Corpus-based Schema Matching Conclusions Schema Matching is hard with insufficient evidence Corpus-based Schema Matching Augment the evidence about elements in unseen schemas Learn schema design patterns to select matches Improves matching especially for difficult tasks Future Work Large schemas and complex mappings User feedback to curate the corpus Corpus as a tool for other data management task [Halevy & Madhavan, IJCAI’2003]

November 4th, 2004Corpus-based Schema Matching Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…)  CDs(x:Album,…) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B

November 4th, 2004Corpus-based Schema Matching Difficult Match Tasks Match tasks are separated by the ease of direct matching Difficult Match TasksEasier Match Tasks