Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington.

Similar presentations


Presentation on theme: "Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington."— Presentation transcript:

1 Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington

2 November 4th, 2004Corpus-based Schema Matching Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… SQL expressions that can populate one database from other BooksAndMusic Title Author Publisher ItemID ItemType ListPrice Categories Keywords Books Title ISBN Price OurPrice Edition BookGenres ISBN Genre Authors ISBN FirstName LastName Inventory Database A Inventory Database B Discounts ItemID DiscountPrice

3 November 4th, 2004Corpus-based Schema Matching Heterogeneity and Data Sharing Mappings provide the glue between independent data sources Books+Music Central CD WorldAmazonAll Books Data Sources Mediator Query Books, Pubs, Authors,… Products, Discounts, … Book, Music, Store, … Mappings Data Integration Schema matching important to any application with multiple data sources

4 November 4th, 2004Corpus-based Schema Matching Typical Approaches Multiple sources of evidence in the schemas Schema element names Descriptions and documentation Data types Schema structure Data instances Combine multiple techniques to exploit all available evidence [Do, Rahm; VLDB 2002], [Doan, et al.; WWW 2002]… Abbreviations, synonyms,… Incomplete, absent,… Inconsistent, absent,… Overlapping schemas,… Different values, scales,… BooksAndCDs/Categories ~ BookCategories/Category ItemID: unique identifier for a book or a CD DateTime  Integer All books have similar attributes All addresses have similar formats

5 November 4th, 2004Corpus-based Schema Matching Name: Instances: Type: … Name: Instances: Type: … MsMs MtMt smsm tntn st Similarity Matrix s1s1 t1t1 1. Build models 4. Generate matches t T s S Schemas Element Models Mapping 3. Combine results Matching Techniques 2. Compare models

6 November 4th, 2004Corpus-based Schema Matching Insufficient evidence Product Music ASINtitleartistsrecordLabeldiscountPrice productIDnamepricesalePrice 0X7630AB12The Concept in Central Park $13.99$11.99 (no tuples) MusicCD CD prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GessColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99

7 November 4th, 2004Corpus-based Schema Matching Obtaining more evidence MusicCD CD prodIDalbumNameartistsrecordCompanypricesalePrice 9R4374FG56Saturday Night FeverThe Bee GessColumbia$14.99$9.99 ASINalbumartistNamepricediscountPrice 4Y3026DF23The Best of the DoorsThe Doors$16.99$12.99 Product, CD Music, MusicCD ASINtitle, albumartists, artistName recordLabeldiscountPrice 4Y3026DF23The Best of the DoorsThe Doors $12.99 productID, prodID name, albumNamepricesalePrice 0X7630AB12, 9R4374FG56 The Concept in Central Park, Saturday Night Fever $13.99, $14.99 $11.99, $9.99 Corpus-based Augment Corpus

8 November 4th, 2004Corpus-based Schema Matching Can we use known schemas and mappings to match as yet unseen schemas? Augment information about elements in schemas being matched Learn schema design patterns and constraints from known schemas to improve matches

9 November 4th, 2004Corpus-based Schema Matching Multiple representations for concepts CDs CD Music ID CDID ProdCode ISBN Album AlbumName Name TrackName DiscountPrice DiscountedPrice SalePrice OurPrice Discounted DiscPrice RecordLabel Label Company RecordingCompany Artist AuthorArtist Name LastName Author Artists CD2Artist AuthorArtists ArtistID Learn alternate names, data instances, names of related elements, data types, …

10 November 4th, 2004Corpus-based Schema Matching Schema Design Patterns Relations between elements Schema element dependency Frequently co-occurring concepts CDs  pricefax  telephone(Warehouse, warehouseID, manager, telephone, fax) (Availability, Books, CDs, Warehouses) discountPrice  pricecity  state numEmployees  manager zipcode  Warehouses Tables and likely columnsTable/column Likely column/table Other column/table WarehouseswarehouseID, telephone, fax, manager, streetAddress, city state, zip, numEmployees, capacity titleBooks isbnBooks, AvailabilityKeywords, Authors

11 November 4th, 2004Corpus-based Schema Matching Name: Instances: Type: … s MsMs Build initial models S Schemas Element Models Name: Instances: Type: … M’ s Search similar elements Build augmented models sef f e Corpus of known schemas and mappings Learn schema design patterns Domain Constraints Generate Matches Typical Schema Matcher Augmented Models Mapping Concepts/Clusters

12 November 4th, 2004Corpus-based Schema Matching Contents of the Corpus In order to augment Learn model ensemble for each element names, data instances, types, structure, … Train using the schemas and mappings Element and elements it maps to are positive examples In order to learn domain constraints Cluster elements in the corpus into concepts Estimate schema statistics Likely tables-columns and element co-occurrence Learn importance of individual constraints

13 November 4th, 2004Corpus-based Schema Matching Experimental Results Four domains Automatically extracted web forms Manually created relational schemas Techniques Direct: Glue [WWW’2004] Corpus-based Augment Corpus-based Pivot [IIW’2004]

14 November 4th, 2004Corpus-based Schema Matching Improved Matching Performance 16-19 schemas and 6 mappings in the corpus 22-54 schema pairs being tested

15 November 4th, 2004Corpus-based Schema Matching Difficult Match Tasks More significant improvements for difficult tasks Improvements are less for easy tasks

16 November 4th, 2004Corpus-based Schema Matching Related Work Using past matching experience [Doan, et al., SIGMOD’2001; Do & Rahm, VLDB’2002] We are trying to match unseen schemas. Using web forms to construct mediated schema [He & Chang, SIGMOD’2003] Clustering of elements is an intermediate step in our corpus. Using a Domain Ontology [Xu & Embley, DASFAA’2003] Our corpus structures are automatically generated.

17 November 4th, 2004Corpus-based Schema Matching Conclusions Schema Matching is hard with insufficient evidence Corpus-based Schema Matching Augment the evidence about elements in unseen schemas Learn schema design patterns to select matches Improves matching especially for difficult tasks Future Work Large schemas and complex mappings User feedback to curate the corpus Corpus as a tool for other data management task [Halevy & Madhavan, IJCAI’2003]http://www.cs.washington.edu/homes/jayant

18 November 4th, 2004Corpus-based Schema Matching Schema Matching Schema Matching: Discovering correspondences between similar elements Eventually… BooksAndMusic(x:Title,…) = Books(x:Title,…)  CDs(x:Album,…) BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords Books Title ISBN Price DiscountPrice Edition CDs Album ASIN Price DiscountPrice Studio BookCategories ISBN Category CDCategories ASIN Category Artists ASIN ArtistName GroupName Authors ISBN FirstName LastName Inventory Database A Inventory Database B

19 November 4th, 2004Corpus-based Schema Matching Difficult Match Tasks Match tasks are separated by the ease of direct matching Difficult Match TasksEasier Match Tasks


Download ppt "Corpus-based Schema Matching Jayant Madhavan Philip Bernstein AnHai Doan Alon Halevy Microsoft Research UIUC University of Washington."

Similar presentations


Ads by Google