Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Recap: Mining association rules from large datasets
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.
1 A Survey of Approaches to Automatic Schema Matching Name: Samer Samarah Number: This.
Greg Riccardi Florida State University. Using SQL to Manipulate Database Content and Structure How to create queries in SQL –Simple select statements.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Aki Hecht Seminar in Databases (236826) January 2009
Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.
1 Translation of ER-diagram into Relational Schema Prof. Sin-Min Lee Department of Computer Science.
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He, Kevin Chen-Chuan Chang, Jiawei Han Presented by Dayi Zhou.
Mining Association Rules
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
LSP 121 Week 2 Normalization and Queries. Normalization The Old Car Club database presented a problem – what if one person owns multiple cars? (One owner.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Objectives of the Lecture :
Structured Query Language (SQL) A2 Teacher Up skilling LECTURE 2.
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
24 GOLDEN COINS, 1 IS FAKE ( WEIGHS LESS). DATABASE CONCEPTS Ahmad, Mohammad J. CS 101.
Mr. Justin “JET” Turner CSCI 3000 – Fall 2015 CRN Section A – TR 9:30-10:45 CRN – Section B – TR 5:30-6:45.
1 Translation of ER-diagram into Relational Schema Prof. Sin-Min Lee Department of Computer Science.
Automated Creation of a Forms- based Database Query Interface Magesh Jayapandian H.V. Jagadish Univ. of Michigan VLDB
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
Querying Structured Text in an XML Database By Xuemei Luo.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
CS 101 – Access notes Databases (Microsoft Access) 4 parts of a database database design –Try to understand the ideas behind database design, not just.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
XML Schema Integration Ray Dos Santos July 19, 2009.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
ITGS Databases.
IS 230Lecture 6Slide 1 Lecture 7 Advanced SQL Introduction to Database Systems IS 230 This is the instructor’s notes and student has to read the textbook.
Chapter 2: Intro to Relational Model. 2.2 Example of a Relation attributes (or columns) tuples (or rows)
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Manipulating Data Lesson 3. Objectives Queries The SELECT query to retrieve or extract data from one table, how to retrieve or extract data by using.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Chapter 71 The Relational Data Model, Relational Constraints & The Relational Algebra.
The Relational Data Model & Relational Algebra
Module 2: Intro to Relational Model
Chapter 2: Relational Model
RELATION.
Entity-Relationship Model
Chapter 2: Intro to Relational Model
Quiz Questions Q.1 An entity set that does not have sufficient attributes to form a primary key is a (A) strong entity set. (B) weak entity set. (C) simple.
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Toward Large Scale Integration
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Manipulating Data Lesson 3.
Presentation transcript:

Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

Automatic Schema Matching, SDBI, Articles A survey of approaches to automatic schema matching Rahm & Bernstein (2001) Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004)

Automatic Schema Matching, SDBI, Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

Automatic Schema Matching, SDBI, Match Definition ID Name NumOfBooks AID AName ANumOfBooks Authors A match is a mapping between elements of two schemas that correspond semantically to each other

Automatic Schema Matching, SDBI, Match Properties ID Name NumOfBooks ID FName LName YearOfBirth Authors ? (n:m) matching also possible (1:1) (1:n) ?

Automatic Schema Matching, SDBI, Match Properties (cont ’ d) ID Name Salary ($) Authors Salary(NIS) = Salary($) * 4.55 We will not find the function, just the attributes ID Name Salary (NIS)

Automatic Schema Matching, SDBI, Match Properties (cont ’ d) EmpName DeptID Employees One relation is mapped to two others EmpName DeptName DeptID DeptName Departments Join

Automatic Schema Matching, SDBI, Match Properties (cont ’ d) Teacher StartTime EndTime Lessons Too hard for PC! PC should only suggest mappings to the user Teacher Time ? ?

Automatic Schema Matching, SDBI, Match Properties (cont ’ d) An automated tool can be helpful here … Field1 Field2 Field3 Field4 Field5 Field6 Field7 Field8 Field9 field10 Field1 Field2 Field3 Field4 Field5 Field6 Field7 Field8 Field9 field10 So maybe it can all be done manually?

Automatic Schema Matching, SDBI, Match Generalization We have defined a match for the relational model. There are other interesting models: … 1 Calvino … AuthorsBooks ID Authors Name

Automatic Schema Matching, SDBI, Match Generalization (cont ’ d) nodes and edges in graphs elements, subelements, and IDREFs in XML … Define a Schema to be a set of elements connected by some structure Use the natural correspondence:

Automatic Schema Matching, SDBI, Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

Automatic Schema Matching, SDBI, Data Migration Date From Message Time Writer Message IsVisible ResponseTo Old ForumNew Forum Migrate data from old DB to new DB Special case: Data warehouse

Automatic Schema Matching, SDBI, E-Commerce Map between different message formats The Invisible Cities 50 book 50 Book Store General Store

Automatic Schema Matching, SDBI, Global Query Interface GOOGLE MSN Yahoo You want to build a Meta-Querier. However …

Automatic Schema Matching, SDBI, Global Query Interface (cont ’ d) Search Type q GOOGLEMSNYahoo Solution: Reduce the html form to its “ schema ” Qry Type

Automatic Schema Matching, SDBI, Semantic Query Processing Id Name Authors Find: Author + Ram + Oren Keywords search scenario SELECT * WHERE Id= ‘ Ram Oren ’ SELECT * WHERE Name= ‘ Ram Oren ’ ? ? Author Ram Oren How does this differ from previous examples?

Automatic Schema Matching, SDBI, Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

Automatic Schema Matching, SDBI, Matchers There are a few algorithms to map attributes of 2 schemas Define such an algorithm as a matcher Define a hybrid matcher as a matcher that combines results from other matchers

Automatic Schema Matching, SDBI, Schema-based Vs. Instance-based Two ways to perform a match: Use schema data (field name, type, constraints … ) Use data from the table

Automatic Schema Matching, SDBI, Instance-based BookIDTotPagesTotPrice BookIDTotalP 160 Build a schema from instance data, then use schema matchers Use the data directly. Example: Two options for using data from the table: Books What is TotalP?

Automatic Schema Matching, SDBI, Instance-based (cont ’ d) Useful when no schema data is available Not useful when no instance data is available … When will we use/not use instance based matchers?

Automatic Schema Matching, SDBI, Schema-Based Element ’ s name Description Data Type Relationships Constraints What useful data is there in the schema?

Automatic Schema Matching, SDBI, Schema-Based: Name Matching Map elements with similar names: String equality Common substrings (Birthday --> DayOfBirth) Canonical names (CName --> Customer Name) Synonyms (Car --> Automobile) Hypernyms (Book is-a Publication) Soundex (ShipTo --> Ship2) User provided (Issue --> Bug)

Automatic Schema Matching, SDBI, Schema-Based: Description Map elements based on description empn //employee namename //name of employee Schema A Schema B

Automatic Schema Matching, SDBI, Schema-Based: Constraint Based Map elements based on Constraints: Data Types Unique, Primary, Foreign Name PID ID PLevel Name PID EmployeesPermissionsEmployees ID Sum Payments ?

Automatic Schema Matching, SDBI, Reuse Previous Matching Schema A Name Salary AName Income Author Money Schema BSchema C Get mapping A  C From mappings A  B and B  C A partial reuse is also possible (e.g. on some of the attributes) Be aware of the domain: salary and income are not always the same!

Automatic Schema Matching, SDBI, Complexity We must compare every subgroup of attributes in schema A to every subgroup in schema B Exponential in the number of attributes However, we can assume the number of attributes is blocked … Also check (n:m) matching only for n,m<C for some C

Automatic Schema Matching, SDBI, Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

Automatic Schema Matching, SDBI, Data Mining TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap Sells Which items are likely to co-appear? Data Mining is the process of discovering patterns in data, usually stored in a Database.

Automatic Schema Matching, SDBI, Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap SellsSupport of an itemset: the fraction of transactions that contain all items in the itemset. What is the support for {Book}?1 And for {Book, Soap}?0.666 The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset

Automatic Schema Matching, SDBI, Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap Sells Algorithm to find frequent itemsets: Why can we stop? 1. Define a threshold minSupport for “ frequent ” itemsets 2. Calculate support for all itemsets of size (1) 3. Calculate support for itemsets of size 2,3,4 … 4. For each size k save the frequent itemsets 5. Stop when there are no frequent itemsets in size K.

Automatic Schema Matching, SDBI, Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap SellsExample: 1.Set minSupport = S({Book})=1, S({Pencil})=0.33, S({Soap})= S({Book, Soap})= S({Book, Soap, Pencil})=0 Where is {Soap, Pencil}?

Automatic Schema Matching, SDBI, Back to Schema Matching … Id First Last Id Salary Name YearAuthors Id AuthorFirst AuthorLast YearBirth Id Author Goal: Map {Name} to {Author}, {Salary} to {Income} … Id FirstName LastName Income Idea:{Name} and {Author} are unlikely to appear together Solution: go to the supermarket, but instead of food buy attributes! What is the difference from the supermarket example?

Automatic Schema Matching, SDBI, The Algorithm Input: set of m schemas {Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last} … {Salary}:{Income} {Year}:{YearBirth} Output: set of n-ary mappings Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income

Automatic Schema Matching, SDBI, Algorithm 1.Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last … } 2. For each pair of attributes, calculate their support (how often they appear together) S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1 Naive Algorithm

Automatic Schema Matching, SDBI, Algorithm (Cont ’ d) 4. Using the A-Priory property calculate support for groups of sizes 3,4,5 … 3. Choose groups with low support S(Name, LastName, Salary) = 0 S(First, Last, Salary) = Return all groups with low support S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1

Automatic Schema Matching, SDBI, Algorithm (Cont ’ d) The algorithm is naive. {name, author, X} Actually for any attribute X we have: {name, author} Then we also have negative correlation for this: {name, author, salary} {name, author, yearOfBirth} suppose we have negative correlation for this:

Automatic Schema Matching, SDBI, Improvement Improvement: Define the support (s) of an itemset {a,b,c … } to be MAX { s(a,b), s(b,c), s(a,c) … } s(name, author)=0.1 s(name, salary)=0.5 s(salary, author)=0.6 Example: s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6 Now the support can go up so checking it is not trivial What is the logic behind this?

Automatic Schema Matching, SDBI, Generalizing the algorithm ({first,last}, {name}) Now the algorithm finds all groups of attributes (a,b,c … ) s.t. none of the pairs appears together. Hopefully these are attributes with the same semantic: {name, author} {salary, payments} … But what about this? Currently we find only (1:1) matching For (n:m) we need to preprocess …

Automatic Schema Matching, SDBI, Preprocess 1.Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last … } 2. Run the normal A-Priori algorithm (find all attributes that DO appear together) S(first, last)=0.9 S(firstName,lastName)=0.85 Pre-Process for the algorithm:

Automatic Schema Matching, SDBI, Preprocess 3. For each schema S in the input: For each frequent attributes group A: If A intersects with S than add new attribute “ A ” to S Id First Last Id First Last First, Last 4. Run the previous algorithm on S 1 ’, S 2 ’… to find negative correlation {First,Last} ({first,last}, {name}) Now we can find groups like: S A S’S’

Automatic Schema Matching, SDBI, Still Not Perfect … Suppose we found these mappings: {first,last}:{name}:{author} {first, yearOfBirth}:{birthDate} {yearOfBirth, monthOfBirth}:{birthDate} There is a contradiction!

Automatic Schema Matching, SDBI, Solution Add the top rank to the results 1. {first,last}:{name}:{author} Delete contradictions to this rank: 2. {first, yearOfBirth}:{birthDate} X Process next mapping 3. {yearOfBirth, monthOfBirth}:{birthDate} 1. {first,last}:{name}:{author} 2. {first, yearOfBirth}:{birthDate} 3. {yearOfBirth, monthOfBirth}:{birthDate} Solution: rank the mappings according to the support of the lowest pair in each mapping

Automatic Schema Matching, SDBI, Attributes with the same name Payment (longint) Step 1 of the algorithm (reminder): Make a list S of all attributes from all schemas S = {Name, Salary, FirstName, LastName, Author, First, Last … } This means that two attributes with the same name are always considered the same. Payment (datetime) ? Solution: add the type to the name Id First Last Id_Int First_String Last_String

Automatic Schema Matching, SDBI, Correlation Measure So Income=Id? s(Income, Id)=0.2 Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income The rare attribute problem:

Automatic Schema Matching, SDBI, Correlation Measure (cont ’ d) s(Salary, Income)=0 Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income The sparseness problem: If Salary=Income than what is their equivalence in the other tables?

Automatic Schema Matching, SDBI, Correlation Measure (cont ’ d) Let A,B be two attributes. Define f 11 : the number of schemas where both A,B appears f 10 : number of schemas where only A appears … f 1+ : f11+f10 A^A Bf 11 f 10 f 1+ ^Bf 01 f 00 f 0+ f +1 f +0 f ++ Support of an itemset: the fraction of transactions that contain all items in the itemset. There are other ways to calculate support:

Automatic Schema Matching, SDBI, Correlation Measure (cont ’ d) support=f 11 /f ++ We used:Lift: f 00 f 11 /f 10 f 11 H-measure f 01 f 10 /f +1 f 1+ A^A Bf 11 f 10 f 1+ ^Bf 01 f 00 f 0+ f +1 f +0 f ++ Every measure fits a different situation For example, in the matching problem we want to “ punish ” attributes that co-appear Id Salary Name Year

Automatic Schema Matching, SDBI, Applications This approach can only be used when we have many schemas El-Al.Com Adult Child Infant Arkia.ComAmerican Airlines.Com Adult Child Destination Passengers To Data Migration? Web query interfaces. Example: Is it possible to use the algorithm for migration by running it on many random schemas?

Automatic Schema Matching, SDBI, Complexity The A-Priory algorithm is O(2^n) Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k

Automatic Schema Matching, SDBI,