Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006.

Similar presentations


Presentation on theme: "Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006."— Presentation transcript:

1 Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006

2 Automatic Schema Matching, SDBI, 20062 Articles A survey of approaches to automatic schema matching Rahm & Bernstein (2001) Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach He, Chen-Chuan Chang & Han (2004)

3 Automatic Schema Matching, SDBI, 20063 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

4 Automatic Schema Matching, SDBI, 20064 Match Definition ID Name NumOfBooks AID AName ANumOfBooks Authors A match is a mapping between elements of two schemas that correspond semantically to each other

5 Automatic Schema Matching, SDBI, 20065 Match Properties ID Name NumOfBooks ID FName LName YearOfBirth Authors ? (n:m) matching also possible (1:1) (1:n) ?

6 Automatic Schema Matching, SDBI, 20066 Match Properties (cont ’ d) ID Name Salary ($) Authors Salary(NIS) = Salary($) * 4.55 We will not find the function, just the attributes ID Name Salary (NIS)

7 Automatic Schema Matching, SDBI, 20067 Match Properties (cont ’ d) EmpName DeptID Employees One relation is mapped to two others EmpName DeptName DeptID DeptName Departments Join

8 Automatic Schema Matching, SDBI, 20068 Match Properties (cont ’ d) Teacher StartTime EndTime Lessons Too hard for PC! PC should only suggest mappings to the user Teacher Time ? ?

9 Automatic Schema Matching, SDBI, 20069 Match Properties (cont ’ d) An automated tool can be helpful here … Field1 Field2 Field3 Field4 Field5 Field6 Field7 Field8 Field9 field10 Field1 Field2 Field3 Field4 Field5 Field6 Field7 Field8 Field9 field10 So maybe it can all be done manually?

10 Automatic Schema Matching, SDBI, 200610 Match Generalization We have defined a match for the relational model. There are other interesting models: … 1 Calvino … AuthorsBooks ID Authors Name

11 Automatic Schema Matching, SDBI, 200611 Match Generalization (cont ’ d) nodes and edges in graphs elements, subelements, and IDREFs in XML … Define a Schema to be a set of elements connected by some structure Use the natural correspondence:

12 Automatic Schema Matching, SDBI, 200612 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

13 Automatic Schema Matching, SDBI, 200613 Data Migration Date From Message Time Writer Message IsVisible ResponseTo Old ForumNew Forum Migrate data from old DB to new DB Special case: Data warehouse

14 Automatic Schema Matching, SDBI, 200614 E-Commerce Map between different message formats The Invisible Cities 50 book 50 Book Store General Store

15 Automatic Schema Matching, SDBI, 200615 Global Query Interface GOOGLE MSN Yahoo You want to build a Meta-Querier. However …

16 Automatic Schema Matching, SDBI, 200616 Global Query Interface (cont ’ d) Search Type q GOOGLEMSNYahoo Solution: Reduce the html form to its “ schema ” Qry Type

17 Automatic Schema Matching, SDBI, 200617 Semantic Query Processing Id Name Authors Find: Author + Ram + Oren Keywords search scenario SELECT * WHERE Id= ‘ Ram Oren ’ SELECT * WHERE Name= ‘ Ram Oren ’ ? ? Author Ram Oren How does this differ from previous examples?

18 Automatic Schema Matching, SDBI, 200618 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

19 Automatic Schema Matching, SDBI, 200619 Matchers There are a few algorithms to map attributes of 2 schemas Define such an algorithm as a matcher Define a hybrid matcher as a matcher that combines results from other matchers

20 Automatic Schema Matching, SDBI, 200620 Schema-based Vs. Instance-based Two ways to perform a match: Use schema data (field name, type, constraints … ) Use data from the table

21 Automatic Schema Matching, SDBI, 200621 Instance-based BookIDTotPagesTotPrice 150050 240040 345090 BookIDTotalP 160 Build a schema from instance data, then use schema matchers Use the data directly. Example: Two options for using data from the table: Books What is TotalP?

22 Automatic Schema Matching, SDBI, 200622 Instance-based (cont ’ d) Useful when no schema data is available Not useful when no instance data is available … When will we use/not use instance based matchers?

23 Automatic Schema Matching, SDBI, 200623 Schema-Based Element ’ s name Description Data Type Relationships Constraints What useful data is there in the schema?

24 Automatic Schema Matching, SDBI, 200624 Schema-Based: Name Matching Map elements with similar names: String equality Common substrings (Birthday --> DayOfBirth) Canonical names (CName --> Customer Name) Synonyms (Car --> Automobile) Hypernyms (Book is-a Publication) Soundex (ShipTo --> Ship2) User provided (Issue --> Bug)

25 Automatic Schema Matching, SDBI, 200625 Schema-Based: Description Map elements based on description empn //employee namename //name of employee Schema A Schema B

26 Automatic Schema Matching, SDBI, 200626 Schema-Based: Constraint Based Map elements based on Constraints: Data Types Unique, Primary, Foreign Name PID ID PLevel Name PID EmployeesPermissionsEmployees ID Sum Payments ?

27 Automatic Schema Matching, SDBI, 200627 Reuse Previous Matching Schema A Name Salary AName Income Author Money Schema BSchema C Get mapping A  C From mappings A  B and B  C A partial reuse is also possible (e.g. on some of the attributes) Be aware of the domain: salary and income are not always the same!

28 Automatic Schema Matching, SDBI, 200628 Complexity We must compare every subgroup of attributes in schema A to every subgroup in schema B Exponential in the number of attributes However, we can assume the number of attributes is blocked … Also check (n:m) matching only for n,m<C for some C

29 Automatic Schema Matching, SDBI, 200629 Contents Problem Definition Applications Classic Approaches Correlation Mining Approach

30 Automatic Schema Matching, SDBI, 200630 Data Mining TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap Sells Which items are likely to co-appear? Data Mining is the process of discovering patterns in data, usually stored in a Database.

31 Automatic Schema Matching, SDBI, 200631 Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap SellsSupport of an itemset: the fraction of transactions that contain all items in the itemset. What is the support for {Book}?1 And for {Book, Soap}?0.666 The A-Priori property: the support for any subset of an itemset is bigger than the support for the itemset

32 Automatic Schema Matching, SDBI, 200632 Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap Sells Algorithm to find frequent itemsets: Why can we stop? 1. Define a threshold minSupport for “ frequent ” itemsets 2. Calculate support for all itemsets of size (1) 3. Calculate support for itemsets of size 2,3,4 … 4. For each size k save the frequent itemsets 5. Stop when there are no frequent itemsets in size K.

33 Automatic Schema Matching, SDBI, 200633 Data Mining (cont ’ d) TransIDItem 1Book 1Pencil 2Book 2Soap 3Book 3Soap SellsExample: 1.Set minSupport = 0.5 2.S({Book})=1, S({Pencil})=0.33, S({Soap})=0.666 3. S({Book, Soap})=0.666 4. S({Book, Soap, Pencil})=0 Where is {Soap, Pencil}?

34 Automatic Schema Matching, SDBI, 200634 Back to Schema Matching … Id First Last Id Salary Name YearAuthors Id AuthorFirst AuthorLast YearBirth Id Author Goal: Map {Name} to {Author}, {Salary} to {Income} … Id FirstName LastName Income Idea:{Name} and {Author} are unlikely to appear together Solution: go to the supermarket, but instead of food buy attributes! What is the difference from the supermarket example?

35 Automatic Schema Matching, SDBI, 200635 The Algorithm Input: set of m schemas {Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last} … {Salary}:{Income} {Year}:{YearBirth} Output: set of n-ary mappings Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income

36 Automatic Schema Matching, SDBI, 200636 Algorithm 1.Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last … } 2. For each pair of attributes, calculate their support (how often they appear together) S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1 Naive Algorithm

37 Automatic Schema Matching, SDBI, 200637 Algorithm (Cont ’ d) 4. Using the A-Priory property calculate support for groups of sizes 3,4,5 … 3. Choose groups with low support S(Name, LastName, Salary) = 0 S(First, Last, Salary) = 0.1 5. Return all groups with low support S(Name, Salary) = 0.4 S(First, Last) = 0.95 S(Last, Name) = 0.1

38 Automatic Schema Matching, SDBI, 200638 Algorithm (Cont ’ d) The algorithm is naive. {name, author, X} Actually for any attribute X we have: {name, author} Then we also have negative correlation for this: {name, author, salary} {name, author, yearOfBirth} suppose we have negative correlation for this:

39 Automatic Schema Matching, SDBI, 200639 Improvement Improvement: Define the support (s) of an itemset {a,b,c … } to be MAX { s(a,b), s(b,c), s(a,c) … } s(name, author)=0.1 s(name, salary)=0.5 s(salary, author)=0.6 Example: s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6 Now the support can go up so checking it is not trivial What is the logic behind this?

40 Automatic Schema Matching, SDBI, 200640 Generalizing the algorithm ({first,last}, {name}) Now the algorithm finds all groups of attributes (a,b,c … ) s.t. none of the pairs appears together. Hopefully these are attributes with the same semantic: {name, author} {salary, payments} … But what about this? Currently we find only (1:1) matching For (n:m) we need to preprocess …

41 Automatic Schema Matching, SDBI, 200641 Preprocess 1.Make a list L of all attributes from all schemas L = {Name, Salary, FirstName, LastName, Author, First, Last … } 2. Run the normal A-Priori algorithm (find all attributes that DO appear together) S(first, last)=0.9 S(firstName,lastName)=0.85 Pre-Process for the algorithm:

42 Automatic Schema Matching, SDBI, 200642 Preprocess 3. For each schema S in the input: For each frequent attributes group A: If A intersects with S than add new attribute “ A ” to S Id First Last Id First Last First, Last 4. Run the previous algorithm on S 1 ’, S 2 ’… to find negative correlation {First,Last} ({first,last}, {name}) Now we can find groups like: S A S’S’

43 Automatic Schema Matching, SDBI, 200643 Still Not Perfect … Suppose we found these mappings: {first,last}:{name}:{author} {first, yearOfBirth}:{birthDate} {yearOfBirth, monthOfBirth}:{birthDate} There is a contradiction!

44 Automatic Schema Matching, SDBI, 200644 Solution Add the top rank to the results 1. {first,last}:{name}:{author} Delete contradictions to this rank: 2. {first, yearOfBirth}:{birthDate} X Process next mapping 3. {yearOfBirth, monthOfBirth}:{birthDate} 1. {first,last}:{name}:{author} 2. {first, yearOfBirth}:{birthDate} 3. {yearOfBirth, monthOfBirth}:{birthDate} Solution: rank the mappings according to the support of the lowest pair in each mapping

45 Automatic Schema Matching, SDBI, 200645 Attributes with the same name Payment (longint) Step 1 of the algorithm (reminder): Make a list S of all attributes from all schemas S = {Name, Salary, FirstName, LastName, Author, First, Last … } This means that two attributes with the same name are always considered the same. Payment (datetime) ? Solution: add the type to the name Id First Last Id_Int First_String Last_String

46 Automatic Schema Matching, SDBI, 200646 Correlation Measure So Income=Id? s(Income, Id)=0.2 Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income The rare attribute problem:

47 Automatic Schema Matching, SDBI, 200647 Correlation Measure (cont ’ d) s(Salary, Income)=0 Id First Last Id Salary Name Year Id AuthorFirst AuthorLast YearBirth Id Author Id FirstName LastName Income The sparseness problem: If Salary=Income than what is their equivalence in the other tables?

48 Automatic Schema Matching, SDBI, 200648 Correlation Measure (cont ’ d) Let A,B be two attributes. Define f 11 : the number of schemas where both A,B appears f 10 : number of schemas where only A appears … f 1+ : f11+f10 A^A Bf 11 f 10 f 1+ ^Bf 01 f 00 f 0+ f +1 f +0 f ++ Support of an itemset: the fraction of transactions that contain all items in the itemset. There are other ways to calculate support:

49 Automatic Schema Matching, SDBI, 200649 Correlation Measure (cont ’ d) support=f 11 /f ++ We used:Lift: f 00 f 11 /f 10 f 11 H-measure f 01 f 10 /f +1 f 1+ A^A Bf 11 f 10 f 1+ ^Bf 01 f 00 f 0+ f +1 f +0 f ++ Every measure fits a different situation For example, in the matching problem we want to “ punish ” attributes that co-appear Id Salary Name Year

50 Automatic Schema Matching, SDBI, 200650 Applications This approach can only be used when we have many schemas El-Al.Com Adult Child Infant Arkia.ComAmerican Airlines.Com Adult Child Destination Passengers To Data Migration? Web query interfaces. Example: Is it possible to use the algorithm for migration by running it on many random schemas?

51 Automatic Schema Matching, SDBI, 200651 Complexity The A-Priory algorithm is O(2^n) Usually there are only few correlations, so in step (k+1) we consider just a few from the groups of size k

52 Automatic Schema Matching, SDBI, 200652


Download ppt "Automatic Schema Matching Seminar on Databases and the Internet Yaron Naveh January 2006."

Similar presentations


Ads by Google