IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
IS698: Database Management Min Song IS NJIT. The Relational Data Model.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
From portions of Chapter 8, 9, 10, &11. Real world is complex. GIS is used model reality. The GIS models then enable us to ask questions of the data by.
Combining Inductive and Analytical Learning Ch 12. in Machine Learning Tom M. Mitchell 고려대학교 자연어처리 연구실 한 경 수
The Relational Database Model
Module 2 Designing a Logical Database Model. Module Overview Guidelines for Building a Logical Database Model Planning for OLTP Activity Evaluating Logical.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Table design screen Field name Data type Field size Other properties.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Aki Hecht Seminar in Databases (236826) January 2009
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Dayi Zhou Week 4 (Oct. 19)
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Min-Max Trees Based on slides by: Rob Powers Ian Gent Yishay Mansour.
Chapter 14 Getting to First Base: Introduction to Database Concepts.
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary
1 CIS607, Fall 2005 Semantic Information Integration Presentation by Enrico Viglino Week 3 (Oct. 12)
Creating a Blank Database 1. Open up Microsoft Access 2. Click on Blank document button 3. On the right panel, Specify the location for saving your database.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Present by Napasakorn Sukjay Poom Samaharn
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
An Approach to Task Modelling for User Interface Design Costin Pribeanu National Institute for Research and Development in Informatics, Bucureşti, Romania.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Dimitrios Skoutas Alkis Simitsis
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
DatabaseIM ISU1 Chapter 7 ER- and EER-to-Relational Mapping Fundamentals of Database Systems.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Data Mining and Decision Support
Ferdowsi University of Mashhad 1 Automatic Semantic Web Service Composition based on owl-s Research Proposal presented by : Toktam ghafarian.
CHAPTER 2 : RELATIONAL DATA MODEL Prepared by : nbs.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Of 24 lecture 11: ontology – mediation, merging & aligning.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Module 11: File Structure
Chapter 2: Relational Model
Entity-Relationship Model
Chapter 2: Intro to Relational Model
Associative Query Answering via Query Feature Similarity
Data Model.
Block Matching for Ontologies
Getting to First Base: Introduction to Database Concepts
Getting to First Base: Introduction to Database Concepts
Theppatorn rhujittawiwat
Getting to First Base: Introduction to Database Concepts
Database Dr. Roueida Mohammed.
Objectives Identify functions.
Presentation transcript:

iMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases

Motivation Consider a union of databases of two banks. Consider a union of databases of two banks. We need to generate a mapping between the schemas We need to generate a mapping between the schemas House Number StreetCityNameId Account status Account number Id Account status AccountAddress Last name First name Id Bank A tables Bank B tables

Introduction Semantic mappings specify the relationships between data stored in disparate sources. Semantic mappings specify the relationships between data stored in disparate sources. A mapping between attribute of target schema to attributes of source schema According to the semantics A mapping between attribute of target schema to attributes of source schema According to the semantics

Motivation – Example continue House Number StreetCityNameId Account status Account number Id Account status AccountAddress Last name First name Id Bank A tables Bank B tables

Motivation – Example continue House Number StreetCityNameId Account status Account number Id Account status AccountAddress Last name First name Id Bank A tables Bank B tables Semantic Mapping!

Introduction Most of the work in this field focused on Matching Process. Most of the work in this field focused on Matching Process. The types of matches can be split to 2: The types of matches can be split to 2: 1 – 1 matching. 1 – 1 matching. Complex matching – Combination of attributes in one schema corresponds to a combination in other schema Complex matching – Combination of attributes in one schema corresponds to a combination in other schema Match Candidate – each matching of attributes from source and target schemas. Match Candidate – each matching of attributes from source and target schemas.

Motivation – Example continue House Number StreetCityNameId Account status Account number Id Account status AccountAddress Last name First name Id Bank A tables Bank B tables Semantic Mapping! 1-1 matching candidate Complex matching candidate

Introduction - examples: Example 1: Example 1: Example 2: Example 2: PhoneAddressName 12345HaifaOhad 13579Tel-AvivDavidCellularLocationStudent HaifaEyal Tel-AvivMiri Price Product name Product ID Discount Product Price Name Product ID Company A Company B

Introduction - examples: Example 1: Example 1: Example 2: Example 2: PhoneAddressName 12345HaifaOhad 13579Tel-AvivDavidCellularLocationStudent HaifaEyal Tel-AvivMiri Price Product name Product ID Discount Product Price Name Product ID Company A Company B

Introduction - examples: Example 1: Example 1: Example 2: Example 2: PhoneAddressName 12345HaifaOhad 13579Tel-AvivDavidCellularLocationStudent HaifaEyal Tel-AvivMiri Price Product name Product ID Discount Product Price Name Product ID Company A Company B

Introduction - examples: Example 1: Example 1: 1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2: Example 2: PhoneAddressName 12345HaifaOhad 13579Tel-AvivDavidCellularLocationStudent HaifaEyal Tel-AvivMiri Price Product name Product ID Discount Product Price Name Product ID Company A Company B

Introduction - examples: Example 1: Example 1: 1 – 1 matching: Name = Student, Address = Location, Phone = Cellular. Example 2: Example 2: PhoneAddressName 12345HaifaOhad 13579Tel-AvivDavidCellularLocationStudent HaifaEyal Tel-AvivMiri Price Product name Product ID Discount Product Price Name Product ID Product Price = Price*(1-Discount) Company A Company B

Difficulties in Generating Matchings Difficult to find the matches because Difficult to find the matches because Finding complex matches is not trivial at all Finding complex matches is not trivial at all How the system should know:How the system should know: Product Price = Price*(1-Discount) The number of candidates for Complex Matches is large. The number of candidates for Complex Matches is large. Sometimes tables should be joined : Sometimes tables should be joined : Price Product name Product ID Discount Product Price Product Name Product ID Product Price = Price*(1-Discount)

Main Parts of the iMAP System Generating Matching Candidates Generating Matching Candidates Pruning matching candidates Pruning matching candidates By exploiting Domain Knowledge By exploiting Domain Knowledge Explaining Match Predictions Explaining Match Predictions Provides an explanation to selected predicted matching Provides an explanation to selected predicted matching Causes the system to be semi automatically. Causes the system to be semi automatically.

iMAP System Architecture Consists three main modules: Consists three main modules: Match Generator – generates the matching candidates using special searchers for target schema and source schema. Match Generator – generates the matching candidates using special searchers for target schema and source schema. Similarity Estimator – generates matrix that stores the similarity score of pairs (target attribute, match candidate) Similarity Estimator – generates matrix that stores the similarity score of pairs (target attribute, match candidate) Match Selector – examines the score matrix and outputs the best matches under certain conditions. Match Selector – examines the score matrix and outputs the best matches under certain conditions.

iMAP System Architecture – cont. To each attribute t of T iMAP generates match candidates from S Similarity Estimator: receives match candidates and outputs similarity matrix Match Selector: receives similarity matrix and output final match candidates

Part 1: Match Generation - searchers The key in match generation is to SEARCH through the space of possible match candidates. The key in match generation is to SEARCH through the space of possible match candidates. Search space – all attributes and data in source schemas Search space – all attributes and data in source schemas Searchers work based on knowledge of operators and attributes types such as numeric, textual and some heuristic methods. Searchers work based on knowledge of operators and attributes types such as numeric, textual and some heuristic methods.

The Internal of Searchers Search Strategy Search Strategy Facing the large space using the standard beam search. Facing the large space using the standard beam search. Match Evaluation Match Evaluation Giving score which approximates the distance between the candidate and the target. Giving score which approximates the distance between the candidate and the target. Termination Condition Termination Condition Search should be stopped because of a large search space. Search should be stopped because of a large search space.

The Internal of Searchers – Example i Iterations which limited by k results: i Iterations which limited by k results: Price Product name Product ID Discount Product Price Name Product ID 1.Product Price = Price*(1-Discount) 2.Product Price = Product ID k. … MAX i MAX i+1 Stop: MAX i -MAX i+1 <delta Return first k candidates

The Internal of Searchers – Join Paths Find matches in Join Paths in two steps: Find matches in Join Paths in two steps: Price Product name Product ID Discount Product Price Name Product ID Product Price = Price*(1-Discount) Company ACompany B First Step - Join paths between tables: Join(T1,T2) Second Step – search process use the join paths

Implemented searchers in iMAP Contains the following searchers: Contains the following searchers: Text Text Numeric Numeric Category Category Schema Mismatch Schema Mismatch Unit Conversion Unit Conversion Date Date Overlap versions of Text, Numeric, Category, Schema Mismatch, Unit Conversion Overlap versions of Text, Numeric, Category, Schema Mismatch, Unit Conversion

Implemented searchers – Text Searcher example Text searcher: Text searcher: Purpose: finds matching candidates that are concatenations of text attributes. Method: Target attribute: Name Target attribute: Name Search Space: attributes in source Search Space: attributes in source Schemas which have textual properties Searcher search in the Search Space Searcher search in the Search Space attributes or concatenations of attributes NameId Last name First name Id

Implemented searchers – Numeric Searcher example Numeric Searcher : Numeric Searcher : Purpose: best matches for numeric attributes. Purpose: best matches for numeric attributes. Issues: Issues: Compute the similarity score of complex match Compute the similarity score of complex match Value distributionValue distribution Type of matches Type of matches +,-,*,/+,-,*,/ 2 Columns2 Columns dim2dim size dim1*dim2=size

Implemented searchers in iMAP – cont. Category Searcher: Category Searcher: Purpose: find matches between categorical attributes in the source and in the schema. Schema Mismatch Searcher: Schema Mismatch Searcher: Purpose: relating the data of a schema with the schema of the other. Occurs very often. Unit Conversion Searcher: Unit Conversion Searcher: Purpose: find matches between different types of units. Date Searcher: Date Searcher: Purpose: finds complex matches for date attributes.

Part 2: Similarity Estimator Receives from the Match Generator candidate matches which based on the score that each searcher assigns. Receives from the Match Generator candidate matches which based on the score that each searcher assigns. Problem: each searcher can give different score Problem: each searcher can give different score Solution: Final score, more accurate, to each match by using additional types of information. Solution: Final score, more accurate, to each match by using additional types of information. iMAP system uses evaluator modules: iMAP system uses evaluator modules: Name-based evaluator – computes score basing on similarity of namesName-based evaluator – computes score basing on similarity of names Naive Bayes evaluatorNaive Bayes evaluator Why not to perform this phase during the search phase? during the search phase? Very Expensive!

Module example - Naive Bayes evaluator Consider the mach Consider the mach agent-address = location Building model: Data instance in target attribute will be positive otherwise the data will be negative Building model: Data instance in target attribute will be positive otherwise the data will be negative Naïve Bayes Classifier learn the model Naïve Bayes Classifier learn the model Applied the trained classifier on the source attribute data Applied the trained classifier on the source attribute data Each data instance receive score Each data instance receive score Return an average on all score as result Return an average on all score as result Loaction(Source) Agent Address (Target) T.A.Haifa EilatT.A. NahariyaJerusalem NesherEilat

Part 3: Match Selector Receives from the Similarity Estimator the scored suggested for matching candidates Receives from the Similarity Estimator the scored suggested for matching candidates Problem: These matches may violate certain domain integrity constraints. Problem: These matches may violate certain domain integrity constraints. For example: mapping 2 source attributes to the same target attributes. For example: mapping 2 source attributes to the same target attributes. Solution: set of domain constraints Solution: set of domain constraints Defined by domain experts or users Defined by domain experts or users

Constraint Example Constraint: Price and Club members price are unrelated Constraint: Price and Club members price are unrelated Match Selector delete this match candidate Match Selector delete this match candidate Price Product name Product ID Club members Price Product ID Product Price Product Name Product ID Match Selector receives list of candidates: k. Product Price = Price+club members price

Exploiting Domain Knowledge iMAP system uses 4 different types of knowledge: iMAP system uses 4 different types of knowledge: Domain Constraints Domain Constraints Past matches Past matches Overlap data Overlap data External data External data iMAP uses its knowledge at all levels of the system and early as it can in match generation. iMAP uses its knowledge at all levels of the system and early as it can in match generation.

Types of knowledge Domain constraints Domain constraints Three cases: Three cases: Name and ID are unrelated - Attributes from the Source schema are unrelatedName and ID are unrelated - Attributes from the Source schema are unrelated searchers searchers Account < Constraint on single attribute tAccount < Constraint on single attribute t Similarity Estimator and Searchers Similarity Estimator and Searchers Account and ID are unrelated - Attributes from the Target Schema are unrelatedAccount and ID are unrelated - Attributes from the Target Schema are unrelated Match Selector Match Selector NameId Account status Account number Id Account status Account Last name First name Id Source: Target:

Types of knowledge – cont. Past Complex Matches Past Complex Matches Numeric Searcher can use past expression template: Numeric Searcher can use past expression template: Price=Price*(1-Discount) generatesPrice=Price*(1-Discount) generatesVARIABLE*(1-VARIABLE) External Data – using external sources for learning about attributes and their data. External Data – using external sources for learning about attributes and their data. Given a target attribute and useful feature of that attribute, iMAP learn about value distribution Given a target attribute and useful feature of that attribute, iMAP learn about value distribution Example: number of cities in stateExample: number of cities in state

Types of knowledge – cont. Overlap Data – Provide information for the mapping process. Overlap Data – Provide information for the mapping process. contains searchers which can exploit overlap data. contains searchers which can exploit overlap data. Overlap Text, Category & Schema Mismatch searchers Overlap Text, Category & Schema Mismatch searchers S and T share a state listing S and T share a state listing Matches: city=state, country=state Matches: city=state, country=state Re-evaluating results: city=state is 0 and country=state is 1 Re-evaluating results: city=state is 0 and country=state is 1 Overlap Numeric Searcher – using the overlap data and using equation discovery system (LAGRMGE) the best arithmetic expression for t is found. Overlap Numeric Searcher – using the overlap data and using equation discovery system (LAGRMGE) the best arithmetic expression for t is found.

Generating Explanations One goal is to provide design environment which the user will inspect the matches predicted by the system, modified them manually and the system will have a feedback. One goal is to provide design environment which the user will inspect the matches predicted by the system, modified them manually and the system will have a feedback. The system uses complex algorithms so it needs to explain the user the matches. The system uses complex algorithms so it needs to explain the user the matches. Explanations are good for the user as well Explanations are good for the user as well Correct matches quickly Correct matches quickly Tells the system where its mistake. Tells the system where its mistake.

Generating Explanations – so, what do you want to know about the matches? iMAP system defines 3 main questions: iMAP system defines 3 main questions: Explain the existing match – why a certain match X is presented in the output of iMAP? Why the match survive the all process? Explain the existing match – why a certain match X is presented in the output of iMAP? Why the match survive the all process? Explain absent match - why a certain match Y is not presented in the output of iMAP? Explain absent match - why a certain match Y is not presented in the output of iMAP? Explain match ranking – why match X is ranked higher than match Y? Explain match ranking – why match X is ranked higher than match Y? Each of these questions can be asked for each module of iMAP. Each of these questions can be asked for each module of iMAP. Question can be reformulated recursively to underlying components. Question can be reformulated recursively to underlying components.

Generating Explanations - Example Suppose we have 2 real-estate schemas: Suppose we have 2 real-estate schemas: iMAP produces the ranked matches: iMAP produces the ranked matches: (1) List-price=price*(1+monthly-fee-rate) (1) List-price=price*(1+monthly-fee-rate) (2) List-price=price (2) List-price=price … Month- posted List-price… Monthly- fee-rate Price iMAP explanation: both matches were generated by the numeric searcher and the similarity estimator also agreed to the ranking.

Generating Explanations - Example Suppose we have 2 real-estate schemas: Suppose we have 2 real-estate schemas: The current order: The current order: (1) List-price=price*(1+monthly-fee-rate) (2) List-price=price Match selector have 2 constraints: (1) month- posted=month-fee-rate, (2) month-posted and price dont share common attributes Match selector have 2 constraints: (1) month- posted=month-fee-rate, (2) month-posted and price dont share common attributes … Month- posted List-price… Monthly- fee-rate Price List-price=price match is selected by the match generator

Generating Explanations - Example Suppose we have 2 real-estate schemas: Suppose we have 2 real-estate schemas: The current order: The current order: (1) List-price=price (2) List-price=price*(1+monthly-fee-rate) iMAP explains that the source for month-posted=month- fee-rate is the date searcher iMAP explains that the source for month-posted=month- fee-rate is the date searcher … Month- posted List-price… Monthly- fee-rate Price The user correct the iMAP that month-fee-rate is not type of date.

Generating Explanations - Example Suppose we have 2 real-estate schemas: Suppose we have 2 real-estate schemas: List-price=price*(1+monthly-fee-rate) is again the chosen match List-price=price*(1+monthly-fee-rate) is again the chosen match The Final order: The Final order: (1) List-price=price*(1+monthly-fee-rate) (2) List-price=price … Month- posted List-price… Monthly- fee-rate Price

Example cont. – generated dependency graph Dependency Graph is small!!! Searchers produce only k best matches iMAP goes through three stages

What do you want to know about the matches? Why a certain match X is presented in the output of iMAP? Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match. Returns the part in the graph that describes the match.

Example cont. – generated dependency graph

What do you want to know about the matches? Why a certain match X is presented in the output of iMAP? Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match. Returns the part in the graph that describes the match. Why match X is ranked higher than match Y? Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 matches. Return the comparing part in the graph between the 2 matches.

Example cont. – generated dependency graph

What do you want to know about the matches? Why a certain match X is presented in the output of iMAP? Why a certain match X is presented in the output of iMAP? Returns the part in the graph that describes the match. Returns the part in the graph that describes the match. Why match X is ranked higher than match Y? Why match X is ranked higher than match Y? Return the comparing part in the graph between the 2 matches. Return the comparing part in the graph between the 2 matches. Why a certain match Y is not presented in the output of iMAP? Why a certain match Y is not presented in the output of iMAP? If the has been eliminated during the process the part that responsible for the eliminating explains why If the has been eliminated during the process the part that responsible for the eliminating explains why Otherwise the iMAP ask the searcher to check if they can generate the match and to explain why it was not generated Otherwise the iMAP ask the searcher to check if they can generate the match and to explain why it was not generated

Example cont. – generated dependency graph

Evaluating iMAP on real world domains iMAP was evaluated on 4 real-word domains: iMAP was evaluated on 4 real-word domains: For the Cricket domain they used 2 independently developed databases For the Cricket domain they used 2 independently developed databases For the other 3 they used one real-world source database and target schema which created by volunteers. For the other 3 they used one real-world source database and target schema which created by volunteers. Databases with overlap domains and databases with disjoint domains Databases with overlap domains and databases with disjoint domains

Evaluating iMAP on real world domains – cont. Data Processing: removing data such as unknown and adding the most obvious constraints. Data Processing: removing data such as unknown and adding the most obvious constraints. Experiments: there are actually 8 experimental domains Experiments: there are actually 8 experimental domains 2 domains for each one – overlap domain and disjoint domain. 2 domains for each one – overlap domain and disjoint domain. Performance measure: Performance measure: 1 matching accuracy 1 matching accuracy 3 matching accuracy 3 matching accuracy Complex match Complex match Partial complex match Partial complex match

Results (1) Overall and 1-1 matching accuracy: Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching Not in the figure, but according to the article the top-3 accuracy is even higher and iMAP also achieves top-1 and top-3 accuracy of 77%-100% for 1-1 matching (a) Exploiting domain constraints and overlap data improve accuracy (b) Disjoint domains achieves lower accuracy than overlap data domains

Results (2) Complex matching accuracy – Top 1 and Top 3:

Results (2) – Cont. Complex matching accuracy – Top 1: Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains (a) Exploiting domain constraints and overlap data improve accuracy (a) Exploiting domain constraints and overlap data improve accuracy (b) iMAP achieves lower accuracy than in overlap data domains (b) iMAP achieves lower accuracy than in overlap data domains No overlap data decreases the accuracy of Numeric Searcher and Text Searcher. No overlap data decreases the accuracy of Numeric Searcher and Text Searcher.

Results (2) – complex matches low results Smaller components – example: apt-number Smaller components – example: apt-number Suggested solution: adding format learning techniques Suggested solution: adding format learning techniques Small noise components – example: agent-id Small noise components – example: agent-id Suggested solution: more aggressive match cleaning and more constraints. Suggested solution: more aggressive match cleaning and more constraints. Disjoint databases – difficult for numeric searcher Disjoint databases – difficult for numeric searcher Suggested solution: using past numeric matches Suggested solution: using past numeric matches Top–k – many results are not in top 1 Top–k – many results are not in top 1 Increasing k to 10 will increase accuracy Increasing k to 10 will increase accuracy

Results (2) Complex matching accuracy – Top 1 and Top 3:

Results (2) – Cont. Complex matching accuracy – Top 3: Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains Low results for default iMAP (for example: inventory=9%) both in overlap domains and disjoint domains Same reasons as in Top 1 Same reasons as in Top 1 (c) Improvement in accuracy compared to (a) when using overlap and constraints (c) Improvement in accuracy compared to (a) when using overlap and constraints This is a outcome of correct complex matches in the top 3 matches This is a outcome of correct complex matches in the top 3 matches

Results (3) Partial Complex matching accuracy – Top 1 and Top 3:

Results (3) – cont. Partial Complex matching accuracy – Top 1 and Top 3: The accuracy is measured in finding only the right attributes The accuracy is measured in finding only the right attributes For example: wrong numeric template but right attributes For example: wrong numeric template but right attributes Much more accuracy than full complex matching accuracy. Much more accuracy than full complex matching accuracy. Partial Complex Matches can be very useful when the user want to fix wrong matches Partial Complex Matches can be very useful when the user want to fix wrong matches

Performance & Efficiency Performance: iMAP is stable after 100 data tuples iMAP is stable after 100 data tuples If we run it on fewer examples first we can reduce iMAP running time If we run it on fewer examples first we can reduce iMAP running time Data tupels Accuracy

Performance & Efficiency – Cont. Efficiency: Unoptimized iMAP version ran for 5 – 20 minutes on the experimental domains Unoptimized iMAP version ran for 5 – 20 minutes on the experimental domains Several techniques are suggested in the article to improve this time: Several techniques are suggested in the article to improve this time: For example breaking the schemas into independent chunks For example breaking the schemas into independent chunks

Explaining match predictions Example for explaining match prediction: Example for explaining match prediction: Conclusion: the Name Based evaluator has more influence – last line Conclusion: the Name Based evaluator has more influence – last line The user can use this information to reduce the influence of the Name Based evaluator The user can use this information to reduce the influence of the Name Based evaluator Searcher Level: Concat(first- name,last-name) was ranked higher than last-name Similarity Estimator: Name based was wrong Name based was wrong Naïve Bayes was right Naïve Bayes was right Match Selector: didnt influence

Related work L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema elements: L. Xu and D. Embley. Using domain ontologies to discover direct and in direct matches for schema elements: Mapping the schema to domain ontology and searching in this domain. Mapping the schema to domain ontology and searching in this domain. Can be added to as additional searcher Can be added to as additional searcher Clio System: Clio System: Sophisticated set of user-interface techniques to improve matches Sophisticated set of user-interface techniques to improve matches

Conclusions Most of the work in that field until now was about 1-1 matching Most of the work in that field until now was about 1-1 matching This article focused on complex matching. This article focused on complex matching. iMAP key is the use of: iMAP key is the use of: Searchers Searchers Domain knowledge Domain knowledge Providing the user the possibility to affect the matches Providing the user the possibility to affect the matches

Any Questions?

Thank you!

Bibliography Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas. Robin Dhamankar, Yoonkyong Lee, AnHai Doan,Alon Halevy, Pedro Domingos. iMAP: Discovering Complex Semantic Matches between Database Schemas.