1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining.

Slides:



Advertisements
Similar presentations
IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.
Advertisements

Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Semantic integration of data in database systems and ontologies
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
USC Graduate Student DayColumbia, SCMarch 2006 Presented by: Jingshan Huang Computer Science & Engineering Department University of South Carolina PhD.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.4/1 Outline Introduction Background Distributed Database Design Database Integration ➡ Schema Matching ➡
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
Semantic Integration in Heterogeneous Databases Using Neural Networks Wen-Syan Li, Chris Clifton Presentation by Jeff Roth.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Generic Schema Matching using Cupid
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Learning to Match Ontologies on the Semantic Web AnHai Doan Jayant Madhavan Robin Dhamankar Pedro Domingos Alon Halevy.
Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003.
Generic Schema Matching with Cupid Jayant Madhavan Philip A. Bernstein Erhard Raham Proceedings of the 27 th VLDB Conference.
Recommender systems Ram Akella November 26 th 2008.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Distributed Database Management Systems. Reading Textbook: Ch. 4 Textbook: Ch. 4 FarkasCSCE Spring
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
Automatic Data Ramon Lawrence University of Manitoba
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Methodology Conceptual Database Design
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
1 Berendt: Knowledge and the Web, 2014, 1 Knowledge and the Web – Schema, instance and ontology matching Bettina.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
A survey of approaches to automatic schema matching Erhard Rahm, Universität für Informatik, Leipzig Philip A. Bernstein, Microsoft Research VLDB 2001.
Web Explanations for Semantic Heterogeneity Discovery Pavel Shvaiko 2 nd European Semantic Web Conference (ESWC), 1 June 2005, Crete, Greece work in collaboration.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
Search Engines and Information Retrieval Chapter 1.
1 Berendt: Advanced databases, 2011, 1 Advanced databases – Core ideas of federated databases; Schema and ontology.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
A SURVEY OF APPROACHES TO AUTOMATIC SCHEMA MATCHING Sushant Vemparala Gaurang Telang.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Querying Structured Text in an XML Database By Xuemei Luo.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
updated CmpE 583 Fall 2008 Ontology Integration- 1 CmpE 583- Web Semantics: Theory and Practice ONTOLOGY INTEGRATION Atilla ELÇİ Computer.
IMAP: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan University of Illinois, Urbana-Champaign.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
XML Schema Integration Ray Dos Santos July 19, 2009.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
1 Berendt: Advanced databases, 2010, 1 Advanced databases – Core ideas of federated databases; Schema and ontology.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
A Survey of Approaches to Automatic Schema Matching (VLDB Journal, 2001) November 7, 2008 IDB SNU Presented by Kangpyo Lee.
1 Berendt: Advanced databases, 2009, 1 Advanced databases – Defining and combining heterogeneous databases:
Mar 27, 2008 Christiano Santiago1 Schema Matching Matching Large XML Schemas Erhard Rahm, Hong-Hai Do, Sabine Maßmann Putting Context into Schema Matching.
Semantic Mappings for Data Mediation
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
1 Berendt: Advanced databases, 2012, 1 Advanced databases – Core ideas of federated databases; Schema and ontology.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Of 24 lecture 11: ontology – mediation, merging & aligning.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval
Data Model.
Integrating Taxonomies
Introduction to Search Engines
Presentation transcript:

1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining heterogeneous databases: Schema matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 25 October 2007

2 Berendt: Advanced databases, winter term 2007/08, 2 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

3 Berendt: Advanced databases, winter term 2007/08, 3 The match problem  Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other

4 Berendt: Advanced databases, winter term 2007/08, 4 Motivation: application areas n Schema integration in multi-database systems n Data integration systems on the Web n Translating data (e.g., for data warehousing) n E-commerce message translation n P2P data management n Model management (tools for easily manipulating models of data)

5 Berendt: Advanced databases, winter term 2007/08, 5 The match operator n Match operator: f(S1,S2) = mapping between S1 and S2 l for schemas S1, S2 n Mapping l a set of mapping elements n Mapping elements l elements of S1, elements of S2, mapping expression n Mapping expression l different functions and relationships

6 Berendt: Advanced databases, winter term 2007/08, 6 Matching expressions: examples n Scalar relations (=, ≥,...) l S.HOUSES.location = T.LISTINGS.area n Functions l T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) l T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n ER-style relationships (is-a, part-of,...) n Set-oriented relationships (overlaps, contains,...) n Any other terms that are defined in the expression language used

7 Berendt: Advanced databases, winter term 2007/08, 7 Matching and mapping 1. Find the schema match („declarative“) 2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“) Example of result of step 2: n To create T.LISTINGS from S (simplified notation): area = SELECT location FROM HOUSES agent-name = SELECT name FROM AGENTS agent-address = SELECT concat(city,state) FROM AGENTS list-price = SELECT price * (1+fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

8 Berendt: Advanced databases, winter term 2007/08, 8 Based on what information can the mappings be found? (based on the homework)

9 Berendt: Advanced databases, winter term 2007/08, 9 Based on what information can the mappings be found? Rahm & Bernstein‘s classification of schema matching approaches

10 Berendt: Advanced databases, winter term 2007/08, 10 Challenges n Semantics of the involved elements often need to be inferred  Often need to base (heuristic) solutions on cues in schema and data, which are unreliable l e.g., homonyms (area), synonyms (area, location) n Schema and data cclues are often incomplete l e.g., date: date of what? n Global nature of matching: to choose one matching possibility, must typically exclude all others as worse n Matching is often subjective and/or context-dependent l e.g., does house-style match house-description or not? n Extremely laborious and error-prone process l e.g., Li & Clifton 200: project at GTE telecommunications: 40 databases, 27K elements, no access to the original developers of the DB  estimated time for just finding and documenting the matches: 12 person years

11 Berendt: Advanced databases, winter term 2007/08, 11 Semi-automated schema matching (1) Rule-based solutions n Hand-crafted rules n Exploit schema information + relatively inexpensive + do not require training + fast (operate only on schema, not data) + can work very well in certain types of applications & domains + rules can provide a quick & concise method of capturing user knowledge about the domain – cannot exploit data instances effectively – cannot exploit previous matching efforts (other than by re-use)

12 Berendt: Advanced databases, winter term 2007/08, 12 Semi-automated schema matching (2) Learning-based solutions n Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“)  Exploit schema information and data n Some approaches: external evidence l Past matches l Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“) l Corpus of users (more details later in this slide set) + can exploit data instances effectively + can exploit previous matching efforts – relatively expensive – require training – slower (operate data) – results may be opaque (e.g., neural network output)  explanation components! (more details later)

13 Berendt: Advanced databases, winter term 2007/08, 13 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

14 Berendt: Advanced databases, winter term 2007/08, 14 Overview (1) n Rule-based approach n Schema types: l Relational, XML n Metadata representation: l Extended ER n Match granularity: l Element, structure n Match cardinality: l 1:1, n:1

15 Berendt: Advanced databases, winter term 2007/08, 15 Overview (2) n Schema-level match: l Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations l Constraint-based: data type and domain compatibility, referential constraints l Structure matching: matching subtrees, weighted by leaves n Re-use, auxiliary information used: l Thesauri, glossaries n Combination of matchers: l Hybrid n Manual work / user input: l User can adjust threshold weights

16 Berendt: Advanced databases, winter term 2007/08, 16 Basic representation: Schema trees Computation overview: 1. Compute similarity coefficients between elements of these graphs 2. Deduce a mapping from these coefficients

17 Berendt: Advanced databases, winter term 2007/08, 17 Computing similarity coefficients (1): Linguistic matching n Operates on schema element names (= nodes in schema tree) 1. Normalization n Tokenization (parse names into tokens based on punctuation, case, etc.) ne.g., Product_ID  {Product, ID} n Expansion (of abbreviations and acronyms) n Elimination (of prepositions, articles, etc.) 2. Categorization / clustering n Based on data types, schema hierarchy, linguistic content of names ne.g., „real-valued elements“, „money-related elements“ 3. Comparison (within the categories) n Compute linguistic similarity coefficients (lsim) based on thesarus (synonmy, hypernymy) n Output: Table of lsim coefficients (in [0,1]) between schema elements

18 Berendt: Advanced databases, winter term 2007/08, 18 How to identify synonyms and homonyms: Example WordNet

19 Berendt: Advanced databases, winter term 2007/08, 19 How to identify hypernyms: Example WordNet

20 Berendt: Advanced databases, winter term 2007/08, 20 Computing similarity coefficients (2): Structure matching n Intuitions: l Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods l Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important) n Procedure: 1. Initialize structural similarity of leaves based on data types nIdentical data types: compat. = 0.5; otherwise in [0,0.5] 2. Process the tree in post-order 3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold 4..

21 Berendt: Advanced databases, winter term 2007/08, 21 The structure matching algorithm n Output: an 1:n mapping for leaves n To generate non-leaf mappings: 2nd post-order traversal

22 Berendt: Advanced databases, winter term 2007/08, 22 Matching shared types n Solution: expand the schema into a schema tree, then proceed as before n Can help to generate context-dependent mappings n Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)

23 Berendt: Advanced databases, winter term 2007/08, 23 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

24 Berendt: Advanced databases, winter term 2007/08, 24 Main ideas n A learning-based approach n Main goal: discover complex matches l In particular: functions such as T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n Works on relational schemas n Basic idea: reformulate schema matching as search

25 Berendt: Advanced databases, winter term 2007/08, 25 Architecture Specialized searchers are specialized on discovering certain types of complex matches  make search more efficient

26 Berendt: Advanced databases, winter term 2007/08, 26 Overview of implemented searchers

27 Berendt: Advanced databases, winter term 2007/08, 27 Example: The textual searcher For target attribute T.LISTINGS.agent-address: n Examine attributes and concatenations of attributes from S n Restrict examined set by analyzing textual properties l Data type information in schema, heuristics (proportion of non-numeric characters etc.) l Evaluate match candidates based on data correspondences, prune inferior candidates

28 Berendt: Advanced databases, winter term 2007/08, 28 Example: The numerical searcher For target attribute T.LISTINGS.list-price: n Examine attributes and arithmetic expressions over them from S n Restrict examined set by analyzing numeric properties l Data type information in schema, heuristics l Evaluate match candidates based on data correspondences, prune inferior candidates

29 Berendt: Advanced databases, winter term 2007/08, 29 Search strategy (1): Example textual searcher 1. Learn a (Naive Bayes) classifier text  class („agent-address“ or „other“) from the data instances in T.LISTINGS.agent-address 2. Apply this classifier to each match candidate (e.g., location, concat(city,state) 3. Score of the candidate = average over instance probabilities 4. For expansion: beam search – only k-top scoring candiates

30 Berendt: Advanced databases, winter term 2007/08, 30 Search strategy (2): Example numeric searcher 1. Get value distributions of target attribute and each candidate 2. Compare the value distributions (Kullback-Leibler divergence measure) 3. Score of the candidate = Kullback-Leibler measure

31 Berendt: Advanced databases, winter term 2007/08, 31 Evaluation strategies of implemented searchers

32 Berendt: Advanced databases, winter term 2007/08, 32 Pruning by domain constraints n Multiple attributes of S: „attributes name and beds are unrelated“  do not generate match candidates with these 2 attributes n Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“  use in evaluation of candidates n Properties of multiple attributes of T: „lot-area and num- baths are unrelated“  at match selector level, „clean up“: l Example –T.num_baths  S.baths –? T.lot-area  (S.lot-sq-feet/43560)+1.3e-15 * S.baths  Based on the domain constraint, drop the term involving S.baths

33 Berendt: Advanced databases, winter term 2007/08, 33 Pruning by using knowledge from overlap data n When S and T share the same data n Consider fraction of data for which mapping is correct l e.g., house locations: l S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address l  Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location, keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)

34 Berendt: Advanced databases, winter term 2007/08, 34 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

35 Berendt: Advanced databases, winter term 2007/08, 35 How to compare? n Input: What kind of input data? (What languages? Only toy examples? What external information?) n Output: mapping between attributes or tables, nodes or paths? How much information does the system report? n Quality measures: metrics for accuracy and completeness? n Effort: how much savings of manual effort, how quantified? l Pre-match effort (training of learners, dictionary preparation,...) l Post-match effort (correction and improvement of the match output) l How are these measured?

36 Berendt: Advanced databases, winter term 2007/08, 36 Match quality measures n Need a „gold standard“ (the „true“ match) n Measures from information retrieval: (standard choice: F1,  = 0.5)  Quantifies post-match effort

37 Berendt: Advanced databases, winter term 2007/08, 37 Benchmarking n Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable  Need more standardized conditions (benchmarks) n Now a tradition of competitions in ontology matching (more later in the course): l Test cases and contests at

38 Berendt: Advanced databases, winter term 2007/08, 38 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

39 Berendt: Advanced databases, winter term 2007/08, 39 Example in iMAP User sees ranked candidates: 1. List-price = price 2. List-price = price * (1 + fee-rate) Explanation: a) Both generated from numeric searcher, 2 ranked higher than 1 b) But: c) Match month-posted = fee-rate d) domain constraint: matches for month-posted and price do not share attributes e)  cannot match list-price to anything to do with fee-rate f) Why c)? g) Data instances of fee-rate were classified as of type date  User corrects this wrong step f), the rest is repaired accordingly

40 Berendt: Advanced databases, winter term 2007/08, 40 Background knowledge structure for explanation: dependency graph

41 Berendt: Advanced databases, winter term 2007/08, 41 MOBS: Using mass collaboration to automate data integration 1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.) 2. Soliciting user feedback: User query  user must answer a simple question  user gets answer to initial query 3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings) 4. Combining user feedback (e.g, majority count) n Important: „instant gratification“ (e.g., include the new field in the results page after a user has given helpful input)

42 Berendt: Advanced databases, winter term 2007/08, 42 Next lecture The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration Semantic Web

43 Berendt: Advanced databases, winter term 2007/08, 43 References / background reading; acknowledgements Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine. Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, Revised Papers (pp ). Springer. luations.pdf McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB). Thanks to Wikipedia for the Kullback-Leibler formulae.