1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining heterogeneous databases: Schema matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 25 October 2007
2 Berendt: Advanced databases, winter term 2007/08, 2 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration
3 Berendt: Advanced databases, winter term 2007/08, 3 The match problem Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other
4 Berendt: Advanced databases, winter term 2007/08, 4 Motivation: application areas n Schema integration in multi-database systems n Data integration systems on the Web n Translating data (e.g., for data warehousing) n E-commerce message translation n P2P data management n Model management (tools for easily manipulating models of data)
5 Berendt: Advanced databases, winter term 2007/08, 5 The match operator n Match operator: f(S1,S2) = mapping between S1 and S2 l for schemas S1, S2 n Mapping l a set of mapping elements n Mapping elements l elements of S1, elements of S2, mapping expression n Mapping expression l different functions and relationships
6 Berendt: Advanced databases, winter term 2007/08, 6 Matching expressions: examples n Scalar relations (=, ≥,...) l S.HOUSES.location = T.LISTINGS.area n Functions l T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) l T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n ER-style relationships (is-a, part-of,...) n Set-oriented relationships (overlaps, contains,...) n Any other terms that are defined in the expression language used
7 Berendt: Advanced databases, winter term 2007/08, 7 Matching and mapping 1. Find the schema match („declarative“) 2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“) Example of result of step 2: n To create T.LISTINGS from S (simplified notation): area = SELECT location FROM HOUSES agent-name = SELECT name FROM AGENTS agent-address = SELECT concat(city,state) FROM AGENTS list-price = SELECT price * (1+fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id
8 Berendt: Advanced databases, winter term 2007/08, 8 Based on what information can the mappings be found? (based on the homework)
9 Berendt: Advanced databases, winter term 2007/08, 9 Based on what information can the mappings be found? Rahm & Bernstein‘s classification of schema matching approaches
10 Berendt: Advanced databases, winter term 2007/08, 10 Challenges n Semantics of the involved elements often need to be inferred Often need to base (heuristic) solutions on cues in schema and data, which are unreliable l e.g., homonyms (area), synonyms (area, location) n Schema and data cclues are often incomplete l e.g., date: date of what? n Global nature of matching: to choose one matching possibility, must typically exclude all others as worse n Matching is often subjective and/or context-dependent l e.g., does house-style match house-description or not? n Extremely laborious and error-prone process l e.g., Li & Clifton 200: project at GTE telecommunications: 40 databases, 27K elements, no access to the original developers of the DB estimated time for just finding and documenting the matches: 12 person years
11 Berendt: Advanced databases, winter term 2007/08, 11 Semi-automated schema matching (1) Rule-based solutions n Hand-crafted rules n Exploit schema information + relatively inexpensive + do not require training + fast (operate only on schema, not data) + can work very well in certain types of applications & domains + rules can provide a quick & concise method of capturing user knowledge about the domain – cannot exploit data instances effectively – cannot exploit previous matching efforts (other than by re-use)
12 Berendt: Advanced databases, winter term 2007/08, 12 Semi-automated schema matching (2) Learning-based solutions n Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“) Exploit schema information and data n Some approaches: external evidence l Past matches l Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“) l Corpus of users (more details later in this slide set) + can exploit data instances effectively + can exploit previous matching efforts – relatively expensive – require training – slower (operate data) – results may be opaque (e.g., neural network output) explanation components! (more details later)
13 Berendt: Advanced databases, winter term 2007/08, 13 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration
14 Berendt: Advanced databases, winter term 2007/08, 14 Overview (1) n Rule-based approach n Schema types: l Relational, XML n Metadata representation: l Extended ER n Match granularity: l Element, structure n Match cardinality: l 1:1, n:1
15 Berendt: Advanced databases, winter term 2007/08, 15 Overview (2) n Schema-level match: l Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations l Constraint-based: data type and domain compatibility, referential constraints l Structure matching: matching subtrees, weighted by leaves n Re-use, auxiliary information used: l Thesauri, glossaries n Combination of matchers: l Hybrid n Manual work / user input: l User can adjust threshold weights
16 Berendt: Advanced databases, winter term 2007/08, 16 Basic representation: Schema trees Computation overview: 1. Compute similarity coefficients between elements of these graphs 2. Deduce a mapping from these coefficients
17 Berendt: Advanced databases, winter term 2007/08, 17 Computing similarity coefficients (1): Linguistic matching n Operates on schema element names (= nodes in schema tree) 1. Normalization n Tokenization (parse names into tokens based on punctuation, case, etc.) ne.g., Product_ID {Product, ID} n Expansion (of abbreviations and acronyms) n Elimination (of prepositions, articles, etc.) 2. Categorization / clustering n Based on data types, schema hierarchy, linguistic content of names ne.g., „real-valued elements“, „money-related elements“ 3. Comparison (within the categories) n Compute linguistic similarity coefficients (lsim) based on thesarus (synonmy, hypernymy) n Output: Table of lsim coefficients (in [0,1]) between schema elements
18 Berendt: Advanced databases, winter term 2007/08, 18 How to identify synonyms and homonyms: Example WordNet
19 Berendt: Advanced databases, winter term 2007/08, 19 How to identify hypernyms: Example WordNet
20 Berendt: Advanced databases, winter term 2007/08, 20 Computing similarity coefficients (2): Structure matching n Intuitions: l Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods l Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important) n Procedure: 1. Initialize structural similarity of leaves based on data types nIdentical data types: compat. = 0.5; otherwise in [0,0.5] 2. Process the tree in post-order 3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold 4..
21 Berendt: Advanced databases, winter term 2007/08, 21 The structure matching algorithm n Output: an 1:n mapping for leaves n To generate non-leaf mappings: 2nd post-order traversal
22 Berendt: Advanced databases, winter term 2007/08, 22 Matching shared types n Solution: expand the schema into a schema tree, then proceed as before n Can help to generate context-dependent mappings n Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)
23 Berendt: Advanced databases, winter term 2007/08, 23 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration
24 Berendt: Advanced databases, winter term 2007/08, 24 Main ideas n A learning-based approach n Main goal: discover complex matches l In particular: functions such as T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n Works on relational schemas n Basic idea: reformulate schema matching as search
25 Berendt: Advanced databases, winter term 2007/08, 25 Architecture Specialized searchers are specialized on discovering certain types of complex matches make search more efficient
26 Berendt: Advanced databases, winter term 2007/08, 26 Overview of implemented searchers
27 Berendt: Advanced databases, winter term 2007/08, 27 Example: The textual searcher For target attribute T.LISTINGS.agent-address: n Examine attributes and concatenations of attributes from S n Restrict examined set by analyzing textual properties l Data type information in schema, heuristics (proportion of non-numeric characters etc.) l Evaluate match candidates based on data correspondences, prune inferior candidates
28 Berendt: Advanced databases, winter term 2007/08, 28 Example: The numerical searcher For target attribute T.LISTINGS.list-price: n Examine attributes and arithmetic expressions over them from S n Restrict examined set by analyzing numeric properties l Data type information in schema, heuristics l Evaluate match candidates based on data correspondences, prune inferior candidates
29 Berendt: Advanced databases, winter term 2007/08, 29 Search strategy (1): Example textual searcher 1. Learn a (Naive Bayes) classifier text class („agent-address“ or „other“) from the data instances in T.LISTINGS.agent-address 2. Apply this classifier to each match candidate (e.g., location, concat(city,state) 3. Score of the candidate = average over instance probabilities 4. For expansion: beam search – only k-top scoring candiates
30 Berendt: Advanced databases, winter term 2007/08, 30 Search strategy (2): Example numeric searcher 1. Get value distributions of target attribute and each candidate 2. Compare the value distributions (Kullback-Leibler divergence measure) 3. Score of the candidate = Kullback-Leibler measure
31 Berendt: Advanced databases, winter term 2007/08, 31 Evaluation strategies of implemented searchers
32 Berendt: Advanced databases, winter term 2007/08, 32 Pruning by domain constraints n Multiple attributes of S: „attributes name and beds are unrelated“ do not generate match candidates with these 2 attributes n Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“ use in evaluation of candidates n Properties of multiple attributes of T: „lot-area and num- baths are unrelated“ at match selector level, „clean up“: l Example –T.num_baths S.baths –? T.lot-area (S.lot-sq-feet/43560)+1.3e-15 * S.baths Based on the domain constraint, drop the term involving S.baths
33 Berendt: Advanced databases, winter term 2007/08, 33 Pruning by using knowledge from overlap data n When S and T share the same data n Consider fraction of data for which mapping is correct l e.g., house locations: l S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address l Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location, keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)
34 Berendt: Advanced databases, winter term 2007/08, 34 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration
35 Berendt: Advanced databases, winter term 2007/08, 35 How to compare? n Input: What kind of input data? (What languages? Only toy examples? What external information?) n Output: mapping between attributes or tables, nodes or paths? How much information does the system report? n Quality measures: metrics for accuracy and completeness? n Effort: how much savings of manual effort, how quantified? l Pre-match effort (training of learners, dictionary preparation,...) l Post-match effort (correction and improvement of the match output) l How are these measured?
36 Berendt: Advanced databases, winter term 2007/08, 36 Match quality measures n Need a „gold standard“ (the „true“ match) n Measures from information retrieval: (standard choice: F1, = 0.5) Quantifies post-match effort
37 Berendt: Advanced databases, winter term 2007/08, 37 Benchmarking n Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable Need more standardized conditions (benchmarks) n Now a tradition of competitions in ontology matching (more later in the course): l Test cases and contests at
38 Berendt: Advanced databases, winter term 2007/08, 38 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration
39 Berendt: Advanced databases, winter term 2007/08, 39 Example in iMAP User sees ranked candidates: 1. List-price = price 2. List-price = price * (1 + fee-rate) Explanation: a) Both generated from numeric searcher, 2 ranked higher than 1 b) But: c) Match month-posted = fee-rate d) domain constraint: matches for month-posted and price do not share attributes e) cannot match list-price to anything to do with fee-rate f) Why c)? g) Data instances of fee-rate were classified as of type date User corrects this wrong step f), the rest is repaired accordingly
40 Berendt: Advanced databases, winter term 2007/08, 40 Background knowledge structure for explanation: dependency graph
41 Berendt: Advanced databases, winter term 2007/08, 41 MOBS: Using mass collaboration to automate data integration 1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.) 2. Soliciting user feedback: User query user must answer a simple question user gets answer to initial query 3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings) 4. Combining user feedback (e.g, majority count) n Important: „instant gratification“ (e.g., include the new field in the results page after a user has given helpful input)
42 Berendt: Advanced databases, winter term 2007/08, 42 Next lecture The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration Semantic Web
43 Berendt: Advanced databases, winter term 2007/08, 43 References / background reading; acknowledgements Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine. Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, Revised Papers (pp ). Springer. luations.pdf McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB). Thanks to Wikipedia for the Kullback-Leibler formulae.