Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining.

Similar presentations


Presentation on theme: "1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining."— Presentation transcript:

1 1 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 1 Advanced databases – Defining and combining heterogeneous databases: Schema matching Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 25 October 2007

2 2 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 2 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

3 3 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 3 The match problem  Given two schemas S1 and S2, find a mapping between elements of S1 and S2 that correspond semantically to each other

4 4 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 4 Motivation: application areas n Schema integration in multi-database systems n Data integration systems on the Web n Translating data (e.g., for data warehousing) n E-commerce message translation n P2P data management n Model management (tools for easily manipulating models of data)

5 5 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 5 The match operator n Match operator: f(S1,S2) = mapping between S1 and S2 l for schemas S1, S2 n Mapping l a set of mapping elements n Mapping elements l elements of S1, elements of S2, mapping expression n Mapping expression l different functions and relationships

6 6 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 6 Matching expressions: examples n Scalar relations (=, ≥,...) l S.HOUSES.location = T.LISTINGS.area n Functions l T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) l T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n ER-style relationships (is-a, part-of,...) n Set-oriented relationships (overlaps, contains,...) n Any other terms that are defined in the expression language used

7 7 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 7 Matching and mapping 1. Find the schema match („declarative“) 2. Create a procedure (e.g., a query expression) to enable automated data translation or exchange (mapping, „procedural“) Example of result of step 2: n To create T.LISTINGS from S (simplified notation): area = SELECT location FROM HOUSES agent-name = SELECT name FROM AGENTS agent-address = SELECT concat(city,state) FROM AGENTS list-price = SELECT price * (1+fee-rate) FROM HOUSES, AGENTS WHERE agent-id = id

8 8 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 8 Based on what information can the mappings be found? (based on the homework)

9 9 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 9 Based on what information can the mappings be found? Rahm & Bernstein‘s classification of schema matching approaches

10 10 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 10 Challenges n Semantics of the involved elements often need to be inferred  Often need to base (heuristic) solutions on cues in schema and data, which are unreliable l e.g., homonyms (area), synonyms (area, location) n Schema and data cclues are often incomplete l e.g., date: date of what? n Global nature of matching: to choose one matching possibility, must typically exclude all others as worse n Matching is often subjective and/or context-dependent l e.g., does house-style match house-description or not? n Extremely laborious and error-prone process l e.g., Li & Clifton 200: project at GTE telecommunications: 40 databases, 27K elements, no access to the original developers of the DB  estimated time for just finding and documenting the matches: 12 person years

11 11 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 11 Semi-automated schema matching (1) Rule-based solutions n Hand-crafted rules n Exploit schema information + relatively inexpensive + do not require training + fast (operate only on schema, not data) + can work very well in certain types of applications & domains + rules can provide a quick & concise method of capturing user knowledge about the domain – cannot exploit data instances effectively – cannot exploit previous matching efforts (other than by re-use)

12 12 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 12 Semi-automated schema matching (2) Learning-based solutions n Rules/mappings learned from attribute specifications and statistics of data content (Rahm&Bernstein: „instance-level matching“)  Exploit schema information and data n Some approaches: external evidence l Past matches l Corpus of schemas and matches („matchings in real-estate applications will tend to be alike“) l Corpus of users (more details later in this slide set) + can exploit data instances effectively + can exploit previous matching efforts – relatively expensive – require training – slower (operate data) – results may be opaque (e.g., neural network output)  explanation components! (more details later)

13 13 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 13 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

14 14 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 14 Overview (1) n Rule-based approach n Schema types: l Relational, XML n Metadata representation: l Extended ER n Match granularity: l Element, structure n Match cardinality: l 1:1, n:1

15 15 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 15 Overview (2) n Schema-level match: l Name-based: name equality, synonyms, hypernyms, homonyms, abbreviations l Constraint-based: data type and domain compatibility, referential constraints l Structure matching: matching subtrees, weighted by leaves n Re-use, auxiliary information used: l Thesauri, glossaries n Combination of matchers: l Hybrid n Manual work / user input: l User can adjust threshold weights

16 16 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 16 Basic representation: Schema trees Computation overview: 1. Compute similarity coefficients between elements of these graphs 2. Deduce a mapping from these coefficients

17 17 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 17 Computing similarity coefficients (1): Linguistic matching n Operates on schema element names (= nodes in schema tree) 1. Normalization n Tokenization (parse names into tokens based on punctuation, case, etc.) ne.g., Product_ID  {Product, ID} n Expansion (of abbreviations and acronyms) n Elimination (of prepositions, articles, etc.) 2. Categorization / clustering n Based on data types, schema hierarchy, linguistic content of names ne.g., „real-valued elements“, „money-related elements“ 3. Comparison (within the categories) n Compute linguistic similarity coefficients (lsim) based on thesarus (synonmy, hypernymy) n Output: Table of lsim coefficients (in [0,1]) between schema elements

18 18 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 18 How to identify synonyms and homonyms: Example WordNet

19 19 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 19 How to identify hypernyms: Example WordNet

20 20 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 20 Computing similarity coefficients (2): Structure matching n Intuitions: l Leaves are similar if they are linguistic & data-type similar, and if they have similar neighbourhoods l Non-leaf elements are similar if linguistically similar & have similar subtrees (where leaf sets are most important) n Procedure: 1. Initialize structural similarity of leaves based on data types nIdentical data types: compat. = 0.5; otherwise in [0,0.5] 2. Process the tree in post-order 3. Stronglink(leaf1, leaf2) iff their weighted sim. ≥ threshold 4..

21 21 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 21 The structure matching algorithm n Output: an 1:n mapping for leaves n To generate non-leaf mappings: 2nd post-order traversal

22 22 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 22 Matching shared types n Solution: expand the schema into a schema tree, then proceed as before n Can help to generate context-dependent mappings n Fails if a cycle of containment and IsDerivedFrom relationships is present (e.g., recursive type definitions)

23 23 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 23 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

24 24 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 24 Main ideas n A learning-based approach n Main goal: discover complex matches l In particular: functions such as T.LISTINGS.list-price = S.HOUSES.price * (1+S.AGENTS.fee-rate) T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS.state) n Works on relational schemas n Basic idea: reformulate schema matching as search

25 25 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 25 Architecture Specialized searchers are specialized on discovering certain types of complex matches  make search more efficient

26 26 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 26 Overview of implemented searchers

27 27 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 27 Example: The textual searcher For target attribute T.LISTINGS.agent-address: n Examine attributes and concatenations of attributes from S n Restrict examined set by analyzing textual properties l Data type information in schema, heuristics (proportion of non-numeric characters etc.) l Evaluate match candidates based on data correspondences, prune inferior candidates

28 28 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 28 Example: The numerical searcher For target attribute T.LISTINGS.list-price: n Examine attributes and arithmetic expressions over them from S n Restrict examined set by analyzing numeric properties l Data type information in schema, heuristics l Evaluate match candidates based on data correspondences, prune inferior candidates

29 29 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 29 Search strategy (1): Example textual searcher 1. Learn a (Naive Bayes) classifier text  class („agent-address“ or „other“) from the data instances in T.LISTINGS.agent-address 2. Apply this classifier to each match candidate (e.g., location, concat(city,state) 3. Score of the candidate = average over instance probabilities 4. For expansion: beam search – only k-top scoring candiates

30 30 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 30 Search strategy (2): Example numeric searcher 1. Get value distributions of target attribute and each candidate 2. Compare the value distributions (Kullback-Leibler divergence measure) 3. Score of the candidate = Kullback-Leibler measure

31 31 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 31 Evaluation strategies of implemented searchers

32 32 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 32 Pruning by domain constraints n Multiple attributes of S: „attributes name and beds are unrelated“  do not generate match candidates with these 2 attributes n Properties of a single attribute of T: „the average value of num-rooms does not exceed 10“  use in evaluation of candidates n Properties of multiple attributes of T: „lot-area and num- baths are unrelated“  at match selector level, „clean up“: l Example –T.num_baths  S.baths –? T.lot-area  (S.lot-sq-feet/43560)+1.3e-15 * S.baths  Based on the domain constraint, drop the term involving S.baths

33 33 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 33 Pruning by using knowledge from overlap data n When S and T share the same data n Consider fraction of data for which mapping is correct l e.g., house locations: l S.HOUSES.location overlaps more with T.LISTINGS.area than with T.LISTINGS.agent-address l  Discard the candidate T.LISTINGS.agent-address = S.HOUSES.location, keep only T.LISTINGS.agent-address = concat(S.AGENTS.city,S.AGENTS,state)

34 34 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 34 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

35 35 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 35 How to compare? n Input: What kind of input data? (What languages? Only toy examples? What external information?) n Output: mapping between attributes or tables, nodes or paths? How much information does the system report? n Quality measures: metrics for accuracy and completeness? n Effort: how much savings of manual effort, how quantified? l Pre-match effort (training of learners, dictionary preparation,...) l Post-match effort (correction and improvement of the match output) l How are these measured?

36 36 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 36 Match quality measures n Need a „gold standard“ (the „true“ match) n Measures from information retrieval: (standard choice: F1,  = 0.5)  Quantifies post-match effort

37 37 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 37 Benchmarking n Do, Melnik, and Rahm (2003) found that evaluation studies were not comparable  Need more standardized conditions (benchmarks) n Now a tradition of competitions in ontology matching (more later in the course): l Test cases and contests at http://www.ontologymatching.org/evaluation.html http://www.ontologymatching.org/evaluation.html

38 38 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 38 Agenda The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration

39 39 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 39 Example in iMAP User sees ranked candidates: 1. List-price = price 2. List-price = price * (1 + fee-rate) Explanation: a) Both generated from numeric searcher, 2 ranked higher than 1 b) But: c) Match month-posted = fee-rate d) domain constraint: matches for month-posted and price do not share attributes e)  cannot match list-price to anything to do with fee-rate f) Why c)? g) Data instances of fee-rate were classified as of type date  User corrects this wrong step f), the rest is repaired accordingly

40 40 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 40 Background knowledge structure for explanation: dependency graph

41 41 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 41 MOBS: Using mass collaboration to automate data integration 1. Initialization: a correct but partial match (e.g. title = a1, title = b2, etc.) 2. Soliciting user feedback: User query  user must answer a simple question  user gets answer to initial query 3. Computing user weights (e.g., trustworthiness = fraction of correct answers to known mappings) 4. Combining user feedback (e.g, majority count) n Important: „instant gratification“ (e.g., include the new field in the results page after a user has given helpful input)

42 42 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 42 Next lecture The match problem & what info to use for matching (Semi-)automated matching: Example CUPID (Semi-)automated matching: Example iMAP Evaluating matching Involving the user: Explanations; mass collaboration Semantic Web

43 43 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 43 References / background reading; acknowledgements Rahm, E. & Bernstein, P.A. (2001). A survey of approaches to automatic schema matching. The VLBD Journal, 10, 334-350. http://research.microsoft.com/~philbe/VLDBJ-Dec2001.pdf Doan, A. & Halevy, A.Y. (2004). Semantic Integration Research in the Database Community: A brief survey. AI Magazine. http://dit.unitn.it/~p2p/RelatedWork/Matching/si-survey-db-community.pdf Madhavan, J., Bernstein, P.A., Rahm, E. (2001). Generic Schema Matching with Cupid. In Proc. Of the 27th VLDB Conference. http://dbs.uni-leipzig.de/de/publication/title/generic_schema_matching_with_cupid Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex semantic matches between database schemas. In Proc. Of SIGMOD 2004. http://citeseer.ist.psu.edu/680053.html Do, H.-H., Melnik, S., & Rahm, E. (2003). Comparison of schema matching evaluations. In Web, Web-Services, and Database Systems: NODe 2002, Web- and Database-Related Workshops, Erfurt, Germany, October 7-10, 2002. Revised Papers (pp. 221-237). Springer. http://dit.unitn.it/~p2p/RelatedWork/Comparison%20of%20Schema%20Matching%20Eva luations.pdf McCann, R., Doan, A., Varadarajan, V., & Kramnik, A. (2003). Building data integration systems via mass collaboration. In Proc. International Workshop on the Web and Databases (WebDB). http://citeseer.ist.psu.edu/675796.html Thanks to Wikipedia for the Kullback-Leibler formulae.


Download ppt "1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Defining and combining."

Similar presentations


Ads by Google