Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

Similar presentations


Presentation on theme: "Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)"— Presentation transcript:

1 Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)

2 What we have studied before?  Formalisms for specifying source descriptions and how to use these descriptions to reformulate queries What is the goal now?  Set of techniques that helps a designer create semantic mappings and source descriptions for a particular data integration application Heuristic task Idea is to reduce the time it takes to create semantic mappings

3 Motivating example DVD vendor schema Product(title, productionYear, releaseDate, basePrice, listPrice, rating, customerReviews, saleLocation) Movie(title, year, director, directorOrigin, mainAtors, genre, awards) Locations(name, taxRate, shippingCharge) Online Aggregator Schema Item(title, year, classification, genre, director, price, starring, angReviews)

4 Recap. semantic heterogeneities Table and attribute names can differ  Attributes rating and classification as well as mainActors and starring Multiple attributes in one schema correspond to a single attribute in the other  basePrice and taxRate from the vendor are used to compute the value of price in the aggregator schema Tabular organization may be different  DVD vendor requires 3 tables, aggregator needs only one. Coverage and level of detail may differ  DVD vendor models releaseDate and awards that the aggregator does not

5 Process of creating schema mappings 1) Schema matching Creating correspondences between elements of 2 schemas Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema 2) Creating schema mappings from the correspondences (and filling in missing details) Specifies the transformations that have to be applied the the source data in order to produce the target data Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

6 Schema matching Goal: to automatically create a set of correspondences between the 2 schemas Given two schemas S1 and S2 and being schema elements the table/attribute names in the schema, a correspondence A -> B states that a set of elements A in S1 maps to a set of attributes B in S2  Most common correspondences are 1-1, where A and B are singletons Ex: for the target relation Item, there are the following correspondences: Product.title -> title Movie.year -> year {Product.basePrice, Locations.taxRate} -> price

7 Output of the schema matcher Associate a confidence measure ([0,1]) with every correspondence  Because of the heuristic nature of the schema matching process  Possible heuristics: examine the names of the schema elements; examine the data values; etc May associate a filter with a correspondence Ex: {Product.basePrice, Locations.taxRate} -> price may apply only to locations in the US

8 Components of a schema matcher Basic Matcher: predicts correspondences based on cues available in schema and data Combiner: combines the predictions of the basic matchers into a single similarity matrix Constraint enforcer: applies domain knowledge and constraints to prune the possible matches Match selector: chooses the best match or matches from the similarity matrix

9 Process of creating schema mappings 1) Schema matching Creating correspondences between elements of 2 schemas Ex: the title/rating in the vendor schema corresponds to the title/classification in the aggregator schema 2) Creating schema mappings from the correspondences (and filling in missing details) Specifies the transformations that have to be applied the the source data in order to produce the target data Ex: To compute the value of price in the aggregator schema, have to join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate

10 Creating mappings from correspondences Create the actual mappings from the matches (correspondences)  Find how tuples from one source can be translated into tuples in the other Challenge: there may be more than one possible way of joining the data  Ex: To compute the value of price in the aggregator schema, may join the Product table with Locations table, using saleLocation = name, and add the appropriate local tax given by taxRate, or join Product with Movie to obtain the origin of director, and compute the price based on the taxes in the director’s country of birth

11 Outline Base matchers Combining match predictions Applying constraints and domain knowledge to candidate schema matches Match selector Applying machine learning techniques to enable the schema matcher to learn Discovering m-m matches From matches to mappings

12 Base matchers Input:  pair of schemas S1 and S2, whith elements A and B, respectively  Additonal available information, such as data instances or text descriptions Output:  Correspondence matrix that assigns to every pair of elements (Ai, Bj) a number between 0 and 1 predicting whether Ai corresponds to Bj

13 Classes of base matchers 1) Name-based matchers: based on comparing names of schema elements 2) Instance-based matchers: based on inspecting data instances For specific domains, it is possible to develop more specialized and effective matchers  Must look at large amounts of data  Slower, but efficiency can be improved  More precise

14 Name-based matchers Compare the names of the elements, hoping that the names convey the true semantics of elements Challenge: to find effective distance measures reflecting the distance of element names  Names are never written in exactly the same way

15 Edit distance Edit distance between the strings that represent the names of the elements Levenshtein distance:  Minimum nb of operations (insertions, deletions or replacements) needed to transform one string into another Given two schema elements represented by the strings s1 and s2 and their edit distance, denoted by editDistance(s1,s2) editSimilarity(s1, s2) = 1 – editDistance(s1, s2)/ max(length(s1), length(s2)) Ex: The Levenshtein distance between the strings FullName and FName is 3; between TotalPrice and PriceSum is 8 The editSimilarity between FullName and FName is 0.625; between TotalPrice and PriceSum is 0.2

16 Qgram distance Qgrams: substrings within the names Given a positive integer q, the set of q-grams of s, qgrams(s), consists of all the substrings of s of size q Ex: the 3-grams of price are: {pri, ric, ice} Given a number q, qgramSimilarity(s1, s2) = ||qgrams(s1)  qgrams(s2)||/ ||qgrams(s1)  qgrams(s2)|| Ex: the 3-gram similarity between pricemarked and markedprice is 7/18 = 0.39 Advantages over edit distance:  Faster  More resilient to word-order interchaging

17 Sound distance Based on the way strings sound Soundex algorithm: encodes names by their sound; can detect when two names sound the same even if spelling varies Ex: ship2 is similar to shipto

18 Normalization Element names can be composed of acronyms or short phrases to express their meanings Normalization: replaces a single token by several tokens that can be compared  Element names should be normalized before applying distance measures Some normalization techniques:  Expand known abbreviations  Expand a string with its synonyms  Remove articles, propositions and conjunctions

19 Instance-based matchers Data instances, if available, convey the meaning of a schema element, more than its name  Use them for predicting correspondences between schema elements Techniques:  Develop a set of rules for inferring common types from the format of the data values Ex: phone numbers, prices, zip codes, etc  Value overlap  Text-field analysis

20 Value overlap Measuring the overlap of values in the two elements Applies to categorical elements: whose values range in some finite domain (e.g., movie ratings, country name) Jaccard coefficient: fraction of the values for the two elements that can be an instance for both of them  Also defined as: conditional probability of a value being an instance of both elements given that it is an instance of one of them JaccardSim(e1,e2) = Pr(e1  e2 | e1  e2) = ||D(e1)  D(e2)||/||D(e1)  D(e2)|| where D(e) is the set of values for element e

21 Text-field analysis Applies to elements whose values are longer texts (e.g., house descriptions)  Their values can vary drastically  The probability of finding the exact string for both elements is very low Idea: to compare the general topics these text fields are about  Use text classifiers

22 Text classifiers Classifier for a concept C: algorithm that identifies instances of C from those that are not  Creates an internal model based on training examples: positive examples that are known to be instances of C and negative examples that are known not to be instances of C  Given an example c, the classifier applies its model to decide whether c is an instance of C

23 Combining match predictions (1) Result of basic matchers summarized in a similarity cube:  Suppose the schema matcher used l base matchers to predict correspondences between elements A1,..., An of S1 and the elements B1,...,Bm of S2.  The similarity cube assigns each triple (b,i,j) a number between 0 and 1 describing the prediction of base matcher b about the correspondence between Ai and Bj

24 Combining match predictions (2) Output: a similarity matrix that combines the predictions of the base matchers  For every pair (i, j), want a value between o and 1, Combined(i, j) that gives a single prediction about the correspondence between Ai and Bj Two possible combinations:  Combined(i,j) = max b=1 l Base(b, i, j)  Combined(i,j) = 1/l sum b=1 l Base(b, i, j) Max used when we trust in a matcher that outputs a high value Avg used otherwise Also multi-step combination functions and give weights to matchers

25 Applying constraints and domain knowledge to candidate matches There may exist domain-specific knowledge helpful in the process of schema matching Expressed as a set of constraints that enable pruning candidate matches  Hard constraints: must be applied; schema matcher will not output any match that violates them  Soft constraints: more heuristic nature; may be violated in some schemas; nb of violated should be minimized  A cost is associated to each constraint: infinite for hard constraints; any positive number for soft constraints

26 Example Schema: Book(ISBN, publisher, pubCountry, title, review) Item(code, name, brand, origin, desc) Inventory(ISBN, quantity, location) InStore(code, availQuant) Constraints: T1: if A -> Item.name, then A is a key. Cost =  T2: if sim(A1,B1)  sim(A2,B1), A1 is next to A3 and B1 is next to B2 and sim(A3,B2) > 0.8 then match A1 to B1. Cost = 2 T3: Average(length(desc))  20. Cost = 1.5 T4: if ||{Ai  attributes(R1) |  Bj  attributes(R2) s.t. sim(Ai,Bj) > 0.8}|| >= ||attributes(R1)||/2 then match table R1 to table R2. Cost = 1

27 Algorithms for applying constraints to the similarity matrix Applying constraints with A* search  Guarantees to find the optimal solution  Computationally more expensive Applying constraints with local propagation  Faster  May get stuck in a local minimum

28 Components of a schema matcher

29 Match selector Input: similarity matrix Output: a schema match or the top few matches If matching system interactive, the user can choose among the top k correspondences The system computes several possible matches Ex: schema1(shipAddr, shipPhone, billAddr, billPhone) schema2(addr, phone) The correspondences: shipPhone ->phone and billPhone->phone are chosen until the user indicates: shipAddr->addr then shipPhone->phone becomes more likely than billPhone->phone

30 The algorithm behind The match selection problem can be formulated as an instance of finding a stable marriage  Elements of S1: men; elements of S2:women  Sim(i,j): the degree to which Ai and Bj desire each other Goal: find a stable match between men and women  A match is unstable if there are Ai ->Bj and Ak->Bl, such that sim(i,l)>sim(i,j) and sim(i,l)>sim(k,l)  If these couples existed then Ai and Bl would want to be matched together To produce a schema match without unhappy couples do:  Match={}  Repeat: Let (i,j) be the highest value in sim such that Ai and Bj are not in match Add Ai->Bj to match

31 Applying machine learning techniques to enable the schema matcher to learn Schema matching tasks often repetitive When working in the same domain, one starts to identify how common domain concepts get expressed in schemas So the designer can create schema matches more quickly over time So,  Can the schema matching also improve over time?  Or: can a schema matcher learn from previous experience?  Machine learning techniques can be applied to schema matching, thus enabling the matcher to improve over time

32 Learning to match Suppose n data sources s1,..., sn whose schemas must be mapped into the mediated schema G Goal:  To train the system by manually providing it with schema matches on a small nb of data sources (e.g., s1,...,sm, where m is much smaller than n)  The system generalizes from the training examples so that it is able to predict matches for sources sm+1,...sn

33 Components of the system

34 Training phase (1) 1. Manually specify mappings for several sources 2. Extract source data 3. Create training data for each base learner 4. Train the base learners 5. Train the meta-learner

35 Training phase (2) Learning classifiers for elements in the mediated schema  The classifier for an element e in the mediated schema examines an element in a source schema and predict whether it matches e or not To create classifiers, employs a machine learning algorithm Each machine learning algorithm typically considers only one aspect of the schema and has advantages/inconvenients  So, use a multi-strategy learning technique

36 Multi-strategy learning Training phase:  Employ a set of learners l1,..., lk Each base learner creates a classifier for each element e of the mediated schema from its training examples  Use a meta-learner to learn weights for the different base learners For each element e of the mediated schema and base learner l, the meta-learner computes a weight w e,l  He knows how to do that, because we are working with training examples Matching phase:  When presented with a schema S whose elements are e 1 ’,.., e t ’  Apply the base learners to e 1 ’,.., e t ’. Let p e,l (e’) be the prediction of learner l on whether e’ matches e  Combine the learners: p e (e’) =  j=1 k w e,lj * p e,lj (e’)

37 Charlie comes to town Find houses with 2 bedrooms priced under 300K homes.comrealestate.comhomeseekers.com Example

38 Data Integration mediated schema homes.comrealestate.com source schema 2 homeseekers.com wrapper source schema 3source schema 1 Find houses with 2 bedrooms priced under 300K

39 Example Rule-based learner Naive-bayes learner listed-price $250,000 $110,000... address price agent-phone description location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema

40 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments

41 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info

42 Base Learners Input  schema information: name, proximity, structure,...  data information: value, format,... Output  prediction weighted by confidence score Examples  Name learner agent-name => (name,0.7), (phone,0.3)  Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)

43 Rule-based learner Examines a set of training examples and computes a set of rules that can be applied to test instances  Rules can be represented as logical formulae or as decision trees  Works well in domains where the set of rules can accurately characterize instances of the class (e.g., identifying elements that adhere to certain formats)

44 Example: rule-based learner for identifying phone numbers (1) Positive and negative examples of phone numbers: Example instance? # of digits position of ( position of ) position of – (608)435-2322 yes 10 1 5 9 (60)445-284 no 9 1 4 8 849-7394 yes 7 - - 4 (1343) 429-441 no 10 1 6 10 43 43 (12 1285) no 10 5 12 - 5549902 no 7 - - - (212) 433 8842 yes 10 1 5 -

45 Example: rule-based learner for identifying phone numbers (2) Common method to learn rules is to create a decision tree Encodes rules such as: If i has 10 digits, a ‘(‘ in position 1 and ‘)’ in position 5, then yes If i has 7 digits, but no ‘-’ in position 4, then no,...

46 Naive Bayes Learner Examines the tokens of a testing instance and assigns to the instance the most likely class given the occurrences of tokens in the training set Effective for recognizing text fields Given a test instance, the learner converts it into a bag of tokens

47 Naive Bayes learner at work Given that c 1,.., c n are elements of the mediated schema, the learner is given a test instance d = {w1,..., wk} to classify Goal: assign d to the element c d with the highest posterior probability given d : C d = arg max ci P(C i |d) P(C i |d)= P(d|C i )P(C i )/P(d) => C d = arg max ci [P(d|C i )P(C i )/P(d)] = arg max ci [P(d|C i )P(C i )] P(d|C i ) and P(C i ) must be estimated from the training data

48 Estimation of P(d|c i ) and P(c i ) P(c i ) is approximated by the portion of the training instances with label c i To compute P(d|c i ) assume that the tokens w j appear in d independently of each other given c i P(d|c i ) = P( w 1 |c i ) P( w 2 |c i )... P( w k |c i ) P( w j |c i ) = n( w j,c i )/n(c i ), where n(c i ) : total number of tokens in the training instances with label c i n( w j,c i ): number of times token w j appears in all training instances with label c i

49 Conclusion of Naive Bayes learner Naive Bayes performs well in many domains in spite of the fact the independence assumption is not always valid Works best when:  There are tokens strongly indicative of the correct label, because they appear in one element and not in the others Ex: “beautiful”, “fantastic” to describe houses  There are only weakly suggestive tokens, but many of them Doesn’t work well  Short or numeric fields

50 Recap. Multi-strategy learning Training phase:  Employ a set of learners l1,..., lk Each base learner creates a classifier for each element e of the mediated schema from its training examples  Use a meta-learner to learn weights for the different base learners For each element e of the mediated schema and base learner l, the meta-learner computes a weight w e,l Matching phase:  When presented with a schema S whose elements are e 1 ’,.., e t ’  Apply the base learners to e 1 ’,.., e t ’. Let p e,l (e’) be the prediction of learner l on whether e’ matches e  Combine the learners: p e (e’) =  j=1 k w e,lj * p e,lj (e’)

51 Training the meta-learner Learns the weights to attach to each of the base learners, from the training examples  Can be different for every mediated-schema element How does it work?  Asks the base learners for predictions on training examples  Judges how well each learner performed in providing the prediction for each mediated-schema element  Assigns to each combination (mediated schema element c i, base learner l j ) a weight indicating how much it trusts that learner predictions regarding c i Can use any classification algorithm to compute the weights

52 Outline Base matchers Combining match predictions Applying constraints and domain knowledge to candidate schema matches Match selector Applying machine learning techniques to enable the schema matcher to learn Discovering m-m matches From matches to mappings

53 Schema matches: correspondences between the source and the target schemas Now: specifying the operations to be performed on the source data so that they can be transformed into the target data  Use DBMS as transformation engines  Creating mappings becomes a process of query discovery Find the queries, using joins, unions, filtering, aggregates, that correctly transform the data into the desired schema Algorithm that explores the space of possible schema mappings  Used in the CLIO system

54 User interaction Creating mappings is a complex process System generates the mapping expressions automatically  The possible mappings are automatically produced using the semantics conveyed by constraints such as foreign keys. System shows the designer example data instances so that she can verify which are the right mappings

55 Motivating example Question:  Union professor salaries with employee salaries, or  Join salaries computed from the two correspondences

56 Possible mappings If attribute ProjRank is a foreign key of the relation PayRate, then the mapping would be: SELECT P.HrRate * W.Hrs FROM PayRate P, WorksOn W WHERE P.Rank = W.ProjRank If attribute ProjRank is not a foreign key of the relation PayRate. Instead, the name attribute of WorksOn is a foreign key of Student and the Yr attribute of Student is a foreign key of PayRate (the salary depends on the year of the student). Then, the following query should be chosen: SELECT P.HrRate * W.Hrs FROM PayRate P, WorksOn W, Student S WHERE W.Name = S.Name AND S.Yr = P.Rank Not clear which join path to choose for mapping f1!

57 Possible mappings (2) One interpretation of f 2 is that values produced from f 1 should be joined with those produced by f 2  Then, most of the values in the source DB would not be mapped to the target Another interpretation: there are two ways of computing the salary of employees: one applying to professors and another to other emplyoyees. The corresponding mapping is: SELECT P.HrRate * W.Hrs FROM PayRate P, WorksOn W, Student S WHERE W.Name=S.Name AND S.Yr = P.Rank UNION ALL SELECT Salary FROM Professor

58 Principles to guide the mapping construction If possible, all values in the source appear in the target  Choose a union rather than a join If possible, a value from the source should only contribute once to the target  Associations between values that exist in the source should not be lost  Use a join rather than a cartesian product to compute the salary value using f 1

59 Possible mappings (3) Consider a few more correspondences: f3: Professor(Id) -> Personnel(Id) f4: Professor(Name) -> Personnel(Name) f5: Address(Addr) -> Personnel(Addr) f6: Student(Name) -> Personnel(Name) They fall into two candidate sets of correspondences:  f2, f3, f4 and f5: map from Professor to Personnel  f1, f6: map from other employees to Professor The algorithm explores the possible joins within every candidate set and considers how to union the transformations corresponding to each candidate set.

60 Possible mappings (4) f3: Professor(Id) -> Personnel(Id) f4: Professor(Name) -> Personnel(Name) f5: Address(Addr) -> Personnel(Addr) f6: Student(Name) -> Personnel(Name) Most reasonable mapping is: SELECT P.Id, P.Name, P.Sal, A.Addr FROM Professor P, Address A WHERE A.Id = P.Id UNION ALL SELECT NULL as ID, S.Name, P.HrRate*W.Hrs, Null as Addr FROM Student S, PayRate P, WorksOn W WHERE S.name=W.name AND S.Yr = P.Rank

61 Possible mappings (5) f3: Professor(Id) -> Personnel(Id) f4: Professor(Name) -> Personnel(Name) f5: Address(Addr) -> Personnel(Addr) f6: Student(Name) -> Personnel(Name) But this one is also possible: SELECT NULL as ID, NULL as Name, NULL as Sal, Addr FROM Address A UNION ALL SELECT P.Id, P.Name, P.Sal, NULL as Addr FROM Professor P UNION ALL SELECT NULL as ID, NULL as Name, NULL as Sal, NULL as Addr FROM Student S...

62 Query discovery algorithm - Goal Eliminates unlikely mappings from the large search space of candidate mappings and identifies correct mappings a user might not otherwise have considered

63 Query discovery algorithm - Characteristics Is interactive:  Explores the space of possible mappings and proposes the most likely ones to the user Accepts user feedback  To guide it in the right direction Uses heuristics  Can be replaced by better ones is available

64 Query discovery algorithm – Input Set of correspondences M = {f i : (A i -> B i )}, where  A i : set of attributes in the source S1  B i : one attribute of the target S2 Possible filters on source attributes  Range restriction on an attribute, aggregate of an attribute, etc

65 Query discovery algorithm – 1st phase Create all possible candidate sets (subsets of M), that contain at most one correspondence per attribute of S2  Represents one way of computing the attributes of S2  If a set covers all attributes of S2, it is called complete set  Elements of do not need to be disjoint

66 Example Given the correspondences:  f1 : S1.A -> T.C  f2 : S2.A -> T.D  f3 : S2.B -> T.C Then the complete candidate sets are:  {{f1, f2}, {f2, f3}} The singleton sets {f1}, {f2} and {f3} are also candidate sets.

67 Query discovery algorithm – 2nd phase Consider the candidate sets in and search for the best set of joins within each candidate set  Considering a candidate set v in and supose (A i -> B i )  v, and A i includes attributes from multiple relations in S1.  Then, search for a join path connecting the relations mentioned in A i using the following: Heuristic:  A join path can be either: A path through foreign keys A path proposed by inspecting previous queries on S, or A path discovered by mining the data for joinable columns in S

68 Query discovery algorithm – 2nd phase The set of candidate sets in for which we find join paths is denoted by . When there are multiple join paths, use the following for selecting join paths: Heuristic:  Prefer paths through foreign keys.  If there are multiple such paths, choose one that involves an attribute on which there is a filter in a correspondence, if it exists.  To further rank paths, favor the join path where the estimated difference between the outer join and inner join is the smallest  Favors joins with the least number of dangling tuples

69 Query discovery algorithm – 3rd phase Examine the candidate sets in , and tries to combine them by union so they cover all the correspondences in M. Search for covers of the correspondences  A subset T of  is a cover is it includes all the correspondences in M and it is minimal (cannot remove a candidate set from T and still obtain a cover) Example:   = {{f1, f2}, {f2, f3}, {f1}, {f2}, {f3}}  Possible covers include T1 = {{f1}, {f2, f3}} T2 = {{f1, f2}, {f2, f3}.

70 Query discovery algorithm – 3rd phase If there are multiple possible covers, use the following: Heuristic:  Choose the cover with the smallest nb of candidate sets (a simpler mapping should be more appropriate)  If there is more than one with the same nb of candidate sets, choose the one that includes more attributes of S2 (to cover more of that schema)

71 Query discovery algorithm – 4th phase Creates a schema mapping expression as an SQL query  First creates an SQL query for each candidate set in the selected cover and then unions them Example: Suppose v is a candidate set: Attributes of S2 in v are put in the SELECT clause Each of the relations in the join paths found for v are put in the FROM clause The corresponding join predicates are put in the WHERE clause Any filters associated with the correspondences in v are also added to the WHERE clause Finally takes the union of the queries for each candidate set in the cover  Compute the SQL mapping expression for T1 and T2 left as an exercice

72 The CLIO tool http://www.almaden.ibm.com/cs/projects/crioll o/ http://www.almaden.ibm.com/cs/projects/crioll o/ http://birte08.stanford.edu/ppts/11-ho.pd

73 References Chapter 4, Draft of the book on “Principles of Data Integration” by AnHai Doan, Alon Halevy, Zachary Ives (in preparation). Erhard Rahm and Philip A. Bernstein. “A survey of approaches to automatic schema matching”. VLDB Journal, 10(4):334–350, 2001. AnHai Doan, Pedro Domingos, Alon Halevy, “Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach”, SIGMOD 2001 R.J. Miller, L.M. Haas, and M. Hernandez. “Schema Matching as Query Discovery”. In VLDB, 2000. Renée J. Miller, Mauricio A. Hernandez, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho and Ronald Fagin, Lucian Popa, “The Clio Project: Managing Heterogeneity”, SIGMOD Record 30(1), March 2001, pp. 78-83.

74 Virtual Data Integration in industry Known as EII: Enterprise Information Integration Different from EAI: Enterprise Application Integration Different from ETL in DW: Extraction, Tranformation and Loading in Data Warehousing Good reference to take a look at:  A. Halevy et al, “Enterprise Information Integration: Successes, Challenges and Controversies”, SIGMOD’05.


Download ppt "Creating semantic mappings (based on the slides of the course: CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)"

Similar presentations


Ads by Google