Learning to Map between Structured Representations of Data AnHai Doan Database & Data Mining Group University of Washington, Seattle Spring 2002 Thank you, ..., for your introduction, and thank you all for coming to my talk. Today I’m going to talk about the problem of mapping between data representations. And this is joint work with my advisors, Alon Halevy and P. Domingos, and my colleguage, J Madhavan, at the University of Washington. Now, I will show you shortly that the problem of mapping representations arises everywhere. But first, let me motivate it by grounding it in a very specific application context, which is DATA INTEGRATION.
Data Integration Challenge Find houses with 2 bedrooms priced under 200K New faculty member Here is a simple example of data integration. Consider Professor Brown. He was recently a faculty candidate. He has jumped through all hoops, and now has been offered a job at this university. Now he’s busy teaching and doing research. But because he just moved here,he’s also looking to buy a house. Now, he wants to find houses with 2 bedrooms and priced under 300K. Naturally, he turns to the Web to find this information. But he discovers that there are many many sources on the Web which provide house listings. He realizes that he has to go to EACH of these sources, query the sources individually, then combine the answers to obtain the desired information. He realizes to his horror that this would take hours and hours of his valuable time. So, the problem of finding houses for new professors, together with some other less important problems such as integrating data at enterprises, has generated a lot of research in databases on developing data integration systems. Such systems would allow Professor Brown to obtain the desired information in only seconds, instead of hours. realestate.com homeseekers.com homes.com
Architecture of Data Integration System Find houses with 2 bedrooms priced under 200K mediated schema source schema 1 source schema 2 source schema 3 Here is the architecture of a data integration system over three real-estate sources on the Web. The system has a mediated schema, which describes the real-estate domain, and the source schemas, which describe the content of the sources. Now, instead of interacting with EACH individual source, Professor Brown can simply pose his query in the mediated schema. The system will take the query, translate it into queries in source schemas, then execute those queries at the sources. It then combines the answeres returned from the sources, to produce the desired answer to the query of Professor Brown. So, such data integration systems are very compelling, because they will significantly boost the productivity of new professors. BUT, building such systems are still very difficult today. In particular, to build such a sytem, we must specify the SEMANTIC MAPPINGS between the mediated schema and the source schemas, so that the system can translate queries at the mediated-schema level to source-schema level. Today, such semantic mappings are still created manually, in a very costly process. So the focus of this talk is on AUTOMATICALLY LEARNING THESE MAPPINGS. realestate.com homeseekers.com homes.com
Semantic Mappings between Schemas Mediated-schema price agent-name address 1-1 mapping complex mapping homes.com listed-price contact-name city state 320K Jane Brown Seattle WA 240K Mike Smith Miami FL However, before doing that, lets get a better feel for the problem by defining it in more details. For simplicity, lets assume that both the mediated schema and source schemas use the RELATIONAL REPRESENTATION. For example, here’s a mediated schema in the real-estate domain, with elements price, agent-name, and address. And here’s the source homes.com, which exports house listings in this relational table, each row in this table corresponds to a house listing. Throughout the talk, we color elements of the mediated schema in red, and elements of source schemas in blue. Now, given a mediated schema and a source schema, the schema-matching problem is to find semantic mappings between the elements of the two schemas. The simplest type of mapping is 1-1 mappings, such as price to listed-price, and agent-name to contact-name. BUT 1-1 mappings make up only a portion of semantic mappings in practice. There are also a lot of complex mappings such as address is the concatenation of city and state, or number of bathrooms is the number of full baths plus number of half baths. In this talk, we shall focus first on finding 1-1 mappings, then on finding complex mappings.
Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases data integration data translation schema/view integration data warehousing semantic query processing model management peer data management AI knowledge bases, ontology merging, information gathering agents, ... Web e-commerce marking up data using ontologies (Semantic Web) But first lets take a step back and ask, if you are not buying a house, why should you care about this problem. The answer is that you should, because it is a fundamental problem in many areas and in numerous applications. Given any domain, if you ask two persons to describe it, they will almost certainly use different terminologies. Thus any application that involves more than one such description must establish semantic mappings between them, in order to have INTEROPERABILITY. As a consequence, variations of this problem arises everywhere. It has been a long standing problem in databases and is becoming increasingly critical. It arises in AI, in the context of ontology merging and information gathering on the Internet. It arises in e-commerce, as the problem of matching catalogs. It is also a fundamental problem in the context of the Semantic Web, which tries to add more structure to the Web by marking up data using ontologies. There we have the problem of matching ontologies. Now, if this problem is so important, why has no one solved it? [REPLACE SLIDE]
Why Schema Matching is Difficult Schema & data never fully capture semantics! not adequately documented Must rely on clues in schema & data using names, structures, types, data values, etc. Such clues can be unreliable same names => different entities: area => location or square-feet different names => same entity: area & address => location Intended semantics can be subjective house-style = house-description? Cannot be fully automated, needs user feedback! The answer is that a lot of progress on this problem has been made, but it still remains a very difficult problem. The fundamental reason for this is because the schema and data never fully capture the intended meaning of the schema elements. You have a relational table, and the table was created 17 years ago, by someone who has retired. Now you have no idea what a table attribute, that is, a schema element, means, let alone trying to match it to elements in another table. So, what do you do? You must rely on clues in schema and data to match elements. You compare for example the names, structures, types, and data values of the elements. The problem is that such clues can be unreliable. You have elements that have the same name, but refer to different real-world entities. For example, an element with name “area” can refer to either the house location or the square feet area of the house. You also have the reverse problem, where elements that have different names refer to the same real-world entity. For example, both “area” and “address” can refer to the house location. To make the problem worse, the intended semantics is also frequently subjective, depending on the application. For example, one application may decide that house-style matches house-description, whereas another application may not. The result is that this problem cannot be fully automated. We do need the user in the loop to provide feedback. [LONG PAUSE]
Current State of Affairs Finding semantic mappings is now a key bottleneck! largely done by hand labor intensive & error prone data integration at GTE [Li&Clifton, 2000] 40 databases, 27000 elements, estimated time: 12 years Will only be exacerbated data sharing becomes pervasive translation of legacy data Need semi-automatic approaches to scale up! Many current research projects Databases: IBM Almaden, Microsoft Research, BYU, George Mason, U of Leipzig, ... AI: Stanford, Karlsruhe University, NEC Japan, ... So, what do people do today with semantic mappings? Unfortunately, they still must create them by hand, in a very labor intensive process. For example, Li&Clifton recently reported that at the phone company GTE people tried to integrate 40 databases, which have a total of 27000 elements, and they estimated that simply finding and documenting the semantic mappings would take them 12 years, unless they have the owners of the databases around. Thus, finding semantic mappings has now become a key bottleneck in building large-scale data management applications. And this problem is going to be even more critical, as data sharing becomes even more pervasive on the Web and at enterprises, and as the need for translating legacy data increases. Clearly, we need semi-automatic solutions to schema matching, in order to scale up. And there have been a lot of research works on such solutions, in both databases and AI.
Goals and Contributions Vision for schema-matching tools learn from previous matching activities exploit multiple types of information incorporate domain integrity constraints handle user feedback My contributions: solution for semi-automatic schema matching can match relational schemas, DTDs, ontologies, ... discovers both 1-1 & complex mappings highly modular & extensible achieves high matching accuracy (66 -- 97%) on real-world data In this context, here are my vision and contributions to solving this problem. First, I believe that a semi-automatic schema matching tool must be able to learn from previous matching activities. It must be able to LOOK OVER the user’s shoulder, to see how the user match schemas, to learn from those matchings, and then to propose mappings itself. Second, I have shown you before that each information type in schema and data can be unreliable, thus it is desirable that such tools should not exploit only a single type of information, but multiple of them, to increase matching accuracy. Third, there is typically a wealth of domain constraints, such as “a house cannot have two addresses”, that such tools should exploit to maximize accuracy. Finally, as I have mentioned, the subjectivity of semantic mappings means user feedback is essential, and so such tools must efficiently incorporate such feedback. So here are my contributions. First, I developed a solution to semi-automatic schema matching that has all the desirable properties mentioned above. The solution can leverage past mappings to predict new mappings. It employs a technique called multi-strategy learning to exploit multiple types of information. It can also efficiently incorporate a wide range of domain constraints and user feedback. The solution focuses on 1-1 mappings for data integration. I then extended it further to handle complex mappings, and to match complex representations such as ontologies. The result is a set of techniques that can match relational schemas, XML DTDs, ontologies, that can discovers both 1-1 and complex mappings, that is highly extensible and easily adapted to new application domains, and that is shown empirically to achieve high accuracy on real-world data. In the rest of the talk, I shall elaborate on these contributions.
Road Map Introduction Schema matching [SIGMOD-01] 1-1 mappings for data integration LSD (Learning Source Description) system learns from previous matching activities employs multi-strategy learning exploits domain constraints & user feedback Creating complex mappings [Tech. Report-02] Ontology matching [WWW-02] Conclusions So here’s a roadmap for the rest of the talk. I have motivated the problem and described my contributions. Next I will describe the LSD system that performs schema matching for data integration. The name LSD stands for learning source descriptions. This part is the meat of the talk, and I will show how LSD learns from previous matchings, how it employs multi-strategy learning, and exploits domain constraints and user feedback. After this, I briefly describes how I has extended LSD to handle more complex mappings and more complex data representations, specifically in the context of ontology matching. Then I discuss future work and conclude.
Schema Matching for Data Integration: the LSD Approach Suppose user wants to integrate 100 data sources 1. User manually creates mappings for a few sources, say 3 shows LSD these mappings 2. LSD learns from the mappings 3. LSD predicts mappings for remaining 97 sources The key idea underlying the LSD approach is as follows: Suppose the user wants to build a data integration system that integrates 100 data sources. And in the rest of the talk, by “user” I mean the “system builder”, not ordinary users such as Professor Brown. Then, first, the user manually creates all 1-1 mappings for a few sources, lets say, 3, and show the LSD system these mappings. Second, LSD uses these mappings to learn how to match source schemas. Finally, LSD uses what it has learned to propose semantic mappings for the remaining 97 sources. This way, the relatively little work that the user perform on the first 3 sources will be amortized nicely over the remaining 97 sources. And the more sources the system has, the larger the pay-off. [QUESTION ON WHAT HAPPENS IF SOURCE SCHEMAS DO NOT ENTIRELY COVER THE MEDIATED SCHEMA.]
Learning from the Manual Mappings Mediated schema price agent-name agent-phone office-phone description If “office” occurs in name => office-phone listed-price contact-name contact-phone office comments Schema of realestate.com realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Here’s an example to illustrate our approach. Consider a mediated schema with elements price, agent-name, agent-phone, and so on. To apply LSD, first the user selects a few sources to be the training sources. In this case, the user selects a single source, realestate.com [POINT]. Next, the user manually specifies the 1-1 mappings between the schema of this source and the mediated schema. These mappings are the five green arrows right here [POINT], which say that listed-price matches price, contact-name matches agent-name, and so on. Once the user has shown LSD these 1-1 mappings, there are many different types of information that LSD could learn from, in order to construct hypotheses on how to match schema elements. For example, LSD could learn from the names of schema elements. Knowing that office matches office-phone, it may construct the hypothesis [POINT] that if the word ”office" occurs in the name of a schema element, then that element is likely to be office-phone. LSD could also learn from the data values. Because comments matches description, LSD knows that these data values here [POINT] are house descriptions. It could then examine them to learn that house descriptions frequently contain words such as fantastic, great, and beautiful. Hence, it may construct the hypothesis [POINT] that if these words appear frequently in the data values of an element, then that element is likely to be house descriptions. LSD could also learn from the characteristics of value distributions. For example, it can look at the average value of this column [POINT], and learn that if the average value is in the thousands, then the element is more likely to be price than the number of bathrooms. And so on. Now, consider the source homes.com, with these schema elements [POINT] and these data values [POINT]. LSD can apply the learned hypotheses to the schema and the data values, in order to predict semantic mappings. For example, because the words "beautiful" and "great" appear frequently in these data values, LSD can predict that "extra-info" matches "description”. ... and the solution to this is MULTI-STRATEGY LEARNING [PAUSE] If “fantastic” & “great” occur frequently in data instances => description homes.com sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
Must Exploit Multiple Types of Information! Mediated schema price agent-name agent-phone office-phone description If “office” occurs in name => office-phone listed-price contact-name contact-phone office comments Schema of realestate.com realestate.com listed-price contact-name contact-phone office comments $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location Here’s an example to illustrate our approach. Consider a mediated schema with elements price, agent-name, agent-phone, and so on. To apply LSD, first the user selects a few sources to be the training sources. In this case, the user selects a single source, realestate.com [POINT]. Next, the user manually specifies the 1-1 mappings between the schema of this source and the mediated schema. These mappings are the five green arrows right here [POINT], which say that listed-price matches price, contact-name matches agent-name, and so on. Once the user has shown LSD these 1-1 mappings, there are many different types of information that LSD could learn from, in order to construct hypotheses on how to match schema elements. For example, LSD could learn from the names of schema elements. Knowing that office matches office-phone, it may construct the hypothesis [POINT] that if the word ”office" occurs in the name of a schema element, then that element is likely to be office-phone. LSD could also learn from the data values. Because comments matches description, LSD knows that these data values here [POINT] are house descriptions. It could then examine them to learn that house descriptions frequently contain words such as fantastic, great, and beautiful. Hence, it may construct the hypothesis [POINT] that if these words appear frequently in the data values of an element, then that element is likely to be house descriptions. LSD could also learn from the characteristics of value distributions. For example, it can look at the average value of this column [POINT], and learn that if the average value is in the thousands, then the element is more likely to be price than the number of bathrooms. And so on. Now, consider the source homes.com, with these schema elements [POINT] and these data values [POINT]. LSD can apply the learned hypotheses to the schema and the data values, in order to predict semantic mappings. For example, because the words "beautiful" and "great" appear frequently in these data values, LSD can predict that "extra-info" matches "description”. ... and the solution to this is MULTI-STRATEGY LEARNING [PAUSE] If “fantastic” & “great” occur frequently in data instances => description homes.com sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle
Multi-Strategy Learning Use a set of base learners each exploits well certain types of information To match a schema element of a new source apply base learners combine their predictions using a meta-learner Meta-learner uses training sources to measure base learner accuracy weighs each learner based on its accuracy In multi-strategy learning, we employ a set of base learners, each exploits well certain types of information. Then, to match a schema element of a new source, we apply the learners, then combine their predictions using a meta-learner. The basic idea behind the meta-learner is that it should be smart enough to figure out which learner is best at matching which type of schema elements, and trust that learner more than other learners. To do this, it uses the training sources to measure the accuracy of each learner, then in combining learner predictions, it weighs each learner based on its accuracy on the training sources. In the next several slides, I will describe the multi-strategy learning approach in more detail.
Base Learners Training Matching Name Learner Naive Bayes Learner training: (“location”, address) (“contact name”, name) matching: agent-name => (name,0.7),(phone,0.3) Naive Bayes Learner training: (“Seattle, WA”,address) (“250K”,price) matching: “Kent, WA” => (address,0.8),(name,0.2) (X1,C1) (X2,C2) ... (Xm,Cm) Observed label Training examples Object Classification model (hypothesis) X labels weighted by confidence score Lets start with the base learners. As a learning program, a base learner operates in two phases. In the training phase, the base learner examines a set of training examples, where each example is a pair of object and its observed label, then constructs an internal classification model on how to assign label to objects. These models are the hypotheses that I mentioned earlier. Then in the matching phase, when given an object, the base learner will apply the classification model to make prediction for the object. The prediction will be labels weighted by confidence score. For example, consider the Name Learner and the Naive Bayes Learner. The Name Learner matches schema elements based on their names. Thus, its training examples look like this. This example says that if a schema element has the name “location”, then it matches address, and so on. Once the Name Learner has been trained on these examples, in the matching phase, given the name “agent name”, the Name Learner may predict that the schema element with this name matches name with confidence 0.7, and matches phone with conf. 0.3. The Naive Bayes Learner matches schema elements based on the frequencies of words and symbols in their data values. Thus, it training examples look like this. This example says that the data value “Seattle, WA” is an address, and so on. Once the NB learner has been trained on these examples, in the matching phase, given the data value “Kent, WA”, the learner may predict that it is an address with confidence 0.8, and an name with confidence 0.2.
The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements I am now in the position to show you the working of the entire LSD system. LSD operates in 2 phases. In the training phase, the user manually provides the mappings for a few sources. LSD then uses these sources to create training data for the base learners. Next, LSD trains the base learners on the training data. The output of training the base learners will be a set of hypotheses. LSD also uses the training data to train the Meta-Learner, and the output of this training process is a set of weights for the base learners. In the matching phase, given a source, LSD uses the source to construct data columns, one per each source-schema element. Next, for each data column [POINT] LSD applies the base learners, then combines their prediction using the meta-learner. The output of the meta-learner is predictions for the data instances. The Prediction Combiner then combines these predictions to produce predictions for source schema elements. The constraint handler thent takes the predictions output by the PC, together with the domain constraints, and output a mapping combination, which has one mapping for each schema element. If not happy with this mapping combination, the user can specify additional constraints as the input to the constraint handler [POINT]. The constraint handler outputs a new mapping combination, and this goes on until the user is happy. Now I will describe the working of LSD in more detail. First, I will describe training the base learners and then describe the Meta-Learner. [POINT] Constraint Handler Weights for Base Learners Meta-Learner Mappings
Training the Base Learners Mediated schema address price agent-name agent-phone office-phone description realestate.com location price contact-name contact-phone office comments Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house Boston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location (“location”, address) (“price”, price) (“contact name”, agent-name) (“contact phone”, agent-phone) (“office”, office-phone) (“comments”, description) Name Learner Naive Bayes Learner (“Miami, FL”, address) (“$250K”, price) (“James Smith”, agent-name) (“(305) 729 0831”, agent-phone) (“(305) 616 1822”, office-phone) (“Fantastic house”, description) (“Boston,MA”, address) To train the base learners, first LSD must create training data for them from several sources. Here’s how it creates training data for the Name Learner and the NB Learner from the source realestate.com. Again, suppose that the user has specified all the 1-1 mappings between this source and the mediated schema [POINT]. Then since the name learner matches schema elements based on their names, its training data are the tuples consisting of the names of source-schema elements and their corresponding mediated-schema elements [POINTING AT THE TWO COLUMNS], for example, location and address, price and price, and so on. Since the Naive Bayes learner matches schema elements by looking at word frequencies in their data values, its training data are the tuples consisting of the data values of source-schema elements and their corresponding mediated-schema element, for example, Miami, FL is an address, 250K is a price, and so on. If LSD has one more source for training, it repeats the same process, and adds the training data obtained from that source to the training data of the name learner and the Naive Bayes learner. Once LSD has created training data for the learners, it trains them on the training data [QUESTION: WHAT IT MEANS TO TRAIN THEM?] to construct internal hypothesis on matching, as I explained before. .
Meta-Learner: Stacking [Wolpert 92,Ting&Witten99] Training uses training data to learn weights one for each (base-learner,mediated-schema element) pair weight (Name-Learner,address) = 0.2 weight (Naive-Bayes,address) = 0.8 Matching: combine predictions of base learners computes weighted average of base-learner confidence scores STATE-OF-THE-ART MACHINE LEARNING METHOD area Name Learner Naive Bayes (address,0.4) (address,0.9) Seattle, WA Kent, WA Bend, OR Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)
The LSD Architecture Training Phase Matching Phase Mediated schema Source schemas Training data for base learners Base-Learner1 .... Base-Learnerk Meta-Learner Base-Learner1 Base-Learnerk Predictions for instances Hypothesis1 Hypothesisk Prediction Combiner Domain constraints Predictions for elements Again, here’s the slide on the working of LSD. I have just explained the training of the base learners, and the training and matching of the meta-learner. Now I will explain the use of these learners in the matching phase, and the working of the PC and the CH. Constraint Handler Weights for Base Learners Meta-Learner Mappings
Applying the Learners homes.com schema area sold-at contact-agent extra-info area Name Learner Naive Bayes Seattle, WA Kent, WA Bend, OR Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Name Learner Naive Bayes Meta-Learner Prediction-Combiner homes.com (address,0.7), (description,0.3) sold-at (price,0.9), (agent-phone,0.1) In the matching phase, suppose we have a source homes.com, and we would like to match the schema element area of this source. LSD starts by applying the base learners to each data instance of area. For simplicity, consider only two base learners: Name Learner and NB Learner. Consider this first instance [POINT]. The name learner will take its name, which is area, and issuea prediction. The Naive Bayes learner will take its data value, which is Seattle, WA, and issue another prediction. The meta-learner then combines the two predictions into this single prediction [POINT]. LSD does the same thing for the remaining two instances. Note that in all cases the input to the name learner is the same, which is area. We arrive at these three predictions [POINT], one for each instance. Now the PC combines the three predictions using a simple heuristic, which is just averaging the confidence scores in this case, to arrive at this single prediction. This prediction says that source-schema element "area" matches "address” with confidence 0.7, and matches "description" with conf. 0.3. We proceed similarly with the other 3 source-schema elements: sold-at, contact-agent, and extra-info, to arrive at these 3 predictions. Now, if we stop here, then the mapping combination that we return would be [POINT] area maps to address, sold-at maps to price, contact-agent maps to agent-phone, and extra-info maps to address. Clearly, this mapping assignment is good, but not perfect because it maps both area and extra-info to address. Intuitively, a house listing does not have two addresses. If we can exploit this domain constraint then we should be able to improve the above mapping assignment. That's the idea behind handling domain constraints. [PUT THE NEXT SLIDE ON AND PAUSE BRIEFLY.] contact-agent (agent-phone,0.9), (description,0.1) extra-info (address,0.6), (description,0.4)
Domain Constraints Encode user knowledge about domain Specified by examining mediated schema Examples at most one source-schema element can match address if a source-schema element matches house-id then it is a key avg-value(price) > avg-value(num-baths) Given a mapping combination can verify if it satisfies a given constraint Intuitively, domain constraints encode user knowledge about the domain. They are specified only once by the user, when examining the mediated schema. For example, in the real-estate domain, a user can specify constraints such as “at most one source-schema element matches address”, or “if a source-schema element matches house-id, then it is a key”, or “if a source-schema element matches price, and another element matches num of bathrooms, then the avg value of the former must be greater than the latter”. The domain constraint are such that LSD can verify if a given mapping combination satisfies a given constraint. For example, ... area: address sold-at: price contact-agent: agent-phone extra-info: address
The Constraint Handler Predictions from Prediction Combiner Domain Constraints At most one element matches address area: (address,0.7), (description,0.3) sold-at: (price,0.9), (agent-phone,0.1) contact-agent: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) area: address sold-at: price contact-agent: agent-phone extra-info: address 0.7 0.9 0.6 0.3402 area: address sold-at: price contact-agent: agent-phone extra-info: description 0.7 0.9 0.4 0.2268 0.3 0.1 0.4 0.0012 Now I'll describe the constraint handler which exploits domain constraints. Conceptually, the constraint handler takes the predictions output by the PC, creates all possible mapping combinations, then assigns to each combination a confidence score. For example, this combination says that area maps to address, contact-phone maps to agent-phone, and extra-info maps to address. Its confidence score [POINT] is the product of these individual predictions [POINT], which come from here [POINT]. The combinations shown here are sorted in decreasing order of their confidence score. The constraint handler then searches the space of mapping combinations to find the one with the highest confidence score, that satisfies the domain constraints. Here, given the only domain constraint [POINT] that no two source-schema elements can map to address, the constraint handler will prune this combination [POINT], and return this one [POINT]. Typically, the space of mapping combinations is huge, so brute-force search is not possible. Our LSD implementation uses a specialized search technique to search this space efficiently. Our constraint handler has several desirable characteristics. First, it can handle arbitrary domain constraints, as long as they can be verified on any mapping combination. Second, it can easily handle user feedback by treating them as additional domain constraints. Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback sold-at does not match price
The Current LSD System Can also handle data in XML format matches XML DTDs Base learners Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97] exploits frequencies of words & symbols WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98] employs information-retrieval similarity metric Name Learner [SIGMOD-01] matches elements based on their names County-Name Recognizer [SIGMOD-01] stores all U.S. county names XML Learner [SIGMOD-01] exploits hierarchical structure of XML data
Empirical Evaluation Four domains For each domain Real Estate I & II, Course Offerings, Faculty Listings For each domain created mediated schema & domain constraints chose five sources extracted & converted data into XML mediated schemas: 14 - 66 elements, source schemas: 13 - 48 Ten runs for each domain, in each run: manually provided 1-1 mappings for 3 sources asked LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified accuracy is on test sources
High Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%
Contribution of Schema vs. Data Average matching accuracy (%) More experiments in my sigmod-01 show that LSD is robust wrt to the amount of data available and can handle user feedback efficiently (say something more specific) LSD with only schema info. LSD with only data info. Complete LSD More experiments in [Doan et al. SIGMOD-01]
LSD Summary LSD LSD focuses on 1-1 mappings learns from previous matching activities exploits multiple types of information by employing multi-strategy learning incorporates domain constraints & user feedback achieves high matching accuracy LSD focuses on 1-1 mappings Next challenge: discover more complex mappings! COMAP (Complex Mapping) system
The COMAP Approach For each mediated-schema element price num-baths address homes.com listed-price agent-id full-baths half-baths city zipcode 320K 53211 2 1 Seattle 98105 240K 11578 1 1 Miami 23591 For each mediated-schema element searches space of all mappings finds a small set of likely mapping candidates uses LSD to evaluate them To search efficiently employs a specialized searcher for each element type Text Searcher, Numeric Searcher, Category Searcher, ...
The COMAP Architecture [Doan et al., 02] Mediated schema Source schema + data Searcher1 Searcher2 Searcherk Mapping candidates Base-Learner1 .... Base-Learnerk Meta-Learner Prediction Combiner Domain constraints Constraint Handler LSD Mappings
An Example: Text Searcher Beam search in space of all concatenation mappings Example: find mapping candidates for address Mediated-schema price num-baths address homes.com listed-price agent-id full-baths half-baths city zipcode 320K 532a 2 1 Seattle 98105 240K 115c 1 1 Miami 23591 concat(agent-id,city) concat(agent-id,zipcode) concat(city,zipcode) 532a Seattle 115c Miami 532a 98105 115c 23591 Seattle 98105 Miami 23591 Best mapping candidates for address (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)
Empirical Evaluation Current COMAP system Three real-world domains eight searchers Three real-world domains in real estate & product inventory mediated schema: 6 -- 26 elements, source schema: 16 -- 31 Accuracy: 62 -- 97% Sample discovered mappings agent-name = concat(first-name,last-name) area = building-area / 43560 discount-cost = (unit-price * quantity) * (1 - discount)
Road Map Introduction Schema matching Creating complex mappings LSD system Creating complex mappings COMAP system Ontology matching GLUE system Conclusions
Ontology Matching Increasingly critical for An ontology Matching knowledge bases, Semantic Web An ontology concepts organized into a taxonomy tree each concept has a set of attributes a set of instances relations among concepts Matching concepts attributes relations CS Dept. US Entity Undergrad Courses Grad Courses People Faculty Staff schemas can be viewed as ontologies with restricted relationship types Assistant Professor Associate Professor Professor name: Mike Burns degree: Ph.D.
Matching Taxonomies of Concepts Entity Undergrad Courses Grad People Staff Faculty Assistant Professor Associate CS Dept. US CS Dept. Australia Entity Courses Staff Academic Staff Technical Staff Senior Lecturer Lecturer Professor
Constraints in Taxonomy Matching Domain-dependent at most one node matches department-chair a node that matches professor can not be a child of a node that matches assistant-professor Domain-independent two nodes match if parents & children match if all children of X matches Y, then X also matches Y Variations have been exploited in many restricted settings [Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98], [Noy et al., IJCAI-01], [Madhavan et al., VLDB-01] Challenge: find a general & efficient approach ontology structure provides many constraints
Solution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83] applied to graph labeling in vision, NLP, hypertext classification finds best label assignment, given a set of constraints starts with initial label assignment iteratively improves labels, using constraints Standard relax. labeling not applicable extended it in many ways [Doan et al., W W W-02] Experiments three real-world domains in course catalog & company listings 30 -- 300 nodes / taxonomy accuracy 66 -- 97% vs. 52 -- 83% of best base learner relaxation labeling very fast (under few seconds)
Related Work Hand-crafted rules Exploit schema 1-1 mapping Single learner Exploit data 1-1 mapping TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01] PROMPT [Noy et al. 00] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et al. 97] Learners + rules, use multi-strategy learning Exploit schema + data 1-1 + complex mapping Exploit domain constraints Rules Exploit data 1-1 + complex mapping CLIO [Miller et. al., 00] [Yan et al. 01] LSD [Doan et al., SIGMOD-01] COMAP [Doan et al. 2002, submitted] GLUE [Doan et al., WWW-02]
Future Work Learning source descriptions formal semantics for mapping query capabilities, source schema, scope, reliability of data, ... Dealing with changes in source description Matching objects across sources More sophisticated user feedback Focus on distributed information management systems data integration, web-service integration, peer data management goal: significantly reduce complexity of construction & maintenance There are many important problems ...
Conclusions Efficiently creating semantic mappings is critical Developed solution for semi-automatic schema matching learns from previous matching activities can match relational schemas, DTDs, ontologies, ... discovers both 1-1 & complex mappings highly modular & extensible achieves high matching accuracy Made contributions to machine learning developed novel method to classify XML data extended relaxation labeling
Backup Slides
Training the Meta-Learner For address Extracted XML Instances Name Learner Naive Bayes True Predictions <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... 0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ... Least-Squares Linear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9
Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)
Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system
Exploiting Hierarchical Structure Existing learners flatten out all structures Developed XML learner similar to the Naive Bayes learner input instance = bag of tokens differs in one crucial aspect consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>
Reasons for Incorrect Matchings Unfamiliarity suburb solution: add a suburb-name recognizer Insufficient information correctly identified general type, failed to pinpoint exact type agent-name phone Richard Smith (206) 234 5412 solution: add a proximity learner Subjectivity house-style = description? Victorian Beautiful neo-gothic house Mexican Great location
Evaluate Mapping Candidates For address, Text Searcher returns (agent-id,0.7) (concat(agent-id,city),0.8) (concat(city,zipcode),0.75) Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8) Naive Bayes Learner: 0.8 Name Learner: “address” vs. “agent id city” 0.3 Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65 Meta-Learner returns (agent-id,0.59) (concat(agent-id,city),0.65) (concat(city,zipcode),0.70)
Relaxation Labeling Applied to similar problems in vision, NLP, hypertext classification People Dept U.S. Dept Australia Courses Courses Courses Courses People Staff Faculty Staff Acad. Staff Tech. Staff Faculty Staff
Relaxation Labeling for Taxonomy Matching Must define neighborhood of a node k features of neighborhood how to combine influence of features Algorithm init: for each pair <N,L>, compute loop: for each pair <N,L>, re-compute Acad. Staff: Faculty Tech. Staff: Staff Staff = People Neighborhood configuration
Relaxation Labeling for Taxonomy Matching Huge number of neighborhood configurations! typically neighborhood = immediate nodes here neighborhood can be entire graph 100 nodes, 10 labels => configurations Solution label abstraction + dynamic programming guarantee quadratic time for a broad range of domain constraints Empirical evaluation GLUE system [Doan et. al., WWW-02] three real-world domains 30 -- 300 nodes / taxonomy high accuracy 66 -- 97% vs. 52 -- 83% of best base learner relaxation labeling very fast, finished in several seconds