Download presentation
Presentation is loading. Please wait.
Published byJustin Glenn Modified over 9 years ago
1
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai Doan
2
2 Administrivia Midterm due Thursday 5-10 pages (single-spaced, 10-12 pt)
3
3 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact house address name phone num-baths full-bathshalf-baths contact-info agent-name agent-phone 1-1 mappingnon 1-1 mapping
4
4 Suppose user wants to integrate 100 data sources 1. User manually creates mappings for a few sources, say 3 shows LSD these mappings 2. LSD learns from the mappings “Multi-strategy” learning incorporates many types of info in a general way Knowledge of constraints further helps 3. LSD proposes mappings for remaining 97 sources The LSD (Learning Source Descriptions) Approach
5
5 listed-price $250,000 $110,000... address price agent-phone description Example location Miami, FL Boston, MA... phone (305) 729 0831 (617) 253 1429... comments Fantastic house Great location... realestate.com location listed-price phone comments Schema of realestate.com If “fantastic” & “great” occur frequently in data values => description Learned hypotheses price $550,000 $320,000... contact-phone (278) 345 7215 (617) 335 2315... extra-info Beautiful yard Great beach... homes.com If “phone” occurs in the name => agent-phone Mediated schema
6
6 LSD’s Multi-Strategy Learning Use a set of base learners each exploits well certain types of information Match schema elements of a new source apply the base learners combine their predictions using a meta-learner Meta-learner uses training sources to measure base learner accuracy weighs each learner based on its accuracy
7
7 Base Learners Input schema information: name, proximity, structure,... data information: value, format,... Output prediction weighted by confidence score Examples Name learner agent-name => (name,0.7), (phone,0.3) Naive Bayes learner “Kent, WA” => (address,0.8), (name,0.2) “Great location” => (description,0.9), (address,0.1)
8
8 Boston, MA $110,000 (617) 253 1429 Great location Miami, FL $250,000 (305) 729 0831 Fantastic house Training the Learners Naive Bayes Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description)... (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description)... realestate.com Name Learner address price agent-phone description Schema of realestate.com Mediated schema location listed-price phone comments
9
9 Beautiful yard Great beach Close to Seattle (278) 345 7215 (617) 335 2315 (512) 427 1115 Seattle, WA Kent, WA Austin, TX Applying the Learners Name Learner Naive Bayes Meta-Learner (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) (address,0.6), (description,0.4) Meta-Learner Name Learner Naive Bayes (address,0.7), (description,0.3) (agent-phone,0.9), (description,0.1) address price agent-phone description Schema of homes.com Mediated schema area day-phone extra-info
10
10 Domain Constraints Impose semantic regularities on sources verified using schema or data Examples a = address & b = address a = b a = house-id a is a key a = agent-info & b = agent-name b is nested in a Can be specified up front when creating mediated schema independent of any actual source schema
11
11 area: address contact-phone: agent-phone extra-info: description area: address contact-phone: agent-phone extra-info: address area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) The Constraint Handler Can specify arbitrary constraints User feedback = domain constraint ad-id = house-id Extended to handle domain heuristics a = agent-phone & b = agent-name a & b are usually close to each other 0.3 0.1 0.4 0.012 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252 Domain Constraints a = address & b = adderss a = b Predictions from Meta-Learner
12
12 Putting It All Together: LSD System L1L1 L2L2 LkLk Mediated schema Source schemas Data listings Training data for base learners Constraint Handler Mapping Combination User Feedback Domain Constraints Base learners: Name Learner, XML learner, Naive Bayes, Whirl learner Meta-learner uses stacking [Ting&Witten99, Wolpert92] returns linear weighted combination of base learners’ predictions Matching PhaseTraining Phase
13
13 Empirical Evaluation Four domains Real Estate I & II, Course Offerings, Faculty Listings For each domain create mediated DTD & domain constraints choose five sources extract & convert data listings into XML mediated DTDs: 14 - 66 elements, source DTDs: 13 – 48 Ten runs for each experiment - in each run: manually provide 1-1 mappings for 3 sources ask LSD to propose mappings for remaining 2 sources accuracy = % of 1-1 mappings correctly identified
14
14 LSD Matching Accuracy LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6% Average Matching Acccuracy (%)
15
15 LSD Summary Applies machine learning to schema matching use of multi-strategy learning Domain & user-specified constraints Probably the most flexible means of doing schema matching today in a semi-automated way Complementary project: CLIO (IBM Almaden) uses key and foreign-key constraints to help the user build mappings
16
Since LSD… A lot more work on the following: Alternative schemes for putting together info from base learners Hierarchical learners Compare two trees: parent nodes are likely to be the same if child nodes are similar; child nodes are likely to be the same if parent nodes are similar Using mass collaboration – humans do the work And a lot of work on entity resolution or record matching Uses similar ideas to try to determine when two records are referring to the same entity 16
17
17 Jumping Up a Level We’ve now seen how heterogeneous data makes a huge difference … In the need for relating different kinds of attributes Mapping languages Mapping tools Query reformulation … and in query processing Adaptive query processing Next time we’ll go even further, and start to consider search – focusing on Google
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.