AnHai Doan Pedro Domingos Alon Levy Department of Computer Science & Engineering University of Washington Learning Source Descriptions for Data Integration
2 Overview Problem definition –schema matching Solution –multi-strategy learning Prototype system –LSD (Learning Source Descriptions) Experiments Related work Summary & future work
3 Data Integration Find houses with four bathrooms and price under $500,000 mediated schema superhomes.com source schema realestate.com source schema homeseekers.com source schema wrapper
4 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact-info house address agent-name agent-phone num-bathsamenities full-bathshalf-bathshandicap- equipped contact name phone
5 Map of the Problem Map of the Problem source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings
6 Current State of Affairs Largely done by hand –labor intensive & error prone –key bottleneck in building applications Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data Need automatic approaches to scale up!
7 Use machine learning to match schemas Basic idea 1. create training data –manually map a set of sources to mediated schema 2. train system on training data –learns from –name of schema elements –format of values –frequency of words & symbols –characteristics of value distribution –proximity, position, structure, system proposes mappings for subsequent sources Our Approach
8 Example realestate.com Seattle, WA (206) $250,000 Fantastic house address phone price description mediated schema location Seattle, WA Dallas, TX... listed-price $250,000 $162,000 $180, agent-phone (206) (206) (214) comments Fantastic house... Great... Hurry!......
9 Multi-Strategy Learning Use a set of base learners –each exploits certain types of information Match schema elements of a new source –apply the learners –combine their predictions using a meta-learner Meta-learner –measures base learner accuracy on training data –weighs each learner based on its accuracy
10 Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Examples –Name matcher –agent-name => (name,0.7), (phone,0.3) –Frequency learner –“Seattle, WA” => (address,0.8), (name,0.2) –“Great location...” => (description,0.9), (address,0.1)
11 Training the Learners realestate.com Seattle, WA (206) $ 250,000 Fantastic house address phone price description mediated schema locationlisted-price agent-phone comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description)... Frequency Learner (“Seattle, WA”, address) (“(206) ”, phone) (“$ 250,000”, price) (“Fantastic house...”, description)...
12 Applying the Learners homes.com address phone price description mediated schema area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Frequency Learner Name Matcher Frequency Learner Meta-learner address description address Combiner address
13 The LSD System Base learners/modules –name matcher –Naive Bayesian learner –Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] –county-name recognizer Meta-learner –uses stacking [Ting&Witten99, Wolpert92] –uses training data to learn weights for base learners –combines predictions using confidence scores/weights
14 Experiments
15 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability Multi-strategy learning in other domains –series of workshops [91,93,96,98,00] –[Freitag98], Proverb [Keim et. al. 99]
16 Summary Schema matching –automated by learning Multi-strategy learning is essential –handles different types of data –incorporates different types of domain knowledge –easy to incorporate new learners –alleviates effects of noise & dirty data Implemented LSD –promising results with initial experiments
17 Future Work Future Work source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings