Download presentation
Presentation is loading. Please wait.
1
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy
2
CS 652 Information Extraction and Information Integration Problem & Solution Problem Large-scale Data Integration Systems Bottleneck: Semantic Mappings Solution Multi-strategy Learning Integrity Constraints XML Structure Learner 1-1 Mappings
3
CS 652 Information Extraction and Information Integration Learning Source Descriptions (LSD) Components Base learners Meta-learner Prediction converter Constraint handler Operating Phases Training phase Matching phase
4
CS 652 Information Extraction and Information Integration Learners Basic Learners Name Matcher (Whirl) Content Matcher (Whirl) Naïve Bayes Learner County-Name Recognizer XML Learner Meta-Learner (Stacking)
5
CS 652 Information Extraction and Information Integration Naïve Bayes Learner Input instance= bags of tokens
6
CS 652 Information Extraction and Information Integration XML Learner Input instance= bags of tokens including text tokens and structure tokens
7
CS 652 Information Extraction and Information Integration Domain Constraint Handler Domain Constraints Impose semantic regularities on schemas and source data in the domain Can be specified at the beginning When creating a mediated schema Independent of any actual source schema Constraint Handler Domain constraints + Prediction Converter + Users’ feedback + Output mappings
8
CS 652 Information Extraction and Information Integration Training Phase Manually Specify Mappings for Several Sources Extract Source Data Create Training Data for each Base Learner Train the Base-Learner Train the Meta-Learner
9
CS 652 Information Extraction and Information Integration Example1 (Training Phase)
10
CS 652 Information Extraction and Information Integration Example1 (Cont.) Source Data Training Data
11
CS 652 Information Extraction and Information Integration Example1 (Cont.) (“location” , ADDRESS) (“Miami, FL”, ADDRESS) Source Data: (location: Miami, FL)
12
CS 652 Information Extraction and Information Integration Matching Phase Extract and Collect Data Match each Source-DTD Tag Apply the Constraint Handler
13
CS 652 Information Extraction and Information Integration Example2 (Matching Phase)
14
CS 652 Information Extraction and Information Integration Example2 (Cont.)
15
CS 652 Information Extraction and Information Integration Example2 (Cont.)
16
CS 652 Information Extraction and Information Integration Experimental Evaluation Measures Matching accuracy of a source Average matching accuracy of a source Average matching accuracy of a domain Experiment Results Average matching accuracy for different domains Contributions of base learners and domain constraint handler Contributions of schema information and instance information Performance sensitivity to the amount data instances
17
CS 652 Information Extraction and Information Integration Limitations Enough Training Data Domain Dependent Learners Ambiguities in Sources Efficiency Overlapping of Schemas
18
CS 652 Information Extraction and Information Integration Conclusion and Future Work Improve over time Extensible framework Multiple types of knowledge Non 1-1 mapping ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.