Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California
Introduction When integrating information, data objects can exist in inconsistent text formats across several sources Previous methods manually construct mapping rules for object identification Active Atlas learns to tailor mapping rules, through limited user input, to a specific application domain Active Atlas achieves higher accuracy and require less user involvement than previous methods
Object Identification Example
Ariadne Information Mediator
Ariadne Information Mediator (cont’d)
Active Atlas Approach to Map Objects First, determine the text formatting transformations and propose candidate mappings Then, learn domain-specific mapping rules
Active Atlas Architecture
Mapping Objects (Transformation Functions) General Transformation Functions Type I: Stemming, Soundex, Abbreviation Type II: Equality, Initial, Prefix, Suffix, Substring, Abbreviation, Acronym
Mapping Objects (Transformation Functions Example)
Mapping Objects (Compute Attribute Similarity Scores)
Mapping Objects (Compute Total Similarity Scores) Total object similarity score is computed as a weighted sum of the attribute similarity scores Each attribute has a uniqueness weight that is a heuristic measure of the importance of that attribute
Mapping Objects ( Output of Candidate Generator)
Mapping Objects (Mapping-Rule Learning) Decision Tree Learning Passive Learning Requires a large set of training examples Active Learning Uses query by bagging technique Selects a small set of initial training examples Includes a variety of training examples Creates a diverse set of decision tree learners Actively chooses the examples for user to label
Mapping Objects (Active Learning)
Experimental Results Three different domains: Restaurants, Companies and Airports Experiments: Two base line experiments Compare the shared attributes seperately Compare the object as a whole Both requires choosing an optimal threshold Passive learning Active learning
Experimental Results (Restaurants) Source A: 331 objects Source B: 533 objects 112 correct mappings 3259 candidate mappings over 10 runs
Measurement of Accuracy Accuracy The total number of correct classifications over the total number of mappings plus the number of correct mappings not proposed
Experimental Results
Related Work
Conclusion The research addresses the problem of mapping objects between structured web sources The experiments results show that Active Atlas can achieve high accuracy, while limiting the user involvement.
Future Work