Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu
Background Problem: Attribute Matching Matching Possibilities (Facets) Attribute Names Data-Value Characteristics Expected Data Values Data-Dictionary Information Structural Properties
Approach Target Schema T Source Schema S Framework Individual Facet Matching Combining Facets Best-First Match Iteration
Example Source Schema S Car Year has 0:1 Make has 0:1 Model has 0:1 Cost Style has 0:1 0:* Year has 0:1 Feature has 0:* Cost has 0:1 Car Mileage has Phone has 0:1 Model has 0:1 Target Schema T Make has 0:1 Miles has 0:1 Year Model Make Year Make Model Car MileageMiles
Individual Facet Matching Attribute Names Data-Value Characteristics Expected Data Values
Attribute Names Target and Source Attributes T : A S : B WordNet C4.5 Decision Tree: feature selection f0: same word f1: synonym f2: sum of distances to a common hypernym root f3: number of different common hypernym roots f4: sum of the number of senses of A and B
WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B
Confidence Measures
Data-Value Characteristics C4.5 Decision Tree Features Numeric data (Mean, variation, standard deviation, …) Alphanumeric data (String length, numeric ratio, space ratio)
Confidence Measures
Expected Data Values Target Schema T and Source Schema S Regular expression recognizer for attribute A in T Data instances for attribute B in S Hit Ratio = N’/N for (A, B) match N’ : number of B data instances recognized by the regular expressions of A N: number of B data instances
Confidence Measures
Combined Measures Threshold:
Final Confidence Measures
Experimental Results Matched Attributes 100% (32 of 32); Unmatched Attributes 99.5% (374 of 376); “Feature” ---”Color”; “Feature” ---”Body Type”. F % F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4%
Conclusions Direct Attribute Matching – feasible Individual-Facet Matching – good Multifaceted Matching – better
Future Work Additional Facets More Sophisticated Combinations Additional Application Domains Automating Feature Selection Indirect Attribute Matching