Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF
2 Schema Mapping Semantic correspondence between two schemas Significance –data integration –data warehouses –ontology merging –message translation in e-commerce –semantic query processing –etc.
3 Schema Representation HouseAgent Golf course Water front Phone_evening Name Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name home phone office phone location Location MLS Bedrooms Phone_day cell phone
4 1:1 Mapping Cardinality HouseAgent Golf course Water front Phone_evening Name Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name home phone office phone location Location MLS Bedrooms Phone_day cell phone
5 n:1 Mapping Cardinality HouseAgent Golf course Water front Phone_evening Name Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name home phone office phone location Location MLS Bedrooms Phone_day cell phone
6 n:m Mapping Cardinality HouseAgent Golf course Water front Phone_evening Name Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name home phone office phone location Location MLS Bedrooms Phone_day cell phone
7 Object-Set Matcher (schema-level) Name-based matcher –string and substring comparison –linguistic methods: stemming, stop words, removing ignorable characters, etc. –thesaurus: WordNet, etc. 1:1 mapping cardinality Agent Name agent name
8 Object-Set Matcher (instance-level) Data Frame –multiple regular expressions in Perl style –as simple as a list of data values Data-frame matcher –use: compare recognized data values –benefit: able to recognize disjunctive data value sets –bias: data frame may not correspond 100% with the semantics –limitation: a needed data frame might not exist 1:1 mapping cardinality Object-set A Ford Honda Object-set B Chevy Toyota Car Model Ford, Honda, Chevy, Toyota …
9 Extended Data-Frame Matcher (instance-level) n:1 mapping cardinality Add a STRICT_SUBSTRING operation With the help of structural analysis Address StreetCityState location Schema 2 Schema N. University Ave., Provo, UT
10 Direct Structure Matcher Comparing structure similarity between two candidate schemas 1:1 mapping cardinality Agent Name Fax Address agent name faxphone Location phone_day
11 Schema 2 Schema 1 Reference Structure Matcher If A and B match C, then A matches B. Able to solve n:m mapping cardinality 1:1, n:1, and n:m mapping cardinalities Phone Day Phone Evening Phone Cell Phone Home Phone Office Phone Day Phone Evening Phone Home Phone Office Phone Cell Phone
12 Experiments Application (Number of Schemes) Precision (%) Recall (%) F (%) Number Matches Number Correct Number Incorrect Faculty Member (5) Course Schedule (5) Real Estate (5) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: (precision 87%, recall 94%, F-measure 90%) Rough Comparison with U of W Results * Faculty Member – Accuracy, ~92% * Course Schedule – Accuracy: ~71% * Real Estate (2 tests) – Accuracy: ~75%
13 Lessons Learned n:1 and n:m matches occur frequently. –22% = 97/437 [DMD+03] (Course Catalog, Company Profile) –45% = 287/638 (Car Ads, Cell Phones, Real Estate) Reference structures provides a way to solve the long- lasting hard cluster mapping (n:m cardinality) problem. Data frames improve the instance-level matchers. The combination of schema-level and instance-level matchers improve the results.