Direct and Indirect Matching of Schema Elements for Data Integration on the Web Li Xu Data Extraction Group Brigham Young University Sponsored by NSF
Car Schema Matching Source Car Year Cost Style Year Feature Cost Car Phone Target Car Miles Mileage Model Make & Model Color Body Type
Mapping Direct Matches Indirect Matches Union Selection Composition Decomposition
Union and Selection Car Source Car Year Cost Style Year Feature Cost Car Phone Target Car Miles Mileage Model Make & Model Color Body Type
Composition and Decomposition Car Source Car Year Cost Style Year Feature Cost Car Phone Target Car Miles Mileage Model Make & Model Color Body Type
Matching Techniques Terminological Relationships Value Characteristics Expected Data Values Structure
Terminological Relationships WordNet Machine-Learned Rules Example: (Make, Brand) The number of different common hypernym roots of A and B Sum of distances of A and B to a common hypernym The sum of the number of senses of A and B
Value Characteristics Machine Learning Features [LC94] String length, numeric ratio, space ratio. Mean, variation, coefficient variation, standard deviation;
Make & ModelBrand Model Expected Values Application Concepts Data Frames CarMake “ford” “honda” … CarModel “accord” “mustang” “taurus” … Ford Mustang Ford Taurus Ford F150 … CarMake. CarModel Legend Mustang A4 … CarModel CarMake TargetSource Acura Audi BMW …
Structure PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address TargetSource
Structure (Cont.) PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address DeliverTo TargetSource
Structure (Cont.) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure City Street City Street POShipToDeliverTo TargetSource
Structure (Cont.) PO POBillToPOLines CityStreetCityStreetItem Count PurchaseOrder InvoiceTo Items ItemCount City Street City Street LineQtyUoM ItemNumber Quantity LineQtyUoM ItemNumber Quantity LineQty QuantityUnitOfMeasure POShipToDeliverTo TargetSource
Structure (Cont.) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber Quantity City Street City Street City Street City Street Count LineQty QuantityUnitOfMeasure POShipToDeliverTo TargetSource
Experiments Methodology Measures Precision Recall F Measure
Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) CorrectFalse Positive False Negative Course Schedule (5) Faculty Member (5) Real Estate (5) Data borrowed from Univ. of Washington Indirect Matches: 94% (precision, recall, F-measure)
Contributions Direct Matches Indirect Matches Expected values Structure High Precision and High Recall