Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF
2 Introduction Many tables on the Web Ontology-based extraction: Works well for unstructured or semi-structured data What about structured data – tables? How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns
3 Problem Detecting The Table of Interest ?
4 Problem Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Different schemas
5 Problem Attribute is Value
6 Problem Attribute-Value is Value ??
7 Problem Value is not Value
8 Problem Factored Values
9 Problem Split Values
10 Problem Merged Values
11 Problem Information Behind Links List Table extending over several pages
12 Solution Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns
13 Solution Detect The Table of Interest Top-level tables Table size: at least 3 rows and columns Grid layout: same # of values Attributes Value density: # of ontology extracted values total # of values in the table
14 Solution Detect The Table of Interest Linked-page tables Table size: at least 2 rows and columns Attributes Attribute-value-pair pattern Page-spanning tables
15 Solution Remove Factoring
16 Solution Replace Boolean Values
17 Solution Form Attribute-Value Pairs,,,,,,,
18 Solution Adjust Attribute-Value Pairs,,,,,,,
19 Solution Add Information Hidden Behind Links Unstructured and semi-structured: concatenate,,,,,,,,, Single attribute value pairs: Pair them together List: Mark the beginning and the end < >
20 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
21 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Each row is a car.
22 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
23 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
24 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
25 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
26 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
27 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
28 Experimental Results − Table Location Car advertisement application domain 12 2 Structured Linked Page Location Precision: 86% Recall: 92% Testing Set 53 Training Set 7 87%(46) 100%(7) Top Table Location Precision:100% Recall:87% %(7) 28 Linked Pages 13 15
29 Experimental Results − Mapping Car advertisement application domain 46 recognized tables in the testing set Total 319 mappings Precision: 95.8% Recall: 92.8% Top-level tables: 77% of the 296 correct mappings Linked tables: 19.6% Both: 3.4%
30 Experimental Results − Table Location Cell-phone sales application domain Testing Set 12 Training Set 5 92%(11) 100%(5) Top Table Location Precision:100% Recall:92% Linked Pages %(5) 3
31 Experimental Results − Mapping Cell-phone sales application domain 11 recognized tables in the testing Set Total 97 mappings Precision: 90.1% Recall: 85.4% Top-level tables: 85.4% of the 88 correct mappings Linked tables: 50.5% Both: 35.9%
32 Contribution Provides an approach to extract information automatically from HTML tables Suggests a different way to solve the problem of schema matching