Download presentation
Presentation is loading. Please wait.
1
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF
2
ER 2002BYU Data Extraction Group Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this
3
ER 2002BYU Data Extraction Group Information Extraction
4
ER 2002BYU Data Extraction Group Extracting Pertinent Information from Documents
5
ER 2002BYU Data Extraction Group A Conceptual-Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..*
6
ER 2002BYU Data Extraction Group Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … End;
7
ER 2002BYU Data Extraction Group Recognition and Extraction Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold
8
ER 2002BYU Data Extraction Group Schema Matching for HTML Tables with Unknown Structure
9
ER 2002BYU Data Extraction Group Table-Schema Matching (Basic Idea) Many Tables on the Web Ontology-Based Extraction –Works well for unstructured or semistructured data –What about structured data – tables? Method –Form attribute-value pairs –Do extraction –Infer mappings from extraction patterns
10
ER 2002BYU Data Extraction Group Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas –{Run #, Yr, Make, Model, Tran, Color, Dr} –{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} –{Vehicle, Distance, Price, Mileage} –{Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}
11
ER 2002BYU Data Extraction Group Problem: Attribute is Value
12
ER 2002BYU Data Extraction Group Problem: Attribute-Value is Value ??
13
ER 2002BYU Data Extraction Group Problem: Value is not Value
14
ER 2002BYU Data Extraction Group Problem: Implied Values ``
15
ER 2002BYU Data Extraction Group Problem: Missing Attributes
16
ER 2002BYU Data Extraction Group Problem: Compound Attributes
17
ER 2002BYU Data Extraction Group Problem: Factored Values
18
ER 2002BYU Data Extraction Group Problem: Split Values
19
ER 2002BYU Data Extraction Group Problem: Merged Values
20
ER 2002BYU Data Extraction Group Problem: Values not of Interest
21
ER 2002BYU Data Extraction Group Problem: Information Behind Links Single-Column Table (formatted as list) Table extending over several pages
22
ER 2002BYU Data Extraction Group Solution Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns
23
ER 2002BYU Data Extraction Group Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA
24
ER 2002BYU Data Extraction Group Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto
25
ER 2002BYU Data Extraction Group Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,
26
ER 2002BYU Data Extraction Group Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,
27
ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto
28
ER 2002BYU Data Extraction Group Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).
29
ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table
30
ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table
31
ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,
32
ER 2002BYU Data Extraction Group Experiment Tables from 60 sites 10 “ training ” tables 50 test tables 357 mappings (from all 60 sites) –172 direct mappings (same attribute and meaning) –185 indirect mappings (29 attribute synonyms, 5 “ Yes/No ” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
33
ER 2002BYU Data Extraction Group Results 10 “training” tables –100% of the 57 mappings (no false mappings) –94.6% of the values in linked pages (5.4% false declarations) 50 test tables –94.7% of the 300 mappings (no false mappings) –On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision 16 missed mappings –4 partial (not all unions included) –6 non-U.S. car-ads (unrecognized makes and models) –2 U.S. unrecognized makes and models –3 prices (missing $ or found MSRP instead) –1 mileage (mileages less than 1,000)
34
ER 2002BYU Data Extraction Group Conclusions Summary –Transformed schema-matching problem to extraction –Inferred semantic mappings –Discovered source-to-target mapping rules Evidence of Success –Tables (mappings): 95% (Recall); 100% (Precision) –Linked Text (value extraction): ~97% (Recall); ~86% (Precision) Future Work –Discover and exploit structure in linked text –Broaden table understanding –Integrate with current extraction tools www.deg.byu.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.