Presentation is loading. Please wait.

Presentation is loading. Please wait.

ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,

Similar presentations


Presentation on theme: "ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,"— Presentation transcript:

1 ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao, Stephen W. Liddle Brigham Young University Funded by NSF

2 ER 2002BYU Data Extraction Group Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this

3 ER 2002BYU Data Extraction Group Information Extraction

4 ER 2002BYU Data Extraction Group Extracting Pertinent Information from Documents

5 ER 2002BYU Data Extraction Group A Conceptual-Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..*

6 ER 2002BYU Data Extraction Group Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … End;

7 ER 2002BYU Data Extraction Group Recognition and Extraction Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold

8 ER 2002BYU Data Extraction Group Schema Matching for HTML Tables with Unknown Structure

9 ER 2002BYU Data Extraction Group Table-Schema Matching (Basic Idea) Many Tables on the Web Ontology-Based Extraction –Works well for unstructured or semistructured data –What about structured data – tables? Method –Form attribute-value pairs –Do extraction –Infer mappings from extraction patterns

10 ER 2002BYU Data Extraction Group Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas –{Run #, Yr, Make, Model, Tran, Color, Dr} –{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} –{Vehicle, Distance, Price, Mileage} –{Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

11 ER 2002BYU Data Extraction Group Problem: Attribute is Value

12 ER 2002BYU Data Extraction Group Problem: Attribute-Value is Value ??

13 ER 2002BYU Data Extraction Group Problem: Value is not Value

14 ER 2002BYU Data Extraction Group Problem: Implied Values ``

15 ER 2002BYU Data Extraction Group Problem: Missing Attributes

16 ER 2002BYU Data Extraction Group Problem: Compound Attributes

17 ER 2002BYU Data Extraction Group Problem: Factored Values

18 ER 2002BYU Data Extraction Group Problem: Split Values

19 ER 2002BYU Data Extraction Group Problem: Merged Values

20 ER 2002BYU Data Extraction Group Problem: Values not of Interest

21 ER 2002BYU Data Extraction Group Problem: Information Behind Links Single-Column Table (formatted as list) Table extending over several pages

22 ER 2002BYU Data Extraction Group Solution Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

23 ER 2002BYU Data Extraction Group Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

24 ER 2002BYU Data Extraction Group Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto

25 ER 2002BYU Data Extraction Group Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,

26 ER 2002BYU Data Extraction Group Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

27 ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto

28 ER 2002BYU Data Extraction Group Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

29 ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

30 ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table

31 ER 2002BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,

32 ER 2002BYU Data Extraction Group Experiment Tables from 60 sites 10 “ training ” tables 50 test tables 357 mappings (from all 60 sites) –172 direct mappings (same attribute and meaning) –185 indirect mappings (29 attribute synonyms, 5 “ Yes/No ” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

33 ER 2002BYU Data Extraction Group Results 10 “training” tables –100% of the 57 mappings (no false mappings) –94.6% of the values in linked pages (5.4% false declarations) 50 test tables –94.7% of the 300 mappings (no false mappings) –On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision 16 missed mappings –4 partial (not all unions included) –6 non-U.S. car-ads (unrecognized makes and models) –2 U.S. unrecognized makes and models –3 prices (missing $ or found MSRP instead) –1 mileage (mileages less than 1,000)

34 ER 2002BYU Data Extraction Group Conclusions Summary –Transformed schema-matching problem to extraction –Inferred semantic mappings –Discovered source-to-target mapping rules Evidence of Success –Tables (mappings): 95% (Recall); 100% (Precision) –Linked Text (value extraction): ~97% (Recall); ~86% (Precision) Future Work –Discover and exploit structure in linked text –Broaden table understanding –Integrate with current extraction tools www.deg.byu.edu


Download ppt "ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,"

Similar presentations


Ads by Google