Presentation is loading. Please wait.

Presentation is loading. Please wait.

BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.

Similar presentations


Presentation on theme: "BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF."— Presentation transcript:

1 BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF

2 BYU 2003BYU Data Extraction Group Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this

3 BYU 2003BYU Data Extraction Group Presentation Outline Information Extraction Schema Matching for Tables Direct Schema Matching Indirect Schema Matching Conclusions and Future Work

4 BYU 2003BYU Data Extraction Group Information Extraction

5 BYU 2003BYU Data Extraction Group Extracting Pertinent Information from Documents

6 BYU 2003BYU Data Extraction Group A Conceptual-Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..*

7 BYU 2003BYU Data Extraction Group Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … End;

8 BYU 2003BYU Data Extraction Group Recognition and Extraction Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold

9 BYU 2003BYU Data Extraction Group Schema Matching for HTML Tables with Unknown Structure Cui Tao

10 BYU 2003BYU Data Extraction Group Table-Schema Matching (Basic Idea) Many Tables on the Web Ontology-Based Extraction –Works well for unstructured or semistructured data –What about structured data – tables? Method –Form attribute-value pairs –Do extraction –Infer mappings from extraction patterns

11 BYU 2003BYU Data Extraction Group Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas –{Run #, Yr, Make, Model, Tran, Color, Dr} –{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} –{Vehicle, Distance, Price, Mileage} –{Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

12 BYU 2003BYU Data Extraction Group Problem: Attribute is Value

13 BYU 2003BYU Data Extraction Group Problem: Attribute-Value is Value ??

14 BYU 2003BYU Data Extraction Group Problem: Value is not Value

15 BYU 2003BYU Data Extraction Group Problem: Implied Values ``

16 BYU 2003BYU Data Extraction Group Problem: Missing Attributes

17 BYU 2003BYU Data Extraction Group Problem: Compound Attributes

18 BYU 2003BYU Data Extraction Group Problem: Factored Values

19 BYU 2003BYU Data Extraction Group Problem: Split Values

20 BYU 2003BYU Data Extraction Group Problem: Merged Values

21 BYU 2003BYU Data Extraction Group Problem: Values not of Interest

22 BYU 2003BYU Data Extraction Group Problem: Information Behind Links Single-Column Table (formatted as list) Table extending over several pages

23 BYU 2003BYU Data Extraction Group Solution Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

24 BYU 2003BYU Data Extraction Group Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

25 BYU 2003BYU Data Extraction Group Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto

26 BYU 2003BYU Data Extraction Group Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,

27 BYU 2003BYU Data Extraction Group Solution: Adjust Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,

28 BYU 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto

29 BYU 2003BYU Data Extraction Group Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

30 BYU 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

31 BYU 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table

32 BYU 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,

33 BYU 2003BYU Data Extraction Group Experiment Tables from 60 sites 10 “ training ” tables 50 test tables 357 mappings (from all 60 sites) –172 direct mappings (same attribute and meaning) –185 indirect mappings (29 attribute synonyms, 5 “ Yes/No ” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

34 BYU 2003BYU Data Extraction Group Results 10 “training” tables –100% of the 57 mappings (no false mappings) –94.6% of the values in linked pages (5.4% false declarations) 50 test tables –94.7% of the 300 mappings (no false mappings) –On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision 16 missed mappings –4 partial (not all unions included) –6 non-U.S. car-ads (unrecognized makes and models) –2 U.S. unrecognized makes and models –3 prices (missing $ or found MSRP instead) –1 mileage (mileages less than 1,000)

35 BYU 2003BYU Data Extraction Group Direct Schema Matching Li Xu

36 BYU 2003BYU Data Extraction Group Attribute Matching for Populated Schemas Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) –Attribute Names –Data-Value Characteristics –Expected Data Values –Data-Dictionary Information –Structural Properties

37 BYU 2003BYU Data Extraction Group Approach Target Schema T Source Schema S Framework –Individual Facet Matching –Combining Facets –Best-First Match Iteration

38 BYU 2003BYU Data Extraction Group Example Source Schema S Car Year has 0:1 Make has 0:1 Model has 0:1 Cost Style has 0:1 0:* Year has 0:1 Feature has 0:* Cost has 0:1 Car Mileage has Phone has 0:1 Model has 0:1 Target Schema T Make has 0:1 Miles has 0:1 Year Model Make Year Make Model Car MileageMiles

39 BYU 2003BYU Data Extraction Group Individual Facet Matching Attribute Names Data-Value Characteristics Expected Data Values

40 BYU 2003BYU Data Extraction Group Attribute Names Target and Source Attributes –T : A –S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books –f0: same word –f1: synonym –f2: sum of distances to a common hypernym root –f3: number of different common hypernym roots –f4: sum of the number of senses of A and B

41 BYU 2003BYU Data Extraction Group WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

42 BYU 2003BYU Data Extraction Group Confidence Measures

43 BYU 2003BYU Data Extraction Group Data-Value Characteristics C4.5 Decision Tree Features –Numeric data (Mean, variation, standard deviation, …) –Alphanumeric data (String length, numeric ratio, space ratio)

44 BYU 2003BYU Data Extraction Group Confidence Measures

45 BYU 2003BYU Data Extraction Group Expected Data Values Target Schema T and Source Schema S –Regular expression recognizer for attribute A in T –Data instances for attribute B in S Hit Ratio = N'/N for (A, B) match –N' : number of B data instances recognized by the regular expressions of A –N: number of B data instances

46 BYU 2003BYU Data Extraction Group Confidence Measures

47 BYU 2003BYU Data Extraction Group Combined Measures Threshold: 0.5 1 0 0 0 0 0 0 0 000000 1 0 0 0 0 0 0000 10 0 0000 0 0 0 0 0 1 0 0 0 00 100 0 0 0 0 0

48 BYU 2003BYU Data Extraction Group Final Confidence Measures 0 0 0

49 BYU 2003BYU Data Extraction Group Experimental Results This schema, plus 6 other schemas –32 matched attributes –376 unmatched attributes Measures –Recall: 100% –Precision: 94% –F Measure: 97% False Positives –“Feature” ---”Color” –“Feature” ---”Body Type”

50 BYU 2003BYU Data Extraction Group Indirect Schema Matching

51 BYU 2003BYU Data Extraction Group Schema Matching Source Car Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

52 BYU 2003BYU Data Extraction Group Mapping Generation Direct Matches as described earlier: –Attribute Names based on WordNet –Value Characteristics based on value lengths, averages, … –Expected Values based on regular-expression recognizers Indirect Matches: –Direct matches –Structure Evaluation Union Selection Decomposition Composition

53 BYU 2003BYU Data Extraction Group Union and Selection Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

54 BYU 2003BYU Data Extraction Group Decomposition and Composition Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

55 BYU 2003BYU Data Extraction Group Structure PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address TargetSource Example Taken From [MBR, VLDB’01]

56 BYU 2003BYU Data Extraction Group Structure (Nonlexical Matches) PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address DeliverTo TargetSource

57 BYU 2003BYU Data Extraction Group Structure (Join over FD Relationship Sets, …) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure City Street City Street POShipToDeliverTo TargetSource

58 BYU 2003BYU Data Extraction Group Structure (Lexical Matches) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber Quantity City Street City Street City Street City Street Count LineQty QuantityUnitOfMeasure POShipToDeliverTo TargetSource

59 BYU 2003BYU Data Extraction Group Experimental Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) CorrectFalse Positive False Negative Course Schedule (5) 98939611929 Faculty Member (5) 100 14000 Real Estate (5) 9296942352010 Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: 94% (precision, recall, F-measure) Rough Comparison with U of W Results (Direct Matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%

60 BYU 2003BYU Data Extraction Group Conclusions and Future Work

61 BYU 2003BYU Data Extraction Group Conclusions Table Mappings –Tables: 94.7% (Recall); 100% (Precision) –Linked Text: ~97% (Recall); ~86% (Precision) Direct Attribute Matching –Matched 32 of 32: 100% Recall –2 False Positives: 94% Precision Direct and Indirect Attribute Matching –Matched 494 of 513: 96% Recall –22 False Positives: 96% Precision www.deg.byu.edu

62 BYU 2003BYU Data Extraction Group Current & Future Work: Improve and Extend Indirect Matching Improve Object-Set Matching (e.g. Lex/non-Lex) Add Relationship-Set Matching Computations

63 BYU 2003BYU Data Extraction Group Current & Future Work: Tables Behind Forms Crawling the Hidden Web Filling in Forms from Global Queries

64 BYU 2003BYU Data Extraction Group Current & Future Work: Developing Extraction Ontologies Creation from Knowledge Sources and Sample Application Pages –μ K Ontology + Data Frames, Lexicons, … –RDF Ontologies User Creation by Example

65 BYU 2003BYU Data Extraction Group Current & Future Work: and Much More … Table Understanding Microfilm Census Records Generate Ontologies by Reading Tables … www.deg.byu.edu


Download ppt "BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF."

Similar presentations


Ads by Google