DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.

Slides:



Advertisements
Similar presentations
IMAP: Discovering Complex Semantic Matches Between Database Schemas Ohad Edry January 2009 Seminar in Databases.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science.
Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Views Joachim Biskup Universität Dortmund and David.
Generic Schema Matching using Cupid
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Aki Hecht Seminar in Databases (236826) January 2009
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu.
Data Frames Version 3 Proposal. Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract.
BYU 2003BYU Data Extraction Group Combining the Best of Global-as-View and Local-as-View for Data Integration Li Xu Brigham Young University Funded by.
Direct and Indirect Matching of Schema Elements for Data Integration on the Web Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
DLLS Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale Funded by:
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Discovering Direct and Indirect Matches for Schema Elements Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
1 A Tool to Support Ontology Creation Based on Incremental Mini-ontology Merging Zonghui Lian.
Recognizing Records from the Extracted Cells of Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF.
fleckvelter gonsity (ld/gg) hepth (gd) burlam falder multon repeat: 1.understand table 2.generate mini-ontology 3.match with growing.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
MIS2502: Data Analytics Relational Data Modeling
Engineering Probability and Statistics
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
SQL: Data Manipulation Presented by Mary Choi For CS157B Dr. Sin Min Lee.
CSE314 Database Systems The Relational Algebra and Relational Calculus Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson Ed Slide Set.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Finding Set-Mappings in Schema Matching Supervisor: Dr. David Cheung Speaker: Eric Lo.
XML Schema Integration Ray Dos Santos July 19, 2009.
COP4020 Programming Languages Functional Programming Prof. Xin Yuan.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
MIS2502: Data Analytics Relational Data Modeling
An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.
Introduction to Geographic Information Systems Fall 2013 (INF 385T-28620) Dr. David Arctur Research Fellow, Adjunct Faculty University of Texas at Austin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Databases Databases are collections of information; our study repeats a theme: Tell the computer the structure, and it can help you! © 2004, Lawrence Snyder.
Key Ideas In Content Math 412 January 14, The Content Standards Number and Operations Algebra Geometry Measurement Data Analysis and Probability.
Databases Chapter 16.
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Data Analytics Relational Data Modeling
Announcements Project 2’s due date is moved to Tuesday 8/3/04
Automating Schema Matching for Data Integration
Presentation transcript:

DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded by NSF

DASFAA 2003BYU Data Extraction Group Information Exchange SourceTarget Information Extraction Schema Matching Leverage this … … to do this

DASFAA 2003BYU Data Extraction Group Outline Information Extraction Direct Schema Matching Indirect Schema Matching Schema Matching for HTML Tables Conclusions

DASFAA 2003BYU Data Extraction Group Outline Information Extraction Direct Schema Matching Indirect Schema Matching Schema Matching for HTML Tables Conclusions

DASFAA 2003BYU Data Extraction Group Extracting Pertinent Information from Documents

DASFAA 2003BYU Data Extraction Group A Conceptual-Modeling Solution YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* * * 1..*

DASFAA 2003BYU Data Extraction Group Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … End;

DASFAA 2003BYU Data Extraction Group Recognition and Extraction Car Year Make Model Mileage Price PhoneNr Subaru SW $1900 (336) Elantra (336) HONDA ACCORD EX 100K (336) Car Feature 0001 Auto 0001 AC 0002 Black door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold

DASFAA 2003BYU Data Extraction Group Outline Information Extraction Direct Schema Matching Indirect Schema Matching Schema Matching for HTML Tables Conclusions

DASFAA 2003BYU Data Extraction Group Attribute Matching for Populated Schemas Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) –Attribute Names –Data-Value Characteristics –Expected Data Values –Data-Dictionary Information –Structural Properties

DASFAA 2003BYU Data Extraction Group Approach Target Schema T Source Schema S Framework –Individual Facet Matching –Combining Facets –Best-First Match Iteration

DASFAA 2003BYU Data Extraction Group Example Source Schema S Car Year has 0:1 Make has 0:1 Model has 0:1 Cost Style has 0:1 0:* Year has 0:1 Feature has 0:* Cost has 0:1 Car Mileage has Phone has 0:1 Model has 0:1 Target Schema T Make has 0:1 Miles has 0:1 Year Model Make Year Make Model Car MileageMiles

DASFAA 2003BYU Data Extraction Group Individual Facet Matching Attribute Names Data-Value Characteristics Expected Data Values

DASFAA 2003BYU Data Extraction Group Attribute Names Target and Source Attributes –T : A –S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books –f0: same word –f1: synonym –f2: sum of distances to a common hypernym root –f3: number of different common hypernym roots –f4: sum of the number of senses of A and B

DASFAA 2003BYU Data Extraction Group WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

DASFAA 2003BYU Data Extraction Group Confidence Measures

DASFAA 2003BYU Data Extraction Group Data-Value Characteristics C4.5 Decision Tree Features –Numeric data (Mean, variation, standard deviation, …) –Alphanumeric data (String length, numeric ratio, space ratio)

DASFAA 2003BYU Data Extraction Group Confidence Measures

DASFAA 2003BYU Data Extraction Group Expected Data Values Target Schema T and Source Schema S –Regular expression recognizer for attribute A in T –Data instances for attribute B in S Hit Ratio = N'/N for (A, B) match –N' : number of B data instances recognized by the regular expressions of A –N: number of B data instances

DASFAA 2003BYU Data Extraction Group Confidence Measures

DASFAA 2003BYU Data Extraction Group Combined Measures Threshold:

DASFAA 2003BYU Data Extraction Group Final Confidence Measures 0 0 0

DASFAA 2003BYU Data Extraction Group Experimental Results This schema, plus 6 other schemas –32 matched attributes –376 unmatched attributes Matched: 100% Unmatched: 99.5% –“Feature” ---”Color” –“Feature” ---”Body Type” F1 93.8% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% F1: WordNet F2: Value Characteristics F3: Expected Values

DASFAA 2003BYU Data Extraction Group Outline Information Extraction Direct Schema Matching Indirect Schema Matching Schema Matching for HTML Tables Conclusions

DASFAA 2003BYU Data Extraction Group Schema Matching Source Car Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

DASFAA 2003BYU Data Extraction Group Mapping Generation Direct Matches as described earlier: –Attribute Names based on WordNet –Value Characteristics based on value lengths, averages, … –Expected Values based on regular-expression recognizers Indirect Matches: –Direct matches –Structure Evaluation Union Selection Decomposition Composition

DASFAA 2003BYU Data Extraction Group Union and Selection Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

DASFAA 2003BYU Data Extraction Group Decomposition and Composition Car Source Year Cost Style Year Feature Cost Phone Target Car Miles Mileage Model Make & Model Color Body Type

DASFAA 2003BYU Data Extraction Group Structure PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address TargetSource Example Taken From [MBR, VLDB’01]

DASFAA 2003BYU Data Extraction Group Structure (Nonlexical Matches) PO POShipToPOBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder DeliverToInvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure CityStreet Address DeliverTo TargetSource

DASFAA 2003BYU Data Extraction Group Structure (Join over FD Relationship Sets, …) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber QuantityUnitOfMeasure City Street City Street POShipToDeliverTo TargetSource

DASFAA 2003BYU Data Extraction Group Structure (Lexical Matches) PO POBillToPOLines CityStreetCityStreetItem Count LineQtyUoM PurchaseOrder InvoiceTo Items ItemCount ItemNumber Quantity City Street City Street City Street City Street Count LineQty QuantityUnitOfMeasure POShipToDeliverTo TargetSource

DASFAA 2003BYU Data Extraction Group Experimental Results Applications (Number of Schemes) Precision (%) Recall (%) F (%) CorrectFalse Positive False Negative Course Schedule (5) Faculty Member (5) Real Estate (5) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: 94% (precision, recall, F-measure) Rough Comparison with U of W Results (direct matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%

DASFAA 2003BYU Data Extraction Group Outline Information Extraction Direct Schema Matching Indirect Schema Matching Schema Matching for HTML Tables Conclusions

DASFAA 2003BYU Data Extraction Group Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas –{Run #, Yr, Make, Model, Tran, Color, Dr} –{Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} –{Vehicle, Distance, Price, Mileage} –{Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}

DASFAA 2003BYU Data Extraction Group Solution Form attribute-value pairs Do extraction Infer mappings from extraction patterns

DASFAA 2003BYU Data Extraction Group Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)* Unnest: μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table Legend ACURA

DASFAA 2003BYU Data Extraction Group Solution: Replace Boolean Values Legend ACURA β CD Table Yes, CD Yes, β Auto β Air Cond β AM/FM Yes, AM/FM Air Cond. Auto

DASFAA 2003BYU Data Extraction Group Solution: Form Attribute-Value Pairs Legend ACURA CD AM/FM Air Cond. Auto,,,,,,,,

DASFAA 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto

DASFAA 2003BYU Data Extraction Group Solution: Infer Mappings Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Each row is a car. π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Make μ (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table π Year Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car).

DASFAA 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Model μ (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)* Table

DASFAA 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} π Price Table

DASFAA 2003BYU Data Extraction Group Solution: Do Extraction Legend ACURA CD AM/FM Air Cond. Auto {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Yes, ρ Colour←Feature π Colour Table U ρ Auto ← Feature π Auto β Auto Table U ρ Air Cond. ← Feature π Air Cond. β Air Cond. Table U ρ AM/FM ← Feature π AM/FM β AM/FM Table U ρ CD ← Feature π CD β CD Table Yes,

DASFAA 2003BYU Data Extraction Group Experiment Tables from 60 sites 10 “ training ” tables 50 test tables 357 mappings (from all 60 sites) –172 direct mappings (same attribute and meaning) –185 indirect mappings (29 attribute synonyms, 5 “ Yes/No ” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

DASFAA 2003BYU Data Extraction Group Results 10 “training” tables –100% of the 57 mappings –No false mappings 50 test tables –94.7% of the 300 mappings –No false mappings 16 missed mappings –4 partial (not all unions included) –6 non-U.S. car-ads (unrecognized makes and models) –2 U.S. unrecognized makes and models –3 prices (missing $ or found MSRP instead) –1 mileage (mileages less than 1,000)

DASFAA 2003BYU Data Extraction Group Conclusions Direct Attribute Matching –Matched 32 of 32: 100% Recall –2 False Positives: 94% Precision Direct and Indirect Attribute Matching –Matched 494 of 513: 96% Recall –22 False Positives: 96% Precision Table Mappings –Matched 284 of 300: 94.7% Recall –No False Positives: 100% Precision