BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Slides:



Advertisements
Similar presentations
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
Advertisements

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
USC Graduate Student DayColumbia, SCMarch 2006 Presented by: Jingshan Huang Computer Science & Engineering Department University of South Carolina PhD.
1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
BYU 2003BYU Data Extraction Group Combining the Best of Global-as-View and Local-as-View for Data Integration Li Xu Brigham Young University Funded by.
Direct and Indirect Matching of Schema Elements for Data Integration on the Web Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.
Schema Mapping: Experiences and Lessons Learned Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Discovering Direct and Indirect Matches for Schema Elements Li Xu Data Extraction Group Brigham Young University Sponsored by NSF.
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
QoM: Qualitative and Quantitative Measure of Schema Matching Naiyana Tansalarak and Kajal T. Claypool (Kajal Claypool - presenter) University of Massachusetts,
1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.
1 Information Integration and Source Wrapping Jose Luis Ambite, USC/ISI.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Spring Research Conference.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Basic Data Mining Techniques
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Learning Source Mappings Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 27, 2008 LSD Slides courtesy AnHai.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Minor Thesis A scalable schema matching framework for relational databases Student: Ahmed Saimon Adam ID: Award: MSc (Computer & Information.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
CSE 636 Data Integration Schema Matching Cupid Fall 2006.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
1 Context-Aware Internet Sharma Chakravarthy UT Arlington December 19, 2008.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School.
Semantic Mappings for Data Mediation
An Ontological Approach to Financial Analysis and Monitoring.
University of Maryland Scaling Heterogeneous Information Access for Wide area Environments Michael Franklin and Louiqa Raschid.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Lecture #11: Ontology Engineering Dr. Bhavani Thuraisingham
Web Ontology Language for Service (OWL-S)
From Knowledge Organization (KO) to Knowledge Representation (KR)
Automating Schema Matching for Data Integration
Context-Aware Internet
Presentation transcript:

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration

BYU Data Extraction Group Funded by NSF2 Data Integration Find houses with four bedrooms priced under $200,000 global schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 wrappers Mediator

BYU Data Extraction Group Funded by NSF3 Problems How to Recognize Applicable Information Sources for an Application? How to Specify Mapping between the Source Schemas and the Global Schema? How to Reformulate User Queries? How to Merge Data from Heterogeneous Sources? …

BYU Data Extraction Group Funded by NSF4 Recognizing Ontology- Applicable HTML Documents

BYU Data Extraction Group Funded by NSF5 Application Ontology How to specify an application?

BYU Data Extraction Group Funded by NSF6 Applicable HTML Documents Multiple-Record Documents Single-Record Documents HTML Forms How to distinguish an applicable HTML document?

BYU Data Extraction Group Funded by NSF7 Multiple-Record Doc’s Document 1: Car Ads Document 2: Items for Sale or Rent

BYU Data Extraction Group Funded by NSF8 Single-Record Doc.

BYU Data Extraction Group Funded by NSF9 HTML Forms Information hidden under the HTML form

BYU Data Extraction Group Funded by NSF10 Recognition Heuristics h1+: Densities h2: Expected Values h3: Grouping How to measure the applicability of an HTML document for an application?

BYU Data Extraction Group Funded by NSF11 Document 1: Car Ads h1+: Densities Document 2: Items for Sale or Rent

BYU Data Extraction Group Funded by NSF12 Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 h2: Expected Values Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4

BYU Data Extraction Group Funded by NSF13 h3: Grouping (of 1-Max Object Sets) Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {

BYU Data Extraction Group Funded by NSF14 Classification Problem Subtasks –Multiple Records –Singleton Record –Application Form Learning Algorithm: Decision Tree C4.5 –(h1+0, h1+1, …, h2, h3, Positive) –(h1+0, h1+1, …, h2, h3, Negative) How to construct recognition rules for an application?

BYU Data Extraction Group Funded by NSF15 Experiments Car Ads and Obituaries Training Sets –Car Ads (Yes| No) 143 | | |69 –Obituaries (Yes| No) 68 | | | 135 Test Sets –Car Ads (40 | 40) Precision 95% Recall 98% F-measure 96% –Obituaries (40 |40) Precision 95% Recall 95% F-measure 95%

BYU Data Extraction Group Funded by NSF16 Link Analysis

BYU Data Extraction Group Funded by NSF17 Form Filling

BYU Data Extraction Group Funded by NSF18 Form Filling (Cont.)

BYU Data Extraction Group Funded by NSF19 Incorrect Positive Response Motorcycle Year Make Price Mileage PhoneNr Feature

BYU Data Extraction Group Funded by NSF20 Historical Figure Deceased Name Death Date Birth Date Age Relationship Relative Name

BYU Data Extraction Group Funded by NSF21 Automating Schema Mapping for Data Integration

BYU Data Extraction Group Funded by NSF22 Schema Mapping Source Car Year Cost Style Year Feature Cost Phone Target Car MilesMileage Model Make & Model Color Body Type

BYU Data Extraction Group Funded by NSF23 Schema Mapping for Populated Schemas Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) –Attribute Names –Data-Value Characteristics –Expected Data Values –Data-Dictionary Information –Structural Properties

BYU Data Extraction Group Funded by NSF24 The Approach Input: –Two Graphs, S and T –Data Instances for S and T –Lightweight Domain Ontology Output: –A Source-to-Target Mapping between S and T Should enable translating data instances from S to T. –Direct and Many Indirect Matches (t, s) (t, s’ <=  ) Framework –Individual Facet Matching –Combination of Individual Matchers

BYU Data Extraction Group Funded by NSF25 Attribute Names Target and Source Attributes –T : A –S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books –f0: same word –f1: synonym –f2: sum of distances to a common hypernym root –f3: number of different common hypernym roots –f4: sum of the number of senses of A and B

BYU Data Extraction Group Funded by NSF26 WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

BYU Data Extraction Group Funded by NSF27 Data-Value Characteristics C4.5 Decision Tree Features –Numeric data (Mean, variation, standard deviation, …) –Alphanumeric data (String length, numeric ratio, space ratio)

BYU Data Extraction Group Funded by NSF28 Make & ModelBrand Model Expected Data Values Concepts and Relationships Data Recognizers –CarMake “ford” “honda” … –CarModel “accord” “mustang” “taurus” … Ford Mustang Ford Taurus Ford F150 … CarMake. CarModel Legend Mustang A4 … CarModel CarMake TargetSource Acura Audi BMW …

BYU Data Extraction Group Funded by NSF29 Structure Matching HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

BYU Data Extraction Group Funded by NSF30 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

BYU Data Extraction Group Funded by NSF31 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

BYU Data Extraction Group Funded by NSF32 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

BYU Data Extraction Group Funded by NSF33 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

BYU Data Extraction Group Funded by NSF34 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

BYU Data Extraction Group Funded by NSF35 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features bedsSQFT MLS location_ description location Target Source MLS Bedrooms

BYU Data Extraction Group Funded by NSF36 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features bedsSQFT MLS location_ description location Target Source MLS Bedrooms

BYU Data Extraction Group Funded by NSF37 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features beds SQFT MLS location_ description location Target Source MLS Bedrooms House’ Address1’

BYU Data Extraction Group Funded by NSF38 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features beds SQFT MLS location_ description location TargetSource MLS Bedrooms House’ Golf course’ Water front’ Address1’ Street1’City1’State1’

BYU Data Extraction Group Funded by NSF39 {Agent} vs. {agent} Agent Name Fax Address StreetCityState agent name faxphone address Target Source

BYU Data Extraction Group Funded by NSF40 {Agent} vs. {agent} Agent Name Fax Address StreetCityState agent name fax phone address Target Source Address2’ Street2,City2’State2’

BYU Data Extraction Group Funded by NSF41 Inter-Relationship Set HouseAgent Golf course Water front Name Fax Address StreetCityState MLS agent Target Source MLS Bedrooms House’

BYU Data Extraction Group Funded by NSF42 Example: Source-To-Target Mapping House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

BYU Data Extraction Group Funded by NSF43 Target-based Integration and Query System (TIQS) Definition : I = (T, {Si}, {Mi}) Phases –Design (Source-to-Target Mappings {Mi}) –Query Processing (Rule Unfolding)

BYU Data Extraction Group Funded by NSF44 Query Reformulation Query –House-Bedrooms(x, 4) :- House-Bedrooms(x, 4), House-Golf_course(x, “Yes”), House-Water_front(x, “Yes”) House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

BYU Data Extraction Group Funded by NSF45 Query Reformulation Query –House-Bedrooms(x, 4) :- House-Bedrooms(x, 4), House-Golf_Course(x, “Yes”), House-Water_Front(x, “Yes”) House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

BYU Data Extraction Group Funded by NSF46 TIQS (Cont.) User Queries –Logic Rules –Maximal and Sound Query Answers Advantages –Rule Unfolding –Scalability

BYU Data Extraction Group Funded by NSF47 Experimental Results Application (Number of Schemes) Precision (%) Recall (%) F (%) Number Matches Number Correct Number Incorrect Faculty Member (5) Course Schedule (5) Real Estate (5) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: (precision 87%, recall 94%, F-measure 90%) Rough Comparison with U of W Results * Course Schedule – Accuracy: ~71% * Real Estate (2 tests) – Accuracy: ~75% * Faculty Member – Accuracy, ~92%

BYU Data Extraction Group Funded by NSF48 Conclusion A Robust and Flexible Approach to Check Applicability of HTML documents A Composite Approach to Automate Schema Mapping –Direct Matches –Indirect Matches An Approach that Combines Advantages of Basic Approaches to Data Integration

BYU Data Extraction Group Funded by NSF49 Future Work Test More Applications and Data to Evaluate the Approaches Extend Training Classifiers for Applicability Checking Further Automating Schema Mapping Automate Ontology Mapping on the Semantic Web Automate Mapping between XML Documents …

BYU Data Extraction Group Funded by NSF50 Thanks ! Questions?