Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Slides:



Advertisements
Similar presentations
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Advertisements

Knowledge Representation
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Scott N. Woodfield David W. Embley Stephen W. Liddle Brigham Young University.
Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:
FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Enabling Search for Facts and Implied Facts in Historical Documents David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
HyKSS: A Multiple Ontology Approach to Hybrid Search Andrew Zitzelberger Brigham Young University MS Thesis Proposal.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Ontology Aware Software Service Agents: Meeting Ordinary User Needs on the Semantic Web Muhammed J. Al-Muhammed Brigham Young University Supported by:
6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Data Frames Version 3 Proposal. Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Conceptual Model Based Semantic Web Services Muhammed J. Al-Muhammed David W. Embley Stephen W. Liddle Brigham Young University Sponsored in part by NSF.
Ontology-Based Free-Form Query Processing for the Semantic Web Thesis proposal by Mark Vickers.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
Resolving Under Constrained and Over Constrained Systems of Conjunctive Constraints for Service Requests Muhammed J. Al-Muhammed David W. Embley Brigham.
Two-Level Semantic Annotation Model BYU Spring Conference 2007 Yihong Ding Sponsored by NSF.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Extracting and Structuring Web Data David W. Embley Department of Computer Science Brigham Young University D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation Stephen W. Liddle Information Systems Department Yihong Ding & David.
Semantic Web Queries by Mark Vickers Funded by NSF.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
Data Frame Augmentation of Free Form Queries for Constraint Based Document Filtering Andrew Zitzelberger.
7/15/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W.
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.
Deryle W. Lonsdale, David W. Embley, Stephen W. Liddle, and Joseph Park BYU Data Extraction Research Group.
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Semantic Network as Continuous System Technical University of Košice doc. Ing. Kristína Machová, PhD. Ing. Stanislav Dvorščák WIKT 2010.
DataBase and Information System … on Web The term information system refers to a system of persons, data records and activities that process the data.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
David W. Embley Brigham Young University Provo, Utah, USA.
Extracting and Structuring Web Data
Cross-language Information Retrieval
David W. Embley Brigham Young University Provo, Utah, USA
GreenFIE-HD: A Form-based Information Extraction Tool for Historical Documents Tae Woo Kim There are thousands of books that contain rich genealogical.
Automating Schema Matching for Data Integration
CS246: Information Retrieval
Grant Number: IIS Institution of PI: Brigham Young University PI’s: David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale Title:
Presentation transcript:

Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley, and Stephen W. Liddle Brigham Young University

Fundamental Problems Lack of semantic web content Lack of semantic web content Difficulty of content creation Difficulty of content creation Inability to use semantic web content easily Inability to use semantic web content easily

Proposed Solutions Automatically annotate data-rich web pages (turning them into semantic web pages) Automatically annotate data-rich web pages (turning them into semantic web pages) Provide for free-form, textual queries of semantic web content Provide for free-form, textual queries of semantic web content

A Show-Case Vision Find me the price and mileage of red Nissans – I want a 1990 or newer.

Demo I: Data Extraction

Demo II: Semantic Annotation

Demo III: Free-Form Query

Explanation: How it Works Extraction Ontologies Extraction Ontologies Semantic Annotation Semantic Annotation Free-Form Query Interpretation Free-Form Query Interpretation

Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

Formalism & Extraction Ontologies Fully formalized in predicate calculus Fully formalized in predicate calculus Object set ~ 1-place predicate Object set ~ 1-place predicate N-ary relationship set ~ n-place predicate N-ary relationship set ~ n-place predicate Constraint ~ closed predicate-calculus formula Constraint ~ closed predicate-calculus formula As a description logic ~ ALCN (Attributive Language with Complement and Numeric Restrictions) As a description logic ~ ALCN (Attributive Language with Complement and Numeric Restrictions) (a quick side note)

Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

Data-Extraction Results: Car Ads Training set for tuning ontology: 100 Test set: 116 Salt Lake Tribune Recall %Precision % Year Make Model Mileage Price PhoneNr Feature 91 99

Car Ads: Comments Dynamic sets Dynamic sets Missed: MERC, Town Car, 98 Royale Missed: MERC, Town Car, 98 Royale Could use lexicon of makes and models Could use lexicon of makes and models Unspecified variation in lexical patterns Unspecified variation in lexical patterns Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patterns could adjust lexical patterns Misidentification of attributes Misidentification of attributes Classified AUTO in AUTO SALES as automatic transmission Classified AUTO in AUTO SALES as automatic transmission Could adjust exceptions in lexical patterns Could adjust exceptions in lexical patterns Typographical errors Typographical errors “Chrystler”, “DODG ENeon”, “I ” “Chrystler”, “DODG ENeon”, “I ” Could look for spelling variations and common typos Could look for spelling variations and common typos

General Extraction Results ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …) ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …) Simple, unified domains: nearly 100% recall and precision Simple, unified domains: nearly 100% recall and precision Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision) Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision) Typical: 80%+ recall and precision Typical: 80%+ recall and precision

Generality & Resiliency of Extraction Ontologies Assumptions about web pages (generality) Assumptions about web pages (generality) Data rich Data rich Narrow domain Narrow domain Document types Document types Simple multiple-record documents (easiest) Simple multiple-record documents (easiest) Single-record documents (harder) Single-record documents (harder) Records with scattered components (even harder) Records with scattered components (even harder) Declarative (resiliency) Declarative (resiliency) Still works when web pages change Still works when web pages change Works for new, unseen pages in the same domain Works for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology Scalable, but takes work to declare the extraction ontology (another quick side note)

Semantic Annotation

Free-Form Query Interpretation Parse Free-Form Query (with data extraction ontology) Parse Free-Form Query (with data extraction ontology) Select Ontology Select Ontology Formulate Query Expression Formulate Query Expression Run Query Over Semantically Annotated Data Run Query Over Semantically Annotated Data

Parse Free-Form Query “Find me the and of all s – I want a ”pricemileagere d Nissan1996or newer >= Operator

Select Ontology Similarity value: 5 Similarity value: 2 “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Conjunctive queries and aggregate queries Conjunctive queries and aggregate queries Mentioned object sets are all of interest in the result. Mentioned object sets are all of interest in the result. Values and operator keywords determine conditions. Values and operator keywords determine conditions. Color = “red” Color = “red” Make = “Nissan” Make = “Nissan” Year >= 1996 Year >= 1996 >= Operator Formulate Query Expression

For Let Where Return Formulate Query Expression

Run Query Over Semantically Annotated Data

Query Interpretation Results: Pilot Experiment with Car Ads 15 car-ads free-form queries from 3 volunteer CS students 15 car-ads free-form queries from 3 volunteer CS students Results Results Recognizing object sets of interest Recognizing object sets of interest Recall: 85% Recall: 85% Precision: 90% Precision: 90% Recognizing constraints Recognizing constraints Recall: 61% Recall: 61% Precision: 79% Precision: 79% Problems Problems Regular expressions not tuned up and lexicons incomplete Regular expressions not tuned up and lexicons incomplete Ambiguities: “Are there any Ford mustangs, 2002, that are red?” (Is 2002 a year, mileage, or price?) Ambiguities: “Are there any Ford mustangs, 2002, that are red?” (Is 2002 a year, mileage, or price?) Caveats Caveats No disjunction No disjunction No negation No negation

General Query Interpretation Results AskOntos AskOntos ( Pilot Experiment on 5 domains: cars, real estate, countries, movies, diamonds) ( Pilot Experiment on 5 domains: cars, real estate, countries, movies, diamonds) Object sets of interest recognized Object sets of interest recognized Recall: 90% Recall: 90% Precision: 90% Precision: 90% Conditions recognized Conditions recognized Recall: 71% Recall: 71% Precision: 88% Precision: 88%

Pragmatics Technical problems Technical problems Extraction and query-interpretation accuracy Extraction and query-interpretation accuracy Execution speed Execution speed Harvesting Harvesting Crawling?! Crawling?! Information behind forms on the hidden web Information behind forms on the hidden web Social problems Social problems Cooperation from web site developers Cooperation from web site developers End-user concerns End-user concerns Motivation Motivation Trust Trust All is not rosy …

Conclusions Automatically create semantic-web content Automatically create semantic-web content Do data extraction over an ordinary web page Do data extraction over an ordinary web page Create semantic-web page Create semantic-web page Cache page Cache page Store external semantic annotation wrt an ontology Store external semantic annotation wrt an ontology Query semantic web pages Query semantic web pages Free-form queries Free-form queries Return results Return results Table Table Link to original web page (scrolled and highlighted) Link to original web page (scrolled and highlighted) Pragmatic considerations Pragmatic considerations