Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley, and Stephen W. Liddle Brigham Young University
Fundamental Problems Lack of semantic web content Lack of semantic web content Difficulty of content creation Difficulty of content creation Inability to use semantic web content easily Inability to use semantic web content easily
Proposed Solutions Automatically annotate data-rich web pages (turning them into semantic web pages) Automatically annotate data-rich web pages (turning them into semantic web pages) Provide for free-form, textual queries of semantic web content Provide for free-form, textual queries of semantic web content
A Show-Case Vision Find me the price and mileage of red Nissans – I want a 1990 or newer.
Demo I: Data Extraction
Demo II: Semantic Annotation
Demo III: Free-Form Query
Explanation: How it Works Extraction Ontologies Extraction Ontologies Semantic Annotation Semantic Annotation Free-Form Query Interpretation Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Formalism & Extraction Ontologies Fully formalized in predicate calculus Fully formalized in predicate calculus Object set ~ 1-place predicate Object set ~ 1-place predicate N-ary relationship set ~ n-place predicate N-ary relationship set ~ n-place predicate Constraint ~ closed predicate-calculus formula Constraint ~ closed predicate-calculus formula As a description logic ~ ALCN (Attributive Language with Complement and Numeric Restrictions) As a description logic ~ ALCN (Attributive Language with Complement and Numeric Restrictions) (a quick side note)
Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Data-Extraction Results: Car Ads Training set for tuning ontology: 100 Test set: 116 Salt Lake Tribune Recall %Precision % Year Make Model Mileage Price PhoneNr Feature 91 99
Car Ads: Comments Dynamic sets Dynamic sets Missed: MERC, Town Car, 98 Royale Missed: MERC, Town Car, 98 Royale Could use lexicon of makes and models Could use lexicon of makes and models Unspecified variation in lexical patterns Unspecified variation in lexical patterns Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) Missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) could adjust lexical patterns could adjust lexical patterns Misidentification of attributes Misidentification of attributes Classified AUTO in AUTO SALES as automatic transmission Classified AUTO in AUTO SALES as automatic transmission Could adjust exceptions in lexical patterns Could adjust exceptions in lexical patterns Typographical errors Typographical errors “Chrystler”, “DODG ENeon”, “I ” “Chrystler”, “DODG ENeon”, “I ” Could look for spelling variations and common typos Could look for spelling variations and common typos
General Extraction Results ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …) ~ 20 Domains (cars, obituaries, cameras, jobs, games, prescription drugs, …) Simple, unified domains: nearly 100% recall and precision Simple, unified domains: nearly 100% recall and precision Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision) Complex, loosely defined domains (e.g. obituaries: 82% recall and 74% precision) Typical: 80%+ recall and precision Typical: 80%+ recall and precision
Generality & Resiliency of Extraction Ontologies Assumptions about web pages (generality) Assumptions about web pages (generality) Data rich Data rich Narrow domain Narrow domain Document types Document types Simple multiple-record documents (easiest) Simple multiple-record documents (easiest) Single-record documents (harder) Single-record documents (harder) Records with scattered components (even harder) Records with scattered components (even harder) Declarative (resiliency) Declarative (resiliency) Still works when web pages change Still works when web pages change Works for new, unseen pages in the same domain Works for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology Scalable, but takes work to declare the extraction ontology (another quick side note)
Semantic Annotation
Free-Form Query Interpretation Parse Free-Form Query (with data extraction ontology) Parse Free-Form Query (with data extraction ontology) Select Ontology Select Ontology Formulate Query Expression Formulate Query Expression Run Query Over Semantically Annotated Data Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ”pricemileagere d Nissan1996or newer >= Operator
Select Ontology Similarity value: 5 Similarity value: 2 “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Conjunctive queries and aggregate queries Conjunctive queries and aggregate queries Mentioned object sets are all of interest in the result. Mentioned object sets are all of interest in the result. Values and operator keywords determine conditions. Values and operator keywords determine conditions. Color = “red” Color = “red” Make = “Nissan” Make = “Nissan” Year >= 1996 Year >= 1996 >= Operator Formulate Query Expression
For Let Where Return Formulate Query Expression
Run Query Over Semantically Annotated Data
Query Interpretation Results: Pilot Experiment with Car Ads 15 car-ads free-form queries from 3 volunteer CS students 15 car-ads free-form queries from 3 volunteer CS students Results Results Recognizing object sets of interest Recognizing object sets of interest Recall: 85% Recall: 85% Precision: 90% Precision: 90% Recognizing constraints Recognizing constraints Recall: 61% Recall: 61% Precision: 79% Precision: 79% Problems Problems Regular expressions not tuned up and lexicons incomplete Regular expressions not tuned up and lexicons incomplete Ambiguities: “Are there any Ford mustangs, 2002, that are red?” (Is 2002 a year, mileage, or price?) Ambiguities: “Are there any Ford mustangs, 2002, that are red?” (Is 2002 a year, mileage, or price?) Caveats Caveats No disjunction No disjunction No negation No negation
General Query Interpretation Results AskOntos AskOntos ( Pilot Experiment on 5 domains: cars, real estate, countries, movies, diamonds) ( Pilot Experiment on 5 domains: cars, real estate, countries, movies, diamonds) Object sets of interest recognized Object sets of interest recognized Recall: 90% Recall: 90% Precision: 90% Precision: 90% Conditions recognized Conditions recognized Recall: 71% Recall: 71% Precision: 88% Precision: 88%
Pragmatics Technical problems Technical problems Extraction and query-interpretation accuracy Extraction and query-interpretation accuracy Execution speed Execution speed Harvesting Harvesting Crawling?! Crawling?! Information behind forms on the hidden web Information behind forms on the hidden web Social problems Social problems Cooperation from web site developers Cooperation from web site developers End-user concerns End-user concerns Motivation Motivation Trust Trust All is not rosy …
Conclusions Automatically create semantic-web content Automatically create semantic-web content Do data extraction over an ordinary web page Do data extraction over an ordinary web page Create semantic-web page Create semantic-web page Cache page Cache page Store external semantic annotation wrt an ontology Store external semantic annotation wrt an ontology Query semantic web pages Query semantic web pages Free-form queries Free-form queries Return results Return results Table Table Link to original web page (scrolled and highlighted) Link to original web page (scrolled and highlighted) Pragmatic considerations Pragmatic considerations