David W. Embley Brigham Young University Provo, Utah, USA
Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%
Fundamental questions What is knowledge? What are facts? How does one know? Philosophy Ontology Epistemology Logic and reasoning
Existence asks “What exists?” Concepts, relationships, and constraints
The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model
Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Find price and mileage of red Nissans, 1990 or newer
Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Fact Annotation … …
Symbols: $ 11, K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: Car(C 123 ) has Price($11,500) Car(C 123 ) has Mileage(117,000) Car(C 123 ) has Make(Nissan) Car(C 123 ) has Feature(AC) Knowledge “Correct” facts Provenance
Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Extraction Ontologies Semantic Annotation Free-Form Query Interpretation
Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Generality: assumptions about web pages Data rich Narrow domain Document types Simple multiple-record documents (easiest) Single-record documents (harder) Records with scattered components (even harder) Resiliency: declarative Still works when web pages change Works for new, unseen pages in the same domain Scalable, but takes work to declare the extraction ontology
Parse Free-Form Query (wrt data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data
“Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator
“Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. Color = “red” Make = “Nissan” Year >= 1996 >= Operator Formulate Query Expression
For Let Where Return Formulate Query Expression
Several dozen person-hours Oodles of extraction ontologies needed How can we resolve this problem?
Forms – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Instance recognizers – Some pre-existing instance recognizers – Lexicons Automated extraction ontology creation?
Basic form-construction facilities: single-entry field multiple-entry field nested form …
Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection
Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name
Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3 Name
Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Need reading path: DOM-tree structure Need to resolve mapping problems Split/Merge Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Name
protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E
Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3
Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS
Also helps adjust ontology constraints
Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Lexicons Name protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC Tryptophan—tRNA ligase TrpRS (Mt)TrpRS …
Instance Recognizers Number Patterns Context Keywords and Phrases
Recognize and annotate with respect to an ontology
Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training “Effortlessly” generate WoK content
Extraction-ontology generation Auto-enhancement of extraction ontologies Form-based specification Auto-generation based on table interpretation Sophisticated conceptualization with TANGO Automated annotation Extraction ontologies Form-based information harvesting Generated pattern-based annotation Simple query specification Free-form queries Generated form-based queries