David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge
A Web of Pages A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%
Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge
Existence asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology
The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology
Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer
Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work How? Fact Annotation … …
Turning Raw Symbols into Knowledge Symbols: $ 11, K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance
Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.
Data Extraction Demo
Semantic Annotation Demo
Free-Form Query Demo
Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology
Semantic Annotation
Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data
Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression
For Let Where Return Formulate Query Expression
Run Query Over Semantically Annotated Data
Automating content annotation – Extraction-ontology creation: a few dozen person hours – Semi-automatic creation FOCIH (Form-based Ontology Creation and Information Harvesting) TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Stepping up to the envisioned Web of Knowledge – Current & future work More challenging annotation projects Semi-automatic annotation via synergistic bootstrapping Knowledge bundles for research studies – Practicalities Great! But Problems Still Need Resolution
Manual Creation
-Library of instance recognizers -Library of lexicons
Craig’s List Alerter Constructed as a “short” class project – Nine applications – A few dozen hours Demo
2002 Jeep Liberty $7,995Toll free Alert! Alert! I found your Jeep Liberty for under $8,000.
FOCIH: Form-based Ontology Creation and Information Harvesting Forms (general familiarity) Information Harvesting Semi-automatic extraction ontology creation – Form-based generation of conceptual model – Instance-recognizer creation Lexicons Some pre-existing instance recognizers
FOCIH Form Creation
FOCIH Ontology Generation
FOCIH Information Harvesting
FOCIH Information-Harvesting Demo
TISP: Table Interpretation with Sibling Pages
Interpretation Technique: Sibling Page Comparison Same
Interpretation Technique: Sibling Page Comparison Almost Same
Interpretation Technique: Sibling Page Comparison Different Same
Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment
Table Unnesting
Simple Tree Matching Algorithm Labels Values [Yang91] Match Score Categorization: Exact/Near-Exact, Sibling-Table, False
Table Structure Patterns Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible.
Pattern Usage (Location.Genetic Position) = X: / cM [mapping data] (Location.Genomic Position) = X: bp
Dynamic Pattern Adjustment ( {L}) 5 ( ( {V}) 5 ) + ( {L}) 5 ( ( {V}) 5 ) + | ( {L}) 6 ( ( {V}) 6 ) +
TISP Demo
TISP/FOCIH Extraction Ontology Creation Reverse engineer with TISP Adjust with FOCIH Data frames – Initialize lexicons with harvested data – Library of data frames—select and specialize
TISP/FOCIH Extraction Ontology Creation
TANGO: Table Analysis for Generating Ontologies Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology
Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
Discover Mappings
Merge
TANGO Demo
Some More Challenging Annotation Applications Multimedia Annotation – Art Images – Music and Lyrics – Closed-captioning Video Historical Document Images – Names, Dates, Places, Events – Learned rules for OCR’d Named Entity Recognition Open Question Answering
Find me an image that is red, dark, scary, and beautiful.
Find something soothing but energetic: Good for recovering hospital patients. How about Mozart’s 40 th Symphony? (Here it is.)Here it is.)
U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest level since early When has Barack Obama visited Iraq? Which U.S. President’s have visited Iraq?
Find names, locations, events, and dates and associations among them for my great grandma Margaret Haines. I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual 'training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34
Learn rules to recognize names, even under less- than-ideal OCR’d documents. Seed models: – Prefix: “Mrs”, Miss”, “Mr” – Initials: “A”, “B”, “C”, … – Given Name: “Charles”, “Francis”, Herbert” – Surname: “Goodrich”, Wells”, White” – Stopword: “Jewell”, “Graves” Updates: – Prefix: first token in line – Given Name: between ‘Prefix’ and ‘Initial’ – Surname: between initial and M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN
Who was the first person to land on the Moon?
Build a page-layout, pattern-based annotator Automate layout recognition based on examples Auto-generate examples with extraction ontologies Synergistically run pattern-based annotator & extraction-ontology annotator Semi-Automatic Annotation via Synergistic Bootstrapping (Based on Nested Schemas with Regular Expressions)
PatML Editor Browser-Rendered Page Page Source Text Information Structure Tree
Synergistic Execution Extraction Ontology Document Conceptual Annotator (ontology-based annotation) Partially Annotated Document Structural Annotator (layout-driven annotation) Annotated Document Layout Patterns Pattern Generation
Knowledge Bundles for Research Studies To do a recent study about associations between lung cancer and tp53 polymorphism, researchers needed to: (1) do a keyword-based search on the SNP data repository for ``tp53'' within organism "homo sapiens"; (2) from the returned records, open each record page one by one and find those coding SNPs that have a minor allele frequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and many properties of the SNP; (4) perform a keyword search in PubMed and skim the hundreds of manuscripts found to determine which manuscripts are related to the SNPs of interest and fit their search criteria; (5) extract the information of interest (e.g., the statistical information, patient information, and treatment information); and (6) organize it.
Knowledge Bundles for Research Studies (1) Search, (2) Filter, (3) Record information
Knowledge Bundles for Research Studies (4) High precision literature search
Knowledge Bundles for Research Studies (5) Extract by reverse engineering
Knowledge Bundles for Research Studies (5) Organize harvested information
Knowledge Bundles for Research Studies
Research Challenge: “I believe that a good biomedical scenario would be to select a topic which already large structured database (gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.” – GN
Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Bio-research studies Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)
Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)
Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation