Download presentation
Presentation is loading. Please wait.
1
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge
2
A Web of Pages A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%
3
Find me an image that is red, dark, scary, and beautiful.
4
Learn rules to recognize names, even under less- than-ideal OCR’d documents. Seed models: – Prefix: “Mrs”, Miss”, “Mr” – Initials: “A”, “B”, “C”, … – Given Name: “Charles”, “Francis”, Herbert” – Surname: “Goodrich”, Wells”, White” – Stopword: “Jewell”, “Graves” Updates: – Prefix: first token in line – Given Name: between ‘Prefix’ and ‘Initial’ – Surname: between initial and M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN
5
Annotating Music and Lyrics Find something soothing but energetic: Good for recovering patients. How about Mozart’s 40 th Symphony?
6
Build a knowledge bundle for checking the association between tp53 polymorphism and lung cancer.
7
U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest level since early 2004. When has Barack Obama visited Iraq? Which U.S. President’s have visited Iraq?
8
Find names, locations, events, and dates and associations among them for my great grandma Margaret Haines. I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in 190 5 Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual 'training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34
9
Who was the first person to land on the Moon?
10
2002 Jeep Liberty $7,995Toll free 1-800-423-0334 Alert! Alert! I found your Jeep Liberty for under $8,000.
11
Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge
12
Existence asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology
13
The nature of knowledge asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology
14
Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer
15
Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work How? Fact Annotation … …
16
Turning Raw Symbols into Knowledge Symbols: $ 11,500 117K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance
17
Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.
18
Data Extraction Demo
19
Semantic Annotation Demo
20
Free-Form Query Demo
21
Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation
22
Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization
23
Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…
24
Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology
25
Semantic Annotation
26
Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data
27
Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator
28
Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”
29
Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression
30
For Let Where Return Formulate Query Expression
31
Run Query Over Semantically Annotated Data
32
Automating content annotation – Extraction-ontology creation: a few dozen person hours – Semi-automatic creation FOCIH (Form-based Ontology Creation and Information Harvesting) TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Stepping up to the envisioned Web of Knowledge – Current & future work Semi-automatic annotation via synergistic bootstrapping Knowledge bundles for research studies – Practicalities Great! But Problems Still Need Resolution
33
Manual Creation
35
-Library of instance recognizers -Library of lexicons
36
Craig’s List Alerter Constructed as a “short” class project – 10 applications – a few dozen hours Demo
37
FOCIH: Form-based Ontology Creation and Information Harvesting Forms (general familiarity) Information Harvesting Semi-automatic extraction ontology creation – Form-based generation of conceptual model – Instance-recognizer creation Lexicons Some pre-existing instance recognizers
38
FOCIH Form Creation
39
FOCIH Ontology Generation
40
FOCIH Information Harvesting
41
FOCIH Information-Harvesting Demo
42
TISP: Table Interpretation with Sibling Pages
43
Interpretation Technique: Sibling Page Comparison Same
44
Interpretation Technique: Sibling Page Comparison Almost Same
45
Interpretation Technique: Sibling Page Comparison Different Same
46
Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment
47
Table Unnesting
48
Simple Tree Matching Algorithm Labels Values [Yang91] Match Score Categorization: Exact/Near-Exact, Sibling-Table, False
49
Table Structure Patterns Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible.
50
Pattern Usage (Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data] (Location.Genomic Position) = X:13518823..13515773 bp
51
Dynamic Pattern Adjustment ( {L}) 5 ( ( {V}) 5 ) + ( {L}) 5 ( ( {V}) 5 ) + | ( {L}) 6 ( ( {V}) 6 ) +
52
TISP Demo
53
TISP/FOCIH Extraction Ontology Creation Reverse engineer with TISP Adjust with FOCIH Data frames – Initialize lexicons with harvested data – Library of data frames—select and specialize
54
TISP/FOCIH Extraction Ontology Creation
60
TANGO: Table Analysis for Generating Ontologies Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology
61
Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
62
Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%
63
Discover Mappings
64
Merge
65
TANGO Demo
66
Build a page-layout, pattern-based annotator Automate layout recognition based on examples Auto-generate examples with extraction ontologies Synergistically run pattern-based annotator & extraction-ontology annotator Semi-Automatic Annotation via Synergistic Bootstrapping (Based on Nested Schemas with Regular Expressions)
67
PatML Editor Browser-Rendered Page Page Source Text Information Structure Tree
69
Synergistic Execution Extraction Ontology Document Conceptual Annotator (ontology-based annotation) Partially Annotated Document Structural Annotator (layout-driven annotation) Annotated Document Layout Patterns Pattern Generation
70
Knowledge Bundles for Research Studies To do a recent study about associations between lung cancer and tp53 polymorphism, researchers needed to: (1) do a keyword-based search on the SNP data repository for ``tp53'' within organism "homo sapiens"; (2) from the returned records, open each record page one by one and find those coding SNPs that have a minor allele frequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and many properties of the SNP; (4) perform a keyword search in PubMed and skim the hundreds of manuscripts found to determine which manuscripts are related to the SNPs of interest and fit their search criteria; and (5) extract the information of interest (e.g., the statistical information, patient information, and treatment information) and organize it.
71
Knowledge Bundles for Research Studies (1): Search, (2): Filter, (3): Record information
72
Knowledge Bundles for Research Studies (4): High precision literature search
73
Knowledge Bundles for Research Studies (5): Extract and organize
74
Knowledge Bundles for Research Studies
75
Research Challenge: “I believe that a good biomedical scenario would be to select a topic which already large structured database (gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.” – GN
76
Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Bio-research studies Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)
77
Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)
78
Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation www.deg.byu.edu www.tango.byu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.