Presentation is loading. Please wait.

Presentation is loading. Please wait.

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Similar presentations


Presentation on theme: "David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge."— Presentation transcript:

1 David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

2 A Web of Pages  A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%

3 Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge

4 Existence  asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology

5 The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology

6 Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer

7 Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work  How? Fact Annotation … …

8 Turning Raw Symbols into Knowledge Symbols: $ 11,500 117K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance

9 Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

10 Data Extraction Demo

11 Semantic Annotation Demo

12 Free-Form Query Demo

13 Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation

14 Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

15 Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

16 Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology

17 Semantic Annotation

18 Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data

19 Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator

20 Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

21 Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression

22 For Let Where Return Formulate Query Expression

23 Run Query Over Semantically Annotated Data

24 Automating content annotation – Extraction-ontology creation: a few dozen person hours – Semi-automatic creation FOCIH (Form-based Ontology Creation and Information Harvesting) TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Stepping up to the envisioned Web of Knowledge – Current & future work More challenging annotation projects Semi-automatic annotation via synergistic bootstrapping Knowledge bundles for research studies – Practicalities Great! But Problems Still Need Resolution

25 Manual Creation

26

27 -Library of instance recognizers -Library of lexicons

28 Craig’s List Alerter Constructed as a “short” class project – Nine applications – A few dozen hours Demo

29 2002 Jeep Liberty $7,995Toll free 1-800-423-0334 Alert! Alert! I found your Jeep Liberty for under $8,000.

30 FOCIH: Form-based Ontology Creation and Information Harvesting Forms (general familiarity) Information Harvesting Semi-automatic extraction ontology creation – Form-based generation of conceptual model – Instance-recognizer creation Lexicons Some pre-existing instance recognizers

31 FOCIH Form Creation

32 FOCIH Ontology Generation

33 FOCIH Information Harvesting

34 FOCIH Information-Harvesting Demo

35 TISP: Table Interpretation with Sibling Pages

36 Interpretation Technique: Sibling Page Comparison Same

37 Interpretation Technique: Sibling Page Comparison Almost Same

38 Interpretation Technique: Sibling Page Comparison Different Same

39 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

40 Table Unnesting

41 Simple Tree Matching Algorithm Labels Values [Yang91] Match Score Categorization: Exact/Near-Exact, Sibling-Table, False

42 Table Structure Patterns Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible.

43 Pattern Usage (Location.Genetic Position) = X:12.69 +/- 0.000 cM [mapping data] (Location.Genomic Position) = X:13518823..13515773 bp

44 Dynamic Pattern Adjustment ( {L}) 5 ( ( {V}) 5 ) + ( {L}) 5 ( ( {V}) 5 ) + | ( {L}) 6 ( ( {V}) 6 ) +

45 TISP Demo

46 TISP/FOCIH Extraction Ontology Creation Reverse engineer with TISP Adjust with FOCIH Data frames – Initialize lexicons with harvested data – Library of data frames—select and specialize

47 TISP/FOCIH Extraction Ontology Creation

48

49

50

51

52

53 TANGO: Table Analysis for Generating Ontologies Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

54 Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

55 Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

56 Discover Mappings

57 Merge

58 TANGO Demo

59 Some More Challenging Annotation Applications Multimedia Annotation – Art Images – Music and Lyrics – Closed-captioning Video Historical Document Images – Names, Dates, Places, Events – Learned rules for OCR’d Named Entity Recognition Open Question Answering

60 Find me an image that is red, dark, scary, and beautiful.

61 Find something soothing but energetic: Good for recovering hospital patients. How about Mozart’s 40 th Symphony? (Here it is.)Here it is.)

62 U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest level since early 2004. When has Barack Obama visited Iraq? Which U.S. President’s have visited Iraq?

63 Find names, locations, events, and dates and associations among them for my great grandma Margaret Haines. I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April 9 1907 to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in 190 5 Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual 'training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34

64 Learn rules to recognize names, even under less- than-ideal OCR’d documents. Seed models: – Prefix: “Mrs”, Miss”, “Mr” – Initials: “A”, “B”, “C”, … – Given Name: “Charles”, “Francis”, Herbert” – Surname: “Goodrich”, Wells”, White” – Stopword: “Jewell”, “Graves” Updates: – Prefix: first token in line – Given Name: between ‘Prefix’ and ‘Initial’ – Surname: between initial and M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN

65 Who was the first person to land on the Moon?

66 Build a page-layout, pattern-based annotator Automate layout recognition based on examples Auto-generate examples with extraction ontologies Synergistically run pattern-based annotator & extraction-ontology annotator Semi-Automatic Annotation via Synergistic Bootstrapping (Based on Nested Schemas with Regular Expressions)

67 PatML Editor Browser-Rendered Page Page Source Text Information Structure Tree

68

69 Synergistic Execution Extraction Ontology Document Conceptual Annotator (ontology-based annotation) Partially Annotated Document Structural Annotator (layout-driven annotation) Annotated Document Layout Patterns Pattern Generation

70 Knowledge Bundles for Research Studies To do a recent study about associations between lung cancer and tp53 polymorphism, researchers needed to: (1) do a keyword-based search on the SNP data repository for ``tp53'' within organism "homo sapiens"; (2) from the returned records, open each record page one by one and find those coding SNPs that have a minor allele frequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and many properties of the SNP; (4) perform a keyword search in PubMed and skim the hundreds of manuscripts found to determine which manuscripts are related to the SNPs of interest and fit their search criteria; (5) extract the information of interest (e.g., the statistical information, patient information, and treatment information); and (6) organize it.

71 Knowledge Bundles for Research Studies (1) Search, (2) Filter, (3) Record information

72 Knowledge Bundles for Research Studies (4) High precision literature search

73 Knowledge Bundles for Research Studies (5) Extract by reverse engineering

74 Knowledge Bundles for Research Studies (5) Organize harvested information

75 Knowledge Bundles for Research Studies

76 Research Challenge: “I believe that a good biomedical scenario would be to select a topic which already large structured database (gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.” – GN

77 Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Bio-research studies Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)

78 Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)

79 Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation www.deg.byu.edu www.tango.byu.edu


Download ppt "David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge."

Similar presentations


Ads by Google