David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
The 20th International Conference on Software Engineering and Knowledge Engineering (SEKE2008) Department of Electrical and Computer Engineering
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
Ancestry OCR Project: Extractors Thomas L. Packer
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,
FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.
Semi-automatic Ontology Creation through Conceptual-Model Integration David W. Embley Brigham Young University ER2008.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
SmartNews: Semantic Searching of News Video Stephen Lynn Brigham Young University CS 652.
Principled Pragmatism: A Guide to the Adaptation of Philosophical Disciplines to Conceptual Modeling David W. Embley, Stephen W. Liddle, & Deryle W. Lonsdale.
Co-training Internal and External Extraction Models By Thomas Packer.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Ontology-Based Free-Form Query Processing for the Semantic Web Thesis proposal by Mark Vickers.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.
Two-Level Semantic Annotation Model BYU Spring Conference 2007 Yihong Ding Sponsored by NSF.
Thesis Defense Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.
Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.
An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
BYU A Synergistic Semantic Annotation Model December 2007 Yihong Ding,
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Thesis Proposal Mini-Ontology GeneratOr (MOGO) Mini-Ontology Generation from Canonicalized Tables Stephen Lynn Data Extraction Research Group Department.
© Ramesh Jain Ramesh Jain CTO, PRAJA inc. and Professor Emeritus, UCSD Emergent Semantics and Experiential Computing.
Theoretical Foundations for Enabling a Web of Knowledge David W. Embley Andrew Zitzelberger Brigham Young University
1 The BT Digital Library A case study in intelligent content management Paul Warren
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Artificial intelligence project
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Dimitrios Skoutas Alkis Simitsis
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Presenter: Shanshan Lu 03/04/2010
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
An Aspect of the NSF CDI Initiative CDI: Cyber-Enabled Discovery and Innovation.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
David W. Embley Brigham Young University Provo, Utah, USA.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
David W. Embley Brigham Young University Provo, Utah, USA
Extracting Information from Diverse and Noisy Scanned Document Images
Presentation transcript:

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge

A Web of Pages  A Web of Facts Birthdate of my great grandpa Orson Price and mileage of red Nissans, 1990 or newer Location and size of chromosome 17 US states with property crime rates above 1%

Fundamental questions – What is knowledge? – What are facts? – How does one know? Philosophy – Ontology – Epistemology – Logic and reasoning Toward a Web of Knowledge

Existence  asks “What exists?” Concepts, relationships, and constraints with formal foundation Ontology

The nature of knowledge  asks: “What is knowledge?” and “How is knowledge acquired?” Populated conceptual model Epistemology

Principles of valid inference – asks: “What is known?” and “What can be inferred?” For us, it answers: what can be inferred (in a formal sense) from conceptualized data. Logic and Reasoning Find price and mileage of red Nissans, 1990 or newer

Distill knowledge from the wealth of digital web data Annotate web pages Need a computational alembic to algorithmically turn raw symbols contained in web pages into knowledge Making this Work  How? Fact Annotation … …

Turning Raw Symbols into Knowledge Symbols: $ 11, K Nissan CD AC Data: price(11,500) mileage(117K) make(Nissan) Conceptualized data: – Car(C 123 ) has Price($11,500) – Car(C 123 ) has Mileage(117,000) – Car(C 123 ) has Make(Nissan) – Car(C 123 ) has Feature(AC) Knowledge – “Correct” facts – Provenance

Actualization (with Extraction Ontologies) Find me the price and mileage of all red Nissans – I want a 1990 or newer.

Data Extraction Demo

Semantic Annotation Demo

Free-Form Query Demo

Explanation: How it Works Extraction Ontologies Semantic Annotation Free-Form Query Interpretation

Extraction Ontologies Object sets Relationship sets Participation constraints Lexical Non-lexical Primary object set Aggregation Generalization/Specialization

Extraction Ontologies External Rep.: \s*[$]\s*(\d{1,3})*(\.\d{2})? Key Word Phrase Left Context: $ Data Frame: Internal Representation: float Values Key Words: ([Pp]rice)|([Cc]ost)| … Operators Operator: > Key Words: (more\s*than)|(more\s*costly)|…

Generality & Resiliency of Extraction Ontologies Generality: assumptions about web pages – Data rich – Narrow domain – Document types Single-record documents (hard, but doable) Multiple-record documents (harder) Records with scattered components (even harder) Resiliency: declarative – Still works when web pages change – Works for new, unseen pages in the same domain – Scalable, but takes work to declare the extraction ontology

Semantic Annotation

Free-Form Query Interpretation Parse Free-Form Query (with respect to data extraction ontology) Select Ontology Formulate Query Expression Run Query Over Semantically Annotated Data

Parse Free-Form Query “Find me the and of all s – I want a ”pricemileageredNissan1996or newer >= Operator

Select Ontology “Find me the price and mileage of all red Nissans – I want a 1996 or newer”

Conjunctive queries and aggregate queries Mentioned object sets are all of interest. Values and operator keywords determine conditions. – Color = “red” – Make = “Nissan” – Year >= 1996 >= Operator Formulate Query Expression

For Let Where Return Formulate Query Expression

Run Query Over Semantically Annotated Data

Automating content annotation – Extraction-ontology creation: a few dozen person hours – Semi-automatic creation FOCIH (Form-based Ontology Creation and Information Harvesting) TISP (Table Interpretation by Sibling Pages) TANGO (Table ANalysis for Generating Ontologies) Stepping up to the envisioned Web of Knowledge – Current & future work More challenging annotation projects Semi-automatic annotation via synergistic bootstrapping Knowledge bundles for research studies – Practicalities Great! But Problems Still Need Resolution

Manual Creation

-Library of instance recognizers -Library of lexicons

Craig’s List Alerter Constructed as a “short” class project – Nine applications – A few dozen hours Demo

2002 Jeep Liberty $7,995Toll free Alert! Alert! I found your Jeep Liberty for under $8,000.

FOCIH: Form-based Ontology Creation and Information Harvesting Forms (general familiarity) Information Harvesting Semi-automatic extraction ontology creation – Form-based generation of conceptual model – Instance-recognizer creation Lexicons Some pre-existing instance recognizers

FOCIH Form Creation

FOCIH Ontology Generation

FOCIH Information Harvesting

FOCIH Information-Harvesting Demo

TISP: Table Interpretation with Sibling Pages

Interpretation Technique: Sibling Page Comparison Same

Interpretation Technique: Sibling Page Comparison Almost Same

Interpretation Technique: Sibling Page Comparison Different Same

Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

Table Unnesting

Simple Tree Matching Algorithm Labels Values [Yang91] Match Score Categorization: Exact/Near-Exact, Sibling-Table, False

Table Structure Patterns Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible.

Pattern Usage (Location.Genetic Position) = X: / cM [mapping data] (Location.Genomic Position) = X: bp

Dynamic Pattern Adjustment ( {L}) 5 ( ( {V}) 5 ) + ( {L}) 5 ( ( {V}) 5 ) + | ( {L}) 6 ( ( {V}) 6 ) +

TISP Demo

TISP/FOCIH Extraction Ontology Creation Reverse engineer with TISP Adjust with FOCIH Data frames – Initialize lexicons with harvested data – Library of data frames—select and specialize

TISP/FOCIH Extraction Ontology Creation

TANGO: Table Analysis for Generating Ontologies Recognize and normalize table information Construct mini-ontologies from tables Discover inter-ontology mappings Merge mini-ontologies into a growing ontology

Recognize Table Information Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

Construct Mini-Ontology Religion Population Albanian Roman Shi’a Sunni Country (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other Afganistan 26,813,057 15% 84% 1% Albania 3,510,484 20% 70% 10%

Discover Mappings

Merge

TANGO Demo

Some More Challenging Annotation Applications Multimedia Annotation – Art Images – Music and Lyrics – Closed-captioning Video Historical Document Images – Names, Dates, Places, Events – Learned rules for OCR’d Named Entity Recognition Open Question Answering

Find me an image that is red, dark, scary, and beautiful.

Find something soothing but energetic: Good for recovering hospital patients. How about Mozart’s 40 th Symphony? (Here it is.)Here it is.)

U.S. President Barack Obama visited Iraq Monday in a stop that was overshadowed by the question of when U.S. troops should go home. Obama made his opposition to the U.S.-led invasion of Iraq five years ago a centerpiece of his campaign and was in Baghdad to assess security in Iraq, where violence has fallen to its lowest level since early When has Barack Obama visited Iraq? Which U.S. President’s have visited Iraq?

Find names, locations, events, and dates and associations among them for my great grandma Margaret Haines. I GERTRUDE SMITH (Mrs William E Haines deceased) Married shortly after graduation Died at age of 22 Was musician and taught piano lessons 1898 HOBART L BENEDICT Millburn Essex County N J Graduated from Rutgers 1902 and from New York Law School in 1904 with degrees of B Sc M Sc and LL B Married April to Martha C Bunnell One daughter Elizabeth Benedict Counsellor at law with offices in Elizabeth and Millburn MARTHA BUNNELL (Mrs Hobart L Benedict) Millburn Essex County N J Married to Hobart L Benedict on date above 1899 CORA SMITH (Mrs Louis Slingerland) 557 Third St South St Peters- burg Florida Married Louis Slingerland a former pupil of Connec- Farms High School Mr Slingerland is engaged in building business in St Petersburg JENNIE HAINES Elmwood Ave Union Union Co N J Graduated from State Normal School Trenton N J in Principal of Hurden Looker School in Hillside Township formerly a part of Union Town- ship STELLA ILLSLEY (Mrs Harry Engel) Hollis Long Island N Y WALTER BOSCHEN Morris Ave Union N J Completed fourth year at Battin High School in 1900 Attended Rutgers College taking up civil engineering course Has been successful in the business world President of the W G Boschen Sales Co Inc manufacturer general agents for mechanical line GEORGE McQUAIDE Springfield N J Was employed by Morris County Traction Company 1900 No graduates 1901 No graduates 1902 MARGARET HAINES Elmwood Ave Union N J Took up stenography and typewriting and is now employed as private secretary of the Correspondence Department of the Singer Manufacturing Com- pany of Elizabeth N J ABBY HEADLEY (Mrs Leslie Ward) 5 Rose St Newark N J CLARENCE GRIGGS Stuyvesant Ave Union N J Graduated from Trenton State Normal School in 1905 having specialized in manual training Taught one half year at Neshanic N J one year at Lin- coin School Roselle N J Teaching manual 'training and mechanical drawing in Newark N J Has taken special courses in Columbia University 34

Learn rules to recognize names, even under less- than-ideal OCR’d documents. Seed models: – Prefix: “Mrs”, Miss”, “Mr” – Initials: “A”, “B”, “C”, … – Given Name: “Charles”, “Francis”, Herbert” – Surname: “Goodrich”, Wells”, White” – Stopword: “Jewell”, “Graves” Updates: – Prefix: first token in line – Given Name: between ‘Prefix’ and ‘Initial’ – Surname: between initial and M RS CHARLES A JEWELL MRS FRANCIS B COOI EN MRS P W ELILSWVORT MRs HERBERT C ADSWVORTH MRS HENRY E TAINTOR MR DANIEl H WELLS MRS ARTHUR L GOODRICH Miss JOSEPHINE WHITE Mss JULIA A GRAVES Ms H B LANGDON Miss MARY H ADAMS Miss ELIZA F Mix 'MRs MIARY C ST )NEC MIRS AI I ERT H PITKIN

Who was the first person to land on the Moon?

Build a page-layout, pattern-based annotator Automate layout recognition based on examples Auto-generate examples with extraction ontologies Synergistically run pattern-based annotator & extraction-ontology annotator Semi-Automatic Annotation via Synergistic Bootstrapping (Based on Nested Schemas with Regular Expressions)

PatML Editor Browser-Rendered Page Page Source Text Information Structure Tree

Synergistic Execution Extraction Ontology Document Conceptual Annotator (ontology-based annotation) Partially Annotated Document Structural Annotator (layout-driven annotation) Annotated Document Layout Patterns Pattern Generation

Knowledge Bundles for Research Studies To do a recent study about associations between lung cancer and tp53 polymorphism, researchers needed to: (1) do a keyword-based search on the SNP data repository for ``tp53'' within organism "homo sapiens"; (2) from the returned records, open each record page one by one and find those coding SNPs that have a minor allele frequency greater than 1%; (3) for each qualifying SNP, record the SNP ID and many properties of the SNP; (4) perform a keyword search in PubMed and skim the hundreds of manuscripts found to determine which manuscripts are related to the SNPs of interest and fit their search criteria; (5) extract the information of interest (e.g., the statistical information, patient information, and treatment information); and (6) organize it.

Knowledge Bundles for Research Studies (1) Search, (2) Filter, (3) Record information

Knowledge Bundles for Research Studies (4) High precision literature search

Knowledge Bundles for Research Studies (5) Extract by reverse engineering

Knowledge Bundles for Research Studies (5) Organize harvested information

Knowledge Bundles for Research Studies

Research Challenge: “I believe that a good biomedical scenario would be to select a topic which already large structured database (gene extraction, vitamins, blood), and then search for and find web pages that augment, support or refute specific aspects of that database.” – GN

Won’t just happen without sufficient content Niche applications – Historical Data (e.g. Genealogy) – Bio-research studies Local WoKs – Intra-organizational effort – Individual interests Practicalities: Bootstrapping the WoK (Future Work)

Potential Rapid growth – Thousands of ontologies – Millions of simultaneous queries – Billions of annotated pages – Trillions of facts Search-engine-like caching & query processing Practicalities: Scalability (Future Work)

Automatic (or near automatic) creation of extraction ontologies Automatic (or near automatic) annotation of web pages Simple but accurate query specification without specialized training Key to Success: Simplicity via Automation