1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages.

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

Critical Reading Strategies: Overview of Research Process

TU/e technische universiteit eindhoven Hera: Development of Semantic Web Information Systems Geert-Jan Houben Peter Barna Flavius Frasincar Richard Vdovjak.

CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Ontologies for multilingual extraction Deryle W. Lonsdale David W. Embley Stephen W. Liddle Supported by the.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Aaron Stewart, and Cui Tao* Brigham Young University, Provo, Utah, USA *Mayo Clinic, Rochester,

FOCIH: Form-based Ontology Creation and Information Harvesting Cui Tao, David W. Embley, Stephen W. Liddle Brigham Young University Nov. 11, 2009 Supported.

CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.

Ontology-Based Free-Form Query Processing for the Semantic Web Thesis proposal by Mark Vickers.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

1 Semi-Automatic Semantic Annotation for Hidden-Web Tables Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham.

Toward Making Online Biological Data Machine Understandable Cui Tao.

1 Data Integration and Extraction over Molecular Biological Data Cui Tao supported by NSF.

A Tool to Support Ontology Creation Based on Incremental Mini-Ontology Merging Zonghui Lian Data Extraction Research Group Supported by.

Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.

By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.

Seed-based Generation of Personalized Bio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of.

1 Extracting RDF Data from Unstructured Sources Based on an RDF Target Schema Tim Chartrand Research Supported By NSF.

Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:

Semi-Automatically Generating Data-Extraction Ontology Yihong Ding March 6, 2001.

Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,

SOLUTION: Source page understanding – Table interpretation Table recognition Table pattern generalization Pattern adjustment Information extraction & semantic.

Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.

Table Interpretation by Sibling Page Comparison Cui Tao & David W. Embley Data Extraction Group Department of Computer Science Brigham Young University.

Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,

Knowledge Mediation in the WWW based on Labelled DAGs with Attached Constraints Jutta Eusterbrock WebTechnology GmbH.

Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.

Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.

Database Support for Semantic Web Masoud Taghinezhad Omran Sharif University of Technology Computer Engineering Department Fall.

SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,

Semantic Web Applications GoodRelations BBC Artists BBC World Cup 2010 Website Emma Nherera.

Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.

Dimitrios Skoutas Alkis Simitsis

An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.

Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.

BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™

David W. Embley Brigham Young University Provo, Utah, USA WoK: A Web of Knowledge.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.

Ontea: Pattern based Annotation Platform Michal Laclavík.

Dictionary based interchanges for iSURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains David Webber.

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

An Aspect of the NSF CDI Initiative CDI: Cyber-Enabled Discovery and Innovation.

Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.

Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

David W. Embley Brigham Young University Provo, Utah, USA.

Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.

Of 24 lecture 11: ontology – mediation, merging & aligning.

Cross-language Information Retrieval

David W. Embley Brigham Young University Provo, Utah, USA

ece 627 intelligent web: ontology and beyond

Source Page Understanding for Heterogeneous Molecular Biological Data

Query Optimization.

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

1 Cui Tao PhD Dissertation Defense Ontology Generation, Information Harvesting and Semantic Annotation For Machine-Generated Web Pages

2 Motivation Birth date of my great grandpa Price and mileage of red Nissans, 1990 or newer Protein and amino acids information of gene cdk-4? US states with property crime rates above 1%

3 Search by Search Engine

4 Search the Hidden Web The Hidden Web: – Hidden behind forms – Hard to query “cdk-4"

5 Query for Data The Hidden Web: – Hidden behind forms – Hard to query Find the protein and the animo-acids information for gene “cdk-4"

6 A Web of Pages  A Web of Knowledge Web of Knowledge – Machine-“understandable” – Publicly accessible – Queriable by standard query languages Semantic annotation – Domain ontologies – Populated conceptual model Problems to resolve – How do we create ontologies? – How do we annotate pages for ontologies?

Contributions of Dissertation Work Web of Pages  Web of Knowledge – Knowledge & meta-knowledge extraction – Reformulation as machine-“understandable” knowledge Automatic & semi-automatic solutions via: – Sibling tables (TISP/TISP++) – User-created forms (FOCIH) 7

8 Automatic Annotation with TISP (Table Interpretation with Sibling Pages) Recognize tables (discard non-tables) Locate table labels Locate table values Find label/value associations

9 Recognize Tables Data Table Layout Tables (discard) Nested Data Tables

10 Find Label/Value Associations Example: (Identification.Gene model(s).Protein, Identification.Gene model(s).2) = WP:CE

11 Interpretation Technique: Sibling Page Comparison

12 Interpretation Technique: Sibling Page Comparison Same

13 Interpretation Technique: Sibling Page Comparison Almost Same

14 Interpretation Technique: Sibling Page Comparison Different Same

15 Technique Details Unnest tables Match tables in sibling pages – “Perfect” match (table for layout  discard ) – “Reasonable” match (sibling table) Determine & use table-structure pattern – Discover pattern – Pattern usage – Dynamic pattern adjustment

16 Table Unnesting

17 Regularity Expectations: ( {L} {V}) n ( {L}) n ( ( {V}) n ) + … Pattern combinations are also possible. Table Structure Patterns

18 ( {L}) n ( ( {V}) n ) + Table Structure Patterns

19 Pattern Usage

20 Dynamic Pattern Adjustment

21 TISP++ Automatic ontology generation Automatic information annotation

22 Ontology Generation – OSM Object set: table labels – Lexical: labels that associate with actual values – Non-lexical: labels that associate with other tables Relationship set: table nesting Constraints: updates based on observation

23 Ontology Generation – OWL Object set: OWL class Relationship set: OWL object property Lexical object set: – OWL data type property – Different annotation properties to keep track of the provenance

Generated Ontology

26 RDF Graph

27 Query the Data Find the protein and the animo-acids information for gene “cdk-4"

28 TISP Evaluation Applications – Commercial: car ads – Scientific: molecular biology – Geopolitical: US states and countries Data: > 2,000 tables in 35 sites Evaluation – Initial two sibling pages Correct separation of data tables from layout tables? Correct pattern recognition? – Remaining tables in site Information properly extracted? Able to detect and adjust for pattern variations?

29 Experimental Results Table recognition: correctly discarded 157 of 158 layout tables Pattern recognition: correctly found 69 of 72 structure patterns Extraction and adjustments: 5 path adjustments and 34 label adjustments  all correct

30 TISP++ Performance Performance depends on TISP TISP test set – Generates all ontologies correctly – Annotates all information in tables correctly

31 Form-based Ontology Creation and Information Harvesting (FOCIH) Personalized ontology creation by form – General familiarity – Reasonable conceptual framework – Appropriate correspondence Transformable to ontological descriptions Capable of accepting source data Automated ontology creation Automated information harvesting

32 Form Creation

33 Created Sample Form

34 Generated Ontology View

35 Source-to-Form Mapping

36 Source-to-Form Mapping

37 Source-to-Form Mapping

38 Source-to-Form Mapping

39 Almost Ready to Harvest Need reading path: DOM-tree structure Need to resolve mapping problems – Pattern recognition – Instance recognition

40 Reading Path

41 Pattern & Instance Recognition

42 Pattern & Instance Recognition

43 Pattern & Instance Recognition regular expression for decimal number left context right context

44 Pattern & Instance Recognition list pattern, delimiter is “,”

45 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

46 Pattern & Instance Recognition list pattern, delimiter is regular expression for percentage numbers and a comma

47 Can Now Harvest

48 Can Now Harvest

49 Can Now Harvest

50 Semantic Annotation

51 Semantic Annotation

52 Semantic Annotation

53 Semantic Annotation

54 Semantic Annotation

55 Semantic Query

56 FOCIH Performance Ontology creation Semantic annotation – Depends on TISP performance – Depends on pattern and instance recognition performance

57 FOCIH Performance Pattern and instance recognition: – Works with highly regular data – Tested 71 mappings – 25 full-string values (25/25 correct) – 38 substring values (29/38 correct) – 8 list patterns (6/8 correct)

58 FOCIH Difficulties

59 FOCIH Difficulties

60 FOCIH Difficulties No selection

61 WoK via TISP

62 WoK via TISP

63 WoK via FOCIH

64 WoK via FOCIH

65 Contributions TISP: automatic sibling table interpretation TISP++: – Automatic ontology generation based on interpreted tables – Automatic semantic annotation for interpreted tables FOCIH: – Semi-automatic personalized ontology creation – Automatic personalized information harvesting and semantic annotation All together: contributes to turning the current web of pages into a web of Knowledge

66 Future Work Sibling pages in addition to sibling tables Reverse engineer from ontologies to forms as a basis for information harvesting for already defined ontologies.