Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding

Querying the Web (Two Approaches) Enhanced query language –Examples: WebSQL, WebOQL –Sources: structured, or restructured before parsing Wrapper –Enables querying in a database-like fashion –Depends on source format not resilient same topic with different formats need different wrappers

Data-Extraction Ontology Beyond the wrapper approach –Extraction technique for data-rich, unstructured, multiple-record Web documents –Does not depend on source format resilient Same topic with different formats uses same ontology Good experimental results

Main Difficulty (Creating the Data-Extraction Ontology) Users must be experts –database theory –regular expression generation Manual creation is impractical –Very large information sources –Frequently added sources of interest –Many varying text formats

Semiautomatic Data-Extraction Generation Generation & Updating Process Input Knowledge Sources Generated Data-Extraction Ontology Training Document(s) Validation Documents

Generation Process For this research, three steps are expected: –Gathering Knowledge –Generating Initial Ontology –Validation & Updating Strategy Ontology Generation Performance Evaluation

Example: Extract Information from Country Library Web Site (http://www.tradeport.org/ts/countries/) Car Advertisement XML Base CIA Factbook XML Base

Learning & Discovering Algorithm  All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country CIA Factbook XML Base Car Advertisement XML Base

Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country

Performance Evaluation Measure precision and recall for each lexical object set in generated extraction ontology Measure was generated with respect to could have been generated Measure was generated with respect to should not have been generated

Delimitation Will not … Consider all storage formats for existing knowledge –XML Consider all document formats –HTML –Plain Text Let users update the input knowledge source at run- time

Contribution Semi-automatically generate a data-extraction ontology Exploit the existing knowledge Link existing data-extraction tools Create a partial library of regular expression recognizers

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Similar presentations

Presentation on theme: "Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Similar presentations

Presentation on theme: "Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding."— Presentation transcript:

Similar presentations

About project

Feedback