Download presentation
Presentation is loading. Please wait.
Published byLydia Sullivan Modified over 9 years ago
1
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding
2
Querying the Web (Two Approaches) Enhanced query language –Examples: WebSQL, WebOQL –Sources: structured, or restructured before parsing Wrapper –Enables querying in a database-like fashion –Depends on source format not resilient same topic with different formats need different wrappers
3
Data-Extraction Ontology Beyond the wrapper approach –Extraction technique for data-rich, unstructured, multiple-record Web documents –Does not depend on source format resilient Same topic with different formats uses same ontology Good experimental results
4
Main Difficulty (Creating the Data-Extraction Ontology) Users must be experts –database theory –regular expression generation Manual creation is impractical –Very large information sources –Frequently added sources of interest –Many varying text formats
5
Semiautomatic Data-Extraction Generation Generation & Updating Process Input Knowledge Sources Generated Data-Extraction Ontology Training Document(s) Validation Documents
6
Generation Process For this research, three steps are expected: –Gathering Knowledge –Generating Initial Ontology –Validation & Updating Strategy Ontology Generation Performance Evaluation
7
Example: Extract Information from Country Library Web Site (http://www.tradeport.org/ts/countries/) Car Advertisement XML Base CIA Factbook XML Base
8
Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country CIA Factbook XML Base Car Advertisement XML Base
9
Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country
10
Performance Evaluation Measure precision and recall for each lexical object set in generated extraction ontology Measure was generated with respect to could have been generated Measure was generated with respect to should not have been generated
11
Delimitation Will not … Consider all storage formats for existing knowledge –XML Consider all document formats –HTML –Plain Text Let users update the input knowledge source at run- time
12
Contribution Semi-automatically generate a data-extraction ontology Exploit the existing knowledge Link existing data-extraction tools Create a partial library of regular expression recognizers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.