Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding
Querying the Web (Two Approaches) Enhanced query language –Examples: WebSQL, WebOQL –Sources: structured, or restructured before parsing Wrapper –Enables querying in a database-like fashion –Depends on source format not resilient same topic with different formats need different wrappers
Data-Extraction Ontology Beyond the wrapper approach –Extraction technique for data-rich, unstructured, multiple-record Web documents –Does not depend on source format resilient Same topic with different formats uses same ontology Good experimental results
Main Difficulty (Creating the Data-Extraction Ontology) Users must be experts –database theory –regular expression generation Manual creation is impractical –Very large information sources –Frequently added sources of interest –Many varying text formats
Semiautomatic Data-Extraction Generation Generation & Updating Process Input Knowledge Sources Generated Data-Extraction Ontology Training Document(s) Validation Documents
Generation Process For this research, three steps are expected: –Gathering Knowledge –Generating Initial Ontology –Validation & Updating Strategy Ontology Generation Performance Evaluation
Example: Extract Information from Country Library Web Site ( Car Advertisement XML Base CIA Factbook XML Base
Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country CIA Factbook XML Base Car Advertisement XML Base
Learning & Discovering Algorithm All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country All Mileage Make … Model Car Capital … Area CountryName Population Agriculture Country
Performance Evaluation Measure precision and recall for each lexical object set in generated extraction ontology Measure was generated with respect to could have been generated Measure was generated with respect to should not have been generated
Delimitation Will not … Consider all storage formats for existing knowledge –XML Consider all document formats –HTML –Plain Text Let users update the input knowledge source at run- time
Contribution Semi-automatically generate a data-extraction ontology Exploit the existing knowledge Link existing data-extraction tools Create a partial library of regular expression recognizers