Download presentation
Presentation is loading. Please wait.
1
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers
2
Outline Introduction –Mining Indicator Terms –Integrating Rainbow –Ontological Analysis of Web Directories –IE and Ontology Learning Future Work Related Work Assessment
3
Introduction Goal: “…to extract information about (mostly generic) products, services and areas of competence of companies, from the free text chunks embedded in web presentations.” Taking advantage of: –Collections of extraction patterns –Ontologies of problem domains Approach: Combine Information Extraction With Ontologies –Ontologies can improve quality of IE –Extracted information can improve/extend ontologies –Bootstrapping
4
Introduction Uses Open Directory (http://dmoz.org) –Obtain labeled training data –Lightweight ontologies “The Open Directory Project is the largest, most comprehensive human-edited directory of the Web.”
6
Mining Indicator Terms Informative terms = generic names of products Indicator terms = situated near informative terms –Example: ‘our assortment includes…’ ‘in our shop you can buy…’ ‘in our shop you can buy…’ Assumption: Directory headings coincide with informatives Purpose: Generate extraction patterns based on Indicator terms They use deeper linguistic techniques
7
Mining Indicator Terms Example:…/Manufacturing/Materials/Metals/Steel/… Informative terms Match headings with text pages to find sentences containing informative terms Grab nearby words as indicator terms Generate extraction patterns from indicator terms
8
Mining Indicator Terms Choosing Indicator Terms –Syntactical analysis: Link Grammar Parser –Chose verbs occurring closest in parse tree to informative word –Arrange verbs into a frequency table –Order by ratio of frequency near informative term to frequency in general –Chose 8 most promising verbs
9
Mining Indicator Terms Preliminary Testing –Sampled 14,500 sentences containing heading terms –Randomly chose 130 sentences with indicators –Manually labeled to estimate if informative term was present or not Example: “We are equipped to run any grade of corrugated from E-flute to Triplewall, including all government grades.” “We are equipped to run any grade of corrugated from E-flute to Triplewall, including all government grades.”
10
Mining Indicator Terms Preliminary Test Results CoverageNon-Filtered 10 – 20 % Pre-Filtered 70 – 80 %
11
Integration into Rainbow RAINBOW (Reusable Architecture for INtelligent Brokering Of Web information access) (Reusable Architecture for INtelligent Brokering Of Web information access) –Web Analysis Tasks: Sentence Extraction Explicit Metadata HTML Structure* Inline Image * Link Topology Structure* Page Similarity –Internal Communication: based on SOAP –Will use ontologies for verifying semantic consistency of web services provided within the distributed system
12
Integration into Rainbow Rainbow will help solve “coverage” problem of directory links pointing to ‘barren’ pages –Using Analysis of: Keywords and HTML Structure on start-up pages URLs of embedded links –Metadata Extractor will be navigated towards promising pages. –Looking for ‘about-us’ or ‘profile’ to find more syntactically correct text, for example.
13
Ontological Analysis of Web Directories Terms and Phrases in single heading belong to a small set of classes Parent-child relations belong to particular classes corresponding to ‘deep’ ontological relations. -Industries - Construction_and_Maintenance - Materials_and_supplies - Masonry_and_Stone - Natural_Stone - International_Sources - Mexico
14
Ontological Analysis of Web Directories Meta-ontology of directory headings Class Named Relations Class- subclass Relations Reflexive Binary Relations
15
Ontological Analysis of Web Directories Interpretation Rules
16
IE and Ontology Learning Extracting with plain indicator terms with simple heuristics works But Even Better: –Learn indicators for each class –Use ontology analysis to classify indicators found –Fill in database templates: true IE
17
IE and Ontology Learning Classify Headings Learn class-specific indicators Human Classifies Directory Headings (WordNet) Closed Loop Strategy:
18
Future Work Complete the Information extraction & ontology learning loop. With relation to Semantic Web, they want to adapt technique to the standards of usual explicit metadata –Example: The information extracted can be forged to RDF triples, with indicator collections accessible over the web
19
Related Work Combining IE and Ontologies (without use of web directories) –Bootstrapping an Ontology-Based Information Extraction Systems Advantages of using Link Grammar Parser –Learning to Generate Semantic Annotation for Domain Specific Sentences Using Yahoo to classify whole documents –Turning Yahoo into an Automatic Web-Page Classifier Similar work aimed at more structured information using search engines –Extracting Patterns and Relations form the World Wide Web Bootstrapping and other statistical methods for IE –Text Classification by Bootstrapping with Keywords –Learning Dictionaries of Information Extraction by Multi-Level Bootstrapping
20
Assessment I don’t think indicator term learning is done (even though they say it is) Counts on not yet decided Ontology learning techniques Need to develop an official directory
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.