Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and.

Similar presentations


Presentation on theme: "Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and."— Presentation transcript:

1 Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and Knowledge Engineering Department of Information and Knowledge Engineering VSE Praha

2 DATESO 20052 Current state ► IE and Ontology learning are frequently discussed issues in the field of Semantic Web. ► Semi-automatic and automatic methods ontology-based extraction of information needed ► Web is great source for unstructured text

3 DATESO 20053 Task is … ► Collect specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers. ► Small and specialized ontologies concerning one product category and describing its frequent relations in common text. ► Make use of fulltext search engines and DMOZ directory for retrieving information ► And UNSPSC (United Nations Standard Products and Services Code) product catalogue

4 DATESO 20054 ► Web directory are rarely valid taxonomies. ► It is easy to see that subheadings are often not specializations of headings ► Some of them are even not concepts (names of entities) but properties that implicitly restrict the extension of a preceding concept in the hierarchy. Consider for example.../Industries/Construction and Maintenance/Materials and Supplies/ /Masonry_and_Stone/Natural Stone/International Sources/Mexico.

5 DATESO 20055 Proposal of method … ► Obtain so called „indicator verbs” that characterize particular term (product category in our case) in UNSPSC. ► Particular terms will be then generalized and may mine verbs that are indicative for the upper level of these terms. ► join UNSPSC taxonomy and it’s list of products with content of company websites to gain valuable information about verbs that usually occur in one sentence with some product category from the taxonomy. ► Use hand classified web directories containing relevant web sites.

6 DATESO 20056 Task sequence decomposition ► Manually select UNSPSC product and corresponding product category from DMOZ Business branch  Search in directory headings names  Search in web site description  Use fulltext ► 1) Input: URL of DMOZ directory containing companies that manufacture desired product. ► Output: List of URL of companies. ► 2) Input: URL of company website ► Output: List of web pages containing the target term. ► 3) Input: Web page containing the term ► Output: File with extracted sentences containing the term ► 4) Input: Sentence with term. ► Output: Tagged sentences ► 5) Input: Verbs ► Output: lemmatized, grouped and saved verbs

7 DATESO 20057 Experiment ► Handling equipment branch / UNSPSC product with corresponding DMOZ category ► Goal is find verbs:  common for most products.  characterizing one branch of products  specific for small group of products, or even only one product ► 7 product categories, 303 verbs collected that occurred 7300 times at web sites.

8 DATESO 20058 Experiment wordlemmanext occurrences include include, includes, includes, includes, includes, includes, include, include, include, include, included, included, include, include, include, include, include, included, include announcedannounce announced, announced, announced, announced, announce, announce, announced, announced, announced, announced arebe are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are, are providingprovideproviding, providing, providing, providing, providing, providing, providing feature featured, features, features, features, features, features, features, features, features, feature, feature followingfollowfollowing, following, following, following, following, following, following leadingleadleading, leading, leading, leading, leading, leading, leading, leading, leading productsproducts is not a verbproducts, products, products, products, products, products includingincludeincluding, including, including, including, including, including

9 DATESO 20059 Experiments ► some verbs are obvious to be entirely neutral and do not characterize the products at all. (be, have, provide and use) ► Some are connected with manufacturing (design, require, offer, make, contact, manufacture, develop, supply) ► activities describing manipulating with material. (handle, lift, install and move)

10 DATESO 200510 Experiments lemmaPer centlemmacroftlemmaTFIDF 1have 43,01have 8,58have 1 318,40 2provide 40,38provide 7,41provide 1 164,76 3design 39,36design 7,14design 1 119,10 4use 37,29use 6,38use 1 028,17 5lift 26,47require 5,32require 802,81 6require 26,43handle 4,70lift 703,11 7handle 19,81lift 4,70handle 676,10 8mount 17,75offer 4,68offer 648,62 9operate 17,66allow 4,31allow 596,96 10truck 17,61include 4,30contact 587,38 11allow 17,25please 4,29move 582,57 12contact 16,37make 4,18please 582,57 13offer 15,99contact 4,15include 572,89 14meet 15,91need 4,06meet 538,52 15include 15,49install 4,06make 538,52

11 DATESO 200511 ► normalization ► Fij = fij * (Vtj / V) ► Croft’s normalization moderates the effect of high- frequency verbs ► cf = K + (1 - K) * fij / mij ► TF/IDF ► wij = fij * log2(N / n)

12 DATESO 200512 Problem remaining … ► Automate assigning UNSPSC category to DMOZ category ► Some UNSPSC have no appropriate category leading in no or little web sites. ► Some categories are less informative

13 DATESO 200513 ► Thank you!


Download ppt "Text mining tool for ontology engineering based on use of product taxonomy and web directory Jan Nemrava and Vojtech Svatek Department of Information and."

Similar presentations


Ads by Google