Download presentation
Presentation is loading. Please wait.
Published byCornelius Sharp Modified over 9 years ago
1
Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin
2
State of the Art Classification Task is a part of IR task There are some successful decisions There are benchmarks (most popular is Reuters- 21578 text categorization test collection ) The better levels of measure F1 are from 0.753 to 0.92 (Sebastiani, 1999) Existing technologies of machine learning are not low-cost (large volume of manual work is needed )
3
Rubryx Technology General Features Method Description Formal Task Description Machine Learning Technology Dictionary Development Technology Examples Selection Technology Tests Results and New Heuristics Applications and Tools
4
General Features Rubryx can be characterized as follows: is based on a controlled dictionary; uses collocations in ranking texts; uses machine learning technology; uses hard-classification; uses multi-label text categorization; uses both category-pivoted and document-pivoted text categorization Moreover, another characteristic feature of the program can be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.
5
Method Description 1. Compile a directory and general thematic dictionary 2. Select sample texts for the category (five documents) by expert for every rubric 3. Generate a micro-dictionary of special format for the category (rubric) based on frequency of occurance of terms from general dictionary in the texts-examples. Set a threshold for every rubric 4. Carry out a complete classification under the category
6
Formal Task Description
7
1. Compile a directory 2. Select sample texts for the category (five documents) by expert for every rubric 3. Generate a micro-dictionary 4. Set a threshold for every rubric Machine Learning Technology 5. After these four steps Rubryx is ready for using
8
Dictionary Development Technology 1.We use an electronic terminological dictionary for whole directory in special formats: three files for one-word, two-word and three-word terms accordingly 3. Terms are placed in micro-dictionary if it was occurred in M samples at least 4. Final micro-dictionary can by corrected by expert Remark: 1. Using collocations give us lexical meaning disambiguation 2. Frequencies are normalized to text size of 1000 words 2. Usually M=2 2.For every sample we determine list of terms in used format with frequency of occurance
9
Examples Selection Technology 1. Samples are selected by expert Samples are the most relevant documents to each rubric 2. It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning 3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work
10
Preliminary Results of Rubryx Testing on the Reuters-21578 text categorization test collection Measure F1 = 0.85 on “places” and “topic” category Measure F1 is 1 on “exchanges” category Categories “people” and “org” need new dictionaries of proper names development Some new heuristics were generated to improve results in categories “places” and “topic”: (taking in account position of terms in clause, taking in account grouping of terms in text, taking in account proper names)
11
Summary of Advantages and know- how Lexical meaning based approach Using collocations give us lexical meaning disambiguation We use an electronic terminological dictionary and micro- dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning Comparable quality of classification with low-cost machine learning
12
Applications and Tools Rubryx – text classification program (versions 1 and 2, See site www.sowsoft.com/rubryx )www.sowsoft.com/rubryx DicTools – utility for dictionary development Spider – application program for text collection from Internet with preliminary classification Dictionaries
13
Rubryx – text classification program Status: Completed application
14
DicTools – utility for dictionary development Status: Completed application
15
Spider – application program for text collection from Internet with preliminary classification Application collects from start www-address all pages relevant to interested rubric. 1. We input category and starting URL 2. Spider goes recursively all links and loads pages. All pages are classified and not interesting link paths are cut. 3.As result we have sufficient economy of traffic and time. Status: Evaluation and testing
16
English Dictionaries Natural Language Processing (7775 terms) Geography (5941 terms) Metallurgy (4946 terms) Politechnical (37488 terms) Economics (1806 terms) Names of market exchanges (69080 terms)
17
Publications V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, 120-126 (2001) ( Article in Russian with abstract in English ) V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM-2003. Moscow, MSLU, 137-143 (2003)
18
Contact Information Vladimir N. Polyakov Moscow State Linguistic University vladimir_polyakov@yahoo.com Vladimir V. Sinitsyn Moscow State Steel and Alloys Institute (Technological University) sowsoft@land.ru Rubryx HomePages (shareware): www.sowsoft.com/rubryx/ www.rubryx.narod.ru
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.