Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin.

Slides:



Advertisements
Similar presentations
The Web Wizards Guide to Freeware/Shareware Chapter Four Essential Tools for Web Page Authors.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Web Content Filter: technology for social safe browsing Ilya Tikhomirov Institute for Systems Analysis of the Russian Academy of Sciences
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Copyright © 2006 Pearson Education, Inc. publishing as Benjamin Cummings. The Literature of Health Education Chapter 9.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Lesson 19 Internet Basics.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
ACCESS TO QUALITY RESOURCES ON RUSSIA Tanja Pursiainen, University of Helsinki, Aleksanteri institute. EVA 2004 Moscow, 29 November 2004.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
IT Introduction to Website Development Welcome!
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Research Papers Locating Your Sources. Two Kinds of Sources Primary source: original text, document, interview, speech, or letter (it is the text itself)
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Internet Research Fourth Edition Unit C. Internet Research – Illustrated, Fourth Edition 2 Internet Research: Unit C Browsing Subject Guides.
Chapter 2: Software Process Omar Meqdadi SE 2730 Lecture 2 Department of Computer Science and Software Engineering University of Wisconsin-Platteville.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Week 9 Search Engines and the Invisible Web. Resource Pages Collections of Links Compiled by “experts” Sometimes annotated Targeted Information for a.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
An Overview of the Internet: The Internet: Then and Now How the Internet Works Major Features of the Internet.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
The Internet and World Wide Web
Do's and don'ts to improve your site's ranking … Presentation by:
Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
What today is about? To inform you on what needs to be in your portfolio for student teaching To teach you how to do it electronically To give you tools.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Sergey Gromov Yulia Krasilnikova Vladimir Polyakov (NRTU MISIS, Moscow) KNOWLEDGE BASE CREATION FOR NATIONAL NANOTECHNOLOGY NETWORKS «CONSTRUCTIONAL NANOMATERIALS»
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Web Page Design Introduction. The ________________ is a large collection of pages stored on computers, or ______________ around the world. Hypertext ________.
CLASSIFICATION OF PRIMARY CARE MEDICAL RECORDS WITH RUBRYX-2: FIRST EXPERIENCE Olga Kaurova 1 Mikhail Alexandrov 1
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Chapter 3: Cost Estimation Techniques
Objectives Overview Identify the four categories of application software Describe characteristics of a user interface Identify the key features of widely.
Health On-Line Patient Education Web Site
Applied Linguistics Chapter Four: Corpus Linguistics
A Suite to Compile and Analyze an LSP Corpus
Information Retrieval and Web Design
Unsupervised Machine Learning: Clustering Assignment
Lesson 19 Internet Basics.
Presentation transcript:

Rubryx Document Classification Technology Authors: V.N. Polyakov, V.V. Sinitsin

State of the Art  Classification Task is a part of IR task  There are some successful decisions  There are benchmarks (most popular is Reuters text categorization test collection )  The better levels of measure F1 are from to 0.92 (Sebastiani, 1999)  Existing technologies of machine learning are not low-cost (large volume of manual work is needed )

Rubryx Technology  General Features  Method Description  Formal Task Description  Machine Learning Technology  Dictionary Development Technology  Examples Selection Technology  Tests Results and New Heuristics  Applications and Tools

General Features Rubryx can be characterized as follows:  is based on a controlled dictionary;  uses collocations in ranking texts;  uses machine learning technology;  uses hard-classification;  uses multi-label text categorization;  uses both category-pivoted and document-pivoted text categorization  Moreover, another characteristic feature of the program can be added to the list, which hasn’t been widely used, yet is highly perspective, namely lexical meaning based approach.

Method Description 1. Compile a directory and general thematic dictionary 2. Select sample texts for the category (five documents) by expert for every rubric 3. Generate a micro-dictionary of special format for the category (rubric) based on frequency of occurance of terms from general dictionary in the texts-examples. Set a threshold for every rubric 4. Carry out a complete classification under the category

Formal Task Description

1. Compile a directory 2. Select sample texts for the category (five documents) by expert for every rubric 3. Generate a micro-dictionary 4. Set a threshold for every rubric Machine Learning Technology 5. After these four steps Rubryx is ready for using

Dictionary Development Technology 1.We use an electronic terminological dictionary for whole directory in special formats: three files for one-word, two-word and three-word terms accordingly 3. Terms are placed in micro-dictionary if it was occurred in M samples at least 4. Final micro-dictionary can by corrected by expert Remark: 1. Using collocations give us lexical meaning disambiguation 2. Frequencies are normalized to text size of 1000 words 2. Usually M=2 2.For every sample we determine list of terms in used format with frequency of occurance

Examples Selection Technology 1. Samples are selected by expert Samples are the most relevant documents to each rubric 2. It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning 3. Technology of machine learning in Rubryx also depends of expert qualification but needs less of manual work

Preliminary Results of Rubryx Testing on the Reuters text categorization test collection  Measure F1 = 0.85 on “places” and “topic” category  Measure F1 is 1 on “exchanges” category  Categories “people” and “org” need new dictionaries of proper names development  Some new heuristics were generated to improve results in categories “places” and “topic”: (taking in account position of terms in clause, taking in account grouping of terms in text, taking in account proper names)

Summary of Advantages and know- how  Lexical meaning based approach  Using collocations give us lexical meaning disambiguation  We use an electronic terminological dictionary and micro- dictionaries in special formats: three files for one-word, two-word and three-word terms accordingly  It is needed 3-5 samples only to each rubric in contrast to thousands of manually classified documents needed in ordinary technologies of machine learning  Comparable quality of classification with low-cost machine learning

Applications and Tools  Rubryx – text classification program (versions 1 and 2, See site )  DicTools – utility for dictionary development  Spider – application program for text collection from Internet with preliminary classification  Dictionaries

Rubryx – text classification program Status: Completed application

DicTools – utility for dictionary development Status: Completed application

Spider – application program for text collection from Internet with preliminary classification Application collects from start www-address all pages relevant to interested rubric. 1. We input category and starting URL 2. Spider goes recursively all links and loads pages. All pages are classified and not interesting link paths are cut. 3.As result we have sufficient economy of traffic and time. Status: Evaluation and testing

English Dictionaries  Natural Language Processing (7775 terms)  Geography (5941 terms)  Metallurgy (4946 terms)  Politechnical (37488 terms)  Economics (1806 terms)  Names of market exchanges (69080 terms)

Publications  V.N. Polyakov, V.V. Sinitsin “Method Automatic Classification of Web-resource by Patterns” in Text Processing and Cognitive Technologies. Paper Collection. Issue 6. Edited by V.D. Solovyev, V.N. Polyakov. Kazan, Otechestvo, (2001) ( Article in Russian with abstract in English )  V.N. Polyakov, V.V. Sinitsin “Rubryx: Technology of Text Classification Using Lexical Meaning Based Approach” in Proc. of International Conference Speech and Computer. SPECOM Moscow, MSLU, (2003)

Contact Information Vladimir N. Polyakov Moscow State Linguistic University Vladimir V. Sinitsyn Moscow State Steel and Alloys Institute (Technological University) Rubryx HomePages (shareware):