Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

ECG Signal processing (2)
On-line learning and Boosting
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Biointelligence Laboratory, Seoul National University
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
What is Statistical Modeling
Chapter 4 Validity.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Active Learning for Class Imbalance Problem
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Chapter 9 – Classification and Regression Trees
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Filtering and Recommendation INST 734 Module 9 Doug Oard.
Universit at Dortmund, LS VIII
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Classification Techniques: Bayesian Classification
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
ICCTA September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Class Imbalance in Text Classification
Why does science matter?. Nature follows a set of rules… If we learn the rules and how they affect us we can understand, predict and prepare for what.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Validation methods.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
1 An introduction to support vector machine (SVM) Advisor : Dr.Hsu Graduate : Ching –Wen Hong.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Information Organization: Overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Information Retrieval
Information Organization: Overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most pertinent to each document Indexing: Select a set of keywords / index terms appropriate to each document

Classification Techniques Manual (a.k.a. Knowledge Engineering) –typically, rule-based expert systems Machine Learning –Probabalistic (e.g., Naïve Bayesian) –Decision Structures (e.g., Decision Trees) –Profile-Based compare document to profile(s) of subject classes similarity rules similar to those employed in I.R. –Support Machines (e.g., SVM)

Machine Learning Procedures Usually train-and-test –Exploit an existing collection in which documents have already been classified a portion used as the training set another portion used as a test set –permits measurement of classifier effectiveness –allows tuning of classifier parameters to yield maximum effectiveness Single- vs. multi-label –can 1 document be assigned to multiple categories?

Automatic Indexing Assign to each document up to k terms drawn from a controlled vocabulary Typically reduced to a multi-label classification problem –each keyword corresponds to a class of documents for which that keyword is an appropriate descriptor

Case Study: SVM categorization Document Collection from DTIC –10,000 documents previously classified manually –Taxonomy of 25 broad subject fields, divided into a total of 251 narrower groups –Document lengths average 2705  1464 words, 623  274 significant unique terms. –Collection has significant unique terms

Document Collection

Sample: Broad Subject Fields 01--Aviation Technology 02--Agriculture 03--Astronomy and Astrophysics 04--Atmospheric Sciences 05--Behavioral and Social Sciences 06--Biological and Medical Sciences 07--Chemistry 08--Earth Sciences and Oceanography

Sample: Narrow Subject Groups Aviation Technology 01 Aerodynamics 02 Military Aircraft Operations 03 Aircraft 0301 Helicopters 0302 Bombers 0303 Attack and Fighter Aircraft 0304 Patrol and Reconnaissance Aircraft

Distribution among Categories

Baseline Establish baseline for conventional techniques –classification –training SVM for each subject area “off-the-shelf” document modelling and SVM libraries

Why SVM? Prior studies have suggested good results with SVM relatively immune to “overfitting” – fitting to coincidental relations encountered during training low dimensionality of model parameters

Machine Learning: Support Vector Machines Binary Classifier –Finds the plane with largest margin to separate the two classes of training samples –Subsequently classifies items based on which side of line they fall Font size Line number hyperplane margin

SVM Evaluation

Baseline SVM Evaluation –Training & Testing process repeated for multiple subject categories –Determine accuracy overall positive (ability to recognize new documents that belong in the class the SVM was trained for) negative (ability to reject new documents that belong to other classes) –Explore Training Issues

SVM “Out of the Box” 16 broad categories with 150 or more documents Lucene library for model preparation LibSVM for SVM training & testing –no normalization or parameter tuning Training set of 100/100 (positive/negative samples) Test set of 50/50

“OOtB” Interpretation Reasonable performance on broad categories given modest training set size. Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

Training Set Size

accuracy plateaus for training set sizes well under the number of terms in the document model

Training Issues Training Set Size –Concern: detailed subject groups may have too few known examples to perform effective SVM training in that subject –Possible Solution: collection may have few positive examples, but has many, many negative example Positive/Negative Training Mixes –effects on accuracy

Increased Negative Training

Training Set Composition experiment performed with 50 positive training examples –OotB SVM training increasing the number of negative training examples has little effect on overall accuracy but positive accuracy reduced

Interpretation may indicate a weakness in SVM –or simply further evidence of the importance of optimizing SVM parameters may indicate unsuitability of treating SVM output as simple boolean decision –might do better as “best fit” in a multi-label classifier