Dynamic Category Profiling for Text Filtering and Classification

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

A Vector Space Model for Automatic Indexing

Evaluation of Decision Forests on Text Categorization

Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.

Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Evaluation of kernel function modification in text classification using SVMs Yangzhe Xiao.

A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.

Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006.

Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.

Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.

Experimental Evaluation

Evaluating Classifiers

Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 Machine Learning: Experimental Evaluation. 2 Motivation Evaluating the performance of learning systems is important because: –Learning systems are usually.

Processing of large document collections Part 2. Feature selection: IG zInformation gain: measures the number of bits of information obtained for category.

1 Text Classification for Healthcare Information Support Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan.

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,

INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .

Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Class Imbalance in Text Classification

Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Article Filtering for Conflict Forecasting Benedict Lee and Cuong Than Comp 540 4/25/2006.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

Information Organization: Evaluation of Classification Performance.

Bootstrap and Model Validation

Data Science Credibility: Evaluating What’s Been Learned

Lecture 1.31 Criteria for optimal reception of radio signals.

Evaluating Classifiers

Guillaume-Alexandre Bilodeau

Improving Health Question Classification by Word Location Weights

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Title Goal Method Result

Project 4 User and Movie 2018/11/10.

Intent-Aware Semantic Query Annotation

Project 1: Text Classification by Neural Networks

Text Categorization Rong Jin.

Learning Algorithm Evaluation

iSRD Spam Review Detection with Imbalanced Data Distributions

Citation-based Extraction of Core Contents from Biomedical Articles

ROC Curves and Operating Points

Statistical Learning Dong Liu Dept. EEIS, USTC.

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

By Hossein Hematialam and Wlodek Zadrozny Presented by

Using Link Information to Enhance Web Page Classification

Incremental Context Mining for Adaptive Document Classification

Dong Xuan*, Sriram Chellappan*, Xun Wang* and Shengquan Wang+

Outlines Introduction & Objectives Methodology & Workflow

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

ROC Curves and Operating Points

Information Organization: Evaluation of Classification Performance

Presentation transcript:

Dynamic Category Profiling for Text Filtering and Classification Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan, R.O.C.

Goal Promote both precision and recall of text filtering and classification by Finding more suitable features (terms) to measure the degree of acceptance (DOA) of a document d w.r.t. a category c Deriving a method to make filtering and classification decisions based on the DOA estimation

Motivation Previous techniques often find and employ those features (terms) that are Representative for a category Discriminative to distinguish a category Unfortunately, content overlapping was often ignored A document d may be classified into a category c only if their share the same content to some extent

An example Two categories c1: computer networks c2: computer animations Previous techniques tend to employ the term like “network” and “animation” as features They are both representative & discriminative Unfortunately, the term “computer” may NOT be selected, but It is helpful to filter out those documents that are about network but NOT computers (e.g. traffic network)

To discriminate c from others To validate content overlapping Therefore, features of a category should be selected dynamically when a document is entered To discriminate c from others To validate content overlapping Features that correlate with c Features that correlate with other categories Features that appear in c but do not appear in d Features that do not appear in c but appear in d Underlying classifier Considered Not considered Dynamic profiling

The method: DP4FC DP4FC: Dynamic Profiling for Filtering & Classification Associating various classifiers with DP4FC Training Document for TF & TC Documents for Threshold Tuning DP4FC Filtered Documents Classified Documents Integrated TF & TC Underlying Classifier Documents for Classifier Building Classifier Building Threshold Tuning DOA Estimation by Dynamic Profiling Testing DOA Estimation

DOA estimation by dynamic profiling Procedure DOAEstimationByDP(c, d), where (1) c is a category, (2) d is a document for thresholding or testing Return: DOA value of d with respect to c Begin (1) DOAbyDP = 0; (2) For each term t in c but not in d, do (2.1) DOAReduction = Support(t, c)  log2(IDF of t in training data and d); (2.2) DOAbyDP = DOAbyDP - DOAReduction; (3) For each term t in d but not in c, do (3.1) DOAReduction = Support(d, c)  log2(IDF of t in training data and d); (3.2) DOAbyDP = DOAbyDP - DOAReduction; (4) Return DOAbyDP; End.

Making a filtering and classification decision Two thresholds are derived One is based on the DOA values produced by the underlying classifier The other is based on the DOA values produced by DP4FC A document may be classified into a category only if its DOA values are greater than or equal to the corresponding thresholds of the category

Experiment Aspects Settings (1) Source of experimental data (A) Reuter-21578 (B) Yahoo text hierarchy (2) Split of test data (A) In-space test data (for evaluating TC) (B) Out-space test data (for evaluating TF) (3) Split of the training data for classifier building (CB) and threshold tuning (TT) (A) 50% for CB; 50% for TT (with 2-fold cross validation) (B) 80% for CB; 20% for TT (with 5-fold cross validation) (4) Parameter settings for the classifier Different sizes of feature sets on which the classification methodologies were built

The underlying classifier The Rocchio’s classifier (RO) 1*DocPDoc/|P|  2*DocNDoc/|N| P and N are the sets of positive and negative documents, respectively RO is often tested in both text filtering and classification Parameter setting 1=16; 2=4 Previous studies showed that this setting is good for RO

Evaluation criteria For text classification: For text filtering Precision (P), Recall (R), and F1=2PR/(P+R) For text filtering Filtering Ratio (FR) # out-space documents filtered out / # out-space documents Average Misclassifications (AM) # misclassifications / # out-space documents misclassified into the category space

Results Performance (in F1) in processing in-space documents

Performance (in FR) in processing out-space documents

Performance (in AM) in processing out-space documents

Conclusion For each category, most documents should be filtered out Content overlapping between a document and a category is thus important It measures how d talks about those contents not in c, and vice versa Unfortunately, it is often ignored by previous techniques It calls for dynamic profiling for each category With dynamic profiling, the classifier’s performance may be both better & more stable in both filtering & classification

Thank you