Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.

Text Categorization.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)

1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Universit at Dortmund, LS VIII

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.

1 Text Classification for Healthcare Information Support Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Object Detection with Discriminatively Trained Part Based Models

Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Enhancing Biomedical Text Rankers by Term Proximity Information 劉瑞瓏慈濟大學醫學資訊學系 2012/06/13.

Prediction of Influencers from Word Use Chan Shing Hei.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.

Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Class Imbalance Classification Implementation Group 4 WEI Lili, ZENG Gaoxiong,

Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Data Mining, ICDM '08. Eighth IEEE International Conference on Duy-Dinh Le National Institute of Informatics Hitotsubashi, Chiyoda-ku Tokyo,

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Proximity-based Ranking of Biomedical Texts Rey-Long Liu * and Yi-Chih Huang * Dept. of Medical Informatics Tzu Chi University Taiwan.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

CS Machine Learning Instance Based Learning (Adapted from various sources)

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Queensland University of Technology

Improving Health Question Classification by Word Location Weights

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Evaluation of IR Systems

Usman Roshan Machine Learning

Using Transductive SVMs for Object Classification in Images

Instance Based Learning (Adapted from various sources)

Feature selection Usman Roshan.

Presented by: Prof. Ali Jaoua

Citation-based Extraction of Core Contents from Biomedical Articles

Usman Roshan Machine Learning

Panagiotis G. Ipeirotis Luis Gravano

Dynamic Category Profiling for Text Filtering and Classification

Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Presentation transcript:

Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan

Outline Background Problem definition The proposed approach: TNR Empirical evaluation Conclusion Training Noise Reduction for TC2

Background Training Noise Reduction for TC3

Training of Text Classifiers Given –Training documents labeled with category labels Return –Text classifiers that can Classify in-space documents (those that are relevant to some categories) Filter out out-space documents (those that are not relevant to any of the categories of interest) Usage: retrieval and dissemination of information Training Noise Reduction for TC4

Typical Problem: Noises in the Training Texts The training documents are inevitably unsound and/or incomplete –A lot of noises in the training texts Those terms that are irrelevant but happen to appear in the training documents Training Noise Reduction for TC5

Problem Definition Training Noise Reduction for TC6

Goal & Motivation Goal –Develop a technique TNR (Training Noise Reduction) that removes possible training noises for text classifiers Motivation –With the help by TNR, text classifiers can be trained to have better performance in Classifying in-space documents Filtering out out-space documents Training Noise Reduction for TC7

Basic Idea Term proximity as the key evidence to identify noises –In a training text d of a category c, a sequence of consecutive terms (in d) are noises if they have many neighboring terms not related to c –They are noises because they may simply happen to appear in d and hence are likely to be irrelevant to c Training Noise Reduction for TC8

Related Work No previous approaches focused on the fusion of relatedness scores of consecutive terms to identify training noises for text classifiers –Term proximity was mainly employed to improve text ranking or select features to build text classifiers TNR can serve as a front-end processor for the techniques of feature selection and classifier development (e.g., SVM) Training Noise Reduction for TC 9

The Proposed Approach: TNR Training Noise Reduction for TC10

Basic Definition Positive correlation vs. Negative correlation –A term t is positively correlated to a category c if occurrence of t in a document d increases the possibility of classifying d into c; otherwise t is negatively correlated to c Therefore, TNR should remove those terms that are negatively correlated to c Training Noise Reduction for TC11

Training Noise Reduction for TC12

The Main Hypothesis Those terms (in d) that have many neighboring terms with negative or low correlation strengths to c may simply happen to appear in d and hence are likely to be the training noises in d Training Noise Reduction for TC13

The Algorithm of TNR (1) For a category c, sequentially scan each term t in d (1.1) Employ the  2 statistics to compute the cumulative correlation strength at t Positive correlation if t is more likely to appear in documents of categories c; otherwise negative correlation Positive correlation  Increase net strength (NetS <= 0) Negative correlation  Decrease net strength (2) Identify the term segments (in d) that are likely to be training noises (2.1) Noise = the text segments that are more likely to contain FP and TN terms Training Noise Reduction for TC14

Training Noise Reduction for TC15 The position at which the net strength is minimized The position at which the net strength becomes negative The training noise identified

Empirical Evaluation Training Noise Reduction for TC16

Experimental Data Top-10 fatal diseases and top-20 cancers in Taiwan –# of diseases: 28 –# of documents: 4669 (of 5 aspects: etiology, diagnosis, treatment, prevention, and symptom) –Source: Web sites of hospitals, healthcare associations, and department of health in Taiwan –Training documents: 2300 documents –Test documents: In-space documents: The remaining 2369 documents Out-space documents: 446 documents about other diseases Training Noise Reduction for TC17

Underlying Classifiers Underlying classifier –The Support Vector Machine (SVM) classifier Training Noise Reduction for TC18

Results: Classification of In- Space Documents Evaluation criteria –Micro-averaged F 1 (MicroF 1 ) –Macro-averaged F 1 (MacroF 1 ) Training Noise Reduction for TC19

Training Noise Reduction for TC20

Training Noise Reduction for TC21

Results: Filtering of Out-Space Documents Evaluation criteria –Filtering ratio (FR) = # out-space documents successfully rejected by all categories / # out-space documents –Average number of misclassifications (AM) = # misclassifications for the out-space documents / # out-space documents Training Noise Reduction for TC22

Training Noise Reduction for TC23

Training Noise Reduction for TC24

Conclusion Training Noise Reduction for TC25

Text classifiers are essential for archival and dissemination of information Many text classifiers are built by a set of training documents –The training documents are inevitably unsound and incomplete, and so contain many training noises Reduction of the training noises can be done by analyzing correlation types of consecutive terms We show that the noise reduction is helpful in improving state-of-the-art text classifiers Training Noise Reduction for TC26