Using IR techniques to improve Automated Text Classification

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Data Mining and Text Analytics in Music Audi Sugianto and Nicholas Tawonezvi.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
Chapter 5: Information Retrieval and Web Search
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor :
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Thien Anh Dinh1, Tomi Silander1, Bolan Su1, Tianxia Gong
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
1 Comparison of Principal Component Analysis and Random Projection in Text Mining Steve Vincent April 29, 2004 INFS 795 Dr. Domeniconi.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
CS378 Final Project The Netflix Data Set Class Project Ideas and Guidelines.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Classification of GPCRs at Family and Subfamily Levels Using Decision Trees & Naïve Bayes Classifiers Betty Yee Man Cheng Language Technologies Institute,
Queensland University of Technology
Presented by: Prof. Ali Jaoua
Machine Learning with Weka
Project 1: Text Classification by Neural Networks
iSRD Spam Review Detection with Imbalanced Data Distributions
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Chapter 7: Transformations
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Using IR techniques to improve Automated Text Classification NLDB04 – 9th International Conference on Applications of Natural Language to Information Systems Using IR techniques to improve Automated Text Classification Hello! My name is Teresa Gonçalves, I come from University of Évora, Portugal and I’ll present a paper named Using Information Retrieval techniques to improve Automated Text Classification Teresa Gonçalves, Paulo Quaresma tcg@di.uevora.pt, pq@di.uevora.pt Departamento de Informática Universidade de Évora, Portugal

Using IR techniques to improve ATC Overview Application area: Text Categorisation Research area: Machine Learning Paradigm: Support Vector Machine Written language: European Portuguese and English Study: Evaluation of Preprocessing Techniques This paper presents a preliminary work on the Automated Text Clssification area using the Support Vector Machines paradigm We evaluate some preprocessing techniques on two sets of documents: one written in the Portuguese language the other, a very well known, in the English one Using IR techniques to improve ATC

Using IR techniques to improve ATC Datasets Multilabel classification datasets Each document can be classified into multiple concepts Written language European Portuguese PAGOD dataset English Reuters dataset In both datasets, each document can be classfied into multiple concepts, which turns the problem at hands in a multilabel classification one. The english written dataset is the Reuters and the Portuguese one is the PAGOD dataset. Using IR techniques to improve ATC

Using IR techniques to improve ATC PAGOD dataset Represents the decisions of the Portuguese Attorney General’s Office since 1940 Caracteristics 8151 documents, 96 Mbytes of characters 68886 distinct words Averaged document: 1339 words, 306 distinct Taxonomy of 6000 concepts, used around 3000 5 most used concepts: number of documents 909, 680, 497, 410, 409 PAGOD dataset represents the decisions of the Portuguese Attorney General’s Office since 1940. It is a set with around 81 hundred documents, with almost 70 thousand different words. Per document, it has an average of 13 hundred and 39 words, of which 306 are distinct Each document can belong to a set of concepts that are organised in a taxonomy of around 6 thousand of which organised in a hierarchy. Only around an half is used. 909 documents belong to the most used concept – pension for exceptional services, while only 409 documents belong to the 5th most used one – militar Using IR techniques to improve ATC

Using IR techniques to improve ATC Reuters-21578 dataset Originally collected by the Carnegie group from the Reuters newswire in 1987 Caracteristics 9603 train documents, 3299 test documents (ModApté split) 31715 distinct words Averaged document: 126 words, 70 distinct Taxonomy of 135 concepts, 90 appear in train/test sets 5 most used concepts: number of documents Train set: 2861, 1648, 534, 428, 385 Test set: 1080, 718, 179, 148, 186 On the other hand, for the Reuters dataset we used the ModApté split wich led to 96 hundred and 3 documents for the train set and 32 hundred and 99 documents for the test set. On all dataset, there are almost 32 thousand different words Per document, it has an average of 126 words, of which 70 are distinct Again, each document can belong to a set of 135 concepts from those 90 appear both in train and test sets. The number of documnts that belong to the 5 most used concepts range from 28 hundred to 3 hundred and 85. for the train set, for example. Using IR techniques to improve ATC

Using IR techniques to improve ATC Experiments Document representation Bag-of-words Retain words’ frequency Discard words that contain digits Algorithm Linear SVM (WEKA software package) Classes of preprocessing experiments Feature reduction/construction Feature subset selection Term weighting For the experiments, we represented each documnet as a bag of words retaing words’ frequency and diacarding all that contained digits, We used a Linear SVM from the WEKA software package and made 3 classes of preprocessing experiments: some feature reduction/contruction experiments, some feature subset selection and some term weighting Using IR techniques to improve ATC

Feature reduction/construction Uses linguistic information red1: no use of linguistic information red2: remove a list of non-relevant words (articles, pronouns, adverbs and prepositions) red3: remove red2 word list and transform each word onto its lemma (its stem for the English dataset) Portuguese POLARIS, a Portuguese lexical database English FreeWAIS stop-list Porter algorithm For the feature reduction/construction experiments we used some linguistic information and made 3 experiments: in the first one, we used the original words, in the second, we removed a list of stop words and in the third we also used the Porter stemming algorithm for the English and the POLARIS lexical database to transform each portugues word into its lemma. Using IR techniques to improve ATC

Feature subset selection Uses a filtering approach Keeps the features (words) that receive higher scores Scoring functions scr1: Term frequency scr2: Mutual information scr3: Gain ratio Threshold value scr1: the number of times each word appears in all documents scr2, scr3: the same number of features as scr1 Experiments sel1, sel50, sel100, sel200, sel400, sel800, sel1200, sel1600 For the feature subset selection we used a filtering approach retaining the words with higher scores. We experimented 3 different scoring functions – the term frequency, the mutual information and the gain ratio (the last two from the information theory field) and tried 8 different threshold values ranging from all words (sel1) until retaining just words that appear, for the term frequency experiment, at least 16 hundred times. Using IR techniques to improve ATC

Number of attributes (with respect to threshold values) The number of words obtained for each threshold value experiment and for each dataset is showed in these graphs. As you can see most words appear only a few times on the set of all documents Using IR techniques to improve ATC

Using IR techniques to improve ATC Term weighting Uses the document, collection and normalisation components wgt1: binary representation with no collection component but normalised to unit lenght wgt2: raw term frequency with no collection nor normalisation component wgt3: term frequency with no collection component but normalised to unit lenght wgt4: term frequency divided by the collection component and normalised to unit lenght For the term weighting we tried 4 experiments: wgt1 is the binary representation wgt2 is the raw term frequency wgt3 is the term frequency normalised to unit lenght, and wgt4 is the popular tfidf measure Using IR techniques to improve ATC

Using IR techniques to improve ATC Experimental results Method PAGOD: 10-fold cross-validation Reuters: train and test set (ModApté split) Measures Precision, recall and F1 Micro- and macro-averaging for the top 5 concepts Significance tests with 95% of confidence For the experimental results we used a 10 fold cross validation for the Portuguese dataset and the train/test set for the English one We measured the precision, recall and F1 measures for each classifier and calculated the micro and macro average for the 5 top concepts. All significance tests wer made with 95% of confidence. Using IR techniques to improve ATC

Using IR techniques to improve ATC PAGOD dataset These graphs show, for the PAGOD dataset the micro and macro-average F1 measure averaged for all thresold values of each experiment: feature reduction/construction: red1, red2 and red3 feature subset selection scoring function: scr1, scr2, scr3 term weighting: wgt1, wgt2, wgt3, wgt4 As you can see the worst values were obtained for raw term frequencies (visible on both graphs) InfoGain (more easily depicted on the macro-average graph) and lemmatisation experiments The best values were obtained for the experiment that remove the words that belong to the stop list, scores them with respect to its term frequency retaining all words (no threshold value) and weights them according to the term frequency normalised to unit lenght. Using IR techniques to improve ATC

Using IR techniques to improve ATC Reuters dataset These graphs show, the values obtained in the same way, for the Reuters dataset In this dataset, the worst values were also obtained for raw term frequencies and InfoGain (visible on both graphs) Feature construction/reduction experiments show similar averaged results. The best values were obtained for the experiment that Mutual Information experiment for the thresold value equal to 4 hundred, with no significant difference between stop and stemming experiments and the tfidf and the term frequency normalised to unit lenght ones. Using IR techniques to improve ATC

Using IR techniques to improve ATC Results PAGOD The best combination scr1 – red2 – wgt3 – sel1 Worst values scr3 and wgt2 experiments Reuters scr2 –(red1,red3) – (wgt3,wgt4) – sel400 scr3, and wgt2 experiments Using IR techniques to improve ATC

Using IR techniques to improve ATC Discussion SVM Deals well with non informative and non independent features in different languages Results Worse values for PAGOD written language? more difficult concepts to learn? more imbalanced dataset? Best experiments Different for both datasets written language? area of written documents? Concluding, we can say that Support vector machines deal well with non informative and non independent features in different languages The best experiments are different for both datasets. Here we can pose to hypothesis: it is because of the written language or of the area of the written documents (legal vs. news documents) The results are worse for the Portuguese dataset. Once again different reasons come to mind: more difficult concepts to learn more imbalanced dataset written language Using IR techniques to improve ATC

Using IR techniques to improve ATC Future work Explore the impact of the imbalance nature of datasets the use of morpho-syntactical information other datasets Try more powerful document representations And now some future work: study the impact of the imbalance nature of these datasets use some morpho-syntatical information on the feature construction/reduction experiments (use just the verbs or the nouns as features) explore other datasets to infer the reason of the different best experiments, and finally, try more powerful document representations than the bag-of-words, using, for example, word order and syntactical and/or semantical information of documents Using IR techniques to improve ATC

Using IR techniques to improve ATC Scoring functions scr1: Term frequency The score is the number of times the feature appears in the dataset scr2: Mutual information It evaluates the worth of a feature A by measuring its mutual information, I(C;A) , with respect to the class, C scr3: Gain ratio The worth is the attribute’s gain ratio with respect to the class Using IR techniques to improve ATC