Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Albert Gatt Corpora and Statistical Methods Lecture 13.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Discriminative and generative methods for bags of features
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Text Classification With Support Vector Machines
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Machines Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas Second Edition A Tutorial on Support Vector Machines for Pattern.
Recommendations via Collaborative Filtering. Recommendations Relevant for movies, restaurants, hotels…. Recommendation Systems is a very hot topic in.
Active Learning with Support Vector Machines
Support Vector Machines
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.
Evaluating Performance for Data Mining Techniques
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Active Learning for Class Imbalance Problem
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Data mining and machine learning A brief introduction.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Universit at Dortmund, LS VIII
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.
Classification using Co-Training
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
SVMs in a Nutshell.
Support Vector Machine (SVM) Presented by Robert Chen.
Finding Clusters within a Class to Improve Classification Accuracy Literature Survey Yong Jae Lee 3/6/08.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Semi-Supervised Clustering
A New Support Vector Finder Method Based on Triangular Calculations
Pawan Lingras and Cory Butz
Concave Minimization for Support Vector Machine Classifiers
Semi-Automatic Data-Driven Ontology Construction System
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht

Overview Text Classification – What and Why? Text Clustering Support Vector Machines (SVMs) with Cluster Features Course of Project Results Conclusion Future Work

Text Classification – What and Why? Text Classification – assigning documents to predefined classes (categories) Labeling manually is time-consuming and sometimes impossible – the process needs to be automated To minimize labeling, automated text classifiers need to be able to utilize unlabeled data

How Does It Work? Text documents are represented using feature vectors Documents (both labeled and unlabeled) are clustered into similar groups - features representing relationship to created clusters are added to the feature vectors The augmented feature vectors are then classified by a Support Vector Machine (SVM) (Raskutti et al, 2002) This novel approach was the basis of this project

Representing Text … With paperless offices becoming more common, companies start using document databases with classification schemes… 1 Companies 3 Document 0 Distance... 1 Offices 0Unix 0 Match Feature Vector

Clustering Labeled feature vectors Unlabeled feature vectors 12… 0…4 10… Feature Vectors

SVMs With Cluster Features Labeled feature vectors (Class 1) Labeled feature vectors (Not Class 1) Unlabeled feature vectors Support Vectors Separating Hyperplane Added Features

Augmented Feature Vector Added FeaturesOriginal Word Frequencies Examples of added features: - binary “closest cluster” indicator - similarity to cluster centroids etc.

Areas of Investigation Investigated the following questions: The value of added cluster features Performance of different clustering algorithms: - Single-Pass (Raskutti et al., 2002) - Snob (Wallace and Boulton, 1968) Transductive Support Vector Machines (TSVMs) with clustering Different factors influencing performance - type of features, number of clusters etc.

Course of Project Implemented Single-Pass clustering algorithm and tested with SVMs - using variations on number and type of features added Combined Snob with SVMs - using different attribute types Tested the approaches on two different data sets - with random splits containing 1%, 5% and 10% labeled data out of the whole training set

Initial Results … Initial results showed that adding cluster features actually degrades SVM performance. Various slightly modified versions of the Single-Pass clustering algorithm as well as Snob were tested, all giving negative results when combined with SVMs. However, one approach showed an improvement...

Partitioning the Data Training set is divided into k partitions with each partition being clustered separately - features added to documents relative to k sets of clusters - k partitions means k x number of cluster features - used k = 5 in experiments

Results Labeled Data Average number of bits for test set of size 600

Results (cont.) Labeled Data Average number of bits for test set of size 3299

Conclusion Results suggest that performance of SVMs depends on: - number of features - type of features - clustering method Partitioning the data: - improves the quality of the features - improves overall performance Issues with use of Snob and clustering in general in text classification

Future Work Extending the SVM+Cluster approach to multi-labeled classification Investigating new sets of cluster features Determining: - optimal number of clusters used for adding cluster features - optimal number of partitions Investigating better methods of using Snob in text classification

References Raskutti, B., Ferra, H. and Kowalczyk, A. (2002). “Using Unlabeled Data for Text Classification through Addition of Cluster Parameters”, In International Conference on Machine Learning (Accepted) Wallace C.S. and Boulton, D.M., “An Information Measure for Classification”, Computer Journal, Vol.11, No.2, 1968, pp