Classification using Co-Training

Slides:

Advertisements

Similar presentations

Albert Gatt Corpora and Statistical Methods Lecture 13.

Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

SVM—Support Vector Machines

Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.

Active Learning to Classify

S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.

CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.

Chapter 5: Partially-Supervised Learning

Text Classification With Support Vector Machines

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Lecture 5 (Classification with Decision Trees)

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.

Spam? Not any more !! Detecting spam s using neural networks ECE/CS/ME 539 Project presentation Submitted by Sivanadyan, Thiagarajan.

Semi-Supervised Learning

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Active Learning for Class Imbalance Problem

Text Classification using SVM- light DSSI 2008 Jing Jiang.

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Universit at Dortmund, LS VIII

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Spam Detection Ethan Grefe December 13, 2013.

Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Report on Semi-supervised Training for Statistical Parsing Zhang Hao

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Semi-Supervised Clustering

Advanced data mining with TagHelper and Weka

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Constrained Clustering -Semi Supervised Clustering-

CS 4/527: Artificial Intelligence

iSRD Spam Review Detection with Imbalanced Data Distributions

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103)

Motivation Internet has grown vastly Widespread usage of emails Several emails received per day(~100) Need to organize emails according to user convenience

Classification Given training data {(x1,y1),…,(xn,yn)} Produce classifier h : X  Y h maps object xi in X to its classification label yi in Y

Features Measurement of some aspect of given data Generally represented as binary functions Eg: Presence of hyperlinks is a feature Presence indicated by 1, and absence by 0.

Email Classification Email can be classified as: Spam and non-spam, or Important or non-important, etc Text represented as bag of words: words from headers words from bodies

Supervised learning Main tool for email management: text classification Text classification uses supervised learning Examples belonging to different classes is given as training set Eg: Emails classified as interesting and uninteresting Learning systems induce general description of these classes

Supervised Classifiers Naïve Bayes Classifier Support Vector Machines (SVM)

Naïve Bayes According to Bayes Theorem, Under naïve conditional independence assumption,

Naïve Bayes Classify text document according to E.g. Class = {Interesting, Junk}

Linear SVM Creates Hyperplane separating data in 2 classes with max separation

Problems with Supervised Learning Requires lot of training data for accurate classification Eg: Microsoft Outlook Mobile Manager requires ~600 emails as labeled examples for best performance Very tedious job for average user

One solution Look at user’s behavior to determine important emails E.g.: Deleting mails without reading might be an indication of junk Not reliable Requires time to gather information

Another solution Use semi-supervised learning Few labeled data coupled with large unlabeled data Possible algorithms Transductive SVM Expectation Maximization (EM) Co-Training

Co-Training Algorithm Based on idea that some features are redundant – “redundantly sufficient” Split features into two sets, F1 and F2 Train two independent classifiers, C1 and C2, one for each feature set Produce two initial weak classifiers using minimum labeled data

Co-Training Algorithm Each classifier Ci examines unlabeled examples and labels them Add most confident +ve and –ve examples to set of labeled examples Loop back to Step 2

Intuition behind Co-Training One classifier confidently predicts class of an unlabeled example In turn, provides one more training example to other classifier E.g.: Two messages with subject “Company meeting at 3 pm” “Meeting today at 5 pm?”

Given: 1st message classified as “meeting” Then, 2nd message is also classified as “meeting” Messages likely to have different body content Hence, classifier based on words in body provided with extra training example

Co-Training on Email Classification Assumption: presence of redundantly sufficient features describing data Two sets of features: words from subject lines (header) words from email bodies

Experiment Setup 1500 emails and 3 folders Division as 250, 500 and 750 emails Division into 3 classification problems where ratio of +ve : -ve examples are 1:5 (highly imbalanced problem) 1:2 (moderately imbalanced problem) 1:1 (balanced problem) Expected: Larger the imbalance, worse the learning results

Experiment Setup For each task 25% of examples left as test set Rest is training set with labeled and unlabeled data Random selection of labeled data Stop words removed and Stemmer used These pre-processed words form the feature set For each feature/word, frequency of the word is the feature value

Experiment Run 50 co-training iterations Appropriate training examples supplied in each iteration Results represent average of 10 runs of the training Learning algorithms used : Naïve Bayes SVM

Results 1: Co-training with Naïve Bayes Absolute difference between accuracy in 1st and 50th iteration Subject-based classifier Body-based classifier Combined classifier (1:5) Highly imbalanced problem -17.11% -19.78% -19.64% (1:2) Moderately imbalanced problem -9.41% -1.20% -1.23% (1:1) Balanced problem 0.78% -0.53% -0.64% Co-Training with Naïve Bayes

Inference from Result 1 Naïve Bayes performs badly Possible reasons: Violation of conditional independence of feature sets. Subject and bodies not redundant. Need to be used together Great sparseness among feature values

Result 2: Co-training with Naïve Bayes and feature selection Absolute difference between accuracy in 1st and 50th iteration Subject-based classifier Body-based classifier Combined classifier (1:5) Highly imbalanced problem -1.09% -1.82% -1.96% (1:2) Moderately imbalanced problem 1.62% 4.31% 3.89% (1:1) Balanced problem 1.54% 6.78% 3.81% Co-Training with Naïve Bayes and feature selection

Inference from Result 2 Naïve Bayes with feature selection works lot better Feature sparseness likely cause of poor result of co-training Naïve Bayes

Result 3: Co-training with SVM Absolute difference between accuracy in 1st and 50th iteration Subject-based classifier Body-based classifier Combined classifier (1:5) Highly imbalanced problem 0.28% 1.80% (1:2) Moderately imbalanced problem 14.89% 22.47% 17.42% (1:1) Balanced problem 12.36% 15.84% 18.15% Co-training with SVM

Inference from Result 3 SVM clearly outperforms Naïve Bayes Works well for very large feature sets

Conclusion Co-training can be applied to email classification Depends on learning method used SVM performs quite well as a learning method for email classification

References S Kiritchenko, S Matwin: Email classification with co-training, In Proc. of CASCON, 2001 A Blum, TM Mitchell: Combining Labeled and Unlabeled Data with Co-Training, In Proc. of the 11th Annual Conference on Computational Learning Theory, 1998 K Nigam, R Ghani: Analyzing the Effectiveness and Applicability of Co-training, In Proc. of the 9th International Conference on Information Knowledge Management, 2000 http://en.wikipedia.org/wiki