Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
A Probabilistic Framework for Semi-Supervised Clustering
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Assuming normally distributed data! Naïve Bayes Classifier.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.
Multi-view Exploratory Learning for AKBC Problems Bhavana Dalvi and William W. Cohen School Of Computer Science, Carnegie Mellon University Motivation.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to machine learning
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Abstract - Many interactive image processing approaches are based on semi-supervised learning, which employ both labeled and unlabeled data in its training.
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
Active Learning for Class Imbalance Problem
Data mining and machine learning A brief introduction.
Mehdi Ghayoumi Kent State University Computer Science Department Summer 2015 Exposition on Cyber Infrastructure and Big Data.
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Semisupervised Learning A brief introduction. Semisupervised Learning Introduction Types of semisupervised learning Paper for review References.
Recent Trends in Text Mining Girish Keswani
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Text Clustering.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
EXPLORATORY LEARNING Semi-supervised Learning in the presence of unanticipated classes Bhavana Dalvi, William W. Cohen, Jamie Callan School of Computer.
Spam Detection Ethan Grefe December 13, 2013.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Watch Listen & Learn: Co-training on Captioned Images and Videos
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Follow-ups to HMMs Graphical Models Semi-supervised learning CISC 5800 Professor Daniel Leeds.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Classification using Co-Training
Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining and Text Mining. The Standard Data Mining process.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Semi-Supervised Learning Jing xu. Slides From: Xiaojin (Jerry) Zhu ---An associate professor in the Department of Computer Sciences at University of Wisconsin-Madison.
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Clustering
Sofus A. Macskassy Fetch Technologies
Advanced data mining with TagHelper and Weka
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Lecture 9: Clustering
Combining Labeled and Unlabeled Data with Co-Training
cs540 - Fall 2015 (Shavlik©), Lecture 25, Week 14
Semi-supervised Learning
Concave Minimization for Support Vector Machine Classifiers
Unsupervised Learning: Clustering
Presentation transcript:

Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”

Semi-Supervised and Active Learning Semi-Supervised learning: Using a combination of labeled and unlabeled examples, or using partially labeled examples Active learning: Having the learning system decide which examples to ask an oracle to label

Spamassassin Spamassassin: – Asks users to label , but they don’t often do it. – Also, they may not label the “most informative” examples. Spamassassin “self-training”: – Train classifier on small number of labeled examples. – Run these on unlabeled examples. Add the ones classified with high confidence to the original training set. (Problem – the ones classified with high confidence are not necessarily the most informative ones. – Retrain the classifier with the new, larger training set.

Xu et al. paper: Method Supervised learning: Train Naive Bayes classifier on small subset of (labeled) s. Semi-supervised learning: Then run Spamassassin’s self-learning method, selecting a large number of new examples to add to training set. Retrain the classifier. Active learning: Cluster remaining unlabeled s using k-means (on term-frequency feature vectors) with Euclidean distance. Select q representative unlabeled s, first from “pure” clusters, then from “impure clusters”, making sure that many clusters are sampled from. The s selected from each cluster are the ones closest to the cluster centroids. Ask the user to label these q examples. For each of these q examples, if the corresponding cluster is “pure”, propagate this label to a fraction p of the that cluster. Add the newly labeled examples to the training set, and retrain the classifier.

Ran on a large corpus (75K) of s. Xu et al. paper: Results