Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Machine learning continued Image source:
Supervised Learning Recap
Probabilistic Clustering-Projection Model for Discrete Data
Learning on Probabilistic Labels Peng Peng, Raymond Chi-wing Wong, Philip S. Yu CSE, HKUST 1.
Statistical Topic Modeling part 1
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Unsupervised Training and Clustering Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
DUAL STRATEGY ACTIVE LEARNING presenter: Pinar Donmez 1 Joint work with Jaime G. Carbonell 1 & Paul N. Bennett 2 1 Language Technologies Institute, Carnegie.
Probabilistic Latent Semantic Analysis
Scalable Text Mining with Sparse Generative Models
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
1 Efficiently Learning the Accuracy of Labeling Sources for Selective Sampling by Pinar Donmez, Jaime Carbonell, Jeff Schneider School of Computer Science,
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Dongyeop Kang1, Youngja Park2, Suresh Chari2
Active Learning for Class Imbalance Problem
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Text Classification, Active/Interactive learning.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Latent Dirichlet Allocation
Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.
Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
The topic discovery models
Classification of unlabeled data:
The topic discovery models
The topic discovery models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza Amini³, Jean-Michel Renders¹ 1 07/07/06 International Workshop on Intelligent Information Access- Helsinki 2006 ¹ Xerox Research Centre Europe, 6 chemin de Maupertuis, F Meylan, FRANCE ² National Research Council Canada, Institute for Information Technology, Interactive Language Technologies Group, 101 St-Jean-Bosco Street, Gatineau, QC K1A 0R6, Quebec, CANADA ³ Department of Computer Science, University of Paris VI, 8 rue de Capitaine Scott, Paris, FRANCE

International Workshop on Intelligent Information Access- Helsinki /07/06 2 Unlabeled examples Introduction (1) Annotation process labeled examples Problem: The annotation process is often costly and time-consuming Supervised learning: Training of the model classifier

International Workshop on Intelligent Information Access- Helsinki /07/06 3 Introduction (2) Solutions: Both solve the same problem but from different perspectives Semi-Supervised Learning Active Learning

Outline The problem/ solutions Our method: Active, Semi-Supervised PLSA Experiments Conclusions/ Future work International Workshop on Intelligent Information Access- Helsinki /07/06 4

Semi-Supervised Learning (SSL) International Workshop on Intelligent Information Access- Helsinki /07/06 5 Given a small set of labeled data L a large set of unlabeled data U Train a model M on Unlabeled data can give us some valuable information about P(x)

Active Learning International Workshop on Intelligent Information Access- Helsinki /07/06 6 Given a small set of labeled data L a large set of unlabeled data U Repeat Train a model(s) M on L Use M to test U Select the most useful example from U Ask the human expert to label it Add the labeled example in L Until M reaches a certain performance level or a certain number of queries

Combination of SSL and Active Learning International Workshop on Intelligent Information Access- Helsinki /07/06 7 Given a small set of labeled data L a large set of unlabeled data U Repeat Train a model(s) M on ( Semi-Supervised Learning) Use M to test U Select the most useful example from U Ask the human expert to label it Add the labeled example in L Until M reaches a certain performance level or a certain number of queries

Active, Semi-Supervised PLSA (1) International Workshop on Intelligent Information Access- Helsinki /07/06 8 We represent our document collection as a term-by-document matrix (in other words, as co-occurrences ):

Active, Semi-Supervised PLSA (2) Problem: Synonyms: different words have the same meaning Polysems: words with multiple meanings Disconnection between topics and words Solution: PLSA (Probabilistic Latent Semantic Analysis) aims to discover something about the meaning behind the words. In other words, about the topics of the document. International Workshop on Intelligent Information Access- Helsinki /07/06 9

Active, Semi-Supervised PLSA (3) We model our data by a mixture model, under the assumption that d and w are independent: International Workshop on Intelligent Information Access- Helsinki /07/06 10 wheregives the profile of a component and gives the topics which are in a document (c=1…K is the index over K latent components)

Active, Semi-Supervised PLSA (4) International Workshop on Intelligent Information Access- Helsinki /07/06 11 When the ratio of labeled-unlabeled document is very low some components contain only unlabeled examples In this case arbitrary probabilities will be assign to components, which will lead to arbitrary decision during the classification Solution: Introduction of an additional “fake label” variable z=L0: All labeled examples keep their label All unlabeled examples get the new “label” z After training the model, we distribute the probability obtained for the “fake” z onto the “real” labels:

Active, Semi-Supervised PLSA (5) International Workshop on Intelligent Information Access- Helsinki /07/06 12 Taking into account the label z, our model becomes: where c=1…K is the index over K latent components. We then use a variant of EM-algorithm to train our multinomial mixture model. The (log)likelihood of the data is: where z(d) the (unique) label of document d and n(w,d) # of occurrences of word w to document d

Active, Semi-Supervised PLSA (6) International Workshop on Intelligent Information Access- Helsinki /07/06 13 E-step: M-step: The EM algorithm

Active, Semi-Supervised PLSA (7) International Workshop on Intelligent Information Access- Helsinki /07/06 14 On the top of SSL Active Learning i.e. the example with the highest entropy. For the binary case the example with probability closest to 0.5

Experimental Setting (1) International Workshop on Intelligent Information Access- Helsinki /07/06 15 Corpus: 3 binary problems from the 20-newsgroups dataset rec.sport.baseball (994) vs. rec.sport.hockey (999) comp.sys.ibm.pc.hardware (982) vs. comp.sys.mac.hardware (961) talk.religion.misc (628) vs. alt.atheism (799) (They represent easy, moderate and hard problems respectively) Corpus split 80% for training set 2 labeled examples (one of each category) The rest unlabeled examples 20% for test set (for unbiased estimation of the accuracy)

Experimental Setting (2) International Workshop on Intelligent Information Access- Helsinki /07/06 16 Comparison of the following methods: Semi-Supervised PLSA +Active Learning Semi-Supervised PLSA + Random Query SVM + Active Learning (choice of the examples closest to the margin)

Experimental Results International Workshop on Intelligent Information Access- Helsinki /07/06 17 Comparison of the active semi-supervised PLSA (top) with semi-supervised PLSA querying random examples (middle) and SVM querying the examples closest to the margin (bottom) Baseball vs. Hockey

Experimental Results International Workshop on Intelligent Information Access- Helsinki /07/06 18 PC vs. Mac

Experimental Results International Workshop on Intelligent Information Access- Helsinki /07/06 19 Religion vs. Atheism

Conclusions International Workshop on Intelligent Information Access- Helsinki /07/06 20 Proposed a method which combines SSL and active learning using PLSA The combination outperforms the Semi-Supervised PLSA alone The hardest the problem is, the more the active learning helps

Future Work International Workshop on Intelligent Information Access- Helsinki /07/06 21 More experiments with different datasets Use different Active Learning methods Take into account different costs Apply our method on multiclass problems …

Thank you International Workshop on Intelligent Information Access- Helsinki /07/06 22 Questions?