The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.

Slides:



Advertisements
Similar presentations
Mission-based Joint Optimal Resource Allocation in Wireless Multicast Sensor Networks Yun Hou Prof Kin K. Leung Archan Misra.
Advertisements

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
ECG Signal processing (2)
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Unsupervised Learning
Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learning.
Minimum Redundancy and Maximum Relevance Feature Selection
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Bregman Information Bottleneck NIPS’03, Whistler December 2003 Koby Crammer Hebrew University of Jerusalem Noam Slonim Princeton University.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
K nearest neighbor and Rocchio algorithm
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Local one class optimization Gal Chechik, Stanford joint work with Koby Crammer, Hebrew university of Jerusalem.
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Distributional Clustering of Words for Text Classification L. Douglas Baker Andrew Kachites McCallum SIGIR’98.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Unsupervised Learning
. Multivariate Information Bottleneck Noam Slonim Princeton University Lewis-Sigler Institute for Integrative Genomics Nir Friedman Naftali Tishby Hebrew.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Data Visualization and Feature Selection: New Algorithms for Nongaussian Data Howard Hua Yang and John Moody NIPS ’ 99.
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
Basic Concepts in Information Theory
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Non Negative Matrix Factorization
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
HKU CS 11/8/ Scalable Clustering of Categorical Data HKU CS Database Research Seminar August 11th, 2004 Panagiotis Karras.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS 478 – Tools for Machine Learning and Data Mining SVM.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
1 Value of information – SITEX Data analysis Shubha Kadambe (310) Information Sciences Laboratory HRL Labs 3011 Malibu Canyon.
Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.
CpSc 881: Machine Learning
Machine Learning Queens College Lecture 7: Clustering.
Machine Learning Saarland University, SS 2007 Holger Bast Marjan Celikik Kevin Chang Stefan Funke Joachim Giesen Max-Planck-Institut für Informatik Saarbrücken,
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Maximum Entropy … the fact that a certain prob distribution maximizes entropy subject to certain constraints representing our incomplete information, is.
Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
March 7, Using Pattern Recognition Techniques to Derive a Formal Analysis of Why Heuristic Functions Work B. John Oommen A Joint Work with Luis.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
PAC-Bayesian Analysis of Unsupervised Learning Yevgeny Seldin Joint work with Naftali Tishby.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Test1 Here some text. Text 2 More text.
Clustering (1) Clustering Similarity measure Hierarchical clustering
Minimum Information Inference
Outlier Processing via L1-Principal Subspaces
A Consensus-Based Clustering Method
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
[type text here] [type text here] [type text here] [type text here]
Your text here Your text here Your text here Your text here Your text here Pooky.Pandas.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Your text here Your text here Your text here Your text here
Feature space tansformation methods
[type text here] [type text here] [type text here] [type text here]
What is The Optimal Number of Features
Presentation transcript:

The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao

Word-clusters vs words –Reduced feature dimensionality. –More robust. –18% increase in accuracy. Challenge: Group similar words into word-clusters that preserve the information about document categories. --Information Bottleneck (IB) Method.

IB method is based on following idea: Given the empirical joint distribution of two variables, one variable is compressed so that the mutual information about the other variable is preserved as much as possible. find clusters of the members of the set X, denoted here by, such that the mutual information I( ;Y) is maximized, under a constraint on the information extracted from X, I ( ;X).

The problem has optimal formal solution without any assumption about the origin of the joint distribution p(x,y).

Kullback-Leibler divergence between the conditional distributions p(y|x) and Z(β,x) is a normalization factor. Single positive β determines the softness of the classification.

Agglomerative IB Algorithm

Normalized information curves for all 10 iterations in large and small sample sizes