1 Feature selection with conditional mutual information maximin in text categorization CIKM2004.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

DECISION TREES. Decision trees  One possible representation for hypotheses.

DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.

Evaluating Classifiers

SVM—Support Vector Machines

Model Assessment, Selection and Averaging

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.

Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.

Machine Learning CMPT 726 Simon Fraser University

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Ensembles of low Variance Architecture Nathan Intrator Tel-Aviv & Brown University Joint work with Shimon.

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Data Visualization and Feature Selection: New Algorithms for Nongaussian Data Howard Hua Yang and John Moody NIPS ’ 99.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.

PATTERN RECOGNITION AND MACHINE LEARNING

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

Geo597 Geostatistics Ch9 Random Function Models.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.

D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Latent Dirichlet Allocation

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.

Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Dynamics of Binary Search Trees under batch insertions and deletions with duplicates ╛ BACKGROUND The complexity of many operations on Binary Search Trees.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Traffic Simulation L2 – Introduction to simulation Ing. Ondřej Přibyl, Ph.D.

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Chapter 3: Maximum-Likelihood Parameter Estimation

LECTURE 11: Advanced Discriminant Analysis

Instance Based Learning

CH 5: Multivariate Methods

CHAPTER 11 Inference for Distributions of Categorical Data

ECE 5424: Introduction to Machine Learning

Rutgers Intelligent Transportation Systems (RITS) Laboratory

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.

CHAPTER 11 Inference for Distributions of Categorical Data

Classification & Prediction

CSE572, CBS572: Data Mining by H. Liu

LECTURE 23: INFORMATION THEORY REVIEW

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

CSE572: Data Mining by H. Liu

CHAPTER 11 Inference for Distributions of Categorical Data

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

1 Feature selection with conditional mutual information maximin in text categorization CIKM2004

2 Abstract Feature selection –Advantage Increase a classifier’s computational speed Reduce the overfitting problem –Drawback do not consider the mutual relationships among the features –one feature’s predictive power is weakened by others –the selected features tend to bias towards major categories –Contribution CMIM (Conditional mutual information maxmin) –select a set of individually discriminating and weakly dependent features

3 Information Theory Review Assumption –discrete random variables as X and Y –1-of-n classification problem

4 Information Theory Review Select a small number of features that can carry as much information as possible – H.Yang 1999, directly estimating the joint prob.  suffers from the curse of high dimension. –Assume that all random variables are discrete, and each of them may take one of M different values. It can be shown that

5 Information Theory Review – which suggests adding a feature F k will never decrease the mutual information. –Approach Current  k-1 selected features max the JMI Next  The next feature, which max the CMI, should be chosen into the feature set to ensure the max of the JMI of k features. –Benefit Features can selected one by one into the feature set through an iterative process In the beginning, a feature which max the MI is first selected into the set.

6 CMIM Algorithm –Deal with the problem of computation when the dimension is high. –Because more information will degrade uncertainty, is certain to be smaller than any CMI with fewer dimensional forms –Therefore, we estimate by their minimum value, i.e.,

7 CMIM Algorithm Use the triplet form – Select a feature F

8 Experiment

9

10 Conclusion and Future Work Present a CMI method and uses a CMIM algorithm to select features –both individually discriminate as well as being dependent on features already selected. The experiments show that both micro-averaged and macro-averaged classification perform better based on this feature selection method, especially when the feature size is small and the category number is large.

11 Conclusion and Future Work CMIM’s drawbacks. –cannot deal with integer-valued or continuous features. –ignores the dependencies among three or larger families of features. –Although CMIM has greatly relieved the computation overhead, the complexity O(NV 3) is still not very attractive. Future work –decrease the complexity of CMIM –consider parameter density models to deal with continuous features, and investigate other conditional models to efficiently formulate features’ mutual relationship.