1 Feature selection with conditional mutual information maximin in text categorization CIKM2004
2 Abstract Feature selection –Advantage Increase a classifier’s computational speed Reduce the overfitting problem –Drawback do not consider the mutual relationships among the features –one feature’s predictive power is weakened by others –the selected features tend to bias towards major categories –Contribution CMIM (Conditional mutual information maxmin) –select a set of individually discriminating and weakly dependent features
3 Information Theory Review Assumption –discrete random variables as X and Y –1-of-n classification problem
4 Information Theory Review Select a small number of features that can carry as much information as possible – H.Yang 1999, directly estimating the joint prob. suffers from the curse of high dimension. –Assume that all random variables are discrete, and each of them may take one of M different values. It can be shown that
5 Information Theory Review – which suggests adding a feature F k will never decrease the mutual information. –Approach Current k-1 selected features max the JMI Next The next feature, which max the CMI, should be chosen into the feature set to ensure the max of the JMI of k features. –Benefit Features can selected one by one into the feature set through an iterative process In the beginning, a feature which max the MI is first selected into the set.
6 CMIM Algorithm –Deal with the problem of computation when the dimension is high. –Because more information will degrade uncertainty, is certain to be smaller than any CMI with fewer dimensional forms –Therefore, we estimate by their minimum value, i.e.,
7 CMIM Algorithm Use the triplet form – Select a feature F
8 Experiment
9
10 Conclusion and Future Work Present a CMI method and uses a CMIM algorithm to select features –both individually discriminate as well as being dependent on features already selected. The experiments show that both micro-averaged and macro-averaged classification perform better based on this feature selection method, especially when the feature size is small and the category number is large.
11 Conclusion and Future Work CMIM’s drawbacks. –cannot deal with integer-valued or continuous features. –ignores the dependencies among three or larger families of features. –Although CMIM has greatly relieved the computation overhead, the complexity O(NV 3) is still not very attractive. Future work –decrease the complexity of CMIM –consider parameter density models to deal with continuous features, and investigate other conditional models to efficiently formulate features’ mutual relationship.