Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong.

Slides:

Advertisements

Similar presentations

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct

SVM—Support Vector Machines

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Minimum Redundancy and Maximum Relevance Feature Selection

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Chapter 4: Linear Models for Classification

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.

Classification and risk prediction

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Learning Maximum Likelihood Bounded Semi-Naïve Bayesian Network Classifier Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Variations of Minimax Probability Machine Huang, Kaizhu

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

MSRC Summer School - 30/06/2009 Cambridge – UK Hybrids of generative and discriminative methods for machine learning.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Constructing a Large Node Chow-Liu Tree Based on Frequent Itemsets Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.

Dept. of Computer Science & Engineering, CUHK Pseudo Relevance Feedback with Biased Support Vector Machine in Multimedia Retrieval Steven C.H. Hoi 14-Oct,

Data mining and statistical learning - lecture 13 Separating hyperplane.

Finite mixture model of Bounded Semi- Naïve Bayesian Network Classifiers Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.

A Study of the Relationship between SVM and Gabriel Graph ZHANG Wan and Irwin King, Multimedia Information Processing Laboratory, Department of Computer.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Support Vector Machines

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Learning Maximum Likelihood Bounded Semi-Naïve Bayesian network classifiers Huang, Kaizhu Sept.25, 2002 Huang, Kaizhu Sept.25, 2002.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Nonlinear Data Discrimination via Generalized Support Vector Machines David R. Musicant and Olvi L. Mangasarian University of Wisconsin - Madison

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Dept. of C.S.E., C.U.H.K. 1 Learning From Data Locally and Globally Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Prof. Michael R. Lyu.

Linear Models for Classification

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.

Ensemble Methods in Machine Learning

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

NTU & MSRA Ming-Feng Tsai

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

The Chinese University of Hong Kong Learning Larger Margin Machine Locally and Globally Dept. of Computer Science and Engineering The Chinese University.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Discriminative Training of Chow-Liu tree Multinet Classifiers

Classification Discriminant Analysis

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Recap: Naïve Bayes classifier

Presentation transcript:

Discriminative Naïve Bayesian Classifiers Kaizhu Huang Supervisors: Prof. Irwin King, Prof. Michael R. Lyu Markers: Prof. Lai Wan Chan, Prof. Kin Hong Wong

Outline Background Background –Classifiers »Discriminative classifiers: Support Vector Machines »Generative classifiers: Naïve Bayesian Classifiers Motivation Motivation Discriminative Naïve Bayesian Classifiers Discriminative Naïve Bayesian Classifiers Experiments Experiments Discussions Discussions Conclusion Conclusion

Background Discriminative Classifiers Discriminative Classifiers –Directly maximize a discriminative function or posterior function –Example: Support Vector Machines SVM

Background Generative Classifiers Generative Classifiers –Model the joint distribution for each class P(x|C) and then use Bayes rules to construct posterior classifiers P(C|x). –Example: Naïve Bayesian Classifiers »Model the distribution for each class under the assumption: each feature of the data is independent with others features, when given the class label. Constant w.r.t. C Combining the assumption

Background Comparison Comparison Example of Missing Information: From left to right: Original digit, 50% missing digit, 75% missing digit, and occluded digit.

Background Why Generative classifiers are not accurate as Discriminative classifiers? Why Generative classifiers are not accurate as Discriminative classifiers? Pre-classified dataset Sub-dataset D1 for Class 1 Sub-dataset D2 for Class 2 Estimate the distribution P1 to approximate D1 accurately Estimate the distribution P2 to approximate D2 accurately Use Bayes rule to perform classification 1. It is incomplete for generative classifiers to just approximate the inner-class information. 2. The inter-class discriminative information between classes are discarded Scheme for Generative classifiers in two-category classification tasks

Background Why Generative Classifiers are superior to Discriminative Classifiers in handling missing information problems? Why Generative Classifiers are superior to Discriminative Classifiers in handling missing information problems? –SVM lacks the ability under the uncertainty –NB can conduct uncertainty inference under the estimated distribution. A is the feature set T is the subset of A, which is missing

Motivation It seems that a good classifier should combine the strategies of discriminative classifiers and generative classifiers. It seems that a good classifier should combine the strategies of discriminative classifiers and generative classifiers. Our work trains one of the generative classifier: Naïve Bayesian Classifies in a discriminative way. Our work trains one of the generative classifier: Naïve Bayesian Classifies in a discriminative way.

Roadmap of our work Discriminative training

How our work relates to other work? Discriminative ClassifiersGenerative Classifiers 1. Jaakkola and Haussler NIPS98 HMM and GMMDiscriminative training 2. Difference: Our method performs a reverse process: From Generative classifiers to Discriminative classifiers Beaufays etc., ICASS99, Hastie etc., JRSS 96 Difference: Our method is designed for Bayesian classifiers.

How our work relates to other work? Optimization on Posterior Distribution P(C|x) 3. Difference: LR will encounter computational difficulties in handling missing information problems. When number of the missing or unknown features grows, it will be intractable to perform inference. Logistical Regression (LR)

Roadmap of our work

Discriminative Naïve Bayesian Classifiers Pre-classified dataset Sub-dataset D1 for Class I Sub-dataset D2 for Class 2 Estimate the distribution P1 to approximate D1 accurately Estimate the distribution P2 to approximate D2 accurately Use Bayes rule to perform classification Working Scheme of Naïve Bayesian Classifier Mathematic Explanation of Naïve Bayesian Classifier Easily solved by Lagrange Multiplier method

Discriminative Naïve Bayesian Classifiers (DNB) Optimization function of DNB Optimization function of DNB On one hand, the minimization of this function tries to approximate the dataset as accurately as possible. On the other hand, the optimization on this function also tries to enlarge the divergence between classes. Optimization on joint distribution directly inherits the ability of NB in handling missing information problems Divergence item

Discriminative Naïve Bayesian Classifiers (DNB) Complete Optimization problem Complete Optimization problem Cannot separately optimize and as in NB, Since they are interactive variables now.

Discriminative Naïve Bayesian Classifiers (DNB) Solve the Optimization problem Solve the Optimization problem –Nonlinear optimization problem under linear constraints. Using Rosen Gradient Projection methods

Discriminative Naïve Bayesian Classifiers (DNB) Gradient and Projection matrix Gradient and Projection matrix

Extension to Multi-category Classification problems

Experimental results Experimental Setup Experimental Setup –Datasets »5 benchmark datasets from UCI machine learning repository –Experimental Environments »Platform:Windows 2000 »Developing tool: Matlab 6.5

Without information missing  Observations –DNB outperforms NB in every datasets –DNB wins in 2 datasets while it loses in three dataets in comparison with SVM –SVM outperforms DNB in Segment and Satimages

With information missing DNB uses DNB uses to conduct inference when there is information missing SVM sets 0 values to the missing features (the default way to process unknown features in LIBSVM) SVM sets 0 values to the missing features (the default way to process unknown features in LIBSVM)

With information missing

1. Observations  NB demonstrates a robust ability in handling missing information problems.  DNB inherits the ability of NB in handling missing information problems while it has a higher classification accuracy than NB  SVM cannot deal with missing information problems easily.  In small datasets, DNB demonstrates a superior ability than NB.

Discussion Why SVM outperforms DNB when no information missing? Why SVM outperforms DNB when no information missing? SVM DNB  SVM directly minimizes the error rate, while DNB minimizes an intermediate term.  SVM assumes no model, while DNB assumes independent relationship among features. “all models are wrong but some are useful”.

Discussion How DNB relates to Fisher Discriminant (FD)? How DNB relates to Fisher Discriminant (FD)? FD  Using the difference of the mean between two classes as the divergence measure is not an informative way in comparison with using distributions.  FD is usually used as dimension reduction method rather than a classification method

Discussion Can DNB be extended to general Bayesian Network (BN) Classifier? Can DNB be extended to general Bayesian Network (BN) Classifier? –Finding optimal General Bayesian Network Classifiers is an NP-complete problem. –Structure learning problem will be involved. Direct application of DNB will encounter difficulties since the structure is non-fixed in restricted BNs. The tree-like discriminative Bayesian Network Classifier is ongoing. The tree-like discriminative Bayesian Network Classifier is ongoing.

Discussion Discriminative training of Tree-like Bayesian Network Classifiers Two reference distributions are used in each iteration. Approximate the Empirical distribution as close as possible And as far as possible from the distribution of the other dataset

Future work Extensive evaluations on discriminative Bayesian network classifiers including Discriminative Naïve Bayesian Classifiers and tree-like Bayesian Network Classifiers. Extensive evaluations on discriminative Bayesian network classifiers including Discriminative Naïve Bayesian Classifiers and tree-like Bayesian Network Classifiers.

Conclusion We develop a novel model named Discriminative Naïve Bayesian Classifiers We develop a novel model named Discriminative Naïve Bayesian Classifiers It outperforms Naïve Bayesian Classifiers when no information is missing It outperforms Naïve Bayesian Classifiers when no information is missing It outperforms SVMs in handling missing information problems. It outperforms SVMs in handling missing information problems.