Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Slides:



Advertisements
Similar presentations
Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Advertisements

January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
CS479/679 Pattern Recognition Dr. George Bebis
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Machine learning continued Image source:
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Classification and risk prediction
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Deep Belief Networks for Spam Filtering
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Visual Recognition Tutorial
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Recent Trends in Text Mining Girish Keswani
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 KDD-09, Paris France Quantification and Semi-Supervised Classification Methods for Handling Changes in Class Distribution Jack Chongjie Xue † Gary M.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Classification Techniques: Bayesian Classification
Optimal Bayes Classification
Learning from Positive and Unlabeled Examples Investigator: Bing Liu, Computer Science Prime Grant Support: National Science Foundation Problem Statement.
Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Latent Dirichlet Allocation
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Quantification in Social Networks Letizia Milli, Anna Monreale, Giulio Rossetti, Dino Pedreschi, Fosca Giannotti, Fabrizio Sebastiani Computer Science.
Machine Learning 5. Parametric Methods.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Introduction to Machine Learning, its potential usage in network area,
CS479/679 Pattern Recognition Dr. George Bebis
Recent Trends in Text Mining
Introductory Seminar on Research: Fall 2017
Data Mining Lecture 11.
Lecture 9: Entity Resolution
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Overview of Machine Learning
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 07: BAYESIAN ESTIMATION
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Multivariate Methods Berlin Chen
Machine Learning: UNIT-3 CHAPTER-1
Bayesian Decision Theory
CS249: Neural Language Model
Presentation transcript:

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics

Agenda Motivation Problem statement Problem solution Results evaluation Conclusion

Motivation In many applications of classification, the real goal is estimating the relative frequency of each class in the unlabelled data (a priori probabilities of data). Examples: prediction in election, happiness, epidemiology

Motivation Classification is a data mining function that assigns each items in a collection to target categories or classes. If we have labeled and unlabeled data when classification is usually solved via supervised machine learning. Popular classes of supervised learning algorithms: Naïve Bayes, k -NN, SVMs, decision trees, neural networks, etc. We can simply use a «classify and count» strategy to estimate priori probabilities of data Is “classify and count” the optimal strategy to estimate relative frequency?

Motivation A perfect classifier is also a perfect “quantifier” (i.e., estimator of class prevalence) but … Real applications may suffer from distribution drift (or “shift”, or “mismatch”), defined as a discrepancy between the class distribution of Tr and that of Te 1. the prior probabilities p(ω j ) may change from training to test set 2. the class-conditional distributions (aka “within-class densities”) p(x| ω j ) may change 3. the posterior probabilities p(ω j |x) may change Standard ML algorithms are instead based on the assumption that training and test items are drawn from the same distribution We are interested in the first case of distribution drift.

Agenda Motivation Problem statement Problem solution Results evaluation Conclusion

Problem statement We have training set Tr and test set Te with p Tr (ω j ) ≠ p Te (ω j ) We have vector of variables X, and indexes of classes ω j, j=1,J We know indexes for each item in training set Tr Task is to estimate p Te (ω j ), j=1,J

Problem statement f 1 f 2 …ω X1ω 1 X2ω 2 ….. Testf 1 f 2 …ω X1ω 1 X2ω 2 ω 1 X3ω 2 X4ω 2 Training set Test set It may be also defined as the task of approximating a distribution of classes p Train (ω j ) ≠ p Test (ω j )

Problem statement Quality estimation: Absolute Error Kullback-Leibler Divergence …

Agenda Motivation Problem statement Problem solution Results evaluation Conclusion

Baseline algorithm Adjusted classify and count In the classifier task we predict the value of category. Trivial solution is to count the number of elements in the predicted classes. We can adjust this with the help of confusion matrix. Standard classifier is tuned to minimize FP + FN or a proxy of it, but we need to minimize FP - FN But we can estimate confusion matrix only with training set. p(ω j ) can be find from equations:

Which methods perform best? Largest experimentation to date is likely: Esuli, A. and F. Sebastiani: 2015, Optimizing Text Quantifiers for Multivariate Loss Functions. ACM Transactions on Knowledge Discovery from Data, 9(4): Article 27, 2015 Fabrizio Sebastiani calls this problem as Quantification Different papers present different methods + use different datasets, baselines, and evaluation protocols; it is thus hard to have a precise view

F. Sebastiani, 2015

Fuzzy classifier Fuzzy classifier estimate the posteriori probabilities of each category on the basis of training set using vector of variable X. If we have distribution drift of a priori probabilities p Train (ω j ) ≠ p Test (ω j ) a posteriori probabilities should be retune. So, our classification results will change.

Adjusting to a distribution drift If we know a new priori probability we can simply count a new value for posteriori probabilities: If we don’t know a priori probability we can estimate it iteratively as it propused in paper: Saerens, M., P. Latinne, and C. Decaestecker: 2002, Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1), 21–41.

EM algorithm* * Saerens, M., P. Latinne, and C. Decaestecker: 2002, Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1), 21–41.

Agenda Motivation Problem statement Problem solution Results evaluation Conclusion

Results evaluation We realize EM algorithm proposed by (Saerens, et al., 2002) and compare with others. F. Sebastiani used baseline algorithms from George Forman George Forman wrote algorithms for HP and he can’t share it, because it is too old! We can compare results by using only same datasets from Esuli, A. and F. Sebastiani: 2015, and same Kullback-Leibler Divergence

F. Sebastiani, 2015

Testing datasets* * Esuli, A. and F. Sebastiani: 2015, Optimizing Text Quantifiers for Multivariate Loss Functions. ACM Transactions on Knowledge Discovery from Data, 9(4): Article 27, 2015

Results evaluation Esuli, A. and F. Sebastiani: 2015

Results evaluation VLPLPHPVHPtotal EM 4,99E-041,91E-031,33E-035,31E-049,88E-04 SVM(KLD)1,21E-031,02E-035,55E E-041,13E-03 VLDLDHDVHDtotal EM 1,17E-041,49E-043,34E-043,35E-039,88E-04 SVM(KLD)7,00E-047,54E-049,39E-042,11E-031,13E-03 VLPLPHPVHPtotal EM 6,52E-051,497E E-047,62E-061,32E-03 SVM(KLD)2,09E-034,92E-047,19E-041,12E-031,32E-03 VLDLDHDVHDtotal EM 3,32E-044,92E-041,83E-034,29E-031,32E-03 SVM(KLD)1.17E E E E-031,32E-03 OHSUMED-S RCV1-V2 OHSUMED-S

Agenda Motivation Problem statement Problem solution Results evaluation Conclusion

Explore the problem to detect new a priori probabilities of data using supervised learning Realize EM algorithm when a priori probabilities counted as a spin off Realize baseline algorithms Test EM algorithm on the datasets and compare with baseline and sate of the art algorithms EM algorithm shows good results

Results Algorithms available at: Thank you for your attention