Document Classification Method with Small Training Data

Slides:



Advertisements
Similar presentations
Point Estimation Notes of STAT 6205 by Dr. Fan.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Statistical Decision Theory Abraham Wald ( ) Wald’s test Rigorous proof of the consistency of MLE “Note on the consistency of the maximum likelihood.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Pattern Recognition and Machine Learning
Assuming normally distributed data! Naïve Bayes Classifier.
Classification and risk prediction
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Classification with reject option in gene expression data Blaise Hanczar and Edward R Dougherty BIOINFORMATICS Vol. 24 no , pages
Oznur Tastan Machine Learning Recitation 3 Sep
1 University of Freiburg Computer Networks and Telematics Prof. Christian Schindelhauer Wireless Sensor Networks 17th Lecture Christian Schindelhauer.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Bayesian Networks. Male brain wiring Female brain wiring.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Text Classification, Active/Interactive learning.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
1 E. Fatemizadeh Statistical Pattern Recognition.
Estimation Chapter 8. Estimating µ When σ Is Known.
Optimal Bayes Classification
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
KNN & Naïve Bayes Hongning Wang
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
A Stochastic Frame Based Approach to RFID Tag Searching Ann L. Wang Dept. of Computer Science and Engineering Michigan State University Joint work with.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Introduction With such a wide variety of algorithms to choose from, which one is best? Are there any reasons to prefer one algorithm over another? Occam’s.
Lecture 1.31 Criteria for optimal reception of radio signals.
Usman Roshan CS 675 Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Reading Notes Wang Ning Lab of Database and Information Systems
Shuang-Hong Yang, Hongyuan Zha, Bao-Gang Hu NIPS2009
SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.
Classification Discriminant Analysis
REMOTE SENSING Multispectral Image Classification
People-LDA using Face Recognition
Minimax Probability Machine (MPM)
Yasunari MAEDA (Kitami Institute of Technology)
EE513 Audio Signals and Systems
Learning Probabilistic Graphical Models Overview Learning Problems.
LECTURE 23: INFORMATION THEORY REVIEW
Section 11.7 Probability.
LECTURE 19: FOUNDATIONS OF MACHINE LEARNING
Multivariate Methods Berlin Chen
Uniform Probability Distribution
Presentation transcript:

Document Classification Method with Small Training Data Yasunari MAEDA (Kitami Institute of Technology) Hideki YOSHIDA (Kitami Institute of Technology) Toshiyasu Matsushima (Waseda University)

Topics Overview of document classification Document classification using distance Document classification using probabilistic model Preparations The first our previous research The second our previous research Experiments of previous methods Proposed method Experiments Conclusion

Overview of document classification Key words in documents are used. ex. classification of newspaper articles class article economy art1, art3 science art5, art6 sport art2, art4 art2 art1 art3 art4 class key word economy stocks, company, science amino acid, computer, sport baseball, swimming, art5 art6 result of classification articles article A new characteristic of amino acid was found. key word class of articles In many cases, each key word belongs to more than one classes.

Document classification using distance A new article is classified into a class whose distance is minimum. class B class A art A1 art B2 art B1 art A3 new article art B3 art A5 art B4 art A4 art A2 class C art C1 The distance between class A and the new article is minimum. The new article is classified into the class A. art C2 art C3 art C4 art C5 It is very easy to use in real case. There is no theoretical guarantee on accuracy. Vector space model is well known.

Document classification using probabilistic model Key words occur depending on probabilistic distributions. A new article is classified into a class whose error rate is minimum. A new characteristic of amino acid was found. article occurrence of article parameters which dominate probability distributions class “science” occurs key word “amino acid” occurs under the condition that class “science” occurs. article classification = estimation for class our previous research minimizing the error rate with respect to the Bayes criterion accuracy is low with small training data We want to improve accuracy with small training data.

Preparations(1/3) : class of documents : set of classes , : key word : set of key words, : a probability of an event that class occurs : a parameter which dominates , : a true parameter which is unknown, : a probability of an event that key word occurs in a document in class . : a parameter which dominates , : a true parameter which is unknown, : a new document, : a class of new document , ( is unknown) : a string of key words in new document , ( is known) : the number of key words in new document

Preparations(2/3) (1) (2) . : training data : the number of documents in the training data : the class of the th document in the training data, : the number of key words in the th document in the training data : a string of key words in the th document in the training data : the th key word in the th document in the training data a probability of an event that the new document occurs (1) a probability of an event that the training data occurs (2) .

Preparations(3/3) document classification problem estimating the unknown class of the new document under the condition that the string of key words in and the training data are given.

The first our previous research(1/2) The class of the new document is estimated by (3) where, (4) , (5) , , : parameters of Dirichlet distribution for , . (prior distributions for the unknown parameters) : the number of documents in the class in the training data : the number of key word in the documents in the class in the training data : the number of key word in the string

The first our previous research(2/2) the first our previous method optimal method which minimizes the error rate with respect to Bayes criterion But, the accuracy is low with small training data. 0.5 is used as parameter of prior distributions in order to represent no information. the second our previous research improve accuracy with small training data Accuracy depends on prior distributions with small training data.

The second our previous research(1/2) We estimate prior distribution using estimating data. estimating data for prior distributions The new documents and the training data occur from the same source. The estimating data occurs from another source. , (6) : the number of documents in the estimating data : the class of the th document : the number of key words in the th document : the string of key words in the th document : the th key word in the th document

The second our previous research(2/2) Parameters in eq(4) and eq(5) are estimated using estimating data. , (7) , (8) where, , : parameters of Dirichlet distribution for , .

Experiments of previous methods(1/2) comparison between the first our previous method and the second first our previous method(prev1) 0.5 is used as each parameter for prior distributions. second our previous method(prev2) Prior distributions are estimated using estimating data. new documents : 10000 (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : 50000 (Japanese Mainichi News Paper 1994)

Experiments of previous methods(2/2) prev2 is higher than prev1 with small training data. But prev2 is lower than prev1 with large training data.

Proposed method Parameters in eq(4) and eq(5) are estimated as follows: , (9) , (10) where,

Experiments (1/2) comparison between the first our previous method and the new proposed method first our previous method(prev1) 0.5 is used as each parameter for prior distributions. new our proposed method(pro) new documents : 10000 (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : 50000 (Japanese Mainichi News Paper 1994)

Experiments (2/2) pro is higher than prev1 in all points.

Conclusion further works Accuracy of our new proposed method is higher than the first our previous method with small training data. And the accuracy is equal when the size of training data is big. with small training data use estimating data mainly with large training data use training data mainly further works We want to study a method to choose the parameters , and .