Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Text Categorization.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Linear Classifiers (perceptrons)
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Crash Course on Machine Learning
Exercise Session 10 – Image Categorization
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Text Classification, Active/Interactive learning.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
An Introduction to Support Vector Machines (M. Law)
Pattern Recognition with N-Tuple Systems Simon Lucas Computer Science Dept Essex University.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Locally Linear Support Vector Machines Ľubor Ladický Philip H.S. Torr.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
KNN & Naïve Bayes Hongning Wang
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Learning Deep Generative Models by Ruslan Salakhutdinov
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Tackling the Poor Assumptions of Naive Bayes Text Classifiers Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime Teevan, David R.Karger Liang Lan 11/19/2007.
MMS Software Deliverables: Year 1
Machine Learning Basics
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Michal Rosen-Zvi University of California, Irvine
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University

Powerpoint Templates Page 2 Sparse Computing for Big Data “Big Data” –machine learning for processing vast datasets Current solution: Parallel computing –processing more data as expensive, or more Alternative solution: Sparse computing –scalable solutions, less expensive

Powerpoint Templates Page 3 Sparse Representation Example: document vector –Dense: word count vector w = [w 1, …, w N ] w = [0, 14, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0] –Sparse: vectors [v, c] of indices v and nonzeros c v = [2, 10, 14, 17], c = [14, 2, 1, 3] Complexity: |w| vs. s(w), number of nonzeros

Powerpoint Templates Page 4 Multinomial Naive Bayes Input word vector w, output label 1 ≤ m ≤ M Sparse representation for parameters p m (n) –Jelinek-Mercer interpolation: α p s (n) + (1- α )p u m (n) –estimation: represent p u m with a hashtable –inference: represent p u m with an inverted index Sparse Inference with MNB

Powerpoint Templates Page 5 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n) p2(n)p2(n) p3(n)p3(n) p4(n)p4(n) p5(n)p5(n) p6(n)p6(n) p7(n)p7(n) p8(n)p8(n) p9(n)p9(n) p 10 (n) p 11 (n) p 12 (n)

Powerpoint Templates Page 6 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n) p2(n)p2(n) p3(n)p3(n) p4(n)p4(n) p5(n)p5(n) p6(n)p6(n) p7(n)p7(n) p8(n)p8(n) p9(n)p9(n) p 10 (n) p 11 (n) p 12 (n) Time complexity: O(s(w) M) M classes N features

Powerpoint Templates Page 7 Sparse representation: α p s (n) + (1- α )p u m (n) Sparse Inference with MNB pum(1)pum(1)p u m (2)p u m (3)p u m (4)p u m (5)p u m (6)p u m (7)p u m (8)p u m (9) pu1(n)pu1(n) pu2(n)pu2(n) pu3(n)pu3(n) pu4(n)pu4(n) pu5(n)pu5(n) pu6(n)pu6(n) pu7(n)pu7(n) pu8(n)pu8(n) pu9(n)pu9(n) p u 10 (n) p u 11 (n) p u 12 (n) p s (n) p s (1)0.18 p s (2)0.07 p s (3)0.06 p s (4)0.06 p s (5)0.04 p s (6)0.02 p s (7)0.02 p s (8)0.01 p s (9)0.01 p s (10)0.01 p s (11)0.01 p s (12) Time complexity: O(s(w) +  m  n:p m (n)>0 1)

Powerpoint Templates Page 8 Sparse Inference with MNB

Powerpoint Templates Page 9 Sparse Inference with MNB O(s(w))

Powerpoint Templates Page 10 Sparse Inference with MNB O(  m  n:p m (n)>0 1)

Powerpoint Templates Page 11 Multi-label Classifiers Multilabel classification –binary labelvector l=[l 1, …, l M ] instead of label m –2 M possible labelvectors, not directly solvable Solved with multilabel extensions: –Binary Relevance (Godbole & Sarawagi 2004) –Label Powerset (Boutell et al. 2004) –Multi-label Mixture Model (McCallum 1999)

Powerpoint Templates Page 12 Multi-label Classifiers Feature normalization: TFIDF –s(w u ) length normalization (“L0-norm”) –TF log-transform of counts, corrects “burstiness” –IDF transform, unsmoothed Croft-Harper IDF

Powerpoint Templates Page 13 Multi-label Classifiers Classifier modification with metaparameters a –a 1 Jelinek-Mercer smoothing of conditionals p m (n) –a 2 Count pruning in training with a threshold –a 3 Prior scaling. Replace p(l) by p(l) a 3 –a 4 Class pruning in classification with a threshold

Powerpoint Templates Page 14 Multi-label Classifiers Direct optimization of a with random search –target f(a): development set Fscore Parallel random search –iteratively sample points around current max f(a) –generate points by dynamically adapted steps –sample f(a) in I iterations of J parallel points I= 30, J=50 → 1500 configurations of a sampled

Powerpoint Templates Page 15 Experiment Datasets

Powerpoint Templates Page 16 Experiment Results

Powerpoint Templates Page 17 Conclusion New idea: sparse inference –reduces time complexity of probabilistic inference –demonstrated for multi-label classification –applicable with different models (KNN, SVM, …) and uses (clustering, ranking, …) Code available, with Weka wrapper: –Weka package manager: SparseGenerativeModel –

Powerpoint Templates Page 18 Multi-label classification Binary Relevance (Godbole & Sarawagi 2004) –each label decision independent binary problem –positive multinomial vs. negative multinomial negatives approximated with background multinomial –threshold parameter for improved accuracy + fast, simple, easy to implement - ignores label correlations, poor performance

Powerpoint Templates Page 19 Multi-label classification Label Powerset (Boutell et al. 2004) –each labelset seen in training is mapped to a class –hashtable for converting classes to labelsets + models label correlations, good performance - takes memory, cannot classify new labelsets

Powerpoint Templates Page 20 Multi-label classification Multi-label Mixture Model –mixture for prior: –classification with greedy search (McCallum 1999) complexity: q times MNB complexity, where q is maximum labelset size s(l) seen in training + models labelsets, generalizes to new labelsets - assumes uniform linear decomposition

Powerpoint Templates Page 21 Multi-label classification Multi-label Mixture Model –related models: McCallum 1999, Ueda 2002 –like label powerset, but labelset conditionals decompose into a mixture of label conditionals: p l (n) = 1/s(l) ∑ m l m p m (n)

Powerpoint Templates Page 22 Classifier optimization Model modifications –1) Jelinek-Mercer smoothing of conditionals: –2) Count pruning. Max 8M conditional counts. On each count update: online pruning with IDF running estimates and meta-parameter a 2

Powerpoint Templates Page 23 Classifier optimization Model modifications –3) Prior scaling. Replace p(l) by p(l) a 3 equivalent to LM scaling in speech recognition –4) Pruning in classification. Sort classes by rank and stop classification on a threshold a 4 sort by: prune:

Powerpoint Templates Page 24 Multinomial Naive Bayes Bayes p(w,m) = p(m) p m (w) Naive p(w,m) = p(m)  n p m (w n, n) Multinomial p(w,m)  p(m)  n p m (n) w n Sparse Inference with MNB