Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University
Powerpoint Templates Page 2 Sparse Computing for Big Data “Big Data” –machine learning for processing vast datasets Current solution: Parallel computing –processing more data as expensive, or more Alternative solution: Sparse computing –scalable solutions, less expensive
Powerpoint Templates Page 3 Sparse Representation Example: document vector –Dense: word count vector w = [w 1, …, w N ] w = [0, 14, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0] –Sparse: vectors [v, c] of indices v and nonzeros c v = [2, 10, 14, 17], c = [14, 2, 1, 3] Complexity: |w| vs. s(w), number of nonzeros
Powerpoint Templates Page 4 Multinomial Naive Bayes Input word vector w, output label 1 ≤ m ≤ M Sparse representation for parameters p m (n) –Jelinek-Mercer interpolation: α p s (n) + (1- α )p u m (n) –estimation: represent p u m with a hashtable –inference: represent p u m with an inverted index Sparse Inference with MNB
Powerpoint Templates Page 5 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n) p2(n)p2(n) p3(n)p3(n) p4(n)p4(n) p5(n)p5(n) p6(n)p6(n) p7(n)p7(n) p8(n)p8(n) p9(n)p9(n) p 10 (n) p 11 (n) p 12 (n)
Powerpoint Templates Page 6 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n) p2(n)p2(n) p3(n)p3(n) p4(n)p4(n) p5(n)p5(n) p6(n)p6(n) p7(n)p7(n) p8(n)p8(n) p9(n)p9(n) p 10 (n) p 11 (n) p 12 (n) Time complexity: O(s(w) M) M classes N features
Powerpoint Templates Page 7 Sparse representation: α p s (n) + (1- α )p u m (n) Sparse Inference with MNB pum(1)pum(1)p u m (2)p u m (3)p u m (4)p u m (5)p u m (6)p u m (7)p u m (8)p u m (9) pu1(n)pu1(n) pu2(n)pu2(n) pu3(n)pu3(n) pu4(n)pu4(n) pu5(n)pu5(n) pu6(n)pu6(n) pu7(n)pu7(n) pu8(n)pu8(n) pu9(n)pu9(n) p u 10 (n) p u 11 (n) p u 12 (n) p s (n) p s (1)0.18 p s (2)0.07 p s (3)0.06 p s (4)0.06 p s (5)0.04 p s (6)0.02 p s (7)0.02 p s (8)0.01 p s (9)0.01 p s (10)0.01 p s (11)0.01 p s (12) Time complexity: O(s(w) + m n:p m (n)>0 1)
Powerpoint Templates Page 8 Sparse Inference with MNB
Powerpoint Templates Page 9 Sparse Inference with MNB O(s(w))
Powerpoint Templates Page 10 Sparse Inference with MNB O( m n:p m (n)>0 1)
Powerpoint Templates Page 11 Multi-label Classifiers Multilabel classification –binary labelvector l=[l 1, …, l M ] instead of label m –2 M possible labelvectors, not directly solvable Solved with multilabel extensions: –Binary Relevance (Godbole & Sarawagi 2004) –Label Powerset (Boutell et al. 2004) –Multi-label Mixture Model (McCallum 1999)
Powerpoint Templates Page 12 Multi-label Classifiers Feature normalization: TFIDF –s(w u ) length normalization (“L0-norm”) –TF log-transform of counts, corrects “burstiness” –IDF transform, unsmoothed Croft-Harper IDF
Powerpoint Templates Page 13 Multi-label Classifiers Classifier modification with metaparameters a –a 1 Jelinek-Mercer smoothing of conditionals p m (n) –a 2 Count pruning in training with a threshold –a 3 Prior scaling. Replace p(l) by p(l) a 3 –a 4 Class pruning in classification with a threshold
Powerpoint Templates Page 14 Multi-label Classifiers Direct optimization of a with random search –target f(a): development set Fscore Parallel random search –iteratively sample points around current max f(a) –generate points by dynamically adapted steps –sample f(a) in I iterations of J parallel points I= 30, J=50 → 1500 configurations of a sampled
Powerpoint Templates Page 15 Experiment Datasets
Powerpoint Templates Page 16 Experiment Results
Powerpoint Templates Page 17 Conclusion New idea: sparse inference –reduces time complexity of probabilistic inference –demonstrated for multi-label classification –applicable with different models (KNN, SVM, …) and uses (clustering, ranking, …) Code available, with Weka wrapper: –Weka package manager: SparseGenerativeModel –
Powerpoint Templates Page 18 Multi-label classification Binary Relevance (Godbole & Sarawagi 2004) –each label decision independent binary problem –positive multinomial vs. negative multinomial negatives approximated with background multinomial –threshold parameter for improved accuracy + fast, simple, easy to implement - ignores label correlations, poor performance
Powerpoint Templates Page 19 Multi-label classification Label Powerset (Boutell et al. 2004) –each labelset seen in training is mapped to a class –hashtable for converting classes to labelsets + models label correlations, good performance - takes memory, cannot classify new labelsets
Powerpoint Templates Page 20 Multi-label classification Multi-label Mixture Model –mixture for prior: –classification with greedy search (McCallum 1999) complexity: q times MNB complexity, where q is maximum labelset size s(l) seen in training + models labelsets, generalizes to new labelsets - assumes uniform linear decomposition
Powerpoint Templates Page 21 Multi-label classification Multi-label Mixture Model –related models: McCallum 1999, Ueda 2002 –like label powerset, but labelset conditionals decompose into a mixture of label conditionals: p l (n) = 1/s(l) ∑ m l m p m (n)
Powerpoint Templates Page 22 Classifier optimization Model modifications –1) Jelinek-Mercer smoothing of conditionals: –2) Count pruning. Max 8M conditional counts. On each count update: online pruning with IDF running estimates and meta-parameter a 2
Powerpoint Templates Page 23 Classifier optimization Model modifications –3) Prior scaling. Replace p(l) by p(l) a 3 equivalent to LM scaling in speech recognition –4) Pruning in classification. Sort classes by rank and stop classification on a threshold a 4 sort by: prune:
Powerpoint Templates Page 24 Multinomial Naive Bayes Bayes p(w,m) = p(m) p m (w) Naive p(w,m) = p(m) n p m (w n, n) Multinomial p(w,m) p(m) n p m (n) w n Sparse Inference with MNB