Download presentation
Presentation is loading. Please wait.
Published byViolet Griffith Modified over 9 years ago
1
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University
2
Powerpoint Templates Page 2 Sparse Computing for Big Data “Big Data” –machine learning for processing vast datasets Current solution: Parallel computing –processing more data as expensive, or more Alternative solution: Sparse computing –scalable solutions, less expensive
3
Powerpoint Templates Page 3 Sparse Representation Example: document vector –Dense: word count vector w = [w 1, …, w N ] w = [0, 14, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 3, 0, 0, 0] –Sparse: vectors [v, c] of indices v and nonzeros c v = [2, 10, 14, 17], c = [14, 2, 1, 3] Complexity: |w| vs. s(w), number of nonzeros
4
Powerpoint Templates Page 4 Multinomial Naive Bayes Input word vector w, output label 1 ≤ m ≤ M Sparse representation for parameters p m (n) –Jelinek-Mercer interpolation: α p s (n) + (1- α )p u m (n) –estimation: represent p u m with a hashtable –inference: represent p u m with an inverted index Sparse Inference with MNB
5
Powerpoint Templates Page 5 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n)0.210.250.380.320.530.68 p2(n)p2(n)0.100.180.340.430.220.07 p3(n)p3(n)0.160.130.080.06 p4(n)p4(n)0.090.100.06 p5(n)p5(n)0.100.140.04 p6(n)p6(n)0.060.130.02 p7(n)p7(n)0.070.02 p8(n)p8(n)0.050.01 p9(n)p9(n)0.050.01 p 10 (n)0.020.01 p 11 (n)0.100.01 p 12 (n)0.050.01
6
Powerpoint Templates Page 6 Dense representation: p m (n) Sparse Inference with MNB p m (1)p m (2)p m (3)p m (4)p m (5)p m (6)p m (7)p m (8)p m (9) p1(n)p1(n)0.210.250.380.320.530.68 p2(n)p2(n)0.100.180.340.430.220.07 p3(n)p3(n)0.160.130.080.06 p4(n)p4(n)0.090.100.06 p5(n)p5(n)0.100.140.04 p6(n)p6(n)0.060.130.02 p7(n)p7(n)0.070.02 p8(n)p8(n)0.050.01 p9(n)p9(n)0.050.01 p 10 (n)0.020.01 p 11 (n)0.100.01 p 12 (n)0.050.01 Time complexity: O(s(w) M) M classes N features
7
Powerpoint Templates Page 7 Sparse representation: α p s (n) + (1- α )p u m (n) Sparse Inference with MNB pum(1)pum(1)p u m (2)p u m (3)p u m (4)p u m (5)p u m (6)p u m (7)p u m (8)p u m (9) pu1(n)pu1(n)0.030.070.210.140.350.50 pu2(n)pu2(n)0.020.110.270.360.150.00 pu3(n)pu3(n)0.100.070.020.00 pu4(n)pu4(n)0.030.040.00 pu5(n)pu5(n)0.060.100.00 pu6(n)pu6(n)0.040.110.00 pu7(n)pu7(n)0.050.00 pu8(n)pu8(n)0.030.00 pu9(n)pu9(n)0.040.00 p u 10 (n)0.010.00 p u 11 (n)0.090.00 p u 12 (n)0.030.00 p s (n) p s (1)0.18 p s (2)0.07 p s (3)0.06 p s (4)0.06 p s (5)0.04 p s (6)0.02 p s (7)0.02 p s (8)0.01 p s (9)0.01 p s (10)0.01 p s (11)0.01 p s (12)0.01 + Time complexity: O(s(w) + m n:p m (n)>0 1)
8
Powerpoint Templates Page 8 Sparse Inference with MNB
9
Powerpoint Templates Page 9 Sparse Inference with MNB O(s(w))
10
Powerpoint Templates Page 10 Sparse Inference with MNB O( m n:p m (n)>0 1)
11
Powerpoint Templates Page 11 Multi-label Classifiers Multilabel classification –binary labelvector l=[l 1, …, l M ] instead of label m –2 M possible labelvectors, not directly solvable Solved with multilabel extensions: –Binary Relevance (Godbole & Sarawagi 2004) –Label Powerset (Boutell et al. 2004) –Multi-label Mixture Model (McCallum 1999)
12
Powerpoint Templates Page 12 Multi-label Classifiers Feature normalization: TFIDF –s(w u ) length normalization (“L0-norm”) –TF log-transform of counts, corrects “burstiness” –IDF transform, unsmoothed Croft-Harper IDF
13
Powerpoint Templates Page 13 Multi-label Classifiers Classifier modification with metaparameters a –a 1 Jelinek-Mercer smoothing of conditionals p m (n) –a 2 Count pruning in training with a threshold –a 3 Prior scaling. Replace p(l) by p(l) a 3 –a 4 Class pruning in classification with a threshold
14
Powerpoint Templates Page 14 Multi-label Classifiers Direct optimization of a with random search –target f(a): development set Fscore Parallel random search –iteratively sample points around current max f(a) –generate points by dynamically adapted steps –sample f(a) in I iterations of J parallel points I= 30, J=50 → 1500 configurations of a sampled
15
Powerpoint Templates Page 15 Experiment Datasets
16
Powerpoint Templates Page 16 Experiment Results
17
Powerpoint Templates Page 17 Conclusion New idea: sparse inference –reduces time complexity of probabilistic inference –demonstrated for multi-label classification –applicable with different models (KNN, SVM, …) and uses (clustering, ranking, …) Code available, with Weka wrapper: –Weka package manager: SparseGenerativeModel –http://sourceforge.net/projects/sgmweka/http://sourceforge.net/projects/sgmweka/
18
Powerpoint Templates Page 18 Multi-label classification Binary Relevance (Godbole & Sarawagi 2004) –each label decision independent binary problem –positive multinomial vs. negative multinomial negatives approximated with background multinomial –threshold parameter for improved accuracy + fast, simple, easy to implement - ignores label correlations, poor performance
19
Powerpoint Templates Page 19 Multi-label classification Label Powerset (Boutell et al. 2004) –each labelset seen in training is mapped to a class –hashtable for converting classes to labelsets + models label correlations, good performance - takes memory, cannot classify new labelsets
20
Powerpoint Templates Page 20 Multi-label classification Multi-label Mixture Model –mixture for prior: –classification with greedy search (McCallum 1999) complexity: q times MNB complexity, where q is maximum labelset size s(l) seen in training + models labelsets, generalizes to new labelsets - assumes uniform linear decomposition
21
Powerpoint Templates Page 21 Multi-label classification Multi-label Mixture Model –related models: McCallum 1999, Ueda 2002 –like label powerset, but labelset conditionals decompose into a mixture of label conditionals: p l (n) = 1/s(l) ∑ m l m p m (n)
22
Powerpoint Templates Page 22 Classifier optimization Model modifications –1) Jelinek-Mercer smoothing of conditionals: –2) Count pruning. Max 8M conditional counts. On each count update: online pruning with IDF running estimates and meta-parameter a 2
23
Powerpoint Templates Page 23 Classifier optimization Model modifications –3) Prior scaling. Replace p(l) by p(l) a 3 equivalent to LM scaling in speech recognition –4) Pruning in classification. Sort classes by rank and stop classification on a threshold a 4 sort by: prune:
24
Powerpoint Templates Page 24 Multinomial Naive Bayes Bayes p(w,m) = p(m) p m (w) Naive p(w,m) = p(m) n p m (w n, n) Multinomial p(w,m) p(m) n p m (n) w n Sparse Inference with MNB
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.