Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Text Mining with Sparse Generative Models

Similar presentations


Presentation on theme: "Scalable Text Mining with Sparse Generative Models"— Presentation transcript:

1 Scalable Text Mining with Sparse Generative Models
Antti Puurula PhD thesis presentation University of Waikato, New Zealand, 8th June 2015

2 Introduction This thesis presents a framework of probabilistic text mining based on sparse generative models Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks

3 Major Contributions of the Thesis
Formalizing multinomial modeling of text Smoothing as two-state Hidden Markov Models Fractional counts as probabilistic data Weighted factors as log-linear models Scalable inference on text Sparse inference using inverted indices for statistical models Tied Document Mixture, a model benefiting from sparse inference Extensive evaluation using a combined experimental setup for classification and retrieval

4 Defining Text Mining “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, ] “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] Definition used in this thesis: Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information

5 Related Fields and Application Domains

6 Volume of Text Mining Publications
References per year found for related fields using academic search engines

7 Volume of Text Mining Publications
References per year found for related fields using academic search engines

8 Volume of Text Mining Publications
References per year found for related fields using academic search engines

9 Scale of Text Data Existing collections:
Google Books, 30M books (2013) Twitter, 200M users, 400M messages per day (2013) WhatsApp, 430M users, 50B messages per day (2014) Available research collections: English Wikipedia, 4.5M articles (2014) Google n-grams, 5-grams estimated from 1T words (2007) Annotated English Gigaword, 4B words with metadata (2012) TREC KBA, 394M annotated documents for classification (2014)

10 Text Mining Methodology in a Nutshell
Normalize and map documents into a structured representation, such as a vector of word counts Segment a problem into machine learning tasks Solve the tasks using algorithms, most commonly linear models

11

12

13 Linear Models for Text Mining
Multi-class linear scoring function:

14 Multinomial Naive Bayes
Bayes model with multinomials conditioned on label variables: Priors are categorical, label-conditionals are multinomial, and normalizer is constant Directed generative graphical model

15

16

17

18

19

20 Formalizing Smoothing of Multinomials
All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, is the background model, and is the smoothing weight Discounting of counts by discounts is applied to

21

22 Two-State HMM Formalization of Smoothing
Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: Component m=2 is shared between the 2-state HMMs for each label

23 Two-State HMM Formalization of Smoothing (2)
Label-conditionals can be rewritten: Choosing , , and implements the smoothed multinomials

24 Two-State HMM Formalization of Smoothing (3)
Maximum likelihood estimation is difficult, due to a sum over terms Given a prior distribution over component assignments , expected log-likelihood estimation decouples:

25

26 Formalizing Fractional Counts
Fractional counts are undefined for categorical and multinomial models Formalization possible with probabilistic data A weight sequence matching a word sequence can be interpreted as probabilities of words occurring in data Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts

27 Formalizing Fractional Counts (2)
Estimation with expected log-likelihood

28 Formalizing Fractional Counts (3)
Inference with expected log-probability

29 Extending MNB with Scaled Factors
MNB with scaled factors for label priors and document lengths , where label prior and document length factors are scaled and renormalized: and

30

31 Sparse Inference for MNB
Naive MNB posterior inference has complexity Sparse inference using an inverted index with precomputed values: , with , , , and for all This has time complexity

32

33 Sparse Inference for Structured Models
Extension to hierarchically smoothed sequence models Complexity reduced from to

34

35 Sparse Inference for Structured Models (2)
A hierarchically smoothed sequence model: With Jelinek-Mercer smoothing, sparse marginalization: Marginalization complexity reduced from to

36

37 Tied Document Mixture Replace label-conditional in MNB with a mixture over hierarchically smoothed document models:

38 Experiments Experiments on 16 text classification and 13 ranked retrieval datasets Development and evaluation segments used, both further split into training and testing segments Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG Models optimized for the evaluation measures using a Gaussian random search on development test set

39

40

41

42 Evaluated Modifications
MNB, TDM, VSM, LR, and SVM models with modifications compared Generalized TF-IDF used: , with for length scaling and for IDF lifting

43

44

45

46

47

48

49

50

51

52

53 Scalability Experiments
Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents Pruned by features (10 to ), documents (10 to ) and labelsets (1 to ) into smaller sets Scalability of naive vs. sparse inference evaluated on MNB and TDM Maximum of 4 hours of computing time allowed for each condition

54

55 Summary of Experiment Results
Effectiveness improvements to MNB: Choice of smoothing – small effect Feature weighting and scaled factors - large effect Tied Document Mixture - very large effect BM25 for ranking outperformed, close to highly optimized SVM for classification Scalability from sparse inference: 10* inference time reduction in largest completed case

56 Conclusion Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … Sparse inference enables scalability for new types of tasks and models Possible future applications of the presented framework Text clustering Text regression N-gram language modeling Topic models

57 Conclusion (2) Thesis statement:
“Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” Truisms in theory that should be reconsidered: Naive Bayes as the “punching bag of machine learning” “the curse of dimensionality” and “ is optimal time complexity”


Download ppt "Scalable Text Mining with Sparse Generative Models"

Similar presentations


Ads by Google