Download presentation
Published bySamuel Wilcox Modified over 9 years ago
1
Scalable Text Mining with Sparse Generative Models
Antti Puurula PhD thesis presentation University of Waikato, New Zealand, 8th June 2015
2
Introduction This thesis presents a framework of probabilistic text mining based on sparse generative models Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks
3
Major Contributions of the Thesis
Formalizing multinomial modeling of text Smoothing as two-state Hidden Markov Models Fractional counts as probabilistic data Weighted factors as log-linear models Scalable inference on text Sparse inference using inverted indices for statistical models Tied Document Mixture, a model benefiting from sparse inference Extensive evaluation using a combined experimental setup for classification and retrieval
4
Defining Text Mining “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, ] “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] Definition used in this thesis: Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information
5
Related Fields and Application Domains
6
Volume of Text Mining Publications
References per year found for related fields using academic search engines
7
Volume of Text Mining Publications
References per year found for related fields using academic search engines
8
Volume of Text Mining Publications
References per year found for related fields using academic search engines
9
Scale of Text Data Existing collections:
Google Books, 30M books (2013) Twitter, 200M users, 400M messages per day (2013) WhatsApp, 430M users, 50B messages per day (2014) Available research collections: English Wikipedia, 4.5M articles (2014) Google n-grams, 5-grams estimated from 1T words (2007) Annotated English Gigaword, 4B words with metadata (2012) TREC KBA, 394M annotated documents for classification (2014)
10
Text Mining Methodology in a Nutshell
Normalize and map documents into a structured representation, such as a vector of word counts Segment a problem into machine learning tasks Solve the tasks using algorithms, most commonly linear models
13
Linear Models for Text Mining
Multi-class linear scoring function:
14
Multinomial Naive Bayes
Bayes model with multinomials conditioned on label variables: Priors are categorical, label-conditionals are multinomial, and normalizer is constant Directed generative graphical model
20
Formalizing Smoothing of Multinomials
All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, is the background model, and is the smoothing weight Discounting of counts by discounts is applied to
22
Two-State HMM Formalization of Smoothing
Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: Component m=2 is shared between the 2-state HMMs for each label
23
Two-State HMM Formalization of Smoothing (2)
Label-conditionals can be rewritten: Choosing , , and implements the smoothed multinomials
24
Two-State HMM Formalization of Smoothing (3)
Maximum likelihood estimation is difficult, due to a sum over terms Given a prior distribution over component assignments , expected log-likelihood estimation decouples:
26
Formalizing Fractional Counts
Fractional counts are undefined for categorical and multinomial models Formalization possible with probabilistic data A weight sequence matching a word sequence can be interpreted as probabilities of words occurring in data Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts
27
Formalizing Fractional Counts (2)
Estimation with expected log-likelihood
28
Formalizing Fractional Counts (3)
Inference with expected log-probability
29
Extending MNB with Scaled Factors
MNB with scaled factors for label priors and document lengths , where label prior and document length factors are scaled and renormalized: and
31
Sparse Inference for MNB
Naive MNB posterior inference has complexity Sparse inference using an inverted index with precomputed values: , with , , , and for all This has time complexity
33
Sparse Inference for Structured Models
Extension to hierarchically smoothed sequence models Complexity reduced from to
35
Sparse Inference for Structured Models (2)
A hierarchically smoothed sequence model: With Jelinek-Mercer smoothing, sparse marginalization: Marginalization complexity reduced from to
37
Tied Document Mixture Replace label-conditional in MNB with a mixture over hierarchically smoothed document models:
38
Experiments Experiments on 16 text classification and 13 ranked retrieval datasets Development and evaluation segments used, both further split into training and testing segments Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG Models optimized for the evaluation measures using a Gaussian random search on development test set
42
Evaluated Modifications
MNB, TDM, VSM, LR, and SVM models with modifications compared Generalized TF-IDF used: , with for length scaling and for IDF lifting
53
Scalability Experiments
Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents Pruned by features (10 to ), documents (10 to ) and labelsets (1 to ) into smaller sets Scalability of naive vs. sparse inference evaluated on MNB and TDM Maximum of 4 hours of computing time allowed for each condition
55
Summary of Experiment Results
Effectiveness improvements to MNB: Choice of smoothing – small effect Feature weighting and scaled factors - large effect Tied Document Mixture - very large effect BM25 for ranking outperformed, close to highly optimized SVM for classification Scalability from sparse inference: 10* inference time reduction in largest completed case
56
Conclusion Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … Sparse inference enables scalability for new types of tasks and models Possible future applications of the presented framework Text clustering Text regression N-gram language modeling Topic models
57
Conclusion (2) Thesis statement:
“Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” Truisms in theory that should be reconsidered: Naive Bayes as the “punching bag of machine learning” “the curse of dimensionality” and “ is optimal time complexity”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.