Scalable Text Mining with Sparse Generative Models

Name: Scalable Text Mining with Sparse Generative Models
Uploaded: 2017-09-08T02:43:50+00:00
Duration: PTM11S6
Channel: Samuel Wilcox
Description: Scalable Text Mining with Sparse Generative Models

Scalable Text Mining with Sparse Generative Models
Antti Puurula PhD thesis presentation University of Waikato, New Zealand, 8th June 2015

Introduction This thesis presents a framework of probabilistic text mining based on sparse generative models Models developed in the framework show state-of-the-art effectiveness in both text classification and retrieval tasks Proposed sparse inference for using these models improves scalability, enabling text mining for very large-scale tasks

Major Contributions of the Thesis
Formalizing multinomial modeling of text Smoothing as two-state Hidden Markov Models Fractional counts as probabilistic data Weighted factors as log-linear models Scalable inference on text Sparse inference using inverted indices for statistical models Tied Document Mixture, a model benefiting from sparse inference Extensive evaluation using a combined experimental setup for classification and retrieval

Defining Text Mining “Knowledge Discovery in Textual databases” (KDT) [Feldman and Dagan, ] “Text Mining as Integration of Several Related Research Areas” [Grobelnik et al., 2000] Definition used in this thesis: Text mining is an interdisciplinary field of research on the automatic processing of large quantities of text data for valuable information

Related Fields and Application Domains

Volume of Text Mining Publications
References per year found for related fields using academic search engines

Scale of Text Data Existing collections:
Google Books, 30M books (2013) Twitter, 200M users, 400M messages per day (2013) WhatsApp, 430M users, 50B messages per day (2014) Available research collections: English Wikipedia, 4.5M articles (2014) Google n-grams, 5-grams estimated from 1T words (2007) Annotated English Gigaword, 4B words with metadata (2012) TREC KBA, 394M annotated documents for classification (2014)

Text Mining Methodology in a Nutshell
Normalize and map documents into a structured representation, such as a vector of word counts Segment a problem into machine learning tasks Solve the tasks using algorithms, most commonly linear models

Linear Models for Text Mining
Multi-class linear scoring function:

Multinomial Naive Bayes
Bayes model with multinomials conditioned on label variables: Priors are categorical, label-conditionals are multinomial, and normalizer is constant Directed generative graphical model

Formalizing Smoothing of Multinomials
All smoothing methods for multinomials can be expressed as , where is an unsmoothed label-conditional model, is the background model, and is the smoothing weight Discounting of counts by discounts is applied to

Two-State HMM Formalization of Smoothing
Replace multinomial with a 0th order categorical state-emission HMM, with M=2 hidden states: Component m=2 is shared between the 2-state HMMs for each label

Two-State HMM Formalization of Smoothing (2)
Label-conditionals can be rewritten: Choosing , , and implements the smoothed multinomials

Two-State HMM Formalization of Smoothing (3)
Maximum likelihood estimation is difficult, due to a sum over terms Given a prior distribution over component assignments , expected log-likelihood estimation decouples:

Formalizing Fractional Counts
Fractional counts are undefined for categorical and multinomial models Formalization possible with probabilistic data A weight sequence matching a word sequence can be interpreted as probabilities of words occurring in data Expected log-likelihoods and log-probabilities given expected counts reproduce results from using fractional counts

Formalizing Fractional Counts (2)
Estimation with expected log-likelihood

Formalizing Fractional Counts (3)
Inference with expected log-probability

Extending MNB with Scaled Factors
MNB with scaled factors for label priors and document lengths , where label prior and document length factors are scaled and renormalized: and

Sparse Inference for MNB
Naive MNB posterior inference has complexity Sparse inference using an inverted index with precomputed values: , with , , , and for all This has time complexity

Sparse Inference for Structured Models
Extension to hierarchically smoothed sequence models Complexity reduced from to

Sparse Inference for Structured Models (2)
A hierarchically smoothed sequence model: With Jelinek-Mercer smoothing, sparse marginalization: Marginalization complexity reduced from to

Tied Document Mixture Replace label-conditional in MNB with a mixture over hierarchically smoothed document models:

Experiments Experiments on 16 text classification and 13 ranked retrieval datasets Development and evaluation segments used, both further split into training and testing segments Classification evaluated with Micro-Fscore, retrieval with MAP and NDCG Models optimized for the evaluation measures using a Gaussian random search on development test set

Evaluated Modifications
MNB, TDM, VSM, LR, and SVM models with modifications compared Generalized TF-IDF used: , with for length scaling and for IDF lifting

Scalability Experiments
Large English Wikipedia dataset for multi-label classification, segmented into 2.34M training documents and 23.6k test documents Pruned by features (10 to ), documents (10 to ) and labelsets (1 to ) into smaller sets Scalability of naive vs. sparse inference evaluated on MNB and TDM Maximum of 4 hours of computing time allowed for each condition

Summary of Experiment Results
Effectiveness improvements to MNB: Choice of smoothing – small effect Feature weighting and scaled factors - large effect Tied Document Mixture - very large effect BM25 for ranking outperformed, close to highly optimized SVM for classification Scalability from sparse inference: 10* inference time reduction in largest completed case

Conclusion Modified Bayes models are strong models for text mining tasks: sentiment analysis, spam classification, document categorization, ranked retrieval, … Sparse inference enables scalability for new types of tasks and models Possible future applications of the presented framework Text clustering Text regression N-gram language modeling Topic models

Conclusion (2) Thesis statement:
“Generative models of text combined with inference using inverted indices provide sparse generative models for text mining that are both versatile and scalable, providing state-of-the-art effectiveness and high scalability for various text mining tasks.” Truisms in theory that should be reconsidered: Naive Bayes as the “punching bag of machine learning” “the curse of dimensionality” and “ is optimal time complexity”

Scalable Text Mining with Sparse Generative Models

Similar presentations

Presentation on theme: "Scalable Text Mining with Sparse Generative Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Text Mining with Sparse Generative Models

Similar presentations

Presentation on theme: "Scalable Text Mining with Sparse Generative Models"— Presentation transcript:

Similar presentations

About project

Feedback