Mining Topics in Documents: Standing on the Shoulders of Big Data

Mining Topics in Documents: Standing on the Shoulders of Big Data
Ashwin Mittal

INTRODUCTION

What and Why? Note: neque digni and in aliquet nisl et a umis varius.

What and Why? Topic modeling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection. It can also be thought of as a form of text mining – a way to obtain recurring patterns of words in textual material. Discovering hidden topical patterns that are present across the collection Annotating documents according to these topics Using these annotations to organize, search and summarize texts

Existing Models LDA AND pLSA Popular classic models used for topic extraction from text documents Needs large amount of data to provide reliable statistics to generate coherent topics. Of the order of thousand documents In practice only few document collections have so many documents

What do we do? Topic models perform unsupervised learning
INVENTING BETTER TOPIC MODELS Topic models perform unsupervised learning If small data - not enough info to provide reliable statistics to generate coherent topics Needs some external supervision HUMAN INTERVENTION Knowledge-based topic models – ask user domain knowledge User may not know everything or what knowledge to provide for better results Not automatic LEARNING LIKE HUMANS DO (LIFELONG) Knowledge-based learning Mine prior knowledge automatically for future learning

Motivation Lifelong learning is possible in our context for these 2 observations – MUST-LINK Topic overlapping across domains Eg. Price - Cost frequently appear together in most of the product reviews, hence form a must-link Most domains share Battery Some share Screen, which has this feature CANNOT-LINK One topical term occurs almost always and other almost never Eg. Picture - Price Should not come together Negatively co-related

Approach LEARNING LIKE HUMANS DO Retaining knowledge from past - use it to help future learning Mine reliable knowledge from past - guide the model inference to generate more coherent topics ALGORITHM MINES 2 FORMS OF KNOWLEDGE Must-Links 2 words should be in the same topic Cannot-Links 2 words should NOT be in the same topic DEAL WITH Wrong knowledge Knowledge transitivity {Price,Cost}

Drawbacks of other models
MC-LDA, DF-LDA Assumes must-links and cannot-links are always correct No overlapping Transitivity problem {light, bright}, {light, weight} DF-LDA links all of them together, {light, bright, weight} MC-LDA considers that a word has only 1 must-link and ignores all the rest. Might miss a god amount of knowledge Generate a lot of cannot-links. Cannot handle and crash Some models used only the must-links Most models assume that input knowledge is correct Labeled documents to produce better fitting topic models Catered to language gaps, with some user defined parameter LTM is the first model to perform lifelong learning but only considers must-links

OVERALL ALGORITHM AMC – Automatically Generated M & C

Phase 1 (Initialization)
LDA is run on each Di  D to produce a set of topics Si Let S = iSi (prior set of topics) It then mines Must-Links using multiple minimum supports frequent itemset mining (FIM) algorithm Used only one time! {Price,Cost}

Phase 2 (Lifelong Learning)
{Price,Cost}

Phase 2 (Cont.) {Price,Cost}
Given a test document Dt, the proposed AMC generates topics (current topics) from S Line 1: A Gibbs Sampler is run with M must-links to generate At set of topics from S. N is the number of iterations. Line 3: Based on At and S, algorithm find cannot-links C. Line 4: KBTM continues using these M & C to improve and generate the final set of topics Line 6: If domain of At exists in S, we replace those topics of domain in S with At; else add At to S. Line 7: With the updated S, new Ms are mined, which will be used in the next new modeling task. {Price,Cost}

Mining Must-Link Knowledge
There should be some semantic correlation between 2 terms w1 and w2 Eg. We expect to see price and cost as topical terms in the same topic across many domains Still, they may not always appear together in the topic relates to price due to previous modeling errors We use frequency-based approach Topic is distribution over terms. Top frequency terms under a topic are expected to imply same semantic meaning. We mainly employ top terms to represent a topic {Price,Cost}

Mining Must-Link Knowledge (Steps)
Given a set of prior topics S, we find set of terms that appear together over multiple topics using data mining technique, Frequent Itemset Mining (FIM) Rare item problem - Single minimum threshold for support count may not be an appropriate approach, for generic topics, price and cost may appear over most documents, but specific topics such as screen, occurs only with product having such feature. Setting it too low – will generate spurious itemsets for generic topics Setting it too high – will not find must-link from less frequent topics We use multiple minimum supports frequency itemset mining (MS-FIM) Minimum itemset support (MIS) – each item in itemset is given a MIS. Minimum support of an itemset is not fixed. Support difference constraint (SDC) – Supports of items in an itemset should not be too much We keep length 2 - less error, better semantic relationship {battery, life}, {battery, power}, {battery, charge}, {price, expensive}, {price, pricy}, {cheap, expensive} {Price,Cost}

Frequent Itemset Mining
{Price,Cost}

Mining Cannot-Link Knowledge
We use the same frequency based approach Not feasible. Lot many cannot-links. For V terms in the vocabulary, there are about O(V2) C links And for new test domain most of the C-links won’t even be useful because they won’t be there, so we focus only on the terms relevant to Dt From the current topics At, we extract cannot-links for each pair of top terms w1 and w2 {Price,Cost}

Mining Cannot-Link Knowledge (Steps)
Let Ndiff, be the number of domains where w1 and w2 appear in different p-topics And Nshare, be the number of domains where w1 and w2 appear in the same topic We need two conditions to control formation of cannot-links: Support Ratio - Ndiff /(Nshare +Ndiff ), >= threshold c Ndiff is greater than support threshold, diff {Price,Cost}

THE AMC MODEL

Dealing with issues of Must-Link
MULTIPLE MEANINGS Eg. Light can refer to visibility or weight DF-LDA Transitivity problem, as discussed earlier. If {w1, w2} and {w2, w3}  {w1, w2, w3} MC-LDA Assumes only 1 relevant must-link and ignores all rest Looses information {Price,Cost}

MULTIPLE MEANINGS (Solution) Construct a must-link graph m1 & m2 as must-links, the vertices of the graph There is a edge if they have a common term For each edge, we check how much their original topics overlap T1 & T2 be the topics for m1 & m2 respectively overlap is the threshold for distinguishing The edges that do not satisfy the above inequality are deleted {Price,Cost}

WRONG KNOWLEDGE A must-link may not be correct in general due to errors in the model Eg. {battery, beautiful} Is not a correct must-link in general for any domain A correct must-link may be correct for one domain but wrong to other Eg. {card, bill} Correct for restaurant, but not if card refers to camera card {Price,Cost}

WRONG KNOWLEDGE (Solution) Pointwise Mutual Information (PMI) measures the extent to which 2 terms tend to co-occur P(w) – probability of seeing term w in a random document P(w1,w2) – probability of seeing both terms co-occurring in a random document {Price,Cost} Positive PMI – semantic correlation of terms Non-positive – little or no

Dealing with issues of Cannot-Link
Cannot-link can have terms that have semantic correlation {battery, charger} Fit in one domain, but does not fit it other {card, bill} correct cannot-link for camera, but not for restaurant domain Wrong cannot-link can also conflict with the must-link {price, cost} and {price, pricy} are must-links; but system finds {pricy, cost} as cannot-link Number of automatically generated C-links are large in number, difficult to handle Low occurrences does not mean, negative links Wrong cannot-links are harder to detect and verify than wrong must-links Therefore, we detect the cannot-links inside the sampling process {Price,Cost}

Gibbs Sampler {Price,Cost}
SIMPLE POLYA URN MODEL (SPU) Terms are the colored balls and urns are the topics 1 ball is drawn from an urn and placed back with another ball of the same color Contents of urn changes over time Rich gets richer Process corresponds to assigning a topic to a term GENERALIZED POLYA URN MODEL (GPU) The drawn ball is replaced with 2 balls of same color along with some balls of other colors This increases the proportion of other balls in the urn This is called promotion {Price,Cost}

Proposed M-GPU Model {Price,Cost}
Consider many urns being sampled simultaneously Each time a term w is assigned to a topic k, the w’ which shares a must-link with w is also assigned to the topic k by certain amount, decided by w’,w. w’ is promoted by w We incorporate a factor , to control how much M-GPU should trust the word relationship indicated by PMI. This is helpful when a word a multiple sense/meaning. {Price,Cost}

Proposed M-GPU Model (Cont.)
For cannot-links Topic urn, where urn is for one document and topics are the colored balls Term urn, where urn is for one topic and words are the colored balls In multiple iterations we sample w from Term Urn and transfer the cannot-link wc, to an urn which has higher proportion of wc i.e., decreasing the probabilities of those cannot-terms under this topic while increasing their corresponding probabilities under some other topic If no urn with wc, then we make a new urn for wc. As proposed earlier, the cannot-link knowledge may not be correct {Price,Cost}

SAMPLING DISTRIBUTIONS

Phase 1 – (a) calculate the conditional probability of sampling a topic for term wi enumerate each topic k and calculate its corresponding probability Sample a must-link mi that contains wi Likely to have the word sense consistent with topic k w1 & w2 - terms in must-link m P(w | k) - probability of term w under topic k w’,w, - promotion matrix nk,w - number of times that term w appears under topic k  is Dirichlet prior

Phase 1 – (b) create set of must-links {m’}, where m’ is either mi or its neighbor in M graph The must-links in this set {m’} are likely to share the same word sense of term wi

Phase 1 – (c) conditional probability of assigning topic k to term wi
n-I - count excluding the current assignment of zi, i.e. z-i wi - current term to be sampled with a topic denoted by zi nd,k - number of times that topic k is assigned to terms in document d nk,w - number of times that term w appears under topic k {mv’} – set of must-links sampled for each term v

Phase 2 – (a) For every cannot-term (say wc) of wi, we sample one instance (say qc) of wc from topic zi zi - topic assigned to term wi dc denotes the document of the instance qc If no instance of wc in zi, skip step (b)

Phase 2 – (b) For each drawn instance qc from Phase 2 (a), resample a topic k (not equal to zi) Superscript -qc denotes the counts excluding the original assignments I() - indicator function, which restricts the ball to be transferred only to an urn that contains a higher proportion of term wc If there is no topic k that has a higher proportion of wc than zc, then keep the original topic assignment, i.e., assign zc to wc {mc’} – set of must-links sampled for each term wc

EVALUATION

Experimental Settings
DATASETS 2 large datasets First – 50 electronic products or domains Second – 50 non-electronic products or domains Each domain has 1000 reviews

Experimental Settings
PARAMETER SETTING 2000 iteration  = 1,  = 0.1, K = 15 (#Topics) Minimum Itemset Support (MIS) = Max(4, 35% of its actual support count in the data) Support Difference Constraint (SDC) = 8% Support Ratio Threshold for cannot-links (c) = 80% Support Threshold for cannot-links (diff) = 10 Overlap Ration Threshold for must-link graph edge (overlap) = 17% Control Factor which determines extent of promotion in must-link () = 0.5

Experiment with First dataset
50 Electronics domains, which have large amount of topic overlapping Treat each domain as test set (Dt), while knowledge is mined from rest 49 Aim is to improve topic modeling with small datasets – 100 reviews are randomly form 1000 reviews of the domain However, knowledge is extracted from full data (1000 reviews) of rest 49

Topic Coherence Evaluates topics generated by each model
Correlates well with human expert labeling Higher Topic Coherence indicates a higher quality of topics

Topic Coherence (Observations)
AMC performs best Even AMC-M performs better than the rest. DF-LDA generates exponential number of cannot-links and crashes GK-LDA, performs better than LDA, incorporates wrong knowledge TC value for AMC increases from r = 1 to 3, and then stabilizes Even with 1000 documents, AMC gives better coherence than LTM by 47 points

Human Evaluation 2 human judges, well familiar with Amazon product reviews Since 50 is large number, 10 random domains were selected Compared with: LDA – basic knowledge-free topic model LTM – Lifelong learning model that achieved highest results amongst the rest Judges were asked to label topic as coherent (if topical terms are semantically coherent) or incoherent Topical words were labeled correct if coherently related to concept represented by topic; otherwise incorrect Evaluation measure –

Human Evaluation (Observations)

Example Topics (Camera)

Experiment with both datasets

Conclusions Proposes an lifelong learning algorithm - AMC
Mines prior knowledge from results of past modeling and uses it to help future modeling Two types of knowledge Must-links Cannot-links AMC not only exploits learned knowledge, but also deals with issues of mines knowledge and improves it Outperforms existing state-of-the-art models significantly Future work – maintaining prior topics and update must-link knowledge when new topics are added

THANK YOU

Mining Topics in Documents: Standing on the Shoulders of Big Data

Similar presentations

Presentation on theme: "Mining Topics in Documents: Standing on the Shoulders of Big Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Topics in Documents: Standing on the Shoulders of Big Data

Similar presentations

Presentation on theme: "Mining Topics in Documents: Standing on the Shoulders of Big Data"— Presentation transcript:

Similar presentations

About project

Feedback