Dongyeop Kang1, Youngja Park2, Suresh Chari2

Dongyeop Kang1, Youngja Park2, Suresh Chari2
1. IT Convergence Laboratory, KAIST Institute,Korea 2. IBM T.J. Watson Research Center, NY, USA Hetero-Labeled LDA: A partially supervised topic model with heterogeneous label information

Topic Discovery - Supervised
Topic classification Learn decision boundaries of classes by learning from data with labels Accurate topic classification for general domains Very hard to build a model for business applications due to data bottleneck

Topic Discovery – Unsupervised
Probabilistic topic modeling Learn topic distribution for each class by learning from data without label information, and choose topic of new data from most similar topic distribution e.g., Latent Dirichlet Allocation (LDA) Not sufficiently accurate or interpretable

Topic Discovery – Semi-supervised
Supervised topic modeling methods Supervised LDA [Blei&McAuliffe,2007], Labeled LDA [Ramage,2009]: document labels provided Semi-supervised topic modeling methods Seeded LDA [Jagarlamudi,2012], zLDA [Andrzejewski,2009]: word labels/constraints provided Limitations Only one kind of domain knowledge is supported The labels should cover the entire topic space, |L| = |T| All documents should be labeled in training data, |Dunlabeled| = Ф

Partially Semi-supervised Topic Modeling with Heterogeneous Labels
Generation of labeled training samples is much more challenging for real-world applications In most large companies, data are generated and managed independently by many different divisions Different types of domain knowledge are available in different divisions Can we discover accurate and meaningful topics with small amount of various types of domain knowledge?

Hetero-Lableled LDA: Main Contributions
Heterogeneity Domain knowledge (labels) come in different forms e.g., document labels, topic-indicative features, a partial taxonomy Partialness Small amount of labels are given We address two kinds of partialness Partially labeled documents: |L| << |T| Partially labeled corpus: |Dlabeled| << |Dunlabeled| Three levels of domain information Group Information: Label Information: Topic Distribution:

Challenges Document labels (Ld) Feature labels (Lw)
????? Feature labels (Lw) {trade, billion, dollar, export, bank, finance} {grain, wheat, corn, oil, oildseed, sugar, tonn} {game, team, player, hit, dont, run, pitch} {god, christian, jesus, bible, church, christ} ?????

Hetero-Labeled LDA: Heterogeneity
w z θ α φ β K D Λd γ Document Labels Λw δ Word Labels Wd

Hetero-Labeled LDA: Partialness
w z θ α φ β Wd Λw δ Λd γ Kd << K Kw << K Kd ∩ Kw ≠ Ф K Kw D

Hetero-Labeled LDA: Heterogeneity+Partialness
Ψ Kd w z θ α φ β K Λw Kw δ Hybrid Constraint Wd D γ Document specific topic distribution General topic distribution

Hetero-Labeled LDA: Generative Process

Hetero-Labeled LDA: Inference & Learning
Gibbs-Sampling

Experiments Datasets: Algorithms: Evaluation metric:
Baseline: LDA, LLDA, zLDA Proposed: HLLDA (L=T), HLLDA (L<T) Evaluation metric: Prediction Accuracy: the higher the better Clustering F-measure: the higher the better Variational Information: the lower the better Data set N V T Reuters 21,073 32,848 20 News20 19,997 82,780 Delicious.5K 5,000 890,429

Experiment: Questions
Q1. How does mixture of heterogeneous label information improve performance of classification and clustering?

Multi-class Prediction Accuracy
Clustering F-Measure

Q2. How does HLLDA improve performance of partially labeled documents? Partially labeled corpus: |Dlabeled| << |Dunlabeled| Partially labeled document: |L| << |T| For a document, the provided label set covers a subset of all the topics the document belongs to. Our goal is to predict the full set of topics for each document.

Partially Labeled Documents: |L| << |T|

Partially Labeled Corpus: |Dlabeled| << |Dunlabeled|

Q3. How good are the generated topics interpretable? Comparison between LLDA and HLLDA User study for topic quality

News-20: LLDA (10) vs HLLDA (10)
<LLDA(10) with 10 Document labels> <HLLDA(10) with 10 Document labels > <LLDA(10) with another 10 Document labels>

Delicious.5k: LLDA (10) vs HLLDA (10)
<LLDA(10) with 10 Document labels> <LLDA(10) with another 10 Document labels> <HLLDA(10) with 10 Document labels >

User Study for Topic Quality
Number of topically irrelevant (Red) and relevant (Blue) words. The more blue (red) words are, the higher (lower) the topic quality is

Conclusions Proposed a novel algorithm for partially semi-supervised topic modeling Incorporates multiple heterogeneous domain knowledge which can be easily obtained in real life Supports two types of partialness : |L| << |T| and |Dlabeled| << |Dunlabeled| A unified graphical model Experimental results confirm that learning from multiple domain information is beneficial (mutually reinforcing) HLLDA outperforms existing semi-supervised methods in terms of classification and clustering task

THANK YOU contact: young_park@us.ibm.com

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Similar presentations

Presentation on theme: "Dongyeop Kang1, Youngja Park2, Suresh Chari2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Similar presentations

Presentation on theme: "Dongyeop Kang1, Youngja Park2, Suresh Chari2"— Presentation transcript:

Similar presentations

About project

Feedback