Final Project Presentation Name: Samer Al-Khateeb Instructor: Dr. Xiaowei Xu Class: Information Science Principal/ Theory (IFSC 7321) TOPIC MODELING FOR.

Slides:

Advertisements

Similar presentations

Scaling Up Graphical Model Inference

Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.

Information retrieval – LSI, pLSI and LDA

Simultaneous Image Classification and Annotation Chong Wang, David Blei, Li Fei-Fei Computer Science Department Princeton University Published in CVPR.

Title: The Author-Topic Model for Authors and Documents

Tutorial : Using Stanford Topic Modeling Toolbox Lili Lin

An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.

LDA Training System 8/22/2012.

Probabilistic Clustering-Projection Model for Discrete Data

Statistical Topic Modeling part 1

A Joint Model of Text and Aspect Ratings for Sentiment Summarization Ivan Titov (University of Illinois) Ryan McDonald (Google Inc.) ACL 2008.

Latent Dirichlet Allocation (LDA)

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.

Generative Topic Models for Community Analysis

Caimei Lu et al. (KDD 2010) Presented by Anson Liang.

Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Latent Dirichlet Allocation a generative model for text

British Museum Library, London Picture Courtesy: flickr.

LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

Example 16,000 documents 100 topic Picked those with large p(w|z)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Online Learning for Latent Dirichlet Allocation

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.

1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.

Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Probabilistic Topic Models

Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

Integrating Topics and Syntax -Thomas L

Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.

Topic Modeling using Latent Dirichlet Allocation

Project 2 Latent Dirichlet Allocation 2014/4/29 Beom-Jin Lee.

Latent Dirichlet Allocation

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Link Distribution on Wikipedia [0407]KwangHee Park.

Automatic Labeling of Multinomial Topic Models

Web-Mining Agents Topic Analysis: pLSI and LDA

14.0 Linguistic Processing and Latent Topic Analysis.

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)

Topic Modeling and Latent Dirichlet Allocation: An Overview

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

Online Multiscale Dynamic Topic Models

Trevor Savage, Bogdan Dit, Malcom Gethers and Denys Poshyvanyk

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Bayesian Inference for Mixture Language Models

Stochastic Optimization Maximization for Latent Variable Models

Michal Rosen-Zvi University of California, Irvine

Latent Dirichlet Allocation

Junghoo “John” Cho UCLA

Topic Models in Text Processing

Hierarchical Relational Models for Document Networks

Presentation transcript:

Final Project Presentation Name: Samer Al-Khateeb Instructor: Dr. Xiaowei Xu Class: Information Science Principal/ Theory (IFSC 7321) TOPIC MODELING FOR ASSOCIATED PRESS ARTICLES USING LATENT DIRICHLET ALLOCATION [LDA]

OUTLINE Introduction Methodology A.Data Collection and Software used B.Data Pre-Processing Experiment and Evaluation A.Selecting the Optimal Model Parameter B.Training and Testing C.Results Summary and Future Work References

INTRODUCTION What is topic modeling? “A topic modeling tool takes a single text (or corpus) and looks for patterns in the use of words; it is an attempt to inject semantic meaning into vocabulary” so words like :Computer, Laptop will be in the same Topic”. What is LDA? 1.Vector Space Model (VSM) 2.Latent Semantic Analysis (LSA) 3.Probabilistic Latent Semantic Analysis (pLSA) 4. Latent Dirichlet Allocation: “It is an important hierarchical Bayesian model for probabilistic topic modeling”. Types of LDA-based topic Models: 1.Document-Topic Model 2.Author-Topic Model (ATM) 3.Relational-Topic Models (RTM) 4.Labeled LDA ( LaLDA) Approximate inference methods: 1.Variational Bayes (VB) 2.Gibbs Sampling (GS) 3.Belief Propagation (BP)

METHODOLOGY Stanford Topic Modeling Toolbox (TMT). Associated Press Articles: 2250 sample with 2 features For Data Pre-processing : 1.Removing all markup tags using Microsoft Excel 2.Case folder: for Lower/Upper Case words (The=tHE=ThE=thE=tHe  the) 3.Filtering: i.WordsAndNumbersOnlyFilter ii.MinimumLengthFilter (3) iii.TermMinimumDocumentCountFilter(4) iv.TermDynamicStopListFilter(30) v.DocumentMinimumLengthFilter (5) vi.StopWordFilter("en") 4.No Stemming Although TMT provide Porter Stemmer

EXPERIMENT AND EVALUATION Selecting The Optimal Model Parameters ( K, α, β ).

EXPERIMENT AND EVALUATION CONTINUE …. Training and Testing 1.Obtained the document-topic distributions for the trained model. 2.To test I used the same dataset. 3.I used Collapsed Gibbs Sampler with 1500 iterations because it is faster than the other method (Collapsed Variational Bayes Approximation).

EXPERIMENT AND EVALUATION CONTINUE …. Results

EXPERIMENT AND EVALUATION CONTINUE …. Results Continue…

SUMMARY AND FUTURE WORK Stanford Topic Modeling (TMT) 2250 Associated Press Articles Collapsed Gibbs Sampler Optimal Paramter (K = 375, α = 0.05, β =0.03) The top 20 words of each document  Try another inference method such as Variational Bayes (VB) or Belief Propagation (BP) using the same dataset and compare the optimal parameters obtained from all methods.  Test the results of training and testing to see what is the more accurate inference method and which one is faster in giving the results.

REFERENCES [1]“Topic model,” Wikipedia, the free encyclopedia. 30-Mar [2]K. Christidis, D. Apostolou, and G. Mentzas, “Exploring Customer Preferences with Probabilistic Topics Models.” [3]D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, [4]“Latent Dirichlet Allocation in C,” [Online]. Available: [Accessed: 25-Apr-2014]. [5]D. Ramage and E. Rosen, “Stanford Topic Modeling Toolbox,” The Stanford Natural Language Processing Group, Sep [Online]. Available: [Accessed: 25-Apr-2014]. [6]“Scala (programming language),” Wikipedia, the free encyclopedia. 28-Apr [7]J. Zeng, “A topic modeling toolbox using belief propagation,” J. Mach. Learn. Res., vol. 13, no. 1, pp. 2233–2236, 2012.

THANKS Questions?!