1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

1 Topic Distributions over Links on Web Jie Tang 1, Jing Zhang 1, Jeffrey Xu Yu 2, Zi Yang 1, Keke Cai 3, Rui Ma 3, Li Zhang 3, and Zhong Su 3 1 Tsinghua.
Information retrieval – LSI, pLSI and LDA
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
Title: The Author-Topic Model for Authors and Documents
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang.
Topic Extraction From Turkish News Articles Anıl Armağan Fuat Basık Fatih Çalışır Arif Usta.
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Tweetool ( version) Final Report Yilei Qian Computer Science University of Southern California A Twitter Recommend System.
Article Review Study Fulltext vs Metadata Searching Brad Hemminger School of Information and Library Science University of North Carolina.
Latent Dirichlet Allocation a generative model for text
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Semantic History Embedding in Online Generative Topic Models Pu Wang (presenter) Authors: Loulwah AlSumait Daniel Barbará
1 A Discriminative Approach to Topic- Based Citation Recommendation Jie Tang and Jing Zhang Presented by Pei Li Knowledge Engineering Group, Dept. of Computer.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Online Learning for Latent Dirichlet Allocation
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Mining Cross-network Association for YouTube Video Promotion Ming Yan, Jitao Sang, Changsheng Xu*. 1 Institute of Automation, Chinese Academy of Sciences,
Hierarchical Topic Models and the Nested Chinese Restaurant Process Blei, Griffiths, Jordan, Tenenbaum presented by Rodrigo de Salvo Braz.
Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
Integrating Topics and Syntax -Thomas L
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Object Recognition a Machine Translation Learning a Lexicon for a Fixed Image Vocabulary Miriam Miklofsky.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Topic Modeling using Latent Dirichlet Allocation
Latent Dirichlet Allocation
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Department of Automation Xiamen University
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Investigating QoS of Web Services by Distributed Evaluation Zibin Zheng Feb. 8, 2010 Department of Computer Science & Engineering.
1 Context-Aware Ranking in Web Search (SIGIR 10’) Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, Hang Li 2010/10/26.
Collaborative Deep Learning for Recommender Systems
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.
Bayesian Semi-Parametric Multiple Shrinkage
Online Multiscale Dynamic Topic Models
People-LDA using Face Recognition
Topic Modeling Nick Jordan.
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
Presentation transcript:

1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua University 2 Dept. of Computer Science and Technology,Beijing Information Science and Technology Incorporating Entities in News Topic Modeling

2 Outline Motivation Related Work Approach Experiment Conclusion & Future Work

3 Motivation 78% of Internet users in China (461 million) read news online[Jun, 2013, CNNIC] 1.Online news reading has become a popular habit 2.With the increasingly overwhelming volume of news articles, it is urgent to organize news to facilitate reading 3.Named entities paly critical role in conveying semantic information. Why not cluster news according to entities?

4 Motivation Named entities which refer to names, locations, time, and organizations play critical roles in conveying news semantics like who, when, where and what etc Who? Where? When? What?

5 Related Work LDA(Latent dirichlet allocation). –Blei, D.M., Ng, A.Y., Jordan, M.I. –In Journal of Machine Learning Research 3 (2003) Entity topic models for mining documents associated with entities –Kim, H., Sun, Y., Hockenmaier, J., Han, J. –In: ICDM’12. (2012) Statistical entity-topic models. –Newman, D., Chemudugunta, C., Smyth, P. –In KDD. (2006)

6 Related Work The dependency of topic, entity and word We propose Entity Centered Topic Model (ECTM). We cluster news articles according to entity topics.

7 Our work Entity topic: A multinomial distribution over entities, represented by top 15 entities. Word topic: A multinomial distribution over words, represented by top 15 words. We separately generate entities and words. We assume when writing a news article, named entities are determined first, which implies a set of entity topics. Then there goes the word topic given the entity topic which has a multinomial distribution over word topics. Entity Topic Word Topic

8 ECTM(Entity Centered Topic Model) Denotations

9 ECTM(Entity Centered Topic Model) [1]Heinrich G. Parameter estimation for text analysis[J]. Web: arbylon. net/publications/text-est. pdf, [1]

10 Data Set Sources ( Chinese: Sina!) Dataset1: Chile Earthquake Event Dataset2: National news Dataset3: Three events of different topics, including Qinghai Earthquake, Two Sessions in 2013 and Tsinghua University ArticlesWordsEntities Dataset16325,4821,657 Dataset270015,8625,357 Dataset31,80019,59710,981 Table 2. Statistics of Datasets

11 Experimental Setup We evaluate ECTM’s performance by measure of perplexity taking LDA and CorrLDA2 as baselines. To further analyze the entity topics generated using different models, we measure with average entropy of entity topics and average sKL of all pairs of entity topics. Finally, we analyze and compare overall results of different models.

12 Perplexity Often one tries to model an unknown probability distribution p, based on a training sample that was drawn from p. Given a proposed probability model q, one may evaluate q by asking how well it predicts a separate test sample x 1, x 2,..., x N which are also drawn from p. The perplexity of the model q is defined as Better models q of the unknown distribution p will tend to assign higher probabilities q(x i ) to the test events. Thus, if they have lower perplexity, they are less surprised by the test sample.

13 Perplexity 1.We first determine the topic number by applying LDA on three datasets and observing the trend of perplexity on topic numbers. 2.Then we test perplexity of topic models based on same topic number. ECTM shows lowest perplexity on three datasets. Perplexity grows with the number of words or entities.

14 Entity Clustering 2. The values with underlines show our ECTM better in entity clustering with lower entropy and larger sKL 1. Definition of Entropy and sKL(symmetrical Kullback–Leibler divergence)

15 Results —ECTM good at news organization Entity Topic Word Topic

16 Results —CorrLDA2 VS ECTM(√) Word Topic Entity Topic

17 Conclusion We propose an entity-centered topic model, ECTM to model generation of news articles by generating entities first, then words. We give evidence that ECTM is better in clustering entities than CorrLDA2 by evaluation of average entropy and sKL. Experimental results prove the effectiveness of ECTM.

18 Future Work Event Clustering: Cluster news articles according to a specific event. Develop hierarchical entity topic models to mine correlation between topics and correlation between topic and entities. Develop dynamic entity topic model: Taking time into consideration

19