Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4.

Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4

Agenda  Introduction  Tag-Topic Model  Tag Hierarchical Dirichlet Process  Experiments and evaluation  Conclusion

Introduction  With the rapid development of web 2.0, the inte rnet has brought a large amount of resources, s uch as blogs, twitter, and encyclopedia.  These resources contain a wealth of informatio n, which can be applied to a variety of fields in i nformation processing to improve the service q uality, but it is too deficiency to use tradition hu man professional to dispose the information.

Introduction  In NLP, computer programs face several tasks t hat require human-level intelligence, or the prog rams should be endowed with the ability of lang uage understanding.  One core of the issues is how to automatically o btain knowledge and effectively use them to ac hieve semantic analysis and computation

Introduction  Tagging has recently emerged as a popular wa y to organize user generated content for Web 2. 0 applications, such as blogs and bookmarks. I n blogs, users can assign one or more tags for each blog. Usually, these tags can reflect the c oncerned subjects of the contents. Tags can be seen as labeled meta-information about the con tent, and they are beneficial for knowledge mini ng from blogs.

Introduction  In this paper, we extend the Tag topic model (T TM) 1 by crystallized HDP as prior distribution. We assume that an author is clear in his mind t hat the content will contains which aspects befo re he writes a blog and for each aspect he will c hoose a tag to describe it.

LDA Generative model

Tag-Topic Model  Basic ideal: each document with a mixture of ta gs, each tag can be viewed as a multinomial dis tribution over topics and each topic is associate d with a multinomial distribution over words.

Tag-Topic Model

THDP  The THDP topic model draws upon the strength s of the two models (TTM, HDP); using the topi c-based representation to model both the conte nt of documents and the tag. As in the THDP m odel, a group of tags, T d, indicate the mainly pu rpose of the blog. For each word in the docume nt a Tag is chosen uniformly at random. Then, as in the topic model, a topic is chosen from a d istribution over topics specific to that tag, and th e word is generated form the chosen topic.

 Given an underlying measure H on multinomial probabil ity vectors, we select a random measure G0 which prov ides a countable infinite collection of multinomial probab ility vectors; these can be viewed as the set of all topics that can be used in a given corpus. For the lth tag in the jth document in the corpus we sample Gj using G0 as a base measure; this selects specific subsets of topics to be used in tag l in document j. From Gj we then generat e a document by repeatedly (1) choose a tag with the e qual probability from the tag sets associate with the doc ument and (2) sampling specific multinomial probability vectors zji from Gj and sampling words wji with probabili ties zji. The overlap among the random measures Gj im plement the sharing of topics among documents.

Experiments and evaluation  DataSet The dataset used in the experiment is from the blog c orpus during October 2011 and December 2012, which is constructed by National Language Resources Monito ring and Research Center, Network Media Branch. Afte r filtering out blog texts with no tags or containing less t han 100 words and some preprocessing such as remov e stop words and extremely common words, filter out th e non-nominal words and retain only the nouns or nomi nal phrases. The dataset containing the tags and conte xt of N = 927 blog, with W = 10438 words in the vocabul ary and T = 558 tags.

Experiments and evaluation  The perplexities for different topic numbers of T TM and THDP

Experiments and evaluation  the topic number for different iteration of THDP.

Experiments and evaluation  An illustration of 8 to pics from 114–topic s olution for the datase t, Each topic is show n with the 10 words and 5 tags that have the highest probabilit y conditioned on that topic.

Conclusion  In this paper, we propose a THDP model. The model uses the HDP as the prior distribution of TTM, which infer the topic number of dataset au tomatically and links the tags to the topics of th e document and capture the semantic of a tag i n the form of topic distribution. Example results on the dataset are used to demonstrate the con sistent and promising performance of the propo sed THDP, the computational expense of the pr oposed model is comparable to that of related t opic model.

Thank you

Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4.

Similar presentations

Presentation on theme: "Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4.

Similar presentations

Presentation on theme: "Incorporating Hierarchical Diric- hlet Process into Tag topic Model 张明 2013.4."— Presentation transcript:

Similar presentations

About project

Feedback