Download presentation
Presentation is loading. Please wait.
Published byAlban White Modified over 8 years ago
1
Link Distribution on Wikipedia [0407]KwangHee Park
2
Table of contents Introduction Topic modeling Preliminary Problem Conclusion
3
Introduction Why focused on Link When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others Assumption Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles
4
Introduction Problem what we want to solve is To analyses latent distribution of set of Target document by topic modeling
5
Topic modeling Topic Topics are latent concepts buried in the textual artifacts of a community described by a collection of many terms that co-occur frequently in context Laura Dietz, Avaré Stewart 2006 ‘ Utilize Probabilistic Topic Models to Enrich Knowledge Bases’ T = {W i,…,W n }
6
Topic modeling Bag of word assumption The bag-of-words model is a simplifying assumption used in natural language processing and information retrieval. In this model, a text (such as a sentence or a document) is represented as an unordered collection of words, disregarding grammar and even word order. From Wikipedia Each document in the corpus is represented by a vector of integer {f 1,f 2,…f |w| } F i = frequency of i th word |w| = number of words
7
Topic modeling Instead of directly associating documents to words, associate each document with some topics and each topic with some significant words Document = {T n, T k,…,T m } {Doc : 1 } {T n : 0.4, T k : 0.3,… }
8
Topic modeling Based upon the idea that documents are mixtures of topics Modeling Document topic term
9
Topic modeling LSA performs dimensionality reduction using the singular value decomposition. The transformed word– document co-occurrence matrix, X, is factorized into three smaller matrices, U, D, and V. U provides an orthonormal basis for a spatial representation of words D weights those dimensions V provides an orthonormal basis for a spatial representation of documents.
10
Topic modeling pLSA Observed word distributions word distributions per topic Topic distributions per document
11
Topic modeling LDA (Latent Dirichlet Allocation) Number of parameters to be estimated in pLSA grows with size of training set In this point LDA method has advantage Alpha and beta are corpus-level documents that are sampled once in the corpus creating generative model (outside of the plates!) pLSA LDA
12
Topic modeling – our approach Target Document = Wikipedia article Terms = linked term in document Modeling method LDA Modeling tool Lingpipe api
13
Advantage of linked term Don’t need to extra preprocessing Boundary detection Remove stopword Word stemming Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document cancer A Cancer
14
Preliminary Problem How well link terms in the document are represent specific characteristic of that document Link evaluation Calculate similarity between document
15
Link evaluation Similarity based evaluation Calculate similarity between terms Sim_t{term1,term2} Calculate similarity between documents Sim_d{doc1,doc2} Compare two similarity
16
Link evaluation Sim_t Similarity between terms Not affected input term set Sim_d Similarity between documents Significantly affected input term set p,q = topic distribution of each document Lin 1991
17
Link evaluation Compare top 10 most similar each link Ex )Link A Term list most similar to A as term Document list most similar to A as document Compare two list – number of overlaps Now under experiment
18
Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can represent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many application
19
Thank
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.