Download presentation
Presentation is loading. Please wait.
Published byTracy Smith Modified over 8 years ago
1
Link Distribution on Wikipedia [0422]KwangHee Park
2
Table of contents Introduction Similarity between document Error case Modify word bag Conclusion
3
Introduction Why focused on Link When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others Assumption Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles
4
Introduction Problem what we want to solve is To analyses latent distribution of set of Target document by topic modeling
5
Topic modeling – our approach Target Document = Wikipedia article Terms = linked term in document Modeling method LDA Modeling tool Lingpipe api
6
Advantage of linked term Don’t need to extra preprocessing Boundary detection Remove stopword Word stemming Include more semantics Co-relation between term and document Ex) cancer as a term cancer as a document cancer A Cancer
7
Preliminary Problem How well link terms in the document are represent specific characteristic of that document Link evaluation Calculate similarity between document
8
Link evaluation Similarity based evaluation Calculate similarity between documents Sim_d{doc1,doc2} Calculate similarity between terms Sim_t{term1,term2} Compare two similarity
9
Similarity between documents Sim_d Similarity between documents Significantly affected input term set Data set 1536 number of document Disease domain : 208 Settlement domain : 1328 p,q = topic distribution of each document Kullback Leibler divergence
10
Example –reasonable
11
Example – not good
12
Error analysis Length problem – overestimate portion of topic If the document contain only few link term then portion of topic of that document tend to be overestimated Ex)1950 년,1960 년, 파푸아 뉴기니, 식인풍습
13
Error analysis Some document’s Link terms do not describe document itself Ex) Date, Country,…etc
14
Demo website For disease domain : http://semanticweb.kaist.ac.kr/research/tmodel/ http://semanticweb.kaist.ac.kr/research/tmodel/ For settlement domain : http://semanticweb.kaist.ac.kr/research/tmodel/sindex.php http://semanticweb.kaist.ac.kr/research/tmodel/sindex.php For disease + settlement domain : http://semanticweb.kaist.ac.kr/research/tmodel/dsindex.php http://semanticweb.kaist.ac.kr/research/tmodel/dsindex.php
15
Modify word bag Including non-link term Excluding noise term Weighted score for duplication term Including incoming link
16
Conclusion Topic modeling with link distribution in Wikipedia Need to measure how well link distribution can represent each article’s characteristic After that analysis topic distribution in variety way Expect topic distribution can be apply many application
17
Thank
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.