Link Distribution in Wikipedia [0324] KwangHee Park
Table of contents Introduction Cluster using LDA Experiment Disease, settlement Demo Considering Application
Introduction Why focused on Link When someone make new article in Wikipedia, mostly they simply link to other language source or link to similar and related article. After that, that article to be wrote by others Assumption Link terms in the Wikipedia articles is the key terms which can represent specific characteristic of articles
Introduction Problem what we want to solve is To analyses latent distribution of set of Target document by Clustering of Link term set Find the Tendency of latent distribution of specific Domain by limiting input document to specific Domain
Process Terminology Term set = all of terms in the input documents Topic = Set of term {W i,…,W n } Document = Set of term {W k,W l,…,W n } Document = set of part of topic {T n, T k,…,T m } {Doc : 1 } {T n : 0.4, T k : 0.3,… } Clustering Term set Find latent distribution of each Document Group by domain
LDA The clustering techniques The LDA model consists of a fixed number of topics Each topic is modeled as a distribution over words. A document under LDA is modeled as a distribution over topics. Term Set Topic n Topic Topic 3 Topic 2 Topic 1 Doc 1 Doc2 Doc 3
Experiment Domain : Disease #Doc : 208 #Link terms : English : 46615, Espanola: 34560, French:, 31747Chinese:, 9286 Korean: 3272 Settlement #Doc : 1328 #Link term : English : , Espanola: , French:150921, Chinese:93227, Korean: Number of Topic 10,20,30,40,50,75,100,125,150,175,200,225,250 Demo site
Considering Application Document Classification Classify domain of target document by calculate similarity between topic distribution of document Usage : Template recommendation,… Domain characteristic # of appearance / # of total Doc Topic number Disease Settlement
Template recommendation Starvation Trenton,_New_Jersey Starvation Disease Trenton,_New_Jersey Settlement
Thanks
Domain characteristic # of appearance /# of total Doc Topic number Disease Settlement