Presentation is loading. Please wait.

Presentation is loading. Please wait.

Link Distribution in Wikipedia

Similar presentations


Presentation on theme: "Link Distribution in Wikipedia"— Presentation transcript:

1 Link Distribution in Wikipedia
Kwanghee Park 03/11

2 Introduction Multilingual Document has their own background knowledge and characteristics When Synchronize each other, we have to analysis each characteristics and supply each other

3 Introduction Focus on Wikipedia Interlanguage links and internal links
Find topic distribution of each article which are connected by Interlanguage links and analysis them.

4 Process overview L L’ Term Set Term Set Doc 1 Doc2 Doc 3
Topic n Topic n Term Set Term Set Topic Topic Topic 3 Topic 3 Topic 2 Topic 2 Topic 1 Topic 1 Doc Doc Doc 3 Doc 1’ Doc2 ‘ Doc 3’

5 Process : LDA modeling Decide number of topic in each language
Because they have deferent amount of terms English>>Espanola≒French >Chances>>Korean Topic assigning Have to translation terms into pivot language Translation by voting Wikipedia Interlanguage link Google translation Any other dictionary

6 Experiment Target domain Tools Disease : 208 number of Doc
Settlement : 1328 number of Doc Tools Lingpipe LDA api

7 Experiment - topic number
Focus on section heading number Total number of section heading :1215 Used over 3 times : 121 Used over 10 times : 36 Clustering 13times 10,20,30,40,50,75,100,125,150,175, 200,225,250

8 Experiment - topic number 30

9 Experiment - topic number 50

10 Experiment - topic number 100

11 Experiment - topic number 200

12 Experiment World_Health_Organization Census_2000
Andalucía,_Valle_del_Cauca inflammation Hashimoto's_thyroiditis Fever Hadley_cell Obesity cancer Disability-adjusted_life_year Max_Rubner Infection Surgery Centers_for_Disease_Control_and_Prevention headache Insomnia diarrhea Census_in_the_United_Kingdom Immune_system gene Adhesin

13 Experiment Peak topic appearance
Experiment specific domain Able to predict shape of topic distribution based on this peek topic Peak topic Peak topic Peak topic

14 Decide topic number Ignore peak topic
Choose well topic distribution ignoring peak topic Well distribution means topics are spread all over the document English = 100 Korean = 50 French, chances, Espanola = 75

15 Future work Term expansion Topic assigning
Link  terms in doc without stop ward Topic assigning Translation issue Translation by voting Wikipedia Interlanguage link Google translation Any other dictionary Not decided detail methods yet Medline Snomed

16 Template recommend Cluster several specific domain documents with unclassified text Classifying domain by analyzing topic distribution of text

17 Disease + Settlement T=50

18 Starvation Trenton,_New_Jersey
Starvation  disease Trenton,_New_Jersey  Settlement

19 Disease + Settlement T=75

20 Disease + Settlement T = 100

21 Thanks

22


Download ppt "Link Distribution in Wikipedia"

Similar presentations


Ads by Google