Download presentation
Presentation is loading. Please wait.
1
Link Distribution in Wikipedia
Kwanghee Park 03/11
2
Introduction Multilingual Document has their own background knowledge and characteristics When Synchronize each other, we have to analysis each characteristics and supply each other
3
Introduction Focus on Wikipedia Interlanguage links and internal links
Find topic distribution of each article which are connected by Interlanguage links and analysis them.
4
Process overview L L’ Term Set Term Set Doc 1 Doc2 Doc 3
Topic n Topic n Term Set Term Set Topic Topic Topic 3 Topic 3 Topic 2 Topic 2 Topic 1 Topic 1 Doc Doc Doc 3 Doc 1’ Doc2 ‘ Doc 3’
5
Process : LDA modeling Decide number of topic in each language
Because they have deferent amount of terms English>>Espanola≒French >Chances>>Korean Topic assigning Have to translation terms into pivot language Translation by voting Wikipedia Interlanguage link Google translation Any other dictionary
6
Experiment Target domain Tools Disease : 208 number of Doc
Settlement : 1328 number of Doc Tools Lingpipe LDA api
7
Experiment - topic number
Focus on section heading number Total number of section heading :1215 Used over 3 times : 121 Used over 10 times : 36 Clustering 13times 10,20,30,40,50,75,100,125,150,175, 200,225,250
8
Experiment - topic number 30
9
Experiment - topic number 50
10
Experiment - topic number 100
11
Experiment - topic number 200
12
Experiment World_Health_Organization Census_2000
Andalucía,_Valle_del_Cauca inflammation Hashimoto's_thyroiditis Fever Hadley_cell Obesity cancer Disability-adjusted_life_year Max_Rubner Infection Surgery Centers_for_Disease_Control_and_Prevention headache Insomnia diarrhea Census_in_the_United_Kingdom Immune_system gene Adhesin
13
Experiment Peak topic appearance
Experiment specific domain Able to predict shape of topic distribution based on this peek topic Peak topic Peak topic Peak topic
14
Decide topic number Ignore peak topic
Choose well topic distribution ignoring peak topic Well distribution means topics are spread all over the document English = 100 Korean = 50 French, chances, Espanola = 75
15
Future work Term expansion Topic assigning
Link terms in doc without stop ward Topic assigning Translation issue Translation by voting Wikipedia Interlanguage link Google translation Any other dictionary Not decided detail methods yet Medline Snomed
16
Template recommend Cluster several specific domain documents with unclassified text Classifying domain by analyzing topic distribution of text
17
Disease + Settlement T=50
18
Starvation Trenton,_New_Jersey
Starvation disease Trenton,_New_Jersey Settlement
19
Disease + Settlement T=75
20
Disease + Settlement T = 100
21
Thanks
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.