Download presentation
Presentation is loading. Please wait.
Published byBertram Summers Modified over 7 years ago
1
Enhancing Wikipedia Search Results Using Text Mining
University of Ruhuna Faculty of Science Department of Computer Science Sri Lanka
2
About this Research…. This is my undergraduate research project of Bachelor of Computer Science (Special) Degree programme. Supervisors : Mr. S.A.S Lorensuhewa Senior Lecturer Department of Computer Science Faculty of Science University of Ruhuna Ms. M.A.L Kalyani Lecturer
3
Problem Definition Wikipedia is an online Encyclopedia popular among most of web users. It has millions of articles related to different subjects and some of these articles are available in different languages. Wikipedia Search Result page provides Wikipedia articles related to a certain keyword which is entered by a user.
4
Problem Definition Wikipedia Search Result Page :
Problem : No content based grouping of Search Results
5
Problem Definition Present a long list of links.
No way to categorize the search results based on the content. Articles with similar content are not even in the adjacent positions of the search result page.
6
Proposed Solution Search result Clustering Methodology.
Group the links, returned by Wikipedia search page for a particular keyword, based on the contents of HTML documents, represented by links. Label those group meaningfully.
7
Proposed Solution Topic 1 Topic 2 Topic 3 Link 1 Link 2 Link 3……..
8
Proposed Solution Potential Advantages
Finding the desired article from the search results becomes easier. Possible to view different usages of a given keyword very quickly. Being an encyclopedia, Wikipedia can be used for such kind of analysis in an easier way with this solution.
9
Methodology For achieving this solution, discoveries of this research were carried based on following four research questions. What is the best clustering algorithm for Wikipedia document clustering? What is the optimum amount of text needed to be extracted from a Wikipedia article? How to determine the optimum number of clusters for a given keyword to have a better grouping? How to label the resulted clusters/groups meaningfully?
10
Methodology This solution was deduced by empirical means.
This deduction process involved four experiments. The first 100 documents which are returned for each of following keywords by the search result pages were subjected to the analysis. Latex Nazi Jaguar Flipper
11
Methodology Textual content under the div tag mw-content-text
12
Methodology Prior to any of these experiments,
Text preprocessing Text Transformation (Attribute Generation) steps were performed on the dataset.
13
Punctuation and Stop Words removal Selected Wikipedia Article Text
Methodology Text Preprocessing Punctuation and Stop Words removal HTML Tag Removal Tokenization Selected Wikipedia Article Text Stemming Features
14
Methodology Attribute Generation
Based on the derived features after Text Preprocessing, TF-IDF (Term Frequency Inverse Document frequency) matrix is created. 𝑇𝐹𝐼𝐷𝐹=𝑓(𝑤) log 𝑁 𝐷 𝑤 𝒇(𝒘) : Frequency of phrase 𝑤 in the document 𝑫 𝒘 Number of documents that contains 𝑤 𝑵 Number of documents in the document set
15
Experiment I Conducted for selecting the most accurate clustering algorithm from: K-means Clustering Agglomerative Hierarchical Clustering. 400 documents selected above were subjected to both of these clustering algorithms. Here number of clusters was selected as four in both cases. Based on majority voting, resulted clusters were labeled.
16
Experiment I Regarding the research question I, this experiment was conducted in two ways that features derived from: First paragraph text Full article text.
17
Experiment I Results K-means clustering outperformed the agglomerative hierarchical clustering in both first paragraph and full article text, in terms of accuracy. First Paragraph Text Full article text K – means Clustering 82.75% 76.5% Agglomerative Hierarchical Clustering 68.75% 70%
18
𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑠𝑢𝑚_𝑜𝑓_𝑇𝐹𝐼𝐷𝐹( 𝐹 𝑛 )= 𝑖=1 𝑁 ( 𝐷 𝑖 , 𝐹 𝑛 ) 𝑁
Experiment II The objective is to analyze the distribution of Average Summation of TF-IDF Scores (AS-TF-IDFS) of features in TF-IDF matrix 𝑁 : Number of documents 𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑠𝑢𝑚_𝑜𝑓_𝑇𝐹𝐼𝐷𝐹( 𝐹 𝑛 )= 𝑖=1 𝑁 ( 𝐷 𝑖 , 𝐹 𝑛 ) 𝑁
19
Experiment II Results
20
Experiment II Special Observations
Only a few high AS-TF-IDFS can be observed 𝑀𝑎 𝑥 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 is significantly higher than 𝑀𝑖𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 There is a turning point (knee point). The features with higher AS-TF-IDFS are available among the top features in most of cluster centroids
21
Experiment III New term is introduced
𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆_𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑=𝑀𝑖 𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 + 𝑀𝑎 𝑥 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 − 𝑀𝑖𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 ×𝐶
22
Experiment III Document sets were selected pairwise for this experiment. For each pair of document sets: Number of features whose 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 is greater than 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆_𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 was selected as number of clusters. K-means clustering was performed Resulted clusters were labeled using majority voting. The Total Error and Number of clusters were recorded. Experiment was continued changing the 𝐶 value. Finally the 𝐶 vs. average Total Error and 𝐶 vs. Number of clusters were plotted in two separate graphs.
23
Experiment III Results 𝐶 value was concluded as 0.25
24
Punctuation and Stop Words removal
Experiment IV Each document set was subjected separately for this experiment. For Labeling purpose: HTML Tag Removal Tokenization Punctuation and Stop Words removal First Paragraph Text of each article Lemmatization
25
Lemmatized Texts of articles Latent Dirichlet Allocation
Experiment IV Keeping 𝐶 value as 0.25, number of clusters was determined. For labeling each resulted cluster In evaluation relevance of the documents to the generated label of the cluster was manually evaluated. Lemmatized Texts of articles Latent Dirichlet Allocation
26
Experiment IV Results Clustering with the features derived from the first paragraph text, gave better accuracy than complete article text. Latex Nazi Jaguar Flipper Number of Documents 100 Decided Number of Clusters 13 16 4 Accuracy 79% 61% 58% Error 21% 39% 42% First Paragraph Text Latex Nazi Jaguar Flipper Number of Documents 100 Decided Number of Clusters 15 19 3 21 Accuracy 74% 73% 56% 47% Error 26% 27% 44% 53% Full Article Text
27
Proposed Methodology
28
Thank You !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.