Enhancing Wikipedia Search Results Using Text Mining

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Enhancing Wikipedia Search Results Using Text Mining University of Ruhuna Faculty of Science Department of Computer Science Sri Lanka

About this Research…. This is my undergraduate research project of Bachelor of Computer Science (Special) Degree programme. Supervisors : Mr. S.A.S Lorensuhewa Senior Lecturer Department of Computer Science Faculty of Science University of Ruhuna Ms. M.A.L Kalyani Lecturer

Problem Definition Wikipedia is an online Encyclopedia popular among most of web users. It has millions of articles related to different subjects and some of these articles are available in different languages. Wikipedia Search Result page provides Wikipedia articles related to a certain keyword which is entered by a user.

Problem Definition Wikipedia Search Result Page : Problem : No content based grouping of Search Results

Problem Definition Present a long list of links. No way to categorize the search results based on the content. Articles with similar content are not even in the adjacent positions of the search result page.

Proposed Solution Search result Clustering Methodology. Group the links, returned by Wikipedia search page for a particular keyword, based on the contents of HTML documents, represented by links. Label those group meaningfully.

Proposed Solution Topic 1 Topic 2 Topic 3 Link 1 Link 2 Link 3……..

Proposed Solution Potential Advantages Finding the desired article from the search results becomes easier. Possible to view different usages of a given keyword very quickly. Being an encyclopedia, Wikipedia can be used for such kind of analysis in an easier way with this solution.

Methodology For achieving this solution, discoveries of this research were carried based on following four research questions. What is the best clustering algorithm for Wikipedia document clustering? What is the optimum amount of text needed to be extracted from a Wikipedia article? How to determine the optimum number of clusters for a given keyword to have a better grouping? How to label the resulted clusters/groups meaningfully?

Methodology This solution was deduced by empirical means. This deduction process involved four experiments. The first 100 documents which are returned for each of following keywords by the search result pages were subjected to the analysis. Latex Nazi Jaguar Flipper

Methodology Textual content under the div tag mw-content-text

Methodology Prior to any of these experiments, Text preprocessing Text Transformation (Attribute Generation) steps were performed on the dataset.

Punctuation and Stop Words removal Selected Wikipedia Article Text Methodology Text Preprocessing Punctuation and Stop Words removal HTML Tag Removal Tokenization Selected Wikipedia Article Text Stemming Features

Methodology Attribute Generation Based on the derived features after Text Preprocessing, TF-IDF (Term Frequency Inverse Document frequency) matrix is created. 𝑇𝐹𝐼𝐷𝐹=𝑓(𝑤) log 𝑁 𝐷 𝑤 𝒇(𝒘) : Frequency of phrase 𝑤 in the document 𝑫 𝒘 Number of documents that contains 𝑤 𝑵 Number of documents in the document set

Experiment I Conducted for selecting the most accurate clustering algorithm from: K-means Clustering Agglomerative Hierarchical Clustering. 400 documents selected above were subjected to both of these clustering algorithms. Here number of clusters was selected as four in both cases. Based on majority voting, resulted clusters were labeled.

Experiment I Regarding the research question I, this experiment was conducted in two ways that features derived from: First paragraph text Full article text.

Experiment I Results K-means clustering outperformed the agglomerative hierarchical clustering in both first paragraph and full article text, in terms of accuracy. First Paragraph Text Full article text K – means Clustering 82.75% 76.5% Agglomerative Hierarchical Clustering 68.75% 70%

𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑠𝑢𝑚_𝑜𝑓_𝑇𝐹𝐼𝐷𝐹( 𝐹 𝑛 )= 𝑖=1 𝑁 ( 𝐷 𝑖 , 𝐹 𝑛 ) 𝑁 Experiment II The objective is to analyze the distribution of Average Summation of TF-IDF Scores (AS-TF-IDFS) of features in TF-IDF matrix 𝑁 : Number of documents 𝐴𝑣𝑒𝑟𝑎𝑔𝑒_𝑠𝑢𝑚_𝑜𝑓_𝑇𝐹𝐼𝐷𝐹( 𝐹 𝑛 )= 𝑖=1 𝑁 ( 𝐷 𝑖 , 𝐹 𝑛 ) 𝑁

Experiment II Results

Experiment II Special Observations Only a few high AS-TF-IDFS can be observed 𝑀𝑎 𝑥 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 is significantly higher than 𝑀𝑖𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 There is a turning point (knee point). The features with higher AS-TF-IDFS are available among the top features in most of cluster centroids

Experiment III New term is introduced 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆_𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑=𝑀𝑖 𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 + 𝑀𝑎 𝑥 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 − 𝑀𝑖𝑛 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 ×𝐶

Experiment III Document sets were selected pairwise for this experiment. For each pair of document sets: Number of features whose 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆 is greater than 𝐴𝑆−𝑇𝐹−𝐼𝐷𝐹𝑆_𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 was selected as number of clusters. K-means clustering was performed Resulted clusters were labeled using majority voting. The Total Error and Number of clusters were recorded. Experiment was continued changing the 𝐶 value. Finally the 𝐶 vs. average Total Error and 𝐶 vs. Number of clusters were plotted in two separate graphs.

Experiment III Results 𝐶 value was concluded as 0.25

Punctuation and Stop Words removal Experiment IV Each document set was subjected separately for this experiment. For Labeling purpose: HTML Tag Removal Tokenization Punctuation and Stop Words removal First Paragraph Text of each article Lemmatization

Lemmatized Texts of articles Latent Dirichlet Allocation Experiment IV Keeping 𝐶 value as 0.25, number of clusters was determined. For labeling each resulted cluster In evaluation relevance of the documents to the generated label of the cluster was manually evaluated. Lemmatized Texts of articles Latent Dirichlet Allocation

Experiment IV Results Clustering with the features derived from the first paragraph text, gave better accuracy than complete article text. Latex Nazi Jaguar Flipper Number of Documents 100 Decided Number of Clusters 13 16 4 Accuracy 79% 61% 58% Error 21% 39% 42% First Paragraph Text Latex Nazi Jaguar Flipper Number of Documents 100 Decided Number of Clusters 15 19 3 21 Accuracy 74% 73% 56% 47% Error 26% 27% 44% 53% Full Article Text

Proposed Methodology

Thank You !