An Automatic Construction of Arabic Similarity Thesaurus

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms  Synonyms, stemming variations, terms.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Query Expansion.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
H. Lundbeck A/S3-Oct-151 Assessing the effectiveness of your current search and retrieval function Anna G. Eslau, Information Specialist, H. Lundbeck A/S.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
1 Query Operations Relevance Feedback & Query Expansion.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Chapter 6: Information Retrieval and Web Search
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Semi-Automatic Image Annotation Liu Wenyin, Susan Dumais, Yanfeng Sun, HongJiang Zhang, Mary Czerwinski and Brent Field Microsoft Research.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
CSCE 590 Web Scraping – Information Extraction II
Queensland University of Technology
An Efficient Algorithm for Incremental Update of Concept space
Information Organization: Overview
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Writing Rubrics Module 5 Activity 4.
Information Retrieval and Web Search
and Knowledge Graphs for Query Expansion Saeid Balaneshinkordan
Multimedia Information Retrieval
Special Topics on Information Retrieval
Finding Story Chains in Newswire Articles
Presentation 王睿.
Document Expansion for Speech Retrieval (Singhal, Pereira)
Inf 722 Information Organisation
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Automatic Global Analysis
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Information Organization: Overview
Information Retrieval and Web Design
A Neural Passage Model for Ad-hoc Document Retrieval
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

An Automatic Construction of Arabic Similarity Thesaurus Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009

Outline Introduction Thesauruses Similarity Thesaurus Proposed Improvement The Experiment Evaluation Discussion Conclusions and Future Work

Introduction Thesaurus importance Effective Information Retrieval systems Vocabulary mismatch problem Query Expansion

Thesauruses Arabic thesauruses Manual construction drawbacks: cost time subjectivity Automatic construction approaches

Similarity Thesaurus Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. Expanding queries based on similarity to their concepts rather than similarity to the individual terms.

Similarity Thesaurus (cont.) Using similarity thesaurus is analogous to the translation from a language to another. Example قرص الدواء قرص الشمس قرص ضوئي قـرص

Similarity Thesaurus construction The similarity thesaurus is a matrix that represents terms similarities. Each term is represented by a vector that determines its relation with each document. The matrix is generated through calculating similarities between terms vectors.

Query Expansion using Similarity Thesaurus Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.

“Sum” method “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. Similarity values are considered to be inconsistent when they contain outliers.

Outliers Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. outlier Y X

Proposed Improvement A given term should have a high similarity value with each individual term in the query in order to be considered related. The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. The total similarity value should remain as the main factor in query expansion.

Proposed Improvement (cont.) Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.

The Experiment we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). After examining the high frequency terms in the collection, we had chosen 150 stop words.

Document collection characteristics Number Number of Documents 208,596 Number of Terms 435,846 Total Number of Terms Occurrences 30,415,222 Average Number of Words per Document 69.78 Number of Processed Terms 248,311

Evaluation The objective of the evaluation was to assess the relevance strength of the produced terms. The evaluation process was applied for both the “SUM” and “MEAN” methods. We have selected twenty common topics that belong to five different domains.

Evaluation For each topic, the top ten related terms were presented to five expert evaluators. Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. Levels of relevance: Relevant Somewhat Relevant Irrelevant

Evaluation Results The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.

Discussion We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. Adding a single term to the query may completely change the concept of the query. The candidate related term should have consistent similarities with all of the query terms.

MEAN Most Related Terms Example The response to a query about the former French president “جاك شيراك”: SUM Most Related Terms Value الفرنسي 0.814 الاليزيه 0.630 فرنسا 0.556 الرئيس 0.503 سترو 0.482 MEAN Most Related Terms Value الفرنسي 0.383 الاليزيه 0.283 فرنسا 0.266 الرئيس 0.241 باريس 0.214

Example (cont.)

Conclusions The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% “MEAN” method shows an improvement of about 3.3% over “SUM” method. We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.

Future Work Applying word stemming. Producing collocations. Constructing a single word-category thesaurus. Using similarity thesaurus in question answering.

End