An Automatic Construction of Arabic Similarity Thesaurus

An Automatic Construction of Arabic Similarity Thesaurus
Abdulaziz Al-Qabbany AbdulMalik Al-Salman Abdulrahman Almuhareb CITALA 2009

Outline Introduction Thesauruses Similarity Thesaurus
Proposed Improvement The Experiment Evaluation Discussion Conclusions and Future Work

Introduction Thesaurus importance
Effective Information Retrieval systems Vocabulary mismatch problem Query Expansion

Thesauruses Arabic thesauruses Manual construction drawbacks:
cost time subjectivity Automatic construction approaches

Similarity Thesaurus Qiu and Frei (1993) presented their query expansion model using similarity thesaurus. Zazo et al. (2005) used the same approach for constructing a Spanish similarity thesaurus. Expanding queries based on similarity to their concepts rather than similarity to the individual terms.

Similarity Thesaurus (cont.)
Using similarity thesaurus is analogous to the translation from a language to another. Example قرص الدواء قرص الشمس قرص ضوئي قـرص

Similarity Thesaurus construction
The similarity thesaurus is a matrix that represents terms similarities. Each term is represented by a vector that determines its relation with each document. The matrix is generated through calculating similarities between terms vectors.

Query Expansion using Similarity Thesaurus
Similarity between the query q and any term t is computed as the sum of the similarities values between each query term and t. SIM_QT(q, t) = As a response to any query, the terms can be ranked in descending order according to their SIM_QT values.

“Sum” method “SUM” method is appropriate when the similarity values between the query terms and the indexed term are consistent within the same range. When similarity values are inconsistent, the differences between the values will not be reflected on the total sum. Similarity values are considered to be inconsistent when they contain outliers.

Outliers Outlier is a value that is considerably dissimilar or inconsistent with the majority of the data. outlier Y X

Proposed Improvement A given term should have a high similarity value with each individual term in the query in order to be considered related. The dispersion between the similarity values is one of the factors that needed to be considered in query expansion. The total similarity value should remain as the main factor in query expansion.

Proposed Improvement (cont.)
Instead of using the sum of the similarity values, we use the mean of the values subtracted by the standard error of the mean (SE). SIM_QT(q, t) = The standard error of the mean is a measure of data dispersion. SE = where, α is the standard deviation and n is the number of values.

The Experiment we used the France Press Agency Arabic news of years 2004, 2005 and 2006 as the document collection. This document collection can be found in LDC Arabic Gigaword corpus (Third Edition). After examining the high frequency terms in the collection, we had chosen 150 stop words.

Document collection characteristics
Number Number of Documents 208,596 Number of Terms 435,846 Total Number of Terms Occurrences 30,415,222 Average Number of Words per Document 69.78 Number of Processed Terms 248,311

Evaluation The objective of the evaluation was to assess the relevance strength of the produced terms. The evaluation process was applied for both the “SUM” and “MEAN” methods. We have selected twenty common topics that belong to five different domains.

Evaluation For each topic, the top ten related terms were presented to five expert evaluators. Each evaluator was asked to study these twenty topics carefully and then specify if the produced terms are relevant or not. Levels of relevance: Relevant Somewhat Relevant Irrelevant

Evaluation Results The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1%.

Discussion We believe that the main reason that makes the “MEAN” method a better method is its ability to detect and exclude outliers. Adding a single term to the query may completely change the concept of the query. The candidate related term should have consistent similarities with all of the query terms.

MEAN Most Related Terms
Example The response to a query about the former French president “جاك شيراك”: SUM Most Related Terms Value الفرنسي 0.814 الاليزيه 0.630 فرنسا 0.556 الرئيس 0.503 سترو 0.482 MEAN Most Related Terms Value الفرنسي 0.383 الاليزيه 0.283 فرنسا 0.266 الرئيس 0.241 باريس 0.214

Example (cont.)

Conclusions The relevance strength of the standard “SUM” method was 95.0%, while the Relevance strength of the “MEAN” method was 98.1% “MEAN” method shows an improvement of about 3.3% over “SUM” method. We conclude that the “MEAN” method is more accurate mainly because it can detect and exclude the outliers.

Future Work Applying word stemming. Producing collocations.
Constructing a single word-category thesaurus. Using similarity thesaurus in question answering.

An Automatic Construction of Arabic Similarity Thesaurus

Similar presentations

Presentation on theme: "An Automatic Construction of Arabic Similarity Thesaurus"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Automatic Construction of Arabic Similarity Thesaurus

Similar presentations

Presentation on theme: "An Automatic Construction of Arabic Similarity Thesaurus"— Presentation transcript:

Similar presentations

About project

Feedback