University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran.

Slides:



Advertisements
Similar presentations
A hybrid model of automatic indexing based on paraconsistent logic Carlos Alberto Correa (University of São Paulo) Nair Yumiko Kobashi (University of São.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Soft Computing. Per Printz Madsen Section of Automation and Control
Current and Future Research Directions University of Tehran Database Research Group 1 October 2009 Abolfazl AleAhmad, Ehsan Darrudi, Hadi.
AI TECHNIQUES Fuzzy Logic (Fuzzy System). Fuzzy Logic : An Idea.
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modeling Modern Information Retrieval
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Advance Information Retrieval Topics Hassan Bashiri.
1 Chapter 18 Fuzzy Reasoning. 2 Chapter 18 Contents (1) l Bivalent and Multivalent Logics l Linguistic Variables l Fuzzy Sets l Membership Functions l.
Vector Space Model CS 652 Information Extraction and Integration.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
IR Models: Review Vector Model and Probabilistic.
Chapter 5: Information Retrieval and Web Search
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
1 D r a f t Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
ISSPA January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran University Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
 Definition Definition  Bit of History Bit of History  Why Fuzzy Logic? Why Fuzzy Logic?  Applications Applications  Fuzzy Logic Operators Fuzzy.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Chapter 6: Information Retrieval and Web Search
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
A Probabilistic Quantifier Fuzzification Mechanism: The Model and Its Evaluation for Information Retrieval Felix Díaz-Hemida, David E. Losada, Alberto.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
The Boolean Model Simple model based on set theory
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
CAASL July Using OWA Fuzzy Operator to Merge Retrieval System Results Tehran University Hadi Amiri, Abolfazl AleAhmad, Caro Lucas, Masoud.
Developed by Joseph GoguenJoseph Goguen. What is fuzzy sets Definition.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Could Be Significant.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Fuzzy Logic Artificial Intelligence Chapter 9. Outline Crisp Logic Fuzzy Logic Fuzzy Logic Applications Conclusion “traditional logic”: {true,false}
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
A Probabilistic Quantifier Fuzzification Mechanism: The Model and Its Evaluation for Information Retrieval Felix Díaz-Hemida, David E. Losada, Alberto.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS 430: Information Discovery
Multimedia Information Retrieval
Disseminating statistical data by short quantified sentences of natural language Miroslav Hudec Faculty of Economic Informatics, University of Economics.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Recuperação de Informação B
Recuperação de Informação B
Berlin Chen Department of Computer Science & Information Engineering
Information Retrieval and Web Design
Advanced information retrieval
Presentation transcript:

University of Tehran FuFaIR: a Fuzzy Farsi Information Retrieval System Amir Nayyeri School of Electrical and Computer Engineering University of Tehran Farhad Oroumchian University of Wollongong in Dubai

AICCSA06 2 Overview Persian Language Persian Language Related Work Related Work Fuzzy IR Fuzzy IR Farsi IR Farsi IR FuFaIR Explanation FuFaIR Explanation Experimental Results Experimental Results Conclusion and Future Work Conclusion and Future Work

AICCSA06 3 Persian Language Spoken in several countries (Iran, Afghanistan, Tajikistan …) Spoken in several countries (Iran, Afghanistan, Tajikistan …) This language has evolved over the years been influenced by many languages This language has evolved over the years been influenced by many languages Contains foreign words from many languages such as Arabic, Turkish, French, English, … Contains foreign words from many languages such as Arabic, Turkish, French, English, … In some cases these words still follow the grammatical rules of their original languages for example: In some cases these words still follow the grammatical rules of their original languages for example: “Maktab” مكتب (singular)  “MAKATEB” مكاتب (plural) “Maktab” مكتب (singular)  “MAKATEB” مكاتب (plural) In some cases these words could use grammatical rules of both languages i.e. In some cases these words could use grammatical rules of both languages i.e. “Khabar” خبر (singular)  “Khabar” خبر (singular)  “AKHBAR” اخبار (Arabic) “AKHBAR” اخبار (Arabic) “KHABAR-HA” خبرها (Persian) “KHABAR-HA” خبرها (Persian) Morphological analyzers for this language need to deal with many forms of words Morphological analyzers for this language need to deal with many forms of words

AICCSA06 4 Information Retrieval and Natural Language Processing for Persian (Farsi) Faculty of Engineering of University of Tehran started working on processing of Persian about 7 years ago. Faculty of Engineering of University of Tehran started working on processing of Persian about 7 years ago. From 3 years ago, it has been a joint co- operation between UT and UOWD. From 3 years ago, it has been a joint co- operation between UT and UOWD. Since then several thousand experiments on processing and retrieval of Persian text have been performed. Since then several thousand experiments on processing and retrieval of Persian text have been performed.

AICCSA06 5 Test Collections 1. Qvanin Collection Documents: Iranian Law Collection Documents: Iranian Law Collection passages passages 41 queries and Relevance Judgments 41 queries and Relevance Judgments 2. Hamshari Collection Documents: 300 MB News from Hamshari Newspaper Documents: 300 MB News from Hamshari Newspaper 3. Part of Speech Tagging Collection A tag set of 40 tags A tag set of 40 tags tagged words tagged words

AICCSA06 6 Natural Language Processing Investigating Automatic Part of Speech Tagging based on machine learning approaches: Investigating Automatic Part of Speech Tagging based on machine learning approaches: Probabilistic (Hidden Markov Model) Probabilistic (Hidden Markov Model) Rule based Rule based Entropy based Entropy based Neural Networks Neural Networks The best so far has reached a 96% accuracy. The best so far has reached a 96% accuracy.

AICCSA06 7 Information Retrieval Experiments All Major Retrieval Models of English text retrieval have been tested and their combinations (i.e.) Fuzzy Logic MMM, Paice, Vector Space Probabilistic BM25 N-Grams N=2, N=3, N=4 Combinational With many different term weighting schemes.

AICCSA06 8 NameWeighting tf.idf tf*log(N/n) / (  (tf 2 ) *  (qtf 2 )) lnc.ltc (1+log(tf))*(1+log(qtf))*log((1+N)/n) / (  (tf 2 ) *  (qtf 2 )) nxx.bpx ( *tf/max tf)+log((N-n)/n) tfc.nfc tf*log(N/n)*( *qtf/max qtf)*log(N/n) / (  (tf 2 ) *  (qtf 2 )) tfc.nfx1 tf* log(N/n)*( *qtf/max qtf) *log(N/n) / (  (tf * log(N/n)) 2 ) tfc.nfx2 tf*log(N/n)*( *qtf/max qtf)*log(N/n) / (  (tf 2 )) Lnu.ltu((1+log(tf))*(1+log(qtf))*log((1+N)/n))/ ((1+log(average tf)) * ((1-s) + s * N.U.W/ average N.U.W) 2) List of Weights that produced the best results Best

AICCSA06 9 NoSystemNoSystemNoSystem Fuzzy Logic Vector Space 1paice-tf.idf 11mmm-tf.idf202gram-Lnu.ltu 2paice-lnc.ltc12mmm-lnc.ltc212gram-tfc.nfx 3paice-Lnu.ltu13mmm-Lnu.ltu222gram-lnc.ltc 4paice-nxx.bpx14mmm-nxx.bpx233gram-Lnu.ltu 5paice-tfc.nfx115mmm-tfc.nfx1243gram-tfc.nfx 6paice-tfc.nfc16mmm-tfc.nfc Probabilistic253gram-lnc.ltc 7BM25 264gram-Lnu.ltu 82gram-BM2517vector-Lnu.ltu274gram-tfc.nfx 93gram-BM2518 vector- tfc.nfx2 284gram-lnc.ltc 104gram-BM2519vector-lnc.ltc Best

AICCSA06 10 The context of the current work Improving the quality of Persian retrieval Improving the quality of Persian retrieval Improving IR systems that used Fuzzy Logic as their retrieval model Improving IR systems that used Fuzzy Logic as their retrieval model

AICCSA06 11 Related Work – Fuzzy IR Fuzzy logic has been used in IR from early days. Fuzzy logic has been used in IR from early days. But only a few of them could show superiority in comparison with Classical approaches like vector space. But only a few of them could show superiority in comparison with Classical approaches like vector space. This has been confirmed for Persian language also. This has been confirmed for Persian language also. The current work has been mostly inspired by one of them: The current work has been mostly inspired by one of them: D.E. Losada, F.D. Hermida, A. Bugarin, S. Barro. Experiments on using fuzzy quantified sentences in adhoc retrieval. ACM Symposium on Applied Aomputin, 2004.

AICCSA06 12 Mixed Min & Max – MMM Calculates the degree of membership of a document to the fuzzy set of the terms in the query as below OR Query: (قيموميت يا حضانت)  ((Guardian OR GOD Parent Q or = (A 1 OR A 2 OR A 3 OR …) SIM(Q or, D) = C or1 * max(d A1, d A2, …) +C or2 * min(d A1, d A2, …) AND Query (املاك و ثبت ) (Registration AND Properties)  Q and = (A 1 AND A 2 AND A 3 AND …) SIM(Q and, D) = C and1 * min(d A1, d A2, …) + C and2 * max(d A1, d A2, …) C and, C or softness coefficient Cand1 = [0.5,0.8] Cand2 = 1 – Cand1 Cor1 > 0.2 Cor2 = 1- Cor1

AICCSA06 13 Paice Model Calculates the degree of membership of a document to the fuzzy set of terms in the query as below: AND Query (املاك و ثبت )  (Registration AND Properties) Q and = (A 1 and A 2 and A 3 and …) OR Query: (قيموميت يا حضانت)  (Guardian OR GOD Parent ) Q or = (A 1 or A 2 or A 3 or …) SIM(Q, D) =  r i-1 td i /  r i-1 r = 1.0 for and queries (td i ascending order) r = 0.7 for or queries (tdi descending order)

AICCSA06 14 Comparison of Fuzzy Systems Experiments on Qavanin Collection

AICCSA06 15 Probabilistic Systems (BM25) Experiments on Qavanin Collection

AICCSA06 16 Comparison of Vector Space Systems With BM25 Experiments on Qavanin Collection

AICCSA06 17 Comparison of Best Vector Space With Best N-grams Experiments on Qavanin Collection

AICCSA06 18 FuFaIR The query is considered as a fuzzy set of relevant documents in the database The query is considered as a fuzzy set of relevant documents in the database The documents will be sent to the client sorted based on their degree of membership to the query's fuzzy set The documents will be sent to the client sorted based on their degree of membership to the query's fuzzy set The larger the value of µ i the more relevant is the document to the query The larger the value of µ i the more relevant is the document to the query i

AICCSA06 19 FuFaIR (Cont.) each term is assigned a membership degree to a document based on the importance of that term for representing the document’s content. each term is assigned a membership degree to a document based on the importance of that term for representing the document’s content. Membership degree can be computed with classical IR parameters such as tf/idf Membership degree can be computed with classical IR parameters such as tf/idf The input query is considered as an algebraic sentence whose elements are: The input query is considered as an algebraic sentence whose elements are: Terms Terms Fuzzy operators such as AND, OR, and NOT Fuzzy operators such as AND, OR, and NOT Applying the operators on terms the final Fuzzy Set results Applying the operators on terms the final Fuzzy Set results i

AICCSA06 20 FuFaIR (Cont.) The membership degree of a document to an individual term is defined as follows in our method: The membership degree of a document to an individual term is defined as follows in our method: i f t,d = Frequency of term t in document d idf (t) = Inverse document frequency of term t

AICCSA06 21 Overview Persian Language Persian Language Related Work Related Work Fuzzy IR Fuzzy IR Farsi IR Farsi IR Fuzzy Logic Overview Fuzzy Logic Overview FuFaIR Explanation FuFaIR Explanation Experimental Results Experimental Results Conclusion and Future Work Conclusion and Future Work

AICCSA06 22 Experimental Results Parameters: Parameters: Hamshahri Corpora has been used Hamshahri Corpora has been used Total size of the collection:300+MB Total size of the collection:300+MB Indexing has been performed after stop word elimination Indexing has been performed after stop word elimination No stemming has been applied No stemming has been applied 30 queries have been used for these experiments 30 queries have been used for these experiments Precision has been computed for top 20 retrieved documents. Precision has been computed for top 20 retrieved documents.

AICCSA06 23 Experimental Results (Cont.) Some Sample Queries: The Bidel music group concert کنسرت موسيقي گروه بيدل Iran AND USA relationsروابط ايران و امريکا Economic benefit of Iran’s agriculture سود اقتصادي کشاورزي ايران The punishment of doping in swimming مجازات دوپينگ در شنا Cancer treatment methods روشهاي درمان سرطان Classic music in Iranموسيقي کلاسيک در ايران

AICCSA06 24 Experimental Results (Cont.) As a bench mark the best Persian retrieval model so far has been selected. That is the Vector Space model with Lnu-ltu weighting scheme. As a bench mark the best Persian retrieval model so far has been selected. That is the Vector Space model with Lnu-ltu weighting scheme. Pivot and the slope parameters have been set to 13.36, and 0.75, respectively Pivot and the slope parameters have been set to 13.36, and 0.75, respectively The effectiveness of these values had been shown by previous works (See Paper). The effectiveness of these values had been shown by previous works (See Paper). To calculate the performance of each run, the precision at 5, 10, 15 and 20 document cut-offs have been calculated and averaged over all 30 queries. To calculate the performance of each run, the precision at 5, 10, 15 and 20 document cut-offs have been calculated and averaged over all 30 queries.

AICCSA06 25 Experimental Results (Cont.) Comparison Results:

AICCSA06 26 Conclusion & Future Work Conclusion Main contribution of this paper: Main contribution of this paper: Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language. Design, implementation and testing of FuFaIR a Fuzzy retrieval system for Persian language. fuzzy quantifiers are also added to the original model to provide more flexibility fuzzy quantifiers are also added to the original model to provide more flexibility In comparison with Vector Space, FuFaIR significantly better performance In comparison with Vector Space, FuFaIR significantly better performance Future Works: Testing different interpretation of the Fuzzy operators on the Persian corpora Testing different interpretation of the Fuzzy operators on the Persian corpora Examining the true value and contribution of a Persian stemmer in retrieval. Examining the true value and contribution of a Persian stemmer in retrieval.

AICCSA06 27 Questions ?

AICCSA06 28 Conception of Fuzzy Logic Many decision-making and problem-solving tasks are too complex to be defined precisely Many decision-making and problem-solving tasks are too complex to be defined precisely however, people succeed by using imprecise knowledge however, people succeed by using imprecise knowledge Fuzzy logic resembles human reasoning in its use of approximate information and uncertainty to generate decisions. Fuzzy logic resembles human reasoning in its use of approximate information and uncertainty to generate decisions.

AICCSA06 29 Natural Language Consider: Consider: Joe is tall -- what is tall? Joe is tall -- what is tall? Joe is very tall -- what does this differ from tall? Joe is very tall -- what does this differ from tall? Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. Natural language (like most other activities in life and indeed the universe) is not easily translated into the absolute terms of 0 and 1. “ false ” “ true ”

AICCSA06 30 Fuzzy Logic An approach to uncertainty that combines real values [0…1] and logic operations An approach to uncertainty that combines real values [0…1] and logic operations Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set membership often found in natural (e.g., spoken) language. Fuzzy logic is based on the ideas of fuzzy set theory and fuzzy set membership often found in natural (e.g., spoken) language.

AICCSA06 31 Example: “Young” Example: Example: Ann is 28, 0.8 in set “Young” Ann is 28, 0.8 in set “Young” Bob is 35, 0.1 in set “Young” Bob is 35, 0.1 in set “Young” Charlie is 23, 1.0 in set “Young” Charlie is 23, 1.0 in set “Young” Unlike statistics and probabilities, the degree is not describing probabilities that the item is in the set, but instead describes to what extent the item is the set. Unlike statistics and probabilities, the degree is not describing probabilities that the item is in the set, but instead describes to what extent the item is the set.

AICCSA06 32 Membership function of fuzzy logic Age YoungOld 1 Middle 0.5 DOM Degree of Membership Fuzzy values Fuzzy values have associated degrees of membership in the set. 0

AICCSA06 33 Benefits of fuzzy logic You want the value to switch gradually as Young becomes Middle and Middle becomes Old. This is the idea of fuzzy logic. You want the value to switch gradually as Young becomes Middle and Middle becomes Old. This is the idea of fuzzy logic.

AICCSA06 34 Fuzzy Set Operations Fuzzy OR (  ): the union of two fuzzy sets is the maximum (MAX) of each element from two sets. Fuzzy OR (  ): the union of two fuzzy sets is the maximum (MAX) of each element from two sets. E.g. E.g. A = {1.0, 0.20, 0.75} A = {1.0, 0.20, 0.75} B = {0.2, 0.45, 0.50} B = {0.2, 0.45, 0.50} A  B = {MAX(1.0, 0.2), MAX(0.20, 0.45), MAX(0.75, 0.50)} A  B = {MAX(1.0, 0.2), MAX(0.20, 0.45), MAX(0.75, 0.50)} = {1.0, 0.45, 0.75}

AICCSA06 35 Fuzzy Set Operations Fuzzy AND (  ): the intersection of two fuzzy sets is just the MIN of each element from the two sets. Fuzzy AND (  ): the intersection of two fuzzy sets is just the MIN of each element from the two sets. E.g. E.g. A  B = {MIN(1.0, 0.2), MIN(0.20, 0.45), MIN(0.75, 0.50)} = {0.2, 0.20, 0.50} A  B = {MIN(1.0, 0.2), MIN(0.20, 0.45), MIN(0.75, 0.50)} = {0.2, 0.20, 0.50}

AICCSA06 36 Fuzzy Set Operations The complement of a fuzzy variable with DOM x is (1-x). The complement of a fuzzy variable with DOM x is (1-x). Complement: The complement of a fuzzy set is composed of all elements’ complement. Complement: The complement of a fuzzy set is composed of all elements’ complement. Example. Example. A c = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, 0.8, 0.25} A c = {1 – 1.0, 1 – 0.2, 1 – 0.75} = {0.0, 0.8, 0.25}