Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone, today I'm going to present a paper which mainly talks about Cross-language Information Retrieval. Zheng Wang
Content Introduction Literature Review Problem Statement Proposed approach Conclusion Content This paper aims to give us a brief overview of cross language information retrieval and related techniques in this field. We will mainly discuss about five parts in this paper. First we’ll have an introduction of what is Cross Language Information Retrieval. Then, we will have a brief review of some papers in this field. After that we'll look at what problems that current system has. And then propose some approach to address these problems. In the end, we give the conclusion of our proposed approach.
Cross Language Information Retrieval permits the user to retrieve their documents in other language than the query. Overview In general, cross language information retrieval permits the user to retrieve their documents in other language than the query. For example, you can use a Chinese query to retrieve both English and Chinese documents.
“ Query Expansion for Cross Language Information Retrieval Improvement ” CLIR module: The CLIR module adds a translation service to the IR engine. QE module: It aims to provide query terms for some corresponding expansion terms. Literature Review In this part, we are going to review four important papers which address different aspect in this field. We will just go through the main concepts or techniques used in these papers instead of looking into every detail of them. The first paper is Query Expansion for … This paper mainly talks about how to overcome the problem of difference between translated data and human language which is induced by using Query expansion(QE). Basically, Query Expansion is the process of a search engine adding search terms to a user‘s weighted search in order to improve precision and/or recall. There are two important concepts in this paper: The first one is CLIR module, which adds a translation service to the IR engine. This module involves translating the contents before indexing them. The original text is kept in the memory of the engine in order to present the original version of texts to users. only the translation is indexed. The second concept is QE module, It aims … The engine searches the index of translated metadata with the expanded query from QE module
Literature Review “Enhanced Query Expansion in English-Arabic CLIR” Query Expansion proved to be effective Using the top retrieved documents the concept of a query is enhanced by adding to its context related terms. co-occurrence analysis assumption: if two terms co-occur then they tend to be related. Apply Disambiguation to reach optimum effectiveness: not all expanded terms are necessarily related to the query Literature Review The paper tells that Query Expansion proved to be effective. Using the top….The concept of a query Related terms are extracted from these documents using co-occurrence analysis. Co-occurrence analysis assumes ….. the final expanded query is formed by adding those terms. Furthermore Optimum effectiveness could be reached by applying disambiguation: It assumes that not all …. In this case only the terms that are directly related to the initial query are considered and analyzed
“Hindi - English Based Cross Language Information Retrieval System for Allahabad Museum” Conversion of Hindi words to English words :a dictionary database Retrieval of documents and images : Vector based model Clustering: a method to process the data and classify documents into different classes and group Spell checkers : Edit distance algorithm Conversion of English Documents to Hindi Documents Literature Review Okay, the third paper is…. The paper talks about the system developed for CLIR The documents and images for query processing were related to Allahabad museum. The languages used were Hindi and English. The documents were stored in English. The user has a choice to input a query keyword in either Hindi or English. The system retrieves the relevant documents and displays them in the desired language. To convert Hindi words to English words: a dictionary database was used, for those not in the dictionary database, Hindi character to English character mapping has been used. To retrieve documents and images, vector based model was used. all the documents were in English format. Preprocessing of documents is done by stemming and stop word removal. After that each documents is represented by vector in vector space. Finally after retrieval of documents from the above retrieval algorithm they showed the output in the language of user choice. If it is English then there is no need to convert but if it is Hindi then conversion is required by using API. In this paper they also used a technique called clustering, it is a method to process the data and classify documents into different classes and group so that we can know what object lies in which category and can find the similarity easily. By applying clustering in vector space, the complexity of search time was reduced In this paper they have discussed the importance of spell checkers, They have worked with web as corpus in order to design flexible and dynamic spell checker. Edit distance algorithm has been used.
“Cross Language Information Retrieval: In Indian Language Perspective” Literature Review This paper is relatively simple than the others. So we just quickly go through it. I just give the name of this paper as your reference. this paper review the work done in the field of Indian languages by various researchers for CLIR system.
The existing CLIR system for English Hindi has some of the following limitations Translation Disambiguation Out-of-Vocabulary problem Translating phrases Named entities Problem Statement This paper focus on developing a better CLIR systems for Hindi language. Here are some limitations that current system has. They are trying to develop a new system that would overcome the existing limitations.
Proposed Approach Processing the Query Building the Index A dictionary based on particular domain. Index of English words and their corresponding Hindi words. Proposed Approach Okay, let’s see what approach they have proposed to develop a better system. Basically the project is divided in two parts: First Processing the Query and second Building the Index query processing will be done to simplify the query. The query could be both English and Hindi. we will keep the count of Hindi and English words so during retrieval of document more weight can be given to the documents in the language with higher score. The second module consists of index which will contain the English words and their corresponding Hindi words and relationships. The documents will be processed and stored in index. All the terms will be stored in English as well as Hindi. The English term will also contain the term id of its respective Hindi term and the documents that contain that English term and vice versa. So when user types query in English and wants answer in Hindi the Hindi term id will be selected and the documents in Hindi corresponding to that term will be retrieved. Here is a architecture of the system.
For the input query for a particular domain in Hindi or English the documents will be retrieved. The documents can be both in English and Hindi. The Precision and Recall will be used as accuracy measures to evaluate the system. Conclusion Here is the Conclusion of the new system. In summary, the documents could be both English and Hindi, the precision and recall will be used to evaluate the system.
Review What is Cross-language information retrieval system Query Expansion, CLIR module, QE module. Conversion between two languages, clustering. Problem and approach Review Okay finally we take a very brief review about what we have learn. First we talk about what is CLIR system. Then, we reviewed four papers where we got to know some concept like: Query Expansion, CLIR module, QE module. Then we talked about how to enhance the Query Expansion In the third paper, we have seen how to convert between two languages and how to reduce the search complexity by applying clustering in vector space. Finally we give some approaches regarding current problems.
Thanks! Hope it’s helpful.