A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado (24401693) CSE 8337 Lecturer : Dr. Margaret.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Chapter 5: Introduction to Information Retrieval
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Information Retrieval in Practice
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Advance Information Retrieval Topics Hassan Bashiri.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Information Retrieval in Practice
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
1 Cross Language Information Retrieval (CLIR) Modern Information Retrieval Sharif University of Technology Fall 2005 Mohsen Jamali.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Multilingual Search Shibamouli Lahiri
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Cross Language Information Retrieval (CLIR)
Presentation transcript:

A Brief Survey on Cross-language Information Retrieval (CLIR) - Text Retrieval Perspective by Ying Alvarado ( ) CSE 8337 Lecturer : Dr. Margaret Dunham April 26, 2007

2 Outline Introduction Concept Why important Approach CLIR problems Resource Approaches Example Techniques A CLIR application system CLIR effectiveness CLIR future tasks CLIR communities References

3 Cross Language IR Definition: Users enter their query in one language and the system retrieves relevant documents in other languages. For example, a user may pose their query in English but retrieve relevant documents written in French. Example CLIR applications Cross-Language retrieval from texts Cross-Language retrieval from audio and images [1] Wikipedia, [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 In this presentation, we focus on text IR only!

4 Monolingual IR: Documents and user requests in the same language Documents (L 1 ) IR system Request (L 1 ) Results (L 1 ) Monolingual vs. Bilingual vs. Multilingual [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 Cross-language IR: Documents and user requests are in different languages (bilingual IR) Documents (L 2 ) Cross-language IR (CLIR) system Request (L 1 )Results(L 2 ) Source language Target language

5 Documents (L 3 ) Multilingual IR (MLIR) system Request (L?)Results (L 2, L 3 or L 4 ) Documents (L 2 ) Documents (L 4 ) e.g. the Web Multilingual IR: Documents in collection in different languages, search requests in any language Monolingual vs. Bilingual vs. Multilingual (con.)

6 Why CLIR? [3] Internet World Stats, TOP TEN LANGUAGES IN THE INTERNET % of all Internet Users Internet Users by Language Internet Penetration by Language Internet Growth for Language ( ) 2007 Estimate World Population for the Language English29.5 %328,666, %139.6 %1,143,218,916 Chinese14.3 %159,001, %392.2 %1,351,737,925 Spanish8.0 %88,920, %260.3 %439,284,783 Japanese7.7 %86,300, %83.3 %128,646,345 German5.3 %58,711, %113.2 %96,025,053 French5.0 %55,521, %355.2 %387,820,873 Portuguese3.6 %40,216, %430.8 %234,099,347 Korean3.1 %34,120, %79.2 %74,811,368 Italian2.8 %30,763, %133.1 %59,546,696 Arabic2.6 %28,540, %931.8 %340,548,157 TOP TEN LANGUAGES81.7 %910,762, %181.4 %4,255,739,462 Rest of World Languages18.3 %203,511, %444.5 %2,318,926,955 WORLD TOTAL100.0 %1,114,274, %208.7 %6,574,666,417 Top Ten Languages Used in the Web ( Number of Internet Users by Language ) Mar. 10, 2007

7 Why CLIR? (con.) [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR A collection may contains documents in many different languages, e.g. the Web. It would be impractical to form a query in each language. The documents may be expressed in more than one languages. For example, Technical documents in which English jargon appears intermixed with narrative text in another language. Academic works which cite the titles of documents in different languages. The user is not sufficiently fluent to express a query in a language, but is able to make use of the documents that are identified. The user is monolingual and wants to query in their native language. Because he can judge relevance even if results not translated have access to document translation [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

8 Handling non-ASCII character sets Untranslatable search keys (OOV): e.g. compound words, proper names, special terms Multi-word concepts, e.g. phrases and idioms Ambiguity, e.g. Homonymy and polysemy Word Inflections, e.g. plurals and gender CLIR problems [5] Ari Pirkola, et al. Dictionary-Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005

9 Ontology Representation of concepts and relationships Thesaurus it more commonly means a listing of words with similar, related, or opposite meanings It does not include the definition of words Bilingual dictionary a list of words together with additional word-specific information. Bilingual controlled vocabulary carefully selected list of words and phrases, which are used to tag units of information (document or work) so that they may be more easily retrieved by a search Corpora The document collection itself Resources for Translation [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR [1] Wikipedia. Related pages. [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model?

10 An example of controlled vocabulary [14] Boxes and Arrows, The hierarchical relationships Women’s Pants: BT Pants NT Casual Pants NT Dress Pants NT Sports Pants The equivalence relationship

11 What to translate? Document translation Text translation E.g., translate entire document collection into English → search collection in English Vector translation Query translation E.g., translate English query into Chinese query → search Chinese document collection [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R. 2006

12 Tradeoffs Document Translation Documents can be translate and stored offline Dependent on high quality automatic machine translation (MT) system Does not easily deal with changing document sets Query Translation Often easier Disambiguation of query terms may be difficult with short queries [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR

13 Approaches to query translation Knowledge-based: Several aspects of domain knowledge is manually encoded in to a lexicon. Ontology-based (concept driven) Thesaurus-based Dictionary-based Expensive to construct lexicons; Lag behind the common use of terminology. Corpus-based: directly exploit statistical information about term usage in a corpora; automatically construct lexicon. Parallel corpora: document pairs, sentence pairs, term pairs Comparable corpora: document pairs, similar content Unaligned corpora: documents from the same domain, not translations of one another, not linked in any other way [8] Miguel E. Ruiz, CLIR. Slides for school seminars [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR

14 Applying monolingual IR techniques Query expansion Relevance feedback Stemming Latent semantic analysis Parsing Part of speech tagging …… [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR

15 Multilingual Thesauri Three construction techniques Build it from scratch Translate an existing thesaurus Merge monolingual thesauri For example EuroWordNet 7 languages Built from existing lexical resources Has the same structure as Princeton WordNet [8] Miguel E. Ruiz, CLIR. Slides for school seminars [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

16 Pseudo-Relevance Feedback Also call Blind feedback Assume that the top n documents in the result set actually are relevant. Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search Top ranked French Documents French Text Retrieval System AltaVista French Query Terms English Translations English Web Pages Parallel Corpus [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

17 Different level alignment in parallel corpora Document alignment Already exists Collected from existing corpora Examine document external features Examine document internal features Sentence alignment Easily constructed from aligned documents Match pattern of relative sentence lengths Good first step for term alignment Term alignment Using co-occurrence-based translation [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

18 Example of term alignment CSE8337 是一门关于信息存储和检索的课程。 CSE8337 is a class about information storage and retrieval.

19 Co-occurrence-based translation Align terms using co-occurrence statistics assumed that the correct translations of query terms tend to co-occur in target language documents How often do a term pair occur in sentence pairs? Weighted by relative position in the sentences Retain term pairs that occur unusually often [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides. 2007

20 Exploiting Unaligned Corpora Example approach: category-based translation Extract a large number of terms from unaligned coprora of the first and second languages Assign a category to each extracted term by accessing monolingual thesauri of the first and second languages Estimate category-to-category translation probabilities Estimate term-to-term translation probabilities using said category-to-category translation probabilities [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent Filing date: Dec 18, Issue date: Apr 26, 2005

21 In Summary Term-aligned Sentence-aligned Document-aligned Unaligned Parallel Comparable Knowledge-based Corpus-based Controlled Vocabulary Free Text Cross-Language Text Retrieval Query Translation Document Translation Text Translation Vector Translation Ontology-based Dictionary-based Thesaurus-based [8] Miguel E. Ruiz, CLIR. Slides for school seminars. 2001

22 An experimental system [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin Automatic construction of parallel English-Chinese corpus for CLIR A parallel text mining system- PTMiner Finds parallel text from web Parallel Text Mining Algorithm 1. Search for candidate sites - Using existing Web search engines, search for the candidate sites that may contain parallel pages; (by using text anchor) 2. File name fetching - For each candidate site, fetch the URLs of Web pages that are indexed by the search engines; 3. Host crawling - Starting from the URLs collected in the previous step, search through each candidate site separately for more URLs; 4. Pair scan - From the obtained URLs of each site, scan for possible parallel pairs; (by analyzing document external features) 5. Download and verifying - Download the parallel pages, determine file size, language and character set, text length, HTML structure, and filter out non- parallel pairs.

23 The workflow of the mining process Sample anchor texts: “english version” [“in english”, ……] Sample document external features: “file-ch.html” vs. “file-en.html” “…/chinese/…/file.html” vs. “…/english/…file.html” Sample document internal features: Character set, HTML structure [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

24 An alignment example [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

25 Part of the lexicons t: ture f: false [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin Encoding scheme transformation (for Chinese) Sentence level segmentation Chinese word segmentation English expression extraction SILC: language and encoding identification system Other techniques and tools used:

26 Results pairs of texts (lexicon) C-E has a precision of 77% E-C has a precision of 81.5% CLIR results Test corpus: TREC5 and TREC6 Chinese track [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin. 2000

27 Does CLIR work? Best systems at TREC-6 (1997): English-French: 49% of highest French monolingual English-German: 64% of highest German monolingual Best systems at CLEF (2002): English-French: 83% of highest French monolingual English-German: 86% of highest German monolingual Best systems at CLEF (2006): English-French: 93.82% of best French monolingual English-Portuguese: 90.91% of best Portuguese monolingual [2]Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

28 Future tasks [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR Extend study scope: Web pages, medical literature, USENET newsgroup articles, records of legislative and legal proceedings … Lower cost, improve efficiency Pay more attention on indexing-time optimizations to improve query-time efficiency Consider user ’ s perspective Improve the utility of ranked lists Define suitable criteria for the construction of a valid multilingual Web corpus Get resources for resource-poor languages

29 CLIR Communities TREC Cross Language Track currently focuses on the Arabic language, Cross-Language Evaluation Forum (CLEF) – a spinoff from TREC - covering many European languages, NTCIR Asian Language Evaluation (covering Chinese, Japanese and Korean). [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR

30 In CLEF 2006, eight tracks were offered to evaluate the performance of systems: multilingual document retrieval on news collections (Ad-hoc) cross-language structured scientific data (Domain-specific) interactive cross-language retrieval multiple language question answering cross-language retrieval on image collections cross-language speech retrieval multilingual web retrieval cross-language geographic retrieval. CLEF [13] Carol Peters, Cross-Language Evaluation Forum - CLEF D-Lib Magazine October 2006

31 References [1] Wikipedia, [2] Paul Clough, Bridging the language gap: making digital collections available to a multilingual society, presentation, 2005 [3] Internet World Stats, [4] D.W. Oard, A Survey of Multilingual Text Retrieval. Computer Science Technical Report Series; Vol. CS-TR [6] Jimmy Lin, Cross-Language and Multimedia Information Retrieval. Slides for LBSC 796/INFM 718R [8] Miguel E. Ruiz, CLIR. Slides for school seminars [5] Ari Pirkola, et al. Dictionary_Based Cross-Language Information Retrieval_ Problems, Methods, and Research Findings. Information Retrieval, Vol [7] Metamodel.com. What are the differences between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model? [9] Rada Mihalcea, Information Retrieval and Web Search. Class slides [10] Jiang Chen, et al. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proceedings of the sixth conference on Applied natural language processin [11] D.W. Oard, When You Come to a Fork in the Road, Take It: Multiple Futures for CLIR Research. SIGIR 2002 CLIR [12] Fredric Gey, et al, CROSS LANGUAGE INFORMATION RETRIEVAL: A RESEARCH ROADMAP. SIGIR 2002 CLIR [13] Carol Peters, Cross-Language Evaluation Forum - CLEF D-Lib Magazine October 2006 [14] Boxes and Arrows, [15] David Hull, Terminology translation for unaligned comparable corpora using category based translation probabilities. United States Patent Filing date: Dec 18, Issue date: Apr 26, 2005 [16] Giorgio M. Di Nunzio, CLEF 2006: Ad Hoc Track Overview. 2006

32 Thank you!