Jimmy Lin College of Information Studies University of Maryland

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Statistical Machine Translation Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 8 October 27, 2004.
Information Extraction Lecture 12 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Multimedia Retrieval. Outline Audio Retrieval Spoken information Music Document Image Analysis and Retrieval Video Retrieval.
Information Extraction Lecture 9 – Multilingual Extraction CIS, LMU München Winter Semester Dr. Alexander Fraser.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Search Engines and Information Retrieval
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Information Retrieval Review
Cross-Language Information Retrieval INFM 718R Session 10: November 2, 2011 Douglas W. Oard.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
Morris LeBlanc.  Why Image Retrieval is Hard?  Problems with Image Retrieval  Support Vector Machines  Active Learning  Image Processing ◦ Texture.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: April 27, 2011.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
INFO 624 Week 3 Retrieval System Evaluation
LYU 0102 : XML for Interoperable Digital Video Library Recent years, rapid increase in the usage of multimedia information, Recent years, rapid increase.
Visual Information Retrieval Chapter 1 Introduction Alberto Del Bimbo Dipartimento di Sistemi e Informatica Universita di Firenze Firenze, Italy.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Cross-Language Retrieval LBSC 796/INFM 718R Douglas W. Oard Session 12: November 26, 2007.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Information Extraction Lecture 9 – Multilingual Extraction Dr. Alexander Fraser, U. Munich September 11th, 2014 ISSALE: University of Colombo School of.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Information Retrieval in Practice
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
Search Engines and Information Retrieval Chapter 1.
Multimedia Databases (MMDB)
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
“ SINAI at CLEF 2005 : The evolution of the CLEF2003 system.” Fernando Martínez-Santiago Miguel Ángel García-Cumbreras University of Jaén.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
Translingual Topic Tracking with PRISE Gina-Anne Levow and Douglas W. Oard University of Maryland February 28, 2000.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
1 CS 430 / INFO 430 Information Retrieval Lecture 17 Metadata 4.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Multilingual Search Shibamouli Lahiri
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.
Visual Information Retrieval
Multimedia Information Retrieval
Presentation transcript:

LBSC 796/INFM 718R: Week 11 Cross-Language and Multimedia Information Retrieval Jimmy Lin College of Information Studies University of Maryland Monday, April 17, 2006

Topics covered so far… Evaluation of IR systems Inner workings of IR black boxes Interacting with retrieval systems Interfaces in support of retrieval

Questions for Today What if the collection contains documents in a foreign language? What if the collection isn’t even comprised of textual documents?

Cross-Language IR Or “finding documents in languages you can’t read” Why would you want to do it? How would you do it?

Most Widely-Spoken Languages Source: Ethnologue (SIL), 1999

Global Trade Source: World Trade Organization 2000 Annual Report

Global Internet Users Web Pages Until the year 2000, more than half of all users of the Internet spoke English as their native language. Although the total number of English speakers on the Internet continues to increase, the number of Internet users that speak a first language other than English is rising far more rapidly. The consulting firm Global Reach estimates that more that two-thirds of all Internet users presently speak a first language other than English. Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

A Community: CLEF CLEF = “Cross-Language Evaluation Forum” 8 tracks at CLEF 2005 Multilingual information retrieval Cross-language information retrieval Interactive cross-language information retrieval Multiple language question answering Cross-language retrieval on image collections Cross-language spoken document retrieval Multilingual Web retrieval Cross-language geographic retrieval

The Information Retrieval Cycle If you can’t understand the documents… Source Selection How do you formulate a query? Resource How do you know something is worth looking at? source reselection Query Formulation Query How can you understand the retrieved documents? System discovery Vocabulary discovery Concept discovery Document discovery Search Ranked List Selection Documents Examination Documents Delivery

CLIR CLIR = “Cross Language Information Retrieval” Typical setup User speaks only English Wants access to documents in a foreign language (e.g., Chinese or Arabic) Requirements User needs to understand retrieved documents! Interface must support browsing of documents in foreign languages How do we do it?

Two Approaches Query translation Document translation Translate both? Translate English query into Chinese query Search Chinese document collection Translate retrieved results back into English Document translation Translate entire document collection into English Search collection in English Translate both?

Query Translation Chinese Document Collection Chinese documents Translation System Retrieval Engine Results Chinese queries select examine English queries

Document Translation Chinese Document Collection Translation Results System Results select examine English Document Collection Retrieval Engine English queries

Tradeoffs Query Translation Document Translation Which is better? Often easier Disambiguation of query terms may be difficult with short queries Translation of documents must be performed at query time Document Translation Documents can be translate and stored offline Automatic translation can be slow Which is better? Often depends on the availability of language-specific resources (e.g., morphological analyzers) Both approaches present challenges for interaction

CLIR Issues No translation! Which translation? Wrong segmentation oil petroleum probe survey take samples Wrong segmentation restrain cymbidium goeringii oil petroleum probe survey take samples

Learning to Translate Lexicons Large text collections People Phrase books, bilingual dictionaries, … Large text collections Translations (“parallel”) Similar topics (“comparable”) People

Hieroglyphic Demotic Greek

Modern Rosetta Stones Newswire: Government: DE-News (German-English) Hong-Kong News, Xinhua News (Chinese-English) Government: Canadian-Hansards (French-English) Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) UN Treaties (Russian, English, Arabic, …) The Bible (many, many languages)

Parallel Corpus Example from DE-News (8/1/1996) English: Diverging opinions about planned tax reform German: Unterschiedliche Meinungen zur geplanten Steuerreform English: The discussion around the envisaged major tax reform continues . German: Die Diskussion um die vorgesehene grosse Steuerreform dauert an . English: The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 . German: Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .

Word-Level Alignment English Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform German English Madam President , I had asked the administration … Señora Presidenta, había pedido a la administración del Parlamento … Spanish

Learning Translations From alignments, automatically induce a translation lexicon Multiple Translations Translation Probabilities survey 探测 (p = 0.4) 试探 (p = 0.3) 测量 (p = 0.25) 样品 (p = 0.05)

A Translation Model From word-aligned bilingual text, we induce a translation model Example: where, p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05

Using Multiple Translations Weighted Structured Query Translation Takes advantage of multiple translations and translation probabilities TF and DF of query term e are computed using TF and DF of its translations:

Experiment Setup Does weighted structured query translation work? Test collection (from CLEF 2000-2003) ~ 44,000 documents in French 153 topics in English (and French, for comparison) IR system: Okapi weights Translation resources Europarl parallel corpus: ~ 100M on each side GIZA++ Statistical MT toolkit

Does it work? Runs: Results: Monolingual baseline One-best translation baseline Weighted structured query translation Results: Weighted structured query translation always beats one-best translation Weighted structured query translation performance approaches monolingual performance

Morphology and Segmentation For the query translation approach The retrieval engine needs to perform monolingual IR in a foreign language Morphology and segmentation pose problems Good segmenters and morphological analyzers are expensive to develop N-gram indexing provides a good solution Use character n-grams based on length of average word Performs about as well as with a good segmenter

Blind Relevance Feedback Augment the query representation with related terms Multiple opportunities for expansion Before doc translation: Enrich the vocabulary After doc translation: Mitigate translation errors Before query translation: Improve the query After query translation: Mitigate translation errors 41

Query Expansion/Translation source language query Source Language IR Query Translation Target Language IR results expanded source language query expanded target language terms source language collection target language collection Pre-translation expansion Post-translation expansion

McNamee and Mayfield Research questions: Setup: What are the effects of pre- and post- translation query expansion in CLIR? How is performance affected by quality of resources? Is CLIR simply measuring translation performance? Setup: CLEF 2001 test collection Dutch, French, German, Italian, Spanish queries English documents Varied the size translation lexicons (randomly threw out entries) Paul McNamee and James Mayfield. (2002) Comparing Cross-Language Query Expansion Techniques by Degrading Translation Resources. Proceedings of SIGIR 2002.

Query Expansion Effect

Lessons Both pre- and post- translation expansions help Pre-translation expansion is a bigger win… why? Translation resources are important!

Interaction CLIR poses some unique challenges for interaction How do you help users select translated query terms? How do you help users select document terms for query refinement? How do you compensate for poor translation quality?

Document Selection Can users recognize relevant documents in a cross-language retrieval setting? What’s the impact of translation quality? Ranked List Selection Documents Examination

Selection Experiment Comparison of two translation methods: Experimental setup (UMD, iCLEF 2001): English topics, French documents Each user works with the same hit list Can users make relevance judgments? What’s the effect of translation quality? Comparison of two translation methods: Term-for-term gloss translation (Gloss) Easily built for a wide range of language pairs Widely available bilingual word lists Machine translation (MT) Syntactic/semantic constraints improve accuracy & fluency Used Systran, a commercially available MT system Developing new language pairs is expensive (years)

Results Quantitative measures: Users with the MT system achieved higher F-score Observed behavior (from observational notes): Documents were usually examined in rank order Title alone was seldom used to judge documents as “relevant” Subjective reactions (from questionnaires): Everyone liked MT Only one participant liked anything about gloss translation MT was preferred overall

Making MIRACLEs Putting everything together in an interactive, cross-language retrieval system…

Key Points Good translation is the key to cross-language information retrieval Where does one obtain them? (e.g., bilingual dictionaries, aligned text, etc.) How does one use them? (e.g., query translation, document translation, etc.) CLIR performance approaches monolingual IR performance CLIR presents addition challenges for interaction support

Multimedia Retrieval We’re primarily going to focus on image and video search

A Picture…

… is comprised of pixels

This is nothing new! Seurat, Georges, A Sunday Afternoon on the Island of La Grande Jatte

Images and Video A digital image = a collection of pixels Each pixel has a “color” Different types of pixels Binary (1 bit): black/white Grayscale (8 bits) Color (3 colors, 8 bits each): red, green, blue A video is simply lots of images in rapid sequence Each image is called a frame Smooth motion requires about 24 frames/sec Compression is the key!

The Structure of Video Video Scenes Shots Frames

The Semantic Gap Raw Media Image-level descriptors Content descriptors This is what we have to work with Raw Media Image-level descriptors SKY MOUNTAINS TREES Content descriptors This is what we want Photo of Yosemite valley showing El Capitan and Glacier Point with the Half Dome in the distance Semantic content

The IR Black Box Index Documents Query Hits Multimedia Objects Representation Function Representation Function Query Representation Document Representation Index Comparison Function Hits

Recipe for Multimedia Retrieval Extract features Low-level features: blobs, textures, color histograms Textual annotations: captions, ASR, video OCR, human labels Match features From “bag of words” to “bag of features”

Demos Google Image Search Hermitage Museum IBM’s MARVEL System http://images.google.com/ http://www.hermitagemuseum.org/fcgi-bin/db2www/qbicSearch.mac/qbic?selLang=English http://mp7.watson.ibm.com/

Combination of Evidence

TREC For Video Retrieval? TREC Video Track (TRECVID) Started in 2001 Goal is to investigate content-based retrieval from digital video Focus on the shot as the unit of information retrieval (why?) Test Data Collection in 2004: 74 hours of CNN Headline News, ABC World News Tonight, C-SPAN http://www-nlpir.nist.gov/projects/trecvid/

Searching Performance A. Hauptmann and M. Christel. (2004) Successful Approaches in the TREC Video Retrieval Evaluations. Proceedings of ACM Multimedia 2004.

Interaction in Video Retrieval Discussion point: What unique challenges does video retrieval present for interactive systems?

Take-Away Message Multimedia IR systems build on the same basic set of tools as textual IR systems If you have a hammer, everything becomes a nail The feature set is different… but the ideas are the same Text is important!