The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advanced Google Becoming a Power Googler. (c) Thomas T. Kaun 2005 How Google Works PageRank: The number of pages link to any given page. “Importance”
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
© Tefko Saracevic, Rutgers University1 digital libraries and human information behavior Tefko Saracevic, Ph.D. School of Communication, Information and.
Morris LeBlanc.  Why Image Retrieval is Hard?  Problems with Image Retrieval  Support Vector Machines  Active Learning  Image Processing ◦ Texture.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
INFO 624 Week 3 Retrieval System Evaluation
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Advance Information Retrieval Topics Hassan Bashiri.
Collaborative Cross-Language Search Douglas W. Oard University of Maryland, College Park May 14, 2015SICS Workshop.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Database Design IST 7-10 Presented by Miss Egan and Miss Richards.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Developing Guidance on Cultural and Linguistic Issues Feng Dai Usability Engineering 26 Nov 2013.
KNOWLEDGE DATABASE Topics inside  Document sharing  Event marketing  Web content.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA December 4,
August 21, 2002Szechenyi National Library Support for Multilingual Information Access Douglas W. Oard College of Information Studies and Institute for.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
1 Teletranslation Context : An Infrastructural Shift Paradigm of Teletranslation Internet and computer-mediated communication with digital media Information.
STF 287 Multicultural Communication 1 Cultural inclusion in information and communications services Specialist Task Force STF 287 “User-oriented handling.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Bing Hong OSIsoft Internationalization &
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Cross-Language Evaluation Forum (CLEF) IST Expected Kick-off Date: August 2001 Carol Peters IEI-CNR, Pisa, Italy Carol Peters: blabla Carol.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Elaine Ménard & Margaret Smithglass School of Information Studies McGill University [Canada] July 5 th, 2011 Babel revisited: A taxonomy for ordinary images.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
UNED at iCLEF 2008: Analysis of a large log of multilingual image searches in Flickr Victor Peinado, Javier Artiles, Julio Gonzalo and Fernando López-Ostenero.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Endangered Species A Collaborative Teaching Unit.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Clarity Cross-Lingual Document Retrieval, Categorisation and Navigation Based on Distributed Services
Jane Reid, AMSc IRIC, QMUL, 30/10/01 1 Information seeking Information-seeking models Search strategies Search tactics.
Information Retrieval
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
What’s happening in iCLEF? (the iCLEF Flickr Challenge) Julio Gonzalo (UNED), Paul Clough (U. Sheffield), Jussi Karlgren (SICS), Javier Artiles (UNED),
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Multilingual Search Shibamouli Lahiri
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
IT323 - Software Engineering 2 1 Tutorial 3.  Suggest ways in which the user interface to an e-commerce system such as an online stores might be adapted.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Web Engineering.
College of Information
European Network of e-Lexicography
Presentation transcript:

The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013

 What is multilingual search?  MLIR and CLIR techniques  Four challenges  Thinking beyond search  Translation and language resources  Providing effective user support  Going from research to practice Outline ISKO UK conference 8-9 July 2013

The “Grand [search] Challenge” “Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the querier, with identical or near identical objects in different media or languages appropriately identified.” Douglas W. Oard and David Hull, AAAI Symposium on Cross- Language IR, Spring 1997, Stanford, USA ISKO UK conference 8-9 July 2013

The need for multilingual search ISKO UK conference 8-9 July 2013

 Multilingual Information Access (MLIA)  Accessing, querying and retrieving information from collections in any language  Includes search and browse functionalities  Cross-Language Information Retrieval (CLIR)  Querying multilingual collections in one language in order to retrieve documents in other languages  Bilingual information retrieval  Multilingual Information Retrieval (MLIR)  Process information (queries, documents, both) in multiple languages  Includes cross-language retrieval What is multilingual search? ISKO UK conference 8-9 July 2013

Documents (L 2 ) Cross-language IR (CLIR) system Request (L 1 ) Results (L 2 )  Documents and user requests are in different languages (bilingual retrieval) Cross-language search Source language Target language ISKO UK conference 8-9 July 2013

Multilingual search  Documents in collection in different languages, search requests in any language Documents (L 3 ) Multilingual IR (MLIR) system Request (L?) Results (L 2, L 3 and/or L 4 ) Documents (L 2 ) Documents (L 4 ) e.g. the Web ISKO UK conference 8-9 July 2013

Google web search ISKO UK conference 8-9 July 2013

Obvious question …  Some users are multilingual  Can formulate searches and judge relevance in many languages  Want the convenience of a single query  Some users are monolingual  Want to query in their native language  Can judge relevance even if results not translated  Have access to document translation  Objects retrieved are language-independent (e.g. images) ISKO UK conference 8-9 July 2013 … “Why do users want to retrieve documents they presumably can’t read?”

 MLIR and CLIR involves matching queries in one language with documents in another  Translate the query (query translation)  Translate the documents (document translation)  Translate queries and documents (e.g. into an intermediate language)  No translation  MLIR may also involve a merging step Matching queries and documents ISKO UK conference 8-9 July 2013

 The use of machine-readable, human-crafted bilingual dictionaries (and thesauri, word-lists and other similar resources)  The use of translation resources generated by statistical approaches from suitable training data  Require parallel or comparable corpora  The use of a ‘full’ machine translation system  Commercial or open source Translation resources ISKO UK conference 8-9 July 2013

Does CLIR/MLIR work?  In the lab CLIR/MLIR using various approaches and translation resources is effective  Results from the Cross-Language Evaluation Forum (CLEF) have shown steady increases  % of monolingual baseline for various tasks  % onwards for most commonly used language pairs  2009 up to 99% of monolingual baseline  Main findings  Well-tuned MT systems are highly effective  Combinations of approaches work well ISKO UK conference 8-9 July 2013

Challenges for multilingual search 1. Thinking beyond search ISKO UK conference 8-9 July 2013

Search in the broader context (includes culture and language) ISKO UK conference 8-9 July 2013

“Globalization is the process of making applications work seamlessly utilizing the user’s preferred language and culture. It involves not just programming and deployment skills but also cultural, translation and language expertise as well.” Multilingual search will typically be part of a wider globalization process and concerned with developing multilingual applications (or websites) that include search functions (i.e. search is not the end goal) Developing global applications ISKO UK conference 8-9 July 2013

 Culture is the behaviour typical of a group of people  Members of the same culture are likely to have the same knowledge of certain things and would think and act similarly in certain situations  Aspects to consider (for design) include  Religion, customs, colours, metaphors, icons and flags, and language  Crossing the cultural boundary includes adapting to a given market’s cultural conventions (localisation) The impact of culture ISKO UK conference 8-9 July 2013

 Translation of the product’s interface and documentation  Colours, images, graphics and icons adapting to cultural and legal requirements  Rendering displaying text correctly (e.g. does the new text fit inside the allocated space?)  Fonts making sure the correct fonts and characters for the target language  Bi-directional text needed in Arabic and other languages  Locale data how to display dates, time, number, currency and other regional data Aspects to consider for localisation ISKO UK conference 8-9 July 2013

Formatting and page layout For languages read right to left right alignment is predominantly used ISKO UK conference 8-9 July 2013

Differences in terminology ISKO UK conference 8-9 July 2013

Challenges for multilingual search 2. Languages and translation ISKO UK conference 8-9 July 2013

 Isn’t translation just replacing words in one language (source) with words in another language (target)?  Not quite so simple …  Often includes changes in syntax  Often includes translation of semantics and use of culturally-specific terms or concepts What is translation? ISKO UK conference 8-9 July 2013

 Out-Of-Vocabulary (OOV) terms  Ambiguity (source and target languages)  Lexical ambiguity (e.g. bat: cricket or animal?)  Syntactic ambiguity (e.g. “I saw the boy with the telescope”)  Translation of phrases (e.g. “George Bush”)  Translation of proper names  Word inflections (e.g. in Swedish)  Compounds (e.g. in German and Dutch)  Word segmentation (e.g. in Arabic and Chinese)  … Further translation challenges ISKO UK conference 8-9 July 2013

 Character sets  Punctuation and marks  Word separation  Digits  Writing direction  Formatting styles  Character shapes  Sort order and case  Date and time formatting Handling different languages ISKO UK conference 8-9 July 2013

 The use of machine-readable, human-crafted bilingual dictionaries (and thesauri, word-lists and other similar resources)  The use of translation resources generated by statistical approaches from suitable training data  Require parallel or comparable corpora  The use of a ‘full’ machine translation system  Commercial or open source Lexical resources for translation ISKO UK conference 8-9 July 2013

 There are approximately 6,800 known languages in the world and just over 2,000 have a writing system  Around 300 have some kind of language processing tools  CLIR/MLIR performance depends on the availability of high-quality translation resources and language processing tools  “Resource bottleneck”  Finding ways to acquire, maintain and update language tools and resources in economic manner  Building and sustaining resources is costly The “resource bottleneck” ISKO UK conference 8-9 July 2013

 ACCURAT project investigated automatically gathering materials from various online sources to train MT systems  Focused on generating resources for “rare” language pairs and specific domains (e.g. automotive industry)  Used comparable corpora to derive translations  Used various online resources  Wikipedia, news sites, parallel versions of websites, social media and crawls of general web pages  Download Toolkit: Example research: ACCURAT ISKO UK conference 8-9 July 2013

Challenges for multilingual search 3. Providing effective user interaction ISKO UK conference 8-9 July 2013

 Individuals can have a range of foreign language abilities and knowledge, e.g. passive vs. active skills  Can use translation to deal with language differences  Many people can use English  Often an interlingua (or default language) for many interfaces  User’s language skills will affect the design of interfaces, for example for search  e.g. monolingual users may need help formulating queries Adapting to language differences ISKO UK conference 8-9 July 2013

Supporting CLIR/MLIR ISKO UK conference 8-9 July 2013

Query translation (Mulinex) Step 1. Step 2. ISKO UK conference 8-9 July 2013

Translation of search results ISKO UK conference 8-9 July 2013

Inputting non-ASCII characters Example of non-ASCII keyboard character input (Arabvista.com). ISKO UK conference 8-9 July 2013

Giving users a choice of interface language ISKO UK conference 8-9 July 2013

Challenges for multilingual search 4. Going from research to practice ISKO UK conference 8-9 July 2013

 The technologies for CLIR/MLIR are proven in the lab, but where are the applications in practice?  For example, consider the lack of CLIR/MLIR in large-scale information services, such as Amazon  We need to go from research to practice  Guidelines for developing CLIR/MLIR applications  Collaborations between academics and business  Effective methods for knowledge transfer (in both directions) Are we stuck in the lab? ISKO UK conference 8-9 July 2013

Research to practice ISKO UK conference 8-9 July 2013

Summary  Search systems are used interactively so you must consider and design for end users  Implication: multilingual search is NOT just an engineering problem; you will also need to understand users’ cultural background and language abilities  Searching is part of wider information seeking activities supported by search applications  Implication: you may have to consider localisation of the whole service and not just querying support  Research has shown we can do cross-language search in multiple languages well (in the lab)  Implication: the challenge is going from research to practice ISKO UK conference 8-9 July 2013

Questions? Paul Clough Senior Lecturer in Information Retrieval ISKO UK conference 8-9 July 2013