Multilingual Information Retrieval

Slides:



Advertisements
Similar presentations
1 Statistical Machine Translation Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 8 October 27, 2004.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Jimmy Lin College of Information Studies University of Maryland
The Challenges of Multilingual Search Paul Clough The Information School University of Sheffield ISKO UK conference 8-9 July 2013.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Craig Schock, 2003 Binary Numbers Numbering Systems Counting Symbolic Bases Common Bases (10, 2, 8, 16) Representing Information Binary to Decimal Conversions.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Evidence from Content LBSC 796/INFM 718R Session 2 September 7, 2011.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Fluency with Information Technology Third Edition by Lawrence Snyder Chapter.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
CIS 234: Character Codes Dr. Ralph D. Westfall April, 2011.
Digital Text Primer Prepared for: AIEA Roundtable on Digitization of Armenian Documents Saturday 7 October 2006, University of Geneva, Switzerland Roland.
Evidence from Content INST 734 Module 2 Doug Oard.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Agenda Data Representation – Characters Encoding Schemes ASCII
The character data type char
BIOS1 Basic Input Output System BIOS BIOS refers to a set of procedures or functions that enable the programmer have access to the hardware of the computer.
Digital Design: From Gates to Intelligent Machines
Representing the Meaning of Documents LBSC 796/CMSC 838o Session 2, February 2, 2004 Philip Resnik.
Representation Week 6 LBSC 671 Creating Information Infrastructures.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
Informatics I101 February 25, 2003 John C. Paolillo, Instructor.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Chapter 1 Evolution of Communication Networks.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Character Encoding, F onts. Overview Why do character encoding and fonts matter to linguists? How can you identify problems? Why do these problems arise?
The CLEF 2003 cross language image retrieval task Paul Clough and Mark Sanderson University of Sheffield
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Cross-Language Retrieval INST 734 Module 11 Doug Oard.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
1 Information Representation in Computer Lecture Nine.
Systems Architecture, Fourth Edition 1 Data Representation Chapter 3.
16 September 2004CLEF 2004 iCLEF 2004 at Maryland: Summarization Design for Interactive Cross-Language QA Daqing He, Jianqiang Wang, Jun Luo and Douglas.
Multilingual Search Shibamouli Lahiri
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Programming for GCSE Topic 2.2: Binary Representation T eaching L ondon C omputing William Marsh School of Electronic Engineering and Computer Science.
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.
DATA REPRESENTATION - TEXT
Why indexing? For efficient searching of a document
Unit 2.6 Data Representation Lesson 2 ‒ Characters
Machine level representation of data Character representation
Multilingual Search using Query Translation and Collection Selection Jacques Savoy, Pierre-Yves Berger University of Neuchatel, Switzerland
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Text Based Information Retrieval
Chapter 2 Data Types and Representations
CS 430: Information Discovery
Statistical NLP: Lecture 13
Representing Characters
ASCII Character Codes nul soh stx etx eot 1 lf vt ff cr so
Number Systems Lecture 2.
Introduction to Machine Translation
Text Encoding.
Language Model Approach to IR
Cross Language Information Retrieval (CLIR)
Text Representation ASCII Collating Sequence
Information Retrieval and Web Design
Language Technologies for Scalable Digital Libraries
Presentation transcript:

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM 1

Global Trade USA EU China Japan Hong Kong South Korea This chart shows the 15 nations with at least 100 billion dollars in annual imports and exports. Together, these nations account for 73% of the world’s exports World trade thus defines nine major languages: English, German, Japanese, Chinese, French, Italian, Dutch, Korean, Spanish There are three key drivers that decide which languages get attention. Where is the money. The G7 languages are well covered Where are the people. This seems to have a much smaller effect. Where are the problems: This explains the interest in Farsi, Korean, etc. Japan Hong Kong South Korea Source: Wikipedia (mostly 2017 estimates)

Most Widely-Spoken Languages Source: Ethnologue (SIL), 2018

Global Internet Users Web Pages

What Does “Multilingual” Mean? Mixed-language document Document containing more than one language Mixed-language collection Collection of documents in different languages Multi-monolingual systems Can retrieve from a mixed-language collection Cross-language system Query in one language finds document in another (Truly) multingual system Queries can find documents in any language 5

A Story in Two Parts IR from the ground up in any language Focusing on document representation Cross-Language IR To the extent time allows

Index Documents Query Hits Representation Function Representation Query Representation Document Representation Index Comparison Function Hits

ASCII American Standard Code for Information Interchange | 0 NUL | 32 SPACE | 64 @ | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | 43 + | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | 45 - | 77 M | 109 m | | 14 SO | 46 . | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII American Standard Code for Information Interchange ANSI X3.4-1968 | 16 DLE | 48 0 | 80 P | 112 p | | 17 DC1 | 49 1 | 81 Q | 113 q | | 18 DC2 | 50 2 | 82 R | 114 r | | 19 DC3 | 51 3 | 83 S | 115 s | | 20 DC4 | 52 4 | 84 T | 116 t | | 21 NAK | 53 5 | 85 U | 117 u | | 22 SYN | 54 6 | 86 V | 118 v | | 23 ETB | 55 7 | 87 W | 119 w | | 24 CAN | 56 8 | 88 X | 120 x | | 25 EM | 57 9 | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL | 7

The Latin-1 Character Set ISO 8859-1 8-bit characters for Western Europe French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1 8

Other ISO-8859 Character Sets -2 -6 -7 -3 -4 -8 -9 -5 9

East Asian Character Sets More than 256 characters are needed Two-byte encoding schemes (e.g., EUC) are used Several countries have unique character sets GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam Many characters appear in several languages Research Libraries Group developed EACC Unified “CJK” character set for USMARC records 10

Unicode Single code for all the world’s characters ISO Standard 10646 Separates “code space” from “encoding” Code space extends Latin-1 The first 256 positions are identical UTF-7 encoding will pass through email Uses only the 64 printable ASCII characters UTF-8 encoding is designed for disk file systems 11

Limitations of Unicode Produces larger files than Latin-1 Fonts may be hard to obtain for some characters Some characters have multiple representations e.g., accents can be part of a character or separate Some characters look identical when printed But they come from unrelated languages Encoding does not define the “sort order” 12

Strings and Segments Retrieval is (often) a search for concepts But what we actually search are character strings What strings best represent concepts? In English, words are often a good choice Well-chosen phrases might also be helpful In German, compounds may need to be split Otherwise queries using constituent words would fail In Chinese, word boundaries are not marked Thissegmentationproblemissimilartothatofspeech 15

Tokenization Words (from linguistics): Tokens (from computer science) Morphemes are the units of meaning Combined to make words Anti (disestablishmentarian) ism Tokens (from computer science) Doug ’s running late !

Morphological Segmentation Swahili Example a + li ni andik ish he past-tense me write causer-effect Declarative-mode Credit: Ramy Eskander

Morphological Segmentation Somali Example cun + t aa eat she present-tense Credit: Ramy Eskander

Stemming Conflates words, usually preserving meaning Rule-based suffix-stripping helps for English {destroy, destroyed, destruction}: destr Prefix-stripping is needed in some languages Arabic: {alselam}: selam [Root: SLM (peace)] Imperfect: goal is to usually be helpful Overstemming {centennial,century,center}: cent Understamming: {acquire,acquiring,acquired}: acquir {acquisition}: acquis Snowball: rule-based system for making stemmers

Longest Substring Segmentation Greedy algorithm based on a lexicon Start with a list of every possible term For each unsegmented string Remove the longest single substring in the list Repeat until no substrings are found in the list 16

Longest Substring Example Possible German compound term (!): washington List of German words: ach, hin, hing, sei, ton, was, wasch Longest substring segmentation was-hing-ton Roughly translates as “What tone is attached?” 17

oil petroleum probe survey take samples restrain oil petroleum probe survey take samples cymbidium goeringii

Probabilistic Segmentation For an input string c1 c2 c3 … cn Try all possible partitions into w1 w2 w3 … c1 c2 c3 … cn c1 c2 c3 c3 … cn c1 c2 c3 … cn etc. Choose the highest probability partition Compute Pr(w1 w2 w3 ) using a language model Challenges: search, probability estimation

Non-Segmentation: N-gram Indexing Consider a Chinese document c1 c2 c3 … cn Don’t segment (you could be wrong!) Instead, treat every character bigram as a term c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn Break up queries the same way

A “Term” is Whatever You Index Word sense Token Word Stem Character n-gram Phrase

Summary A term is whatever you index So the key is to index the right kind of terms! Start by finding fundamental features We have focused on character coded text Same ideas apply to handwriting, OCR, and speech Combine characters into easily recognized units Words where possible, character n-grams otherwise Apply further processing to optimize results Stemming, phrases, … 27

A Story in Two Parts IR from the ground up in any language Focusing on document representation Cross-Language IR To the extent time allows

Query-Language CLIR Somali Document Collection Translation Results System Results select examine English Document Collection Retrieval Engine English queries

Document-Language CLIR Somali Document Collection Somali documents Translation System Retrieval Engine Results Somali queries select examine English queries

Query vs. Document Translation Query translation Efficient for short queries (not relevance feedback) Limited context for ambiguous query terms Document translation Rapid support for interactive selection Need only be done once (if query language is same) 23

Indexing Time: Statistical Document Translation

Language-Neutral Retrieval Somali Query Terms Query “Translation” English Document Terms Document “Translation” “Interlingual” Retrieval 1: 0.91 2: 0.57 3: 0.36

Translation Evidence Lexical Resources Large text collections Phrase books, bilingual dictionaries, … Large text collections Translations (“parallel”) Similar topics (“comparable”) Similarity Similar writing (if the character set is the same) Similar pronunciation People May be able to guess topic from lousy translations Fundamentally, there are four sources of knowledge that we can rely on when teaching a machine to translate. Perhaps the simplest is some form of dictionary. Dictionaries are very useful, but it is hard for machines to learn to select the right translation using a dictionary alone because the machine has no real sense of context. Large collections of text can provide that context, however, and in recent years they have proven to be very useful as a basis for building “machine translation” systems. The best results have been obtained using very large collections of translated documents, which we call a “parallel text collection”. The next two slides illustrate how that is done.

Types of Lexical Resources Ontology Organization of knowledge Thesaurus Ontology specialized to support search Dictionary Rich word list, designed for use by people Lexicon Rich word list, designed for use by a machine Bilingual term list Pairs of translation-equivalent terms 22

Named entities added Full Query Named entities from term list Named entities removed

Backoff Translation Lexicon might contain stems, surface forms, or some combination of the two. Document Translation Lexicon mangez mangez - eat surface form mangez mange mange - eats eat stem surface form mange mangez mange - eat surface form stem mangez mange mangent mange - eat stem

Hieroglyphic Egyptian Demotic Greek

Types of Bilingual Corpora Parallel corpora: translation-equivalent pairs Document pairs Sentence pairs Term pairs Comparable corpora: topically related Collection pairs 32

Some Modern Rosetta Stones News: DE-News (German-English) Hong-Kong News, Xinhua News (Chinese-English) Government: Canadian Hansards (French-English) Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) UN Treaties (Russian, English, Arabic, …) Religion Bible, Koran, Book of Mormon

Word-Level Alignment English Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform German English Madam President , I had asked the administration … Señora Presidenta, había pedido a la administración del Parlamento … Spanish

A Translation Model From word-aligned bilingual text, we induce a translation model Example: where, p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05

Using Multiple Translations Weighted Structured Query Translation Takes advantage of multiple translations and translation probabilities TF and DF of query term e are computed using TF and DF of its translations:

BM-25 document frequency term frequency document length

Retrieval Effectiveness CLEF French

Bilingual Query Expansion source language query Source Language IR Query Translation Target Language IR results expanded source language query expanded target language terms source language collection target language collection Pre-translation expansion Post-translation expansion

Query Expansion Effect Paul McNamee and James Mayfield, SIGIR-2002

Cognate Matching Dictionary coverage is inherently limited Translation of proper names Translation of newly coined terms Translation of unfamiliar technical terms Strategy: model derivational translation Orthography-based Pronunciation-based

Matching Orthographic Cognates Retain untranslatable words unchanged Often works well between European languages Rule-based systems Even off-the-shelf spelling correction can help! Subword (e.g., character-level) MT Trained using a set of representative cognates

Matching Phonetic Cognates Forward transliteration Generate all potential transliterations Reverse transliteration Guess source string(s) that produced a transliteration Match in phonetic space

Cross-Language “Retrieval” Query Query Translation Search Translated Query Ranked List The answer can be given by looking at interactions within the search process monolingual or multilingual. Besides interested in nominate, interactive ir also interested in the three yellow boxes in predict and choose

Uses of “MT” in CLIR Term Translation Term Matching Query Formulation Term Matching Query Translated Query Snippet Translation Query Reformulation Query Translation Indicative Translation Search Ranked List Informative Translation Selection Document Examination Document Use

Interactive Cross-Language Question Answering iCLEF 2004

Questions, Grouped by Difficulty 8 Who is the managing director of the International Monetary Fund? 11 Who is the president of Burundi? 13 Of what team is Bobby Robson coach? 4 Who committed the terrorist attack in the Tokyo underground? 16 Who won the Nobel Prize for Literature in 1994? 6 When did Latvia gain independence? 14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon “Angela”? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party? 1 What year was Thomas Mann awarded the Nobel Prize? 3 Who is the German Minister for Economic Affairs? 9 When did Lenin die? 5 How much did the Channel Tunnel cost?

For Further Reading Multilingual IR African-Language IR Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009 African-Language IR Open CLIR Challenge (Swahili), IARPA, 2018 Nkosana Malumba et al, AfriWeb: A Search Engine for a Marginalized Language, ICADL, 2015 Cross-Language IR Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012