Multilingual Information Retrieval

Multilingual Information Retrieval
Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM 1

Global Trade USA EU China Japan Hong Kong South Korea
This chart shows the 15 nations with at least 100 billion dollars in annual imports and exports. Together, these nations account for 73% of the world’s exports World trade thus defines nine major languages: English, German, Japanese, Chinese, French, Italian, Dutch, Korean, Spanish There are three key drivers that decide which languages get attention. Where is the money. The G7 languages are well covered Where are the people. This seems to have a much smaller effect. Where are the problems: This explains the interest in Farsi, Korean, etc. Japan Hong Kong South Korea Source: Wikipedia (mostly 2017 estimates)

Most Widely-Spoken Languages
Source: Ethnologue (SIL), 2018

Global Internet Users Web Pages

What Does “Multilingual” Mean?
Mixed-language document Document containing more than one language Mixed-language collection Collection of documents in different languages Multi-monolingual systems Can retrieve from a mixed-language collection Cross-language system Query in one language finds document in another (Truly) multingual system Queries can find documents in any language 5

A Story in Two Parts IR from the ground up in any language
Focusing on document representation Cross-Language IR To the extent time allows

Index Documents Query Hits Representation Function Representation
Query Representation Document Representation Index Comparison Function Hits

ASCII American Standard Code for Information Interchange
| 0 NUL | 32 SPACE | | 96 ` | | 1 SOH | 33 ! | 65 A | 97 a | | 2 STX | 34 " | 66 B | 98 b | | 3 ETX | 35 # | 67 C | 99 c | | 4 EOT | 36 $ | 68 D | 100 d | | 5 ENQ | 37 % | 69 E | 101 e | | 6 ACK | 38 & | 70 F | 102 f | | 7 BEL | 39 ' | 71 G | 103 g | | 8 BS | 40 ( | 72 H | 104 h | | 9 HT | 41 ) | 73 I | 105 i | | 10 LF | 42 * | 74 J | 106 j | | 11 VT | | 75 K | 107 k | | 12 FF | 44 , | 76 L | 108 l | | 13 CR | | 77 M | 109 m | | 14 SO | | 78 N | 110 n | | 15 SI | 47 / | 79 O | 111 o | ASCII American Standard Code for Information Interchange ANSI X | 16 DLE | | 80 P | 112 p | | 17 DC1 | | 81 Q | 113 q | | 18 DC2 | | 82 R | 114 r | | 19 DC3 | | 83 S | 115 s | | 20 DC4 | | 84 T | 116 t | | 21 NAK | | 85 U | 117 u | | 22 SYN | | 86 V | 118 v | | 23 ETB | | 87 W | 119 w | | 24 CAN | | 88 X | 120 x | | 25 EM | | 89 Y | 121 y | | 26 SUB | 58 : | 90 Z | 122 z | | 27 ESC | 59 ; | 91 [ | 123 { | | 28 FS | 60 < | 92 \ | 124 | | | 29 GS | 61 = | 93 ] | 125 } | | 30 RS | 62 > | 94 ^ | 126 ~ | | 31 US | 64 ? | 95 _ | 127 DEL | 7

The Latin-1 Character Set
ISO bit characters for Western Europe French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8

Other ISO-8859 Character Sets
-2 -6 -7 -3 -4 -8 -9 -5 9

East Asian Character Sets
More than 256 characters are needed Two-byte encoding schemes (e.g., EUC) are used Several countries have unique character sets GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam Many characters appear in several languages Research Libraries Group developed EACC Unified “CJK” character set for USMARC records 10

Unicode Single code for all the world’s characters
ISO Standard 10646 Separates “code space” from “encoding” Code space extends Latin-1 The first 256 positions are identical UTF-7 encoding will pass through Uses only the 64 printable ASCII characters UTF-8 encoding is designed for disk file systems 11

Limitations of Unicode
Produces larger files than Latin-1 Fonts may be hard to obtain for some characters Some characters have multiple representations e.g., accents can be part of a character or separate Some characters look identical when printed But they come from unrelated languages Encoding does not define the “sort order” 12

Strings and Segments Retrieval is (often) a search for concepts
But what we actually search are character strings What strings best represent concepts? In English, words are often a good choice Well-chosen phrases might also be helpful In German, compounds may need to be split Otherwise queries using constituent words would fail In Chinese, word boundaries are not marked Thissegmentationproblemissimilartothatofspeech 15

Tokenization Words (from linguistics): Tokens (from computer science)
Morphemes are the units of meaning Combined to make words Anti (disestablishmentarian) ism Tokens (from computer science) Doug ’s running late !

Morphological Segmentation
Swahili Example a + li ni andik ish he past-tense me write causer-effect Declarative-mode Credit: Ramy Eskander

Morphological Segmentation
Somali Example cun + t aa eat she present-tense Credit: Ramy Eskander

Stemming Conflates words, usually preserving meaning
Rule-based suffix-stripping helps for English {destroy, destroyed, destruction}: destr Prefix-stripping is needed in some languages Arabic: {alselam}: selam [Root: SLM (peace)] Imperfect: goal is to usually be helpful Overstemming {centennial,century,center}: cent Understamming: {acquire,acquiring,acquired}: acquir {acquisition}: acquis Snowball: rule-based system for making stemmers

Longest Substring Segmentation
Greedy algorithm based on a lexicon Start with a list of every possible term For each unsegmented string Remove the longest single substring in the list Repeat until no substrings are found in the list 16

Longest Substring Example
Possible German compound term (!): washington List of German words: ach, hin, hing, sei, ton, was, wasch Longest substring segmentation was-hing-ton Roughly translates as “What tone is attached?” 17

oil petroleum probe survey take samples restrain oil petroleum probe survey take samples cymbidium goeringii

Probabilistic Segmentation
For an input string c1 c2 c3 … cn Try all possible partitions into w1 w2 w3 … c c2 c3 … cn c1 c2 c3 c3 … cn c1 c c3 … cn etc. Choose the highest probability partition Compute Pr(w1 w2 w3 ) using a language model Challenges: search, probability estimation

Non-Segmentation: N-gram Indexing
Consider a Chinese document c1 c2 c3 … cn Don’t segment (you could be wrong!) Instead, treat every character bigram as a term c1 c2 , c2 c3 , c3 c4 , … , cn-1 cn Break up queries the same way

A “Term” is Whatever You Index
Word sense Token Word Stem Character n-gram Phrase

Summary A term is whatever you index
So the key is to index the right kind of terms! Start by finding fundamental features We have focused on character coded text Same ideas apply to handwriting, OCR, and speech Combine characters into easily recognized units Words where possible, character n-grams otherwise Apply further processing to optimize results Stemming, phrases, … 27

A Story in Two Parts IR from the ground up in any language
Focusing on document representation Cross-Language IR To the extent time allows

Query-Language CLIR Somali Document Collection Translation Results
System Results select examine English Document Collection Retrieval Engine English queries

Document-Language CLIR
Somali Document Collection Somali documents Translation System Retrieval Engine Results Somali queries select examine English queries

Query vs. Document Translation
Query translation Efficient for short queries (not relevance feedback) Limited context for ambiguous query terms Document translation Rapid support for interactive selection Need only be done once (if query language is same) 23

Indexing Time: Statistical Document Translation

Language-Neutral Retrieval
Somali Query Terms Query “Translation” English Document Terms Document “Translation” “Interlingual” Retrieval 1: 0.91 2: 0.57 3: 0.36

Translation Evidence Lexical Resources Large text collections
Phrase books, bilingual dictionaries, … Large text collections Translations (“parallel”) Similar topics (“comparable”) Similarity Similar writing (if the character set is the same) Similar pronunciation People May be able to guess topic from lousy translations Fundamentally, there are four sources of knowledge that we can rely on when teaching a machine to translate. Perhaps the simplest is some form of dictionary. Dictionaries are very useful, but it is hard for machines to learn to select the right translation using a dictionary alone because the machine has no real sense of context. Large collections of text can provide that context, however, and in recent years they have proven to be very useful as a basis for building “machine translation” systems. The best results have been obtained using very large collections of translated documents, which we call a “parallel text collection”. The next two slides illustrate how that is done.

Types of Lexical Resources
Ontology Organization of knowledge Thesaurus Ontology specialized to support search Dictionary Rich word list, designed for use by people Lexicon Rich word list, designed for use by a machine Bilingual term list Pairs of translation-equivalent terms 22

Named entities added Full Query Named entities from term list Named entities removed

Backoff Translation Lexicon might contain stems, surface forms, or some combination of the two. Document Translation Lexicon mangez mangez - eat surface form mangez mange mange - eats eat stem surface form mange mangez mange - eat surface form stem mangez mange mangent mange - eat stem

Hieroglyphic Egyptian Demotic Greek

Types of Bilingual Corpora
Parallel corpora: translation-equivalent pairs Document pairs Sentence pairs Term pairs Comparable corpora: topically related Collection pairs 32

Some Modern Rosetta Stones
News: DE-News (German-English) Hong-Kong News, Xinhua News (Chinese-English) Government: Canadian Hansards (French-English) Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) UN Treaties (Russian, English, Arabic, …) Religion Bible, Koran, Book of Mormon

Word-Level Alignment English
Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform German English Madam President , I had asked the administration … Señora Presidenta, había pedido a la administración del Parlamento … Spanish

A Translation Model From word-aligned bilingual text, we induce a translation model Example: where, p(探测|survey) = 0.4 p(试探|survey) = 0.3 p(测量|survey) = 0.25 p(样品|survey) = 0.05

Using Multiple Translations
Weighted Structured Query Translation Takes advantage of multiple translations and translation probabilities TF and DF of query term e are computed using TF and DF of its translations:

BM-25 document frequency term frequency document length

Retrieval Effectiveness
CLEF French

Bilingual Query Expansion
source language query Source Language IR Query Translation Target Language IR results expanded source language query expanded target language terms source language collection target language collection Pre-translation expansion Post-translation expansion

Query Expansion Effect
Paul McNamee and James Mayfield, SIGIR-2002

Cognate Matching Dictionary coverage is inherently limited
Translation of proper names Translation of newly coined terms Translation of unfamiliar technical terms Strategy: model derivational translation Orthography-based Pronunciation-based

Matching Orthographic Cognates
Retain untranslatable words unchanged Often works well between European languages Rule-based systems Even off-the-shelf spelling correction can help! Subword (e.g., character-level) MT Trained using a set of representative cognates

Matching Phonetic Cognates
Forward transliteration Generate all potential transliterations Reverse transliteration Guess source string(s) that produced a transliteration Match in phonetic space

Cross-Language “Retrieval”
Query Query Translation Search Translated Query Ranked List The answer can be given by looking at interactions within the search process monolingual or multilingual. Besides interested in nominate, interactive ir also interested in the three yellow boxes in predict and choose

Uses of “MT” in CLIR Term Translation Term Matching
Query Formulation Term Matching Query Translated Query Snippet Translation Query Reformulation Query Translation Indicative Translation Search Ranked List Informative Translation Selection Document Examination Document Use

Interactive Cross-Language Question Answering
iCLEF 2004

Questions, Grouped by Difficulty
8 Who is the managing director of the International Monetary Fund? 11 Who is the president of Burundi? 13 Of what team is Bobby Robson coach? 4 Who committed the terrorist attack in the Tokyo underground? 16 Who won the Nobel Prize for Literature in 1994? 6 When did Latvia gain independence? 14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon “Angela”? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party? 1 What year was Thomas Mann awarded the Nobel Prize? 3 Who is the German Minister for Economic Affairs? 9 When did Lenin die? 5 How much did the Channel Tunnel cost?

For Further Reading Multilingual IR African-Language IR
Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009 African-Language IR Open CLIR Challenge (Swahili), IARPA, 2018 Nkosana Malumba et al, AfriWeb: A Search Engine for a Marginalized Language, ICADL, 2015 Cross-Language IR Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012

Multilingual Information Retrieval

Similar presentations

Presentation on theme: "Multilingual Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multilingual Information Retrieval

Similar presentations

Presentation on theme: "Multilingual Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback