Cross Language Information Retrieval (CLIR)

Cross Language Information Retrieval (CLIR)
Modern Information Retrieval Sharif University of Technology Fall 2005

The General Problem Find documents written in any language
Using queries expressed in a single language We are rapidly constructing an extensive network infrastructure for moving information across national boundaries, but much remains to be done before linguistic barriers can be surmounted as effectively as geographic ones. Users seeking information from a digital library could benefit from the ability to query large collections once using a single language, even when more than one language is present in the collection. If the information they locate is not available in a language that they can read, some form of translation will be needed Traditional IR identifies relevant documents in the same language as the query. This problem is referred to as monolingual IR. Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query. This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment. In addition to the problems of monolingual IR, CLIR is faced with the problem of language differences between queries and documents. The key problem is query translation (or document translation)

The General Problem (cont)
Traditional IR identifies relevant documents in the same language as the query (monolingual IR) Cross-language information retrieval (CLIR) tries to identify relevant documents in a language different from that of the query This problem is more and more acute for IR on the Web due to the fact that the Web is a truly multilingual environment

Why is CLIR important?

Characteristics of the WWW
Country of Origin of Public Web Sites, 2001 (% of Total) (OCLC Web Characterization, 2001)

Global Internet User Population
2000 2005 English English Chinese There are many predictions, and of course most of them will be wrong. But they at least raise interesting questions. And, of course, some of them will be right. The trick is to guess which ones! Source: Global Reach

CLIR is Multidisciplinary
CLIR involves researchers from the following fields: information retrieval, natural language processing, machine translation and summarization, speech processing, document image understanding, human-computer interaction

User Needs Search a monolingual collection in a language that the user cannot read. Retrieve information from a multilingual collection using a query in a single language. Select images from a collection indexed with free text captions in an unfamiliar language. Locate documents in a multilingual collection of scanned page images.

Why Do Cross-Language IR?
When users can read several languages Eliminates multiple queries Query in most fluent language Monolingual users can also benefit If translations can be provided If it suffices to know that a document exists If text captions are used to search for images

Approaches to CLIR

Design Decisions What to index? What to translate?
Free text or controlled vocabulary What to translate? Queries or documents Where to get translation knowledge? Dictionary, ontology, training corpus

Cross-Language Text Retrieval
Query Translation Document Translation Text Translation Vector Translation Controlled Vocabulary Free Text Knowledge-based Corpus-based Ontology-based Dictionary-based Term-aligned Sentence-aligned Document-aligned Unaligned Thesaurus-based Parallel Comparable 6 11

Early Development 1964 International Road Research Documentation
English, French and German thesaurus 1969 Pevzner Exact match with a large Russian/English thesaurus 1970 Salton Ranked retrieval with small English/German dictionary 1971 UNESCO Proposed standard for multilingual thesauri

Controlled Vocabulary Matures
1977 IBM STAIRS-TLS Large-scale commercial cross-language IR 1978 ISO Standard 5964 Guidelines for developing multilingual thesauri 1984 EUROVOC thesaurus Now includes all 9 EC languages 1985 ISO Standard 5964 revised

Free Text Developments
1970, 1973 Salton Hand coded bilingual term lists 1990 Latent Semantic Indexing 1994 European multilingual IR project First precision/recall evaluation 1996 SIGIR Cross-lingual IR workshop 1998 EU/NSF digital library working group

Knowledge-based Techniques for Free Text Searching

Knowledge Structures for IR
Ontology Representation of concepts and relationships Thesaurus Ontology specialized for retrieval Bilingual lexicon Ontology specialized for machine translation Bilingual dictionary Ontology specialized for human translation 22 22

Query vs. Document Translation
Query translation Very efficient for short queries Not as big an advantage for relevance feedback Hard to resolve ambiguous query terms Document translation May be needed by the selection interface And supports adaptive filtering well Slow, but only need to do it once per document Poor scale-up to large numbers of languages 23 23

Language Identification
Can be specified using metadata Included in HTTP and HTML Determined using word-scale features Which dictionary gets the most hits? Determined using subword features Letter n-grams in electronic and printed text Phoneme n-grams in speech 24 24

Document Translation Example
Approach Select a single query language Translate every document into that language Perform monolingual retrieval Long documents provide enough context And many translation errors do not hurt retrieval Much of the generation effort is wasted And choosing a single translation can hurt 25 25

Query Translation Example
Select controlled vocabulary search terms Retrieve documents in desired language Form monolingual query from the documents Perform a monolingual free text search English Web Pages French Query Terms Information Need Controlled Vocabulary Multilingual Text Retrieval System English Abstracts Thesaurus Alta Vista 26 13 26

Machine Readable Dictionaries
Based on printed bilingual dictionaries Becoming widely available Used to produce bilingual term lists Cross-language term mappings are accessible Sometimes listed in order of most common usage Some knowledge structure is also present Hard to extract and represent automatically The challenge is to pick the right translation 27 27

Unconstrained Query Translation
Replace each word with every translation Typically 5-10 translations per word About 50% of monolingual effectiveness Ambiguity is a serious problem Example: Fly (English) 8 word senses (e.g., to fly a flag) 13 Spanish translations (enarbolar, ondear, …) 38 English retranslations (hoist, brandish, lift…) 28 14 28

Exploiting Part-of-Speech Tags
Constrain translations by part of speech Noun, verb, adjective, … Effective taggers are available Works well when queries are full sentences Short queries provide little basis for tagging Constrained matching can hurt monolingual IR Nouns in queries often match verbs in documents 29 29

Phrase Indexing Improves retrieval effectiveness two ways
Phrases are less ambiguous than single words Idiomatic phrases translate as a single concept Three ways to identify phrases Semantic (e.g., appears in a dictionary) Syntactic (e.g., parse as a noun phrase) Cooccurrence (words found together often) Semantic phrase results are impressive 30 30

Corpus-based Techniques for Free Text Searching

Types of Bilingual Corpora
Parallel corpora: translation-equivalent pairs Document pairs Sentence pairs Term pairs Comparable corpora Content-equivalent document pairs Unaligned corpora Content from the same domain 32 32

Pseudo-Relevance Feedback
Enter query terms in French Find top French documents in parallel corpus Construct a query from English translations Perform a monolingual free text search French Query Terms Top ranked French Documents English Web Pages English Translations French Text Retrieval System Parallel Corpus Alta Vista 33 33

Learning From Document Pairs
Count how often each term occurs in each pair Treat each pair as a single document English Terms Spanish Terms E1 E2 E E4 E S1 S2 S S4 Doc 1 4 2 2 1 Doc 2 8 4 4 2 Doc 3 2 2 1 2 Doc 4 2 1 2 1 Doc 5 4 1 2 1 34 34

Similarity-Based Dictionaries
Automatically developed from aligned documents Terms E1 and E3 are used in similar ways Terms E1 & S1 (or E3 & S4) are even more similar For each term, find most similar in other language Retain only the top few (5 or so) Performs as well as dictionary-based techniques Evaluated on a comparable corpus of news stories Stories were automatically linked based on date and subject 35 19 35

Generalized Vector Space Model
“Term space” of each language is different But the “document space” for a corpus is the same Describe new documents based on the corpus Vector of cosine similarity to each corpus document Easily generated from a vector of term weights Multiply by the term-document matrix Compute cosine similarity in document space Excellent results when the domain is the same 36 36

Latent Semantic Indexing
Designed for better monolingual effectiveness Works well across languages too Cross-language is just a type of term choice variation Produces short dense document vectors Better than long sparse ones for adaptive filtering Training data needs grow with dimensionality Not as good for retrieval efficiency Always 300 multiplications, even for short queries 37 37

Sentence-Aligned Parallel Corpora
Easily constructed from aligned documents Match pattern of relative sentence lengths Not yet used directly for effective retrieval But all experiments have included domain shift Good first step for term alignment Sentences define a natural context 38 18 38

Cooccurrence-Based Translation
Align terms using cooccurrence statistics How often do a term pair occur in sentence pairs? Weighted by relative position in the sentences Retain term pairs that occur unusually often Useful for query translation Excellent results when the domain is the same Also practical for document translation Term usage reinforces good translations 39 39

Exploiting Unaligned Corpora
Documents about the same set of subjects No known relationship between document pairs Easily available in many applications Two approaches Use a dictionary for rough translation But refine it using the unaligned bilingual corpus Use a dictionary to find alignments in the corpus Then extract translation knowledge from the alignments 40 21 40

Feedback with Unaligned Corpora
Pseudo-relevance feedback is fully automatic Augment the query with top ranked documents Improves recall “Recenters” queries based on the corpus Short queries get the most dramatic improvement Two opportunities: Query language: Improve the query Document language: Suppress translation error 41 41

Context Linking Automatically align portions of documents
For each query term: Find translation pairs in corpus using dictionary Select a “context” of nearby terms e.g., +/- 5 words in each language Choose translations from most similar contexts Based on cooccurrence with other translation pairs No reported experimental results 42 42

Language Encoding Standards
Language (alphabet) specific native encoding: Chinese GB, Big5, Western European ISO (Latin1) Russian KOI-8, ISO , CP-1251 UNICODE (ISO/IEC 10646) UTF-8 variable-byte length UTF-16, UCS-2 fixed double-byte

Performance Evaluation

Constructing Test Collections
One collection for retrospective retrieval Start with a monolingual test collection Documents, queries, relevance judgments Translate the queries by hand Need 2 collections for adaptive filtering Monolingual test collection in one language Plus a document collection in the other language Generate relevance judgments for the same queries 44 44

Evaluating Corpus-Based Techniques
Same domain evaluation Partition a bilingual corpus Design queries Generate relevance judgments for evaluation part Cross-domain evaluation Can use existing collections and corpora No good metric for degree of domain shift 45 45

Evaluation Example Corpus-based same domain evaluation
Use average precision as figure of merit Technique Cross-lang Mono-lingual Ratio Cooccurrence-based dictionary 0.43 0.47 91% Pseudo-relevance feedback 0.40 0.44 90% Generalized vector space model 0.38 95% Latent semantic indexing 0.31 0.37 84% Dictionary-based translation 0.29 61% 46 From Carbonell, et al, “Translingual Information Retrieval: A Comparative Evaluation,” IJCAI-97 46

User Interface Design

Query Formulation Interactive word sense disambiguation
Show users the translated query Retranslate it for monolingual users Provide an easy way of adjusting it But don’t require that users adjust or approve it 48 48

Selection and Examination
Document selection is a decision process Relevance feedback, problem refinement, read it Based on factors not used by the retrieval system Provide information to support that decision May not require very good translations e.g., Word-by-word title translation People can “read past” some ambiguity May help to display a few alternative translations 49 49

Summary Controlled vocabulary Dictionary-based
Mature, efficient, easily explained Dictionary-based Simple, broad coverage Comparable and parallel corpora Effective in the same domain Unaligned corpora Experimental 50 50

Cross Language Information Retrieval (CLIR)

Similar presentations

Presentation on theme: "Cross Language Information Retrieval (CLIR)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross Language Information Retrieval (CLIR)

Similar presentations

Presentation on theme: "Cross Language Information Retrieval (CLIR)"— Presentation transcript:

Similar presentations

About project

Feedback