Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001.

Similar presentations


Presentation on theme: "Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001."— Presentation transcript:

1 Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001

2 Agenda Questions Overview –The information –The users Cross-Language Search User Interaction

3 The Grand Plan Phase 1: What makes up an IR system? –perspectives on the elephant Phase 2: Representations –words, ratings Phase 3: Beyond English text –ideas applied in many settings

4 A Driving Example Visual History Foundation –Interviews with Holocaust survivors –39 years’ worth of audio/video –32 languages; accented, emotional speech –30 people, 2 years : $12 million Joint project: MALACH –VHF, IBM, JHU, UMD –http://www.clsp.jhu.edu/research/malach

5 Translation Translingual Browsing Translingual Search QueryDocument SelectExamine Information Access Information Use

6 A Little (Confusing) Vocabulary Multilingual document –Document containing more than one language Multilingual collection –Collection of documents in different languages Multilingual system –Can retrieve from a multilingual collection Cross-language system –Query in one language finds document in another Translingual system –Queries can find documents in any language

7 Who needs Cross-Language Search? When users can read several languages –Eliminate multiple queries –Query in most fluent language Monolingual users can also benefit –If translations can be provided –If it suffices to know that a document exists –If text captions are used to search for images

8 Motivations Commerce Security Social

9 Global Internet Hosts Source: Network Wizards Jan 99 Internet Domain Survey

10 Source: Jack Xu, Excite@Home, 1999 Global Web Page Languages

11 European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

12 European Web Size Projection Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

13 Global Internet Audio source: www.real.com, Feb 2000 Almost 2000 Internet-accessible Radio and Television Stations

14 13 Months Later source: www.real.com, Mar 2001 About 2500 Internet-accessible Radio and Television Stations

15 User Needs Assessment Who are the potential users? What goals do we seek to support? What language skills must we accommodate?

16 Global Languages Source: http://www.g11n.com/faq.html

17 Global Trade Billions of US Dollars (1999) Source: World Trade Organization 2000 Annual Report

18 Source: Global Reach English 2000 2005 Global Internet User Population Chinese

19 Agenda Questions Overview Cross-Language Search User Interaction

20 The Search Process Choose Document-Language Terms Query-Document Matching Infer Concepts Select Document-Language Terms Document Author Query Choose Document-Language Terms Monolingual Searcher Choose Query-Language Terms Cross-Language Searcher

21

22 Some history: from controlled vocabular to free text 1964 International Road Research –Multilingual thesauri 1970 SMART –Dictionary-based free-text cross-language retrieval 1978 ISO Standard 5964 (revised 1985) –Guidelines for developing multilingual thesauri 1990 Latent Semantic Indexing –Corpus-based free-text translingual retrieval

23 Multilingual Thesauri Build a cross-cultural knowledge structure –Cultural differences influence indexing choices Use language-independent descriptors –Matched to language-specific lead-in vocabulary Three construction techniques –Build it from scratch –Translate an existing thesaurus –Merge monolingual thesauri

24 Free Text CLIR What to translate? –Queries or documents Where to get translation knowledge? –Dictionary or corpus How to use it?

25 Translingual Retrieval Architecture Language Identification English Term Selection Chinese Term Selection Cross- Language Retrieval Monolingual Chinese Retrieval 3: 0.91 4: 0.57 5: 0.36 1: 0.72 2: 0.48 Chinese Query Chinese Term Selection

26 Evidence for Language Identification Metadata –Included in HTTP and HTML Word-scale features –Which dictionary gets the most hits? Subword features –Character n-gram statistics

27 Query-Language Retrieval Monolingual Chinese Retrieval 3: 0.91 4: 0.57 5: 0.36 Chinese Query Terms English Document Terms Document Translation

28 Example: Modular use of MT Select a single query language Translate every document into that language Perform monolingual retrieval

29 TDT-3 Mandarin Broadcast News Systran Balanced 2-best translation Is Machine Translation Enough?

30 Document-Language Retrieval Monolingual English Retrieval 3: 0.91 4: 0.57 5: 0.36 Query Translation Chinese Query Terms English Document Terms

31 Query vs. Document Translation Query translation –Efficient for short queries (not relevance feedback) –Limited context for ambiguous query terms Document translation –Rapid support for interactive selection –Need only be done once (if query language is same) Merged query and document translation –Can produce better effectiveness than either alone

32 The Short Query Challenge Source: Jack Xu, Excite@Home, 1999

33 Interlingual Retrieval Interlingual Retrieval 3: 0.91 4: 0.57 5: 0.36 Query Translation Chinese Query Terms English Document Terms Document Translation

34 Key Challenges in CLIR probe survey take samples cymbidium goeringii restrain oil petroleum Wrong segmentation Which translation? No translation?

35 What’s a “Term?” Granularity of a “term” depends on the task –Long for translation, more fine-grained for retrieval Phrases improve translation two ways –Less ambiguous than single words –Idiomatic expressions translate as a single concept Three ways to identify phrases –Semantic (e.g., appears in a dictionary) –Syntactic (e.g., parse as a noun phrase) –Co-occurrence (appear together unexpectedly often)

36 Sources of Evidence for Translation Corpus statistics Lexical resources Algorithms The user

37 Hieroglyphic Egyptian Demotic Greek

38 Types of Bilingual Corpora Parallel corpora: translation-equivalent pairs –Document pairs –Sentence pairs –Term pairs Comparable corpora: topically related –Collection pairs –Document pairs

39 Exploiting Parallel Corpora Automatic acquisition of translation lexicons Statistical machine translation Corpus-guided translation selection Document-linked techniques

40 Lexicon acquisition from the WWW Word alignment (GIZA) STRAND Chunk-level alignment Association stats 3378 document pairs 63K chunks 500K words 170K entries Frequency-based thresholding … cannot understand crew commands… ne comprenez pas les instructions de l’ equip…

41 Corpus-Guided Translation Selection Rank translation alternatives for each term –pick English word e that maximizes Pr(e) –Pick English word e that maximizes Pr(e|c) –Pick English words e 1 …e n maximizing Pr(e 1 …e n |c 1 …c m ) = statistical machine translation! Unigram language models are easy to build –Can use the collection being searched –Limits uncommon translation and spelling error effects

42 Corpus-Based CLIR Example Top ranked French Documents French IR System English IR System French Query Terms English Translations Top ranked English Documents Parallel Corpus

43 Exploiting Comparable Corpora Blind relevance feedback –Existing CLIR technique + collection-linked corpus Lexicon enrichment –Existing lexicon + collection-linked corpus Dual-space techniques –Document-linked corpus

44 Blind Relevance Feedback Augment a representation with related terms –Find related documents, extract distinguishing terms Multiple opportunities: –Before doc translation:Enrich the vocabulary –After doc translation:Mitigate translation errors –Before query translation:Improve the query –After query translation:Mitigate translation errors Short queries get the most dramatic improvement

45 Example: Post-Translation “Document Expansion” Mandarin Chinese Documents Term-to-Term Translation English Corpus IR System Top 5 Automatic Segmentation Term Selection IR System Results English Query Document to be Indexed Single Document

46 Mandarin Newswire Text Post-Translation Document Expansion

47 Why Document Expansion Works Story-length objects provide useful context Ranked retrieval finds signal amid the noise Selective terms discriminate among documents –Enrich index with low DF terms from top documents Similar strategies work well in other applications –CLIR query translation –Monolingual spoken document retrieval

48 Lexicon Enrichment … Cross-Language Evaluation Forum … … Solto Extunifoc Tanixul Knadu … ? Similar techniques can guide translation selection

49 Lexicon Enrichment Use a bilingual lexicon to align “context regions” –Regions with high coincidence of known translations Pair unknown terms with unmatched terms –Unknown:language A, not in the lexicon –Unmatched:language B, not covered by translation Treat the most surprising pairs as new translations Not yet tested in a CLIR application

50 Learning From Document Pairs E1 E2 E3 E4 E5 S1 S2 S3 S4 Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 4221 8442 2 221 2121 4121 English TermsSpanish Terms

51 Similarity “Thesauri” For each term, find most similar in other language –Terms E1 & S1 (or E3 & S4) are used in similar ways Treat top related terms as candidate translations –Applying dictionary-based techniques Performed well on comparable news corpus –Automatically linked based on date and subject codes

52 Generalized Vector Space Model “Term space” of each language is different –Document links define a common “document space” Describe documents based on the corpus –Vector of similarities to each corpus document Compute cosine similarity in document space Very effective in a within-domain evaluation

53 Latent Semantic Indexing Cosine similarity captures noise with signal –Term choice variation and word sense ambiguity Signal-preserving dimensionality reduction –Conflates terms with similar usage patterns Reduces term choice effect, even across languages Computationally expensive

54 Cognate Matching Dictionary coverage is inherently limited –Translation of proper names –Translation of newly coined terms –Translation of unfamiliar technical terms Strategy: model derivational translation –Orthography-based –Pronunciation-based

55 Matching Orthographic Cognates Retain untranslatable words unchanged –Often works well between European languages Rule-based systems –Even off-the-shelf spelling correction can help! Character-level statistical MT –Trained using a set of representative cognates

56 Matching Phonetic Cognates Forward transliteration –Generate all potential transliterations Reverse transliteration –Guess source string(s) that produced a transliteration Match in phonetic space

57 Sources of Evidence for Translation Corpus statistics Lexical resources Algorithms The user

58 Types of Lexical Resources Ontology Thesaurus Lexicon Dictionary Bilingual term list

59 Original query:El Nino and infectious diseases Term selection:“El Nino” infectious diseases Term translation: (Dictionary coverage: “El Nino” is not found) Translation selection: Query formulation: Structure: Dictionary-Based Query Translation

60 The Unbalanced Translation Problem Common query terms may have many translations Some of the translations may be rare IR systems give rare translations higher weights The wrong documents get highly ranked

61 Solution 1: Balanced Translation Replace each term with plausible translations –Common terms have many translations –Specific terms are more useful for retrieval Balance the contribution of each translation –Modular: duplicate translations –Integrated: average the contributions

62 Solution 2: “Structured Queries” Weight of term a in a document i depends on: –TF(a,i): Frequency of term a in document i –DF(a): How many documents term a occurs in Build pseudo-terms from alternate translations –TF (syn(a,b),i) = TF(a,i)+TF(b,i) –DF (syn(a,b) = |{docs with a} U {docs with b}| Downweights terms with any common translation –Particularly effective for long queries

63 (Query Terms: 1: 2: 3: ) Computing Weights Unbalanced: –Overweights query terms that have many translations Balanced (#sum): –Sensitive to rare translations Pirkola (#syn): –Deemphasizes query terms with any common translation

64 Relative Effectiveness NTCIR-2 ECIR Collection, CETA+LDC Dictionary, Inquery 3.1p1

65 Exploiting Part-of-Speech (POS) Constrain translations by part-of-speech –Requires POS tagger and POS-tagged lexicon Works well when queries are full sentences –Short queries provide little basis for tagging Constrained matching can hurt monolingual IR –Nouns in queries often match verbs in documents

66 Evaluation Collections TREC –TREC-6/7/8: English, French, German, Italian text –TREC-9: Chinese text –TREC-10: Arabic text CLEF –CLEF-1: English, French, German, Italian text –CLEF-2: Above, plus Spanish and Dutch TDT –TDT-2/3: Chinese and English, text and speech NTCIR –NTCIR-1: Japanese and English text –NTCIR-2: Japanese, Chinese and English text

67 Percent of Monolingual Performance Create document collection in language F Create queries and relevance judgments –Queries in language F, relevance judgments –Translated queries in language E Monolingual run uses F queries, F docs –Upper bound on performance Cross-language run uses E queries, F docs –Measure as percentage of the upper bound Caveat: “query expansion” effect!

68 A real example (HLT’2001) Combining two translation lexicons –Contained in downloaded dictionary –Extracted from bilingual WWW documents Interaction between stemming and translation –4-stage backoff, balanced translation Evidence combination –Baseline: dictionary with four-stage backoff –Merging: reranking informed by WWW tralex –Backoff: from baseline to WWW tralex Experiment and results

69 Four-stage Backoff Tralex might contain stems, surface forms, or some combination of the two. mangez mange mangezmange mangezmangemangentmange - eat - eatseat - eat DocumentTranslation Lexicon surface form stemsurface form stem French stemmer: Oard, Levow, and Cabezas (2001); English: Inquiry’s kstem

70 Tralex Merging by Voting Ranked dictionary F e1 e2 e3 100 99 98 100 99 Corpus tralex F e2 e3 e4 e5  F Ranked merge e2 e3 e1 199 197 100 e4 e5

71 Tralex Backoff Attempt 4-stage backoff using dictionary If not found, use corpus tralex as 5 th stage Dictionary-based 4-stage backoff Lookup in WWW-based tralex mangez - eat surface form

72 Experiment CLEF-2000 French document collection –~21M words, Le Monde, document translation English queries (34 topics with relevant docs) Inquery (using #sum operator) Conditions (balanced top-2 translation of docs) : –Baseline: downloaded dictionary alone –Corpus (STRAND) tralex, thresholds N=1,2,3 –Tralex merging by voting –Backoff from dictionary tralex to corpus tralex

73 Results STRAND corpus tralex (N=1)0.2320 STRAND corpus tralex (N=2)0.2440 STRAND corpus tralex (N=3)0.2499 Merging by voting0.2892 Baseline: downloaded dictionary0.2919 Backoff from dictionary to corpus tralex 0.3282 ConditionMean Average Precision +12% (p <.01) relative

74 Results Detail  mAP

75 Topic Detection and Tracking English and Chinese news stories –Newswire, radio, and television sources Query-by-example, mixed language/source –Merged mutilingual result set Set-based retrieval measures –Focus on utility

76 TDT: Evaluating CL Speech Retrieval 2265 manually segmented stories 3371 manually segmented stories Development Collection: TDT-2 Evaluation Collection: TDT-3 Mar 98 Oct 98Dec 98 17 topics, variable number of exemplars Jun 98Jan 98 Exhaustive relevance assessment based on event overlap English text topic exemplars: Associated Press New York Times Mandarin audio broadcast news: Voice of America 56 topics, variable number of exemplars

77 Query by Example English Newswire Exemplars Mandarin Audio Stories President Bill Clinton and Chinese President Jiang Zemin engaged in a spirited, televised debate Saturday over human rights and the Tiananmen Square crackdown, and announced a string of agreements on arms control, energy and environmental matters. There were no announced breakthroughs on American human rights concerns, including Tibet, but both leaders accentuated the positive … 美国总统克林顿的助手赞扬中国官员允许电视现场直播克林顿和江泽民在首脑会晤后举行 的联合记者招待会。。特别是一九八九镇压民主运动的决定。他表示镇压天安门民主运动 是错误的,他还批评了中国对西藏精神领袖达 国家安全事务助理伯格表示,这次直播让中国 人第一次在种公开的论坛上听到围绕敏感的人权问题的讨论。在记者招待会上 …

78 Known Item Retrieval Design queries to retrieve a single document Measure the rank of that document in the list Average the inverse of the rank across queries Use sign test for statistical significance Useful first pass evaluation strategy –Avoids the cost of relevance judgments –Poor mean inverse rank implies poor average precision –Does not distinguish well among fairly close systems

79 Evaluating Corpus-Based Techniques Within-domain evaluation (upper bound) –Partition a bilingual corpus into training and test –Use the training part to tune the system –Generate relevance judgments for evaluation part Cross-domain evaluation (fair) –Use existing corpora and evaluation collections –No good metric for degree of domain shift

80 Evaluating Tralex Coverage Lexicon size Vocabulary coverage of the collection Term instance coverage of the collection Term weight coverage of the collection Term weight coverage on representative queries Retrieval performance on a test collection

81 Outline Questions Overview Search  Browsing

82 Interactive CLIR Important: Strong support for interactive relevance judgments can make up for less accurate nominations –[Hersh et al., SIGIR 2000] Practical: Interactive relevance judgments based on imperfect translations can beat fully automatic nominations alone –[Oard and Resnik, IP&M 1997]

83 User-Assisted Query Translation

84 Reverse Translation Swiss bank Query in English : bank: Bankgebäude ( ) bankverbindung (bank account, correspondent) bank (bench, settle) damm (causeway, dam, embankment) ufer (shore, strand, waterside) wall (parapet, rampart) Click on a box to remove a possible translation: Search Continue

85 Selection Goal: Provide information to support decisions May not require very good translations –e.g., Term-by-term title translation People can “read past” some ambiguity –May help to display a few alternative translations

86 Language-Specific Selection Swiss bank Query in English : Search EnglishGerman (Swiss) (Bankgebäude, bankverbindung, bank) 1 (0.72)Swiss Bankers Criticized AP / June 14, 1997 2 (0.48) Bank Director Resigns AP / July 24, 1997 1 (0.91)U.S. Senator Warpathing NZZ / June 14, 1997 2 (0.57) [Bankensecret] Law Change SDA / August 22, 1997 3 (0.36) Banks Pressure Existent NZZ / May 3, 1997

87 Translingual Selection Swiss bank Query in English : Search German Query: (Swiss) (Bankgebäude, bankverbindung, bank) 1 (0.91)U.S. Senator Warpathing NZZJune 14, 1997 2 (0.57) [Bankensecret] Law Change SDAAugust 22, 1997 3 (0.52)Swiss Bankers CriticizedAPJune 14, 1997 4 (0.36) Banks Pressure ExistentNZZMay 3, 1997 5 (0.28) Bank Director ResignsAPJuly 24, 1997

88 Merging Ranked Lists Types of Evidence –Rank –Score Evidence Combination –Weighted round robin –Score combination Parameter tuning –Condition-based –Query-based 1 voa4062.22 2 voa3052.21 3 voa4091.17 … 1000 voa4221.04 1 voa4062.52 2 voa2156.37 3 voa3052.31 … 1000 voa2159.02 1 voa4062 2 voa3052 3 voa2156 … 1000 voa4201

89 Examination Interface Two goals –Refine document delivery decisions – Support vocabulary discovery for query refinement Rapid translation is essential –Document translation retrieval strategies are a good fit –Focused on-the-fly translation may be a viable alternative

90 Uh oh…

91 State-of-the-Art Machine Translation

92 Term-by-Term Gloss Translation

93 Experiment Design Participant 1 2 3 4 Task Order Narrow: Broad: Topic Key System Key System B: System A: Topic11, Topic17Topic13, Topic29 Topic11, Topic17Topic13, Topic29 Topic17, Topic11Topic29, Topic13 Topic17, Topic11Topic29, Topic13 11, 13 17, 29

94 An Experiment Session Task and system familiarization (30 minutes) –Gain experience with both systems 4 searches (20 minutes x 4): –Read topic description (in a language you know) –Examine translations (into that same language) –Select one of 5 relevance judgments (two clusters) Relevant Somewhat relevant, Not relevant, Unsure, Not judged –Instructed to seek high precision 8 questionnaires –Initial (1), each topic (4), each system (2), Final (1)

95 Measure of Effectiveness Unbalanced F-Measure: –P = precision –R = recall –  = 0.8 Favors precision over recall This models an application in which: –Fluent translation is expensive –Missing some relevant documents would be okay

96 French Results Overview  CLEF  AUTO

97 English Results Overview  CLEF  AUTO

98 Maryland Experiments MT is almost always better –Significant overall and for narrow topics alone (one-tailed t-test, p<0.05) F measure is less insightful for narrow topics –Always near 0 or 1 |---------- Broad topics -----------||--------- Narrow topics -----------|

99 Some Observations Small agreement with CLEF assessments! –Time pressure, precision bias, strict judgments Systran was fairly consistent across sites –Only when the language pair was the same Monolingual > Systran > Gloss –In both recall and precision UNED’s phrase translations improve recall –With no adverse affect on precision

100 Delivery Use may require high-quality translation –Machine translation quality is often rough Route to best translator based on: –Acceptable delay –Required quality (language and technical skills) –Cost

101 Summary Controlled vocabulary –Mature, efficient, easily explained Dictionary-based –Simple, broad coverage Collection-aligned corpus-based –Generally helpful Document- and Term-aligned corpus-based –Effective in the same domain User interface –Only very preliminary results available

102 Research Opportunities Segmentation & Phrase Indexing Lexical Coverage

103 No “muddiest point” question… … it’s all CLIR!

104 Concerns about Social Filtering Subjectivity: is good for me good for you? (6) Sparse recommendations (4) Privacy and trust are in tension(3) Insufficient perceived self interest (3) Difficulty of making inferences from behavior Changing information needs How to observe user behavior? Non-cooperative users Scalable ratings servers Adoption of new technology

105 Muddiest Points Architectures for using implicit feedback (4) Rocchio (4) –Is it commonly used? Machine learning techniques (2) –Especially hill climbing and rule induction How are links related to social filtering? The math behind Google How to reducing marginal cost of ratings? Relationship between filtering and routing


Download ppt "Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001."

Similar presentations


Ads by Google