Download presentation
Presentation is loading. Please wait.
Published byRoberta Hubbard Modified over 8 years ago
1
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard
2
What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge
3
Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search
4
Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization
5
Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts
6
Process/System Co-Design
7
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict
8
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection
9
Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing
10
Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge
11
“Okapi” Term Weights TF componentIDF component
12
A Ranking Function: Okapi BM25 document frequency term frequency query termquerydocument lengthdocument average document lengthterm frequency in query
13
Estimating TF and DF for Query Terms 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 20 50 25 3040 0.30.4 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58 0.1 200 0.2 e1e1 0.4 0.3 0.2 0.1 f1f2f3f4f1f2f3f4
14
Learning to Translate Lexicons –Phrase books, bilingual dictionaries, … Large text collections –Translations (“parallel”) –Similar topics (“comparable”) Similarity –Similar pronunciation, similar users People
15
Hieroglyphic Demotic Greek
16
Statistical Machine Translation Señora Presidenta, había pedido a la administración del Parlamento que garantizase Madam President, I had asked the administration to ensure that
17
Bidirectional Translation wonders of ancient world (CLEF Topic 151) se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 Unidirectional: si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 Bidirectional:
18
Experiment Setup Test collections Document processing - Stemming, accent-removal (CLEF French) - Word segmentation, encoding conversion (TREC Chinese) - Stopword removal (all collections) Training statistical translation models (GIZA++) M1(10)M1(10), HMM(5), M4(5)Models (iterations) 1,583,807672,247# of sentence pairs English-ChineseEnglish-FrenchLanguages FBIS et al.EuroparlParallel corpus SourceCLEF’01-03TREC-5,6 Query languageEnglish Document languageFrenchChinese # of topics15154 # of documents87,191139,801 Avg # of rel docs2395
20
f 1 (0.32) f 2 (0.21) f 3 (0.11) f 4 (0.09) f 5 (0.08) f 6 (0.05) f 7 (0.04) f 8 (0.03) f 9 (0.03) f 10 (0.02) f 11 (0.01) f 12 (0.01) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1f1 f1f2f3f4f5f1f2f3f4f5 f1f2f3f4f1f2f3f4 f1f2f3f4f5f6f7f1f2f3f4f5f6f7 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f1f1 f1f1 f1f2f1f2 f1f2f1f2 f1f2f3f1f2f3 f1f1 Cumulative Probability ThresholdTranslations Pruning Translations
21
Unidirectional without Synonyms (PSQ) CLEF FrenchTREC-5,6 Chinese Statistical significance vs monolingual (Wilcoxon signed rank test) CLEF French: worse at peak TREC-5,6 Chinese: worse at peak Q D
22
Bidirectional with Synonyms (DAMM) (Q) (D)v.s. Q D DAMM significantly outperformed PSQ DAMM is statistically indistinguishable from monolingual at peak IMM: nearly as good as DAMM for French, but not for Chinese CLEF FrenchTREC-5,6 Chinese
23
Indexing Time Dictionary-based vector translation, single Sun SPARC in 2001
24
The Problem Space Retrospective search –Web search –Specialized services (medicine, law, patents) –Help desks Real-time filtering –Email spam –Web parental control –News personalization Real-time interaction –Instant messaging –Chat rooms –Teleconferences Key Capabilities Map across languages –For human understanding –For automated processing
25
Making a Market Multitude of potential applications –Retrospective search, email, IM, chat, … –Natural consequence of language diversity Limiting factor is translation readability –Searchability is mostly a solved problem Leveraging human translation has potential –Translation routing, volunteers, cacheing
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.