Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard

What Do People Search For? Searchers often don’t clearly understand –The problem they are trying to solve –What information is needed to solve the problem –How to ask for that information The query results from a clarification process Dervin’s “sense making”: Need GapBridge

Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Q3 Formalized Need Q4 Compromised Need (Query) End-user Search Intermediated Search

Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Process/System Co-Design

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge

“Okapi” Term Weights TF componentIDF component

A Ranking Function: Okapi BM25 document frequency term frequency query termquerydocument lengthdocument average document lengthterm frequency in query

Estimating TF and DF for Query Terms 0.4*20 + 0.3*5 + 0.2*2 + 0.1*50 = 14.9 20 50 25 3040 0.30.4 0.4*50 + 0.3*40 + 0.2*30 + 0.1*200 = 58 0.1 200 0.2 e1e1 0.4 0.3 0.2 0.1 f1f2f3f4f1f2f3f4

Learning to Translate Lexicons –Phrase books, bilingual dictionaries, … Large text collections –Translations (“parallel”) –Similar topics (“comparable”) Similarity –Similar pronunciation, similar users People

Hieroglyphic Demotic Greek

Statistical Machine Translation Señora Presidenta, había pedido a la administración del Parlamento que garantizase Madam President, I had asked the administration to ensure that

Bidirectional Translation wonders of ancient world (CLEF Topic 151) se//0.31 demande//0.24 demander//0.08 peut//0.07 merveilles//0.04 question//0.02 savoir//0.02 on//0.02 bien//0.01 merveille//0.01 pourrait//0.01 Unidirectional: si//0.01 sur//0.01 me//0.01 t//0.01 emerveille//0.01 ambition//0.01 merveilleusement//0.01 veritablement//0.01 cinq//0.01 hier//0.01 merveilles//0.92 merveille//0.03 emerveille//0.03 merveilleusement//0.02 Bidirectional:

Experiment Setup Test collections Document processing - Stemming, accent-removal (CLEF French) - Word segmentation, encoding conversion (TREC Chinese) - Stopword removal (all collections) Training statistical translation models (GIZA++) M1(10)M1(10), HMM(5), M4(5)Models (iterations) 1,583,807672,247# of sentence pairs English-ChineseEnglish-FrenchLanguages FBIS et al.EuroparlParallel corpus SourceCLEF’01-03TREC-5,6 Query languageEnglish Document languageFrenchChinese # of topics15154 # of documents87,191139,801 Avg # of rel docs2395

f 1 (0.32) f 2 (0.21) f 3 (0.11) f 4 (0.09) f 5 (0.08) f 6 (0.05) f 7 (0.04) f 8 (0.03) f 9 (0.03) f 10 (0.02) f 11 (0.01) f 12 (0.01) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 f1f1 f1f2f3f4f5f1f2f3f4f5 f1f2f3f4f1f2f3f4 f1f2f3f4f5f6f7f1f2f3f4f5f6f7 f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 f 9 f 10 f 11 f 12 f1f1 f1f1 f1f2f1f2 f1f2f1f2 f1f2f3f1f2f3 f1f1 Cumulative Probability ThresholdTranslations Pruning Translations

Unidirectional without Synonyms (PSQ) CLEF FrenchTREC-5,6 Chinese Statistical significance vs monolingual (Wilcoxon signed rank test) CLEF French: worse at peak TREC-5,6 Chinese: worse at peak Q D

Bidirectional with Synonyms (DAMM) (Q) (D)v.s. Q D DAMM significantly outperformed PSQ DAMM is statistically indistinguishable from monolingual at peak IMM: nearly as good as DAMM for French, but not for Chinese CLEF FrenchTREC-5,6 Chinese

Indexing Time Dictionary-based vector translation, single Sun SPARC in 2001

The Problem Space Retrospective search –Web search –Specialized services (medicine, law, patents) –Help desks Real-time filtering –Email spam –Web parental control –News personalization Real-time interaction –Instant messaging –Chat rooms –Teleconferences Key Capabilities Map across languages –For human understanding –For automated processing

Making a Market Multitude of potential applications –Retrospective search, email, IM, chat, … –Natural consequence of language diversity Limiting factor is translation readability –Searchability is mostly a solved problem Leveraging human translation has potential –Translation routing, volunteers, cacheing

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.

Similar presentations

Presentation on theme: "Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.

Similar presentations

Presentation on theme: "Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard."— Presentation transcript:

Similar presentations

About project

Feedback