Download presentation
Presentation is loading. Please wait.
Published byCameron Wilkinson Modified over 9 years ago
1
DCU@FIRE 2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland
2
Outline Motivation System Setup and Changes Monolingual Experiments Crosslingual Experiments SMT system Training data Translation results OOV Reduction FAQ Retrieval Results Conclusions and Future Work
3
Motivation Task: Given a SMS query, find FAQ documents answering the query Last year’s DCU system: SMS correction and normalisation In-Domain retrieval: Three approaches (SOLR, Lucene, Term Overlap) Out-of-domain (OOD) detection: Three approaches (term overlap, normalized BM25 scores, ML) Combination of ID retrieval and OOD results
4
Motivation This year’s system: Same SMS correction and normalisation one more spelling correction resource (manually created) Single retrieval approach: Lucene with BM25 retrieval model Single OOD detection approach: IB-1 classification using Timbl (Machine Learning) additional features for term overlap and normalized BM25 scores Trained statistical machine translation system for document translation (Hindi to English)
5
Questions Investigate the influence of OOD detection on system performance the influence of out-of-vocabulary (OOV) words on crosslingual performance
6
Collection Statistics LanguageDocumentsTraining (rel/non_rel) Test (rel/non_rel) English72514476 (3047/1429) 1733 (726/1007) Hindi1994554 (173/381) 579 (200/379) English to Hindi 1994554 (173/381) 431 (75/356)
7
Monolingual Experiments (Setup) Experiments for English and Hindi Processing steps: Normalize SMS and FAQ documents Correct SMS queries Retrieve answers Detect OOD queries (or not), e.g. “NONE” queries Produce final result
8
Crosslingual Experiments (Setup) Experiments for English to Hindi Additional translation step to translate Hindi FAQ documents into English Translation is based on newly trained statistical machine translation system (SMT) Problem: sparse training data → combination of different training resources out of vocabulary (OOV) words → OOV reduction
9
Crosslingual Experiments (SMT System) Training an SMT system Data preparation tokenization/normalization scripts Data alignment Giza++ for word-level alignment Phrase extraction Moses MT toolkit Training a language model SRILM for trigram LM with Kneser-Ney smoothing Tuning Minimum error rate tuning (MERT)
10
Crosslingual Experiments (Training Data) Agro (agricultural domain): 246 sentences Crowdsourced HI-EN data: 50k sentences EILMT (tourism domain): 6700 sentences ICON: 7000 sentences TIDES: 50k sentences FIRE ad-hoc queries: 200 titles, 200 descriptions Interlanguage Wikipedia links: 27k entries OPUS/KDE: 97k entries UWdict: 128k entries
11
Translation Results (Hindi to English) DataTraining / Test / DevelopmentBLEU TIDES49,504 / 697 / 98813.30 Crowdsourced EN-HI41,396 / 8000 / 40007.04 ICON7000 / 500 / 50025.38
12
OOV Reduction Problem: 15.4% untranslated words in translation output Idea: modify untranslated words to obtain a translation OOV reduction is based on two resources UWdict Manually created transliteration lexicon (TRL): 639 entries
13
OOV Reduction Word modifications: Character normalization, e.g. replace Chandrabindu with Bindu delete Virama character replace long with short vowels Stemming Lucene Hindi stemmer Transliteration ITRANS transliteration rules rules for cleaning up ITRANS results Decompounding word split at every position into candidate constituents word is decompounded if both constituents have a translation
14
OOV Reduction Results (Hindi to English) Lookup formLookup DataCount% Reduction original termUWdict.4,72814.5 original termTRL830.3 normalized termUWdict4191.3 normalized termTRL240.1 stemmed termUWdict1,4134.4 stemmed termTRL140.0 stemmed normalized termUWdict1350.4 stemmed normalized termTRL00.0 compound constituentsUWdict7212.2 transliterationN/A24,97376.8
15
FAQ Retrieval Results RunLanguageOOD detection OOV reduction ID correct OOD correct MRR 1ENN-661/72619/10070.937 2ENY-595/726981/10070.949 1HIN-77/37913/3790.473 2HIY-26/379375/3790.880 1EN2HINN29/7541/10070.450 2EN2HINY22/7560/10070.365 3EN2HIYY4/75989/10070.444
16
Conclusions Monolingual experiments: Good performance for English and Hindi OOD detection improves MRR (but reduces number of correct ID queries) Crosslingual experiments: Lower performance OOD detection reduces MRR OOV reduction reduces MRR
17
Future work Further analysis of our results needed Normalization issues for MT training data? Unbalanced OOD training data for Hindi and English? Is there Hindi textese (e.g. abbreviations etc.)? Does the training data match the test data? manually or automatically created Improve transliteration approach Comparison to other submissions
18
10q 4 ur @ensn
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.