Download presentation
Presentation is loading. Please wait.
Published byMark Morton Modified over 9 years ago
1
1 Flexible and Efficient Toolbox for Information Retrieval MIRACLE group José Miguel Goñi-Menoyo (UPM) José Carlos González-Cristóbal (UPM-Daedalus) Julio Villena-Román (UC3M-Daedalus)
2
2 Our approach u New Year’s Resolution: work with all languages in CLEF adhoc, image, web, geo, iclef, qa… u Wish list: Language-dependent stuff Language-independent stuff Versatile combination Fast Simple for non computer scientists u Not to reinvent the wheel again every year! u Approach: Toolbox for information retrieval
3
3 Agenda u Toolbox u 2005 Experiments u 2005 Results u 2006 Homework
4
4 Toolbox Basics u Toolbox made of small one-function tools u Processing as a pipeline (borrowed from Unix): Each tool combination leads to a different run approach u Shallow I/O interfaces: tools in several programming languages (C/C++, Java, Perl, PHP, Prolog…), with different design approaches, and from different sources (own development, downloading, …)
5
5 MIRACLE Tools u Tokenizer: pattern matching isolate punctuation split sentences, paragraphs, passages identifies some entities compounds, numbers, initials, abbreviations, dates extracts indexing terms own-development (written in Perl) or “outsourced” u Proper noun extraction Naive algorithm: Uppercase words unless stop-word, stop- clef or verb/adverb u Stemming: generally “outsourced” u Transforming tools: lowercase, accents and diacritical characters are normalized, transliteration
6
6 More MIRACLE Tools u Filtering tools: stop-words and stop-clefs phrase pattern filter (for topics) u Automatic translation issues: “outsourced” to available on- line resources or desktop applications Bultra (En Bu)Webtrance (En Bu)AutTrans (Es Fr, Es Pt) MoBiCAT (En Hu)SystranBabelFish Altavista BabylonFreeTranslationGoogle Language Tools InterTransWordLingoReverso u Semantic expansion EuroWordNet own resources for Spanish u The philosopher's stone: indexing and retrieval system
7
7 Indexing and Retrieval System u Implements boolean, vectorial and probabilistic BM25 retrieval models Only BM25 in used in CLEF 2005 Only OR operator was used for terms u Native support for UTF-8 (and others) encodings No transliteration scheme is needed Good results for Bulgarian u More efficiency achieved than with previous engines Several orders of magnitude in indexing time
8
8 Trie-based index calm, cast, coating, coat, money, monk, month
9
9 1st course implementation: linked arrays calm, cast, coating, coat, money, monk, month
10
10 Efficient tries: avoiding empty cells abacus, abet, ace, baby be, beach, bee
11
11 Basic Experiments u S: Standard sequence (tokenization, filtering, stemming, transformation) u N: Non stemming u R: Use of narrative field in topics u T: Ignore narrative field u r1: Pseudo-relevance feedback (with 1st retrieved document) u P: Proper noun extraction (in topics) SR, ST, r1SR, NR, NT, NP
12
12 Paragraph indexing u H: Paragraph indexing docpars (document paragraphs) are indexed instead of docs term doc1#1, doc69#5 … combination of docpars relevance: rel N = rel mN + α / n * ∑ j≠m rel jN n=paragraphs retrieved for doc N rel jN =relevance of paragraph i of doc N m=paragraph with maximum relevance α=0.75 (experimental) HR, HT
13
13 Combined experiments u “Democratic system”: documents with good score in many experiments are likely to be relevant u a: Average: Merging of several experiments, adding relevance u x: WDX - asymmetric combination of two experiments: First (more relevant) non-weighted D documents from run A Rest of documents from run A, with W weight All documents from run B, with X weight Relevance re-sorting Mostly used for combining base runs with proper nouns runs aHRSR, aHTST, xNP01HR1, xNP01r1SR1
14
14 Multilingual merging u Standard approaches for merging: No normalization and relevance re-sorting Standard normalization and relevance re-sorting Min-max normalization and relevance re-sorting u Miracle approach for merging: The number of docs selected from a collection (language) is proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard approaches is used
15
15 Results We performed… … countless experiments! (just for the adhoc task)
16
16 Monolingual Bulgarian Stemmer (UTF-8): Neuchâtel Rank: 4th
17
17 Bilingual English Bulgarian (83% monolingual) En Bu: Bultra, Webtrance Rank: 1st
18
18 Monolingual Hungarian Stemmer: Neuchâtel Rank: 3rd
19
19 Bilingual English Hungarian (87% monolingual) En Hu: MoBiCAT Rank: 1st
20
20 Monolingual French Stemmer: Snowball Rank: >5th
21
21 Bilingual English French (79% monolingual) En Fr: Systran Rank: 5th
22
22 Bilingual Spanish French (81% monolingual) Es Fr: ATrans, Systran (Rank: 5th)
23
23 Monolingual Portuguese Stemmer: Snowball Rank: >5th (4th)
24
24 Bilingual English Portuguese (55% monolingual) En Pt: Systran Rank: 3rd
25
25 Bilingual Spanish Portuguese (88% monolingual) Es Pt: ATrans (Rank: 2nd)
26
26 Multilingual-8 (En, Es, Fr) Rank: 2nd [Fr, En] 3rd [Es]
27
27 Conclusions and homework u Toolbox = “imagination is the limit” u Focus on interesting linguistic things instead of boring text manipulation u Reusability (half of the work is done for next year!) u Keys for good results: Fast IR engine is essential Native character encoding support Topic narrative Good translation engines make the difference u Homework: further development on system modules, fine tuning Spanish, French, Portuguese…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.