INAOE at GeoCLEF 2008: A Ranking Approach based on Sample Documents Esaú Villatoro-Tello Manuel Montes-y-Gómez Luis Villaseñor-Pineda Language Technologies Laboratory National Institute of Astrophysics, Optics and Electronics Tonantzintla, México
General ideas Our system focuses on the ranking process It is based on the following hypotheses: –Current IR machines are able to retrieve relevant documents for geographic queries –Complete documents provide more and better elements for the ranking than isolated query-terms We aimed to show that: –Using some query-related sample texts it is possible to improve the final ranking of some retrieved documents
General architecture of our system Query Retrieved documents (small) Document Collection First Stage (Retrieval Stage) IR Machine Feedback Process Selected Sample Texts Retrieved documents (large) Second Stage (Ranking stage) Query Expansion Re-ranked documents Re-ranking Process
Re-ranking process Similarity Calculation Information Fusion Sample Texts Retrieved Documents Re-Ranked list of Documents 1 2 r |R| 1 2 s |S| Geonames DB Geo- Expansion Process Different ranking proposals 1 2 r |R|
System configuration Traditional modules IR Machine: –Based on LEMUR –Retrieves 1000 documents (original/expanded queries) Feedback module –Based on blind relevance feedback –Selects the top 5 retrieved documents (sample texts) Query Expansion –Adds to the original query the five most frequent terms from the sample texts
System Configuration Re-ranking module Geo-Expansion: –Geo-terms are identified using NER LingPipe –Expands geo-terms of sample texts by adding their two nearest ancestors (Paris France, Europe) Similarity Calculation: –Considers thematic and geographic similarities; it is based on the cosine formula Information Fusion: –Merges into one single list all different ranking proposals, using the Round-Robin technique
Evaluation points Query Retrieved documents (small) Document Collection First Stage (Retrieval Stage) IR Machine Feedback Process Selected Sample Texts Retrieved documents (large) Second Stage (Ranking stage) Query Expansion Re-ranked documents Re-ranking Process 1st EP 2nd EP 3rd EP
Experimental results Submitted runs Eval. Point Experiment Description: 1 st inaoe-BASELINE Title + Description inaoe-BASELINE Title + Description + Narrative 2 nd inaoe-BRF Baseline1 + 5 term (from 5 docs) 3 rd inaoe-RRBF re-rank: BRF-5-5, without any distinction inaoe-RRGeo re-rank: BRF-5-5, distinction (thematic, geographic) inaoe-RRGeoExp re-rank: BRF-5-5, distinction (thematic, geographic + expansion) +4.87%+3.33%+0%+3.24%
Experimental results Additional runs Sample texts were manually selected (from Inaoe-BASELINE1) Two documents were selected in average for each topic Eval. Point Experiment Description 1 st Inaoe-BASELINE Title + Description 2 nd inaoe-BRF Baseline1 + 5 term (2* docs) 3 rd inaoe-RRBF re-rank:BRF-5-2*, without any distinction inaoe-RRGeo re-rank: BRF-5-2*, distinction (thematic, geographic) inaoe-RRGeoExp re-rank:BRF-5-2*, distinction (thematic, geographic +expansion) +26.4%+15.8%+28.3%+3.24%
Final remarks Results showed that the query-related sample texts allow improving the original ranking of the retrieved documents Our experiments also showed that the proposed method is very sensitive to the presence of incorrect sample texts Since our geo-expansion process is still very simple, we believe it is damaging the performance of the method Ongoing Work A new sample text selection method A new strategy for geographic expansion that considers a more precise disambiguation strategy
Thank you! Manuel Montes y Gómez Language Technologies Laboratory National Institute of Astrophysics, Optics and Electronics Tonantzintla, México