Download presentation
Presentation is loading. Please wait.
Published byPhyllis Flynn Modified over 9 years ago
1
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR 2009 2010. 07. 09. Summarized by Jaehui Park, IDS Lab., Seoul National University
2
Copyright 2008 by CEBT CONTENTS INTRODUCTION RELATED RESEARCH PROXIMITY MEASURES PROXIMITY RETREIVAL MODEL EXPERIMENTS SETUP RESULTS CONCLUSION 2
3
Copyright 2008 by CEBT INTRODUCTION The occurrences of the query-terms in the document Intuition – Documents in which query-terms occur closer together should be ranked higher than documents in which the query-terms appear far apart. The relationships between all query-terms – The pairwise similarity between terms Contributions A list of term-term proximity measures An intuitive framework for the proximity model Machine learning approach to search through the space of term-term proximity functions Performance evaluations 3
4
Copyright 2008 by CEBT PROXIMITY MEASURES 1234567891011121314 DabcdabdefghaiJ Qab 4 pos(D,a) = {1,5,12}, pos(D,b)={2,6} tf(D,a) = 3, tf(D,b) = 2 12 measures are introduced. The distance between the positions of a pair of terms in a document. (1~6) Combining the term-frequencies of each terms in the document (7,8) The terms in the entire query (9,10) Normalization measures (11,12)
5
Copyright 2008 by CEBT PROXIMITY MEASURES min_dist(a,b,D) = 1 The minimum distance between any occurrences of a and b in D. – closeness -> relatedness diff_avg_pos(a,b,D) = ((1+5+12)/3)-((2+6)/2)) The difference between the average positions of a and b in D. – Where each term tends to occur avg_dist(a,b,D) = ((1+5)+(3+1)+(10+6))/(2*3) = 26/6=4.33 The average distance between a and b for all possible position combinations in D – Promoting the terms that consistently occur close to one another in a localised area 5
6
Copyright 2008 by CEBT PROXIMITY MEASURES avg_min_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The average of the shortest distance between each occurrence of the least frequently occurring term and any occurrence of the other term. – The occurrence of a at position 12 maybe completely unrelated to b match_dist(a,b,D) = ((2-1)+(6-5))/2 = 1 The smallest distance achievable when each occurrence of a term is uniquely matched to another occurrence of a term max_dist(a,b,D) = (12-6) = 6 The maximum distance between any two occurrences of a and b. – Useful normalization factor 6
7
Copyright 2008 by CEBT PROXIMITY MEASURES sum(tf(a),tf(b)) = 3+2 = 5 The sum of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms prod(tf(a),tf(b)) = 3*2 = 6 The product of the term frequencies of a and b in D. – An implicit indication of the proximity of both terms fullcover(Q,D) = 12 The length of the document that covers all occurrences query-terms. – A query specific measures min-cover(Q,D) = 2 The length of the document that covers all query-terms at least once – min-dist+1 for a two-term query 7
8
Copyright 2008 by CEBT PROXIMITY MEASURES dl(D) = 14 The length of the document – A useful factor for normalization in IR qt(Q,D) = 2 The number of unique terms that match both document and query 8
9
Copyright 2008 by CEBT PROXIMITY MEASURES Correlations of measures FBIS, FT, FR collections from TREC disk 4 and 5 OHSUMED collections Performing re-ranking on the top-N (=1000) documents from an initial ranked list using a proximity function 9
10
Copyright 2008 by CEBT PROXIMITY MEASURES Inverse correlations Exceptions: * qt: correlated with relevance 10
11
Copyright 2008 by CEBT PROXIMITY RETRIEVAL MODEL Extending a vector model Documents and queries as matrices – Ex) 3-term query – w(): a standard term-weighting scheme – p(): a proximity function No theoretical basis – An intuitive extension of a vector based approach – Genetic Programming (GP) technique Combining some or all of the 12 proximity measures 11
12
Copyright 2008 by CEBT EXPERIMENTAL SETUP Term weighting scheme BM25 scheme Previous work Proximity function The benchmark proximity functions BM25 + t() ES + t() 12
13
Copyright 2008 by CEBT EXPERIMENTAL SETUP GP process A heuristic stochastic search algorithm Training Financial Times – 69500 documents – Queries: 25 title only, 30 title + descriptions – Fitness function: MAP GP – Ranking documents using the weighting scheme for top 3000 documents – 6 runs of GP Initial population of 2000 for 30 generations Elitist strategy 13
14
Copyright 2008 by CEBT EXPERIMENTAL RESULTS Wilcoxon signed-rank test 14
15
Copyright 2008 by CEBT EXPERIMENTAL RESULTS Wilcoxon signed-rank test 15
16
Copyright 2008 by CEBT CONCLUSION We have outlined an extensive list of measures that may be used to capture the notion of proximity in a document. We have indicated the potential correlation between each of the individual measures and relevance. min_dist is highly correlated with relevance. We outline an IR framework which incorporates the term-term similarities of all possible query-term pairs. We adopt population based learning technique (GP) which learns useful proximity functions. An evaluation of three proximity functions It is possible to use combinations of proximity measures to improve the performance of IR systems for both short and long queries. 16
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.