Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University.

Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University New York Jacob Shapiro,Isak Taksa Department of Statistics and Computer Information Systems Baruch College New York DIMACS is a partnership of Rutgers University, Princeton University, AT&T LabsResearch, Bell Labs, NEC Research Institute and Telcordia Technologies (formerly Bellcore). DIMACS Technical Report 2002- 58 December 2002

Introduction Combination of multiple evidences (multiple query formulations, multiple retrieval schemes or systems) has been shown to be effective in data fusion in information retrieval This paper has shown the effect of combination using rank performs better than combination using score under certain conditions

Data fusion in Information Retrieval Data fusion is a process of combining information gathered by multiple agents into a single representation. The combination of multiple evidences resulting from different query formulations or from different schemes (model) Why and how multiple evidences should be combined remain unanswered, researchers have come to realize the advantage and benefit of combining multiple evidences

Data fusion in Information Retrieval When different systems are commensurable, combination using similarity scores is better than combination using only ranks When multiple systems have incompatible scores, a DF method based on ranked outputs rather than the scores directly is the proper method for combination. It is not clear that such combination is possible among systems that have different methods for computing similarity scores.

Data fusion in Information Retrieval The LC (linear combination) model for fusion of IR systems combines the results lists of multiple IR systems by scoring each document with a weighted sum of the scores from each of the component systems Vogt and Cottrell(1999) introduced that LC model should only be used when the systems involved have high performance, a large overlap of relevant documents, and a small overlap of non-relevant documents

Data fusion in Information Retrieval Data fusion using rank works better than using similarity scores if the runs in the combination have ‘ different ’ rank- similarity curves.

Model and Architecture Each of the multiple evidences (say evidence A) is presented as a list of three columns (or vectors) –x: indices –r A (x) rank data –s A (x) score function –The s A (x) is sorted by the value and r A (x) means the rank list of document –d 2 is the highest score document and its score is 10 x 1 2 3 4 5 6 7 8 9 10 r A (x) 2 3 5 6 8 4 10 7 1 9 s A (x)10 7 7 6 5 4 3 2 1 1

Model and Architecture [n]={1, 2, 3, 4, …, n} [k, n]=[n]-[k-1]= {1, 2, 3, 4, …, n}-{1, 2, 3, 4, …, k-1} ={k, k+1, k+2, …, n} The rank-score function f A for the system A is a function from [n] to [0, s]={x  R and 0  x  s} where s is the highest score the system A can have

Model and Architecture The model uses a rank list which consists of a rank function and a score function We perform analytical study and simulation of the DF of different rank lists and investigate the effectiveness of these DF ’ s

Model and Architecture Rank combination Given two ranked lists A and B with r A (x), s A (x) and r B (x), s B (x) respectively Sort the array f AB (x) in ascending order and let s f (x) be the resulting array

Model and Architecture Score combination Given two ranked lists A and B with r A (x), s A (x) and r B (x), s B (x) respectively Sort the array g AB (x) in descending order and let s g (x) be the resulting array Two rank data f AB (x) and g AB (x) may contain duplicate values. When this happens, choosing the smaller rank in the inverse mapping f -1 AB or g -1 AB.

Model and Architecture x 1 2 3 4 5 6 7 8 9 10 r B (x) 2 8 5 6 3 1 4 7 10 9 s B (x) 10 7 6.4 6.2 4.2 4 3 2 1 0 x 1 2 3 4 5 6 7 8 9 10 r A (x) 5 9 6 2 8 7 1 3 10 4 s A (x)10 9 8 7 6 5 4 3 2 1 (a) Ranked list A (b) Ranked list B

Model and Architecture x 1 2 3 4 5 6 7 8 9 10 f AB (x) 6.5 2.5 4 8.5 2 3.5 7 6.5 6 9 s f (x) 2 2.5 3.5 4 6 6.5 6.5 7 8.5 9 r C (x) 5 2 6 3 9 1 8 7 4 10 x 1 2 3 4 5 6 7 8 9 10 g AB (x)4.0 8.5 3.6 2.0 8.2 7.1 3.5 5 4.5 1.5 s g (x)8.5 8.2 7.1 5.0 4.5 4.0 3.6 3.5 2.0 1.5 r D (x) 2 5 6 8 9 1 3 7 4 0 (c) Combination of A and B by rank (d) Combination of A and B by score

Model and Architecture (a) Two rank-score functions with n=10, s=10

Model and Architecture

Analysis Problem 1. –For what A and B, P(C) (or P(D)) >max{P(A), P(B)}? Problem 2. –For what A and B, P(C) > P(D)?

Analysis Lemma1 Let L 1 and L 2 meet at (x*, y*) Let and Let r A = e A, the identity permutation, and r B = t º e A where t = the transposition (i, j), i < j. Then g AB (i) ≥ g AB (j) if and only if Lemma 2 Let r A, s A, r B, s B be as defined in Lemma 1. Let (x*, y*) be the point which connects L 1 and L 2. Let i, j be such that i < x* < j < n. Let (x*, y+) be the midpoint between (x*, y*) and (x*, y(x*)). If g AB (i) ≥ g AB (j), then y+ ≥ g AB (i)

Analysis Lemma1 proof The function for L 1 and L 2 which meet at (x*, y*) in a similar fashion –We have r A (i) = i, r A (j) = j and r B (i) = j, r B (j) = i –s A (i) = y(i), s A (j) = y(j) and s B (i) = y 1 (i), s B (j) = y 2 (i) 1 n 0 s (x*,y*) L1L1 L2L2 L ij

Analysis Lemma1 proof g AB (i) ≥ g AB (j) if and only if s A (i) + s B (j) ≥ s A (j) + s B (i) let d = sn - ny* + y* - sx* > 0 Multiply both side by and simplify

Analysis Theorem 1 Let A, B, C and D be defined as before. Let s A = L and s B = L 1 U L 2 (L 1 and L 2 meet at (x*, y*) be defined as above ) Let r A = e A be the identity permutation. If r B = t ◦ e A, where t = the transposition (i, j), (i < j), and q < x*, then P @q (C) ≥ P @q (D). Theorem 2 Based the definition in the theorem 1 q < x*, then P avg (C) ≥ P avg (D).

Simulation Assume the number of documents to be n=500 The highest score given to any rank is s=100 The total number of possible permutations as rank data vector is 500! We fix s A to be the straight line connecting the two end points (1,500) and (1,100) In each simulation, we generate the thousand (10k) cases of r B The special case of which is a combination of two straight lines with one turning point (x*, y*)

Simulation Case 1 –r A =e A the identity permutation, r B =random –s A is the straight line connecting (500, 0) and (1,100). In fact, s A has the following formula y= (-100/499)(x-500) –For each permissible turning point (x*, y*) we generate 10k r B ’ s. Then we combine these 10k ranked list B with ranked list A –The results are listed in Figure 6 using P @50 and P avg respectively –Tuples at point (x*, y*) (i.e. (a, b, c)) where a, b and c are the number of cases out of the 10k cases so that P @50 (C) P @50 (D) and P @50 (C) = P @50 (D)

Simulation Figure 6(a):r A =e A, r B = random f A = line between (0,100) and (500,0), f B = single turning point. at (x, y) P @50 at(x, y), P @50 (C) vs P @50 (D) (, =) – number of cases (total 10,000)

Simulation Figure 6(c): r A =e A, r B = random f A = line between (0,100) and (500,0), f B = single turning point. at (x, y) P avg at(x, y), P avg (C) vs P avg (D) ( ) – percentages (number of cases - 10,000)

Simulation Case 2 r A =random and r B =random Figure 7(b): r A =random, r B = random f A = line between (0,100) and (500,0), f B = single turning point. at (x, y) P @50 at(x, y), P @50 (C) vs P @50 (D) (, =) – percentages (number of cases - 10,000)

Simulation Figures 6 and 7 exhibit certain features which are quite noticeable When the turning point (x*, y*) is below the standard line (i.e. s A ) at certain location, the performance of the combination using rank (P(C)) is most likely to be better than that of the combination using score (P(D))

Simulation The performance of C and D as compared to those of A and B

Simulation Figure 8: r A =random, r B = random f A = line between (0,100) and (500,0), f B = single turning point. at (x, y) P = P @50 at (x, y), x = (a,b) a- rank, b -score y = number of cases (total number of cases – 10,000)

Simulation Figure 9 lists the change of (a, b, c) along the change of the turning point (x*, y*), where P = P @50 –a = number of 10k cases with P(C) > max {P(A), P(B)}, –b= number of 10k cases with P(D) > max {P(A), P(B)}, and –c= number of 10k cases with min { P(C), P(D)}> max {P(A), P(B)}

Simulation Figure 9(a): r A =random, r B = random f A = line between (0,100) and (500,0) x = P l /P h, y = t and f B = single turning point ( t, t/5 ) P = P @50 at (x, y), P(C) vs max{ P(A), P(B)}(>) – percentage ( ), number of cases for each y - 10,000

Simulation Figure 9(b): r A =random, r B = random f A = line between (0,100) and (500,0) x = P l /P h, y = t and f B = single turning point ( t, t/5 ) P = P @50 at (x, y), P(D) vs max{ P(A), P(B)}(>) – percentage ( ), number of cases for each y - 10,000

Simulation Figure 9(c): r A =random, r B = random f A = line between (0,100) and (500,0) x = P l /P h, y = t and f B = single turning point ( t, t/5 ) P = P @50 at (x, y), min{ P(C ), P(D)} vs max{ P(A), P(B)}(>) – percentage ( ), number of cases for each y - 10,000

Discussion Combination using Rank vs. Score Result in the When is big enough, combination using ranks performs better than combination using scores When the difference between r A and r B is a transposition (i,j) and q< x*, the performance of rank combination is at least as good as that of score combination

Discussion Effectiveness of combination More recent study by Ng and Kantor ([16] and [17]) has identified two predictive variables the Kendall distance and the performance ratio The Kendall distance d K (r A, r B ) measures the degree of concordance between two different rank lists r A and r B The performance ratio P l /P h measures the similarity of performance of the two IR schemes A and B The distribution of the positive cases is clustered around P l /P h ~ 1 As to the Kendall distance d K (r A, r B ), we have not found any significant pattern on this line

Segmentation result K-means –Lamda=0.2, penalty=0.01 –0.129569,0.411073,0.151609 Button up –Lamda=0.2, penalty=0.0078125, –0.12659,0.336444,0.21302 TMM –Lamda=0.04 penalty=0.1, –0.0924661,0.260335,0.141078

Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University.

Similar presentations

Presentation on theme: "Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University.

Similar presentations

Presentation on theme: "Methods of data fusion in information retrieval rank vs. score combination D. Frank Hsu Department of Computer and Information Science Fordham University."— Presentation transcript:

Similar presentations

About project

Feedback