Presentation is loading. Please wait.

Presentation is loading. Please wait.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Similar presentations


Presentation on theme: "MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian."— Presentation transcript:

1 MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian Zimmer Matthias Bender

2 MPI Informatik 2/17 Oberseminar AG5 Overview 1 Result merging problem 2 Selected result merging methods 3 Efficient index processing 4 Summary 5 References

3 MPI Informatik 3/17 Oberseminar AG5 Query processing in distributed IR q – query, P – set of peers, P’ – subset of peers most “relevant” for q R i – ranked result list of P i, R m – merged result list, Merging P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P1’P1’ Selection Retrieval,q> P2’P2’ P3’P3’ R1R1 R2R2 R3R3 RMRM RMRM...................

4 MPI Informatik 4/17 Oberseminar AG5 Naive merging approaches How we can combine results from n peers if we need top-k documents? 1. Retrieve k documents with the highest similarity scores. Problem: scores incomparable 2. Fetch k best documents from n peers, re-rank them and select k from k*n. Problem: too expensive communication 3. Take the same number k/n of documents from each peer in round-robin fashion. Problem: some databases contain several high relevant documents and some are not

5 MPI Informatik 5/17 Oberseminar AG5 Merging properties in a Peer-to-Peer Web Search Engine Heterogeneous databases (local scores incomparable) High scalability (document fetching is very expensive) Cooperative environment (statistics available, common search model) Highly dynamic (no room for learning methods) Database selection (extra info for score normalization)

6 MPI Informatik 6/17 Oberseminar AG5 Collection fusion – results merging from disjoint data sets Ideal retrieval effectiveness: 100% of the single collection baseline. Data fusion – results merging from single document set Ideal retrieval effectiveness: > 100% of the single collection baseline if rankings are independent (e.g. TF*IDF and PageRank) Our scenario is in the middle, probably, more close to the collection fusion Collection vs. Data fusion Disjoint document set DB 1 DB 2 DB 3 P1P1 P2P2 P3P3 P1P1 P2P2 P3P3 DB 1 Shared document set Overlapping document set DB 1 DB 2 DB 3 P1P1 P2P2 P3P3

7 MPI Informatik 7/17 Oberseminar AG5 Results Merging Problem Objective: Merge returned documents from multiple sources into a single ranked list. Difficulty: Local document similarity scores may be incomparable due to different statistical values. Solution: Transform local scores into global ones. An Ideal Merging: Ideal retrieval effectiveness: same as that as if all documents were in the single collection. Efficiency: optimize the retrieval process.

8 MPI Informatik 8/17 Oberseminar AG5 Global IDF merging (Viles and French [6]) Global IDF: compute global IDFs as follows (Viles and French [6]): Where DF i – number of documents with particular term on peer i, |D i | – overall number of documents on peer i. Assumption: documents overlap among collection affects all term proportionally to their IDFs Precision: in disjoint case – 100% of single collection baseline, in overlapping setup – ?

9 MPI Informatik 9/17 Oberseminar AG5 ICF merging (Callan [1]) Inverted Collection Frequency ICF instead of IDF: replace IDF with ICF value Where CF – number of peers with particular term, |C| – number of collections (peers) in the system Assumption: ICF value is analogue of IDF in peer-to-peer setting, important term occurs on a small number of peers Precision: in disjoint case – ? in overlapping setup – ?

10 MPI Informatik 10/17 Oberseminar AG5 CORI-merging (1) (Callan [1]) Database selection step constants are heuristics tuned for INQUERY search engine cw – is the number of indexing terms in collection avg_cw – is the average number of indexing terms across collections DF-based part CF-based part Database score r i for current query term q i

11 MPI Informatik 11/17 Oberseminar AG5 CORI-merging (2) (Callan [1]) Database scores R i for query q with n terms Minimum database R min score with T=0 for each term Maximum database R max score with T=1 for each term Normalized database score R’ I low-score database still can contribute documents in top-k

12 MPI Informatik 12/17 Oberseminar AG5 CORI-merging (3) (Callan [1]) Normalizing document scores the similar procedure gives us reducing effect of different local IDFs Scores for merging: both effects of DF and ICF are included in R i Assumption: It is a most successful representative method of those that combines database score with local scores Precision: in disjoint case – 70-100% of single collection baseline, in overlapping setup – 70-100%

13 MPI Informatik 13/17 Oberseminar AG5 Obtain “fair” scores by language models manipulation all peers collection – G document - D query term – q smoothing parameters – λ, β peer collection – C query - Q Language modeling merging (Si et al.[3]) Assumptions: linear separation of evidences, correct segmentation of documents across collections Precision: equal or better than CORI

14 MPI Informatik 14/17 Oberseminar AG5 Methods summary Selected methods Global IDF normalization ICF normalization CORI merging Language modeling merging Which method is the best? Future experiments will show What we can get from merging in terms of computational efficacy? Reduce index processing cost, lets look at the example

15 MPI Informatik 15/17 Oberseminar AG5 Index processing optimization P1P1 Query = {A B C}, 2 selected peers, top-10 results needed; index lists are processed with TA-sorted, it stops when WorstScore10>BestScorecand PqPq 1. 1.{A B C} 2. 2.and GIDF are posed P2P2 A B............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. C 3. WorstScore=0.6 propagated to querying peer 3. WorstScore=0.8 propagated to querying peer A B............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. C 2. WorstScore=0.6 BestScore=0.8 2. WorstScore=0.8 BestScore=0.9 4. Largest WorstScore=0.8 returns to the peers 4. WorstScore=0.7 BestScore=0.79 TA-sorted stops 4. WorstScore=0.8 BestScore=0.79 TA-sorted stops Our prize 5. WorstScore=0.71 BestScore=0.7 stops originally

16 MPI Informatik 16/17 Oberseminar AG5 Future work To do list Several merging methods must be implemented and evaluated on experimental testbed Effect of index processing optimization should be investigated on family of algorithms Other issues How to compute PageRank in distributed environment? How we can incorporate bookmarks in result merging? How we can obtain the best combination of similarity + PageRank + bookmarks?

17 MPI Informatik 17/17 Oberseminar AG5 References 1.Callan J. P., Distributed Information Retrieval, 2000. 2.Callan J. P., Lu Z., Croft W. B., Searching Distributed Collections With Inference Networks, 1995. 3.Si L., Jin R., Callan J., Ogilvie P., A Language Modeling Framework for Resource Selection and Results Merging, 2002. 4.Craswell N., Methods for Distributed Information Retrieval, PhD Thesis, 2000. 5.Kirsch S. T., Distributed search patent. U.S. Patent 5,659,732, 1997. 6.Viles C. L., French J. C., Dissemination of collection wide information in a distributed information retrieval system, 1995.


Download ppt "MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian."

Similar presentations


Ads by Google