Download presentation
Presentation is loading. Please wait.
Published byFrederica Cain Modified over 9 years ago
1
«Full-text federated search of text-based digital libraries in peer-to-peer networks» Information Retrieval 2006, Springer Jie Liu, Jamie Callan Language Technologies Institute, School of Computer Science, Carnegie Mellon University Paper presentation: Konstantinos Zaharis, Dept. of Comp. & Comm.Engineering, UTH
2
Paper Outline Introduction Overview / prior research Full-text federated search in p2p Test data Evaluation methodology – experimental settings Results Conclusions and future work
3
Introduction Federated ~ distributed Problem addressed: use of p2p nets as a search layer for text-based digital libraries (dl’s). Why p2p? Because they do not need central authority (decentralized), they connect heterogeneous, multi-vendor and lightly – managed enterprise nets. In short they are robust and scalable
4
Two types of environments in a p2p net Cooperative p2p environments: each provider gives its own accurate resource description to each neighbouring directory service (hub) Uncooperative p2p environments: each directory service conducts independently query-based sampling to obtain sample documents from its neighbouring providers in order to create their own resource description
5
Overview Distributed IR poses three main problems: 1.Resource representation: discover content areas covered by each dl 2.Resource selection: decide which dl’s are most appropriate for an information need based on their descriptions 3.Result merging: merge ranked retrieval results from a set of selected dl’s
6
Source representation (prior research) STARTS cooperative protocol (Gravano et al., Proc. of ACM SIGMOD, 1997) Query-based sampling for uncooperative environments (Callan, 2000). Directly refers to “hidden web” problem
7
Source selection (prior research) Algorithms based on resource ranking (CORI, gGIOSS, Kullback-Leibler divergence based) Threshold for resource selection usually set to a heuristic value (e.g. 5 or 10)
8
Result merging (prior research) 1 st approch: normalize resource specific document scores into resource independent document scores (CORI, SSL merging algorithms) 2 nd approach: recalculate document scores at directory service (Kirsch algorithm – each resource provides summary statistics)
9
P2p network architecture 1.Clients (information consumers): issue requests (queries) 2.Servers (information providers, dl’s): route requests (query routing) to other servers (directory services) and respond to requests (retrieval) 3.Lower level leaf nodes: providers and consumers. Only connect to hubs 4.Upper level hub nodes: directory services. Connect with leaves and other hubs 5.Query routing is the unique/critical issue in p2p nets
10
Structured vs hierachical p2p architecture Important distinction: structured architecture uses DHT (distributed hash tables) which maps every data object to a distributed key. On the contrary hierarchical architectures that automatically discover contents of dl (appropriate for dynamic, heterogenous, privacy protected nets) Hierarchical architecture support sophisticated search techniques that are not constrainted to controlled or small vocabularies (more appropriate for full-text search). However they are more complex and demand higher communication costs Common characteristic: construction of an overlay to organize peers for efficient query routing (semantic overlay networks)
11
Existing implementations PlanetP, each peer uses a TF.IDF algorithm to decide which peers to contact for information request (Cuenca-Acuna and Ngugen, 2002) pSearch, uses the semantic vector of each document (through LSI) to distribute document indices in a structured p2p net (Tang et al., 2003)
12
Paper contribution Revise and adapt methods to solve more efficiently the problems in hierarchical p2p nets Develop new approaches (e.g. resource ranking) Discriminate between cooperative and uncooperative environments Support thesis by extended experimental results
13
Resource description (1) Format: a collection language model (lists of terms and frequencies along with corpus statistics) Resource: can be a single provider (dl), a hub (multiple connected providers) or a neighborhood (all peers reachable from a hub) Description of providers: cf slide #9 Description of hubs: aggregation of description of neighboring providers (within 1 hop)
14
Neighborhood description Routing indices: terms+freq+path to other docs (Crespo and Garcia-Molina, 2002b) Each hub calculates and sends to its hub neighbor the resource description of its neighborhood Total # of documents aggregated in exponential time Detection/avoidance of graph cycles because it affects the accuracy of descriptions
15
Resource selection (2) Query routing: directing queries to peers that are most likely to contain relevant documents. Cost proportional to # of messages carrying the query Flooding technique: accurate but inefficient (exponential # of query messages) Random forward technique: relatively efficient but inaccurate Then what?
16
Resource ranking (full-text) Providers: use of K-L divergence resource ranking algorithm to calculate P(Pi | Q) (Si and Callan, 2004) Hubs: same as above with aggregation over selected neighborhoods After ranking, the idea is to select the top-ranked entities by either a) specifying a predetermined number (not as good for dynamically changed nets) or b) letting the entities to learn their own threshold values automatically and autonomously
17
Unsupervised threshold learning method Providers estimate ranking scores of relevant and non-relevant documents using the merged retrieval results of a set of training queries (set-based threshold learning) Hubs, however use individual training queries for each member of their neighborhoods (individual- based threshold learning
18
Result merging (3) Cooperative environments: use of Kirsch algorithm (Kirsch, 1997) modified to the point that it no longer needs global statistics (fewer costs) Uncooperative environments: no summary statistics are available, so adapt the Semi-Supervised-Learning algorithm (Si and Callan, 2003a). Use linear regression with local weights and overlapping documents
19
a real P2P network search model Source:“Full text federated search in P2P networks”, Lu J., PhD Dissertation, CMU 2007
20
Test data and evaluations Use of WT10g-based testbed collection # provides (websites) ~ 2500 # hubs (content-based clustering) ~ 25 # documents ~1500000 Queries automatically generated (by extracting key terms from documents) Evaluation criteria: a) search accuracy and b) query routing efficiency
21
Experimental settings Four methods for resource selection –Flooding –Random selection –Full-text selection using a fixed threshold (e.g. 1% of the top- ranked neighbouring hubs) –Full-text selection using learned thresholds TTL (time-to-live) value for each query message set to 6 Query-based sampling for resource representation in uncooperative environments # of training points to apply SSL method set to 3
22
Experimental results (numbers)
23
Experimental results show … Full-text selection performs better than flooding or random selection Using learned thresholds for resource selection yields a few more query messages (than using a fixed threshold) but improves search accuracy Uncooperative environments exhibit ~10% search performance degradation in comparision to cooperative once, which is generally accepted
24
Conclusions and future work Enhance hub functionality so as not only to provide sufficient information for its connected providers, but also calculate a path to other, probably useful, peers (provider routing technique) Method works well for small/medium sized p2p nets with regulated network structures and organized content distribution. But what happens in larger-scale networks? What happens in dynamically/temporally evolved nets? What about load balancing, dynamic clustering and fault tolerance?
25
Comments … Paper contained no intriguing ideas but proposed practical modifications to existing methods Writing style demonstrated frequent repetitions, verbatimism and often vagueness It is obvious that researchers are more inclined to better empirical results/tools for real world applications than theoretical models All references are taken from commenting paper reference list
26
Any questions? Thank you for your attention!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.