Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana Podnar Žarko *, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences EPFL, Lausanne, Switzerland *FER, University of Zagreb, Croatia Contact:
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Motivation Clustered retrieval engines are reaching scalability limits –Fast growing public Web –Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo –Dynamically changing content P2P retrieval as a scalable alternative –Involve large number of peer machines (millions) –Exploit scalable P2P search techniques –Support community-oriented search
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE P2P full text retrieval Goals –retrieval performance comparable to state-of-the-art engines –scalable in terms of generated traffic (indexing and retrieval) Two basic approaches –Document partitioning unstructured overlay network for search (e.g. Gnutella) –Term partitioning structured overlay network for search (e.g. Chord, P-Grid) Problem: communication cost for search [Li et al, IPTPS 2003] –Document partitioning: broadcast search –Term partitioning: long posting lists transmitted over network, in particular when processing multi-term queries
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Approach Some facts about web retrieval –queries are in general short (on average 2 to 3 terms) –users pose queries containing frequent terms –users are interested in a few high-precision answers (fast) Full-text information retrieval engine built over a structured P2P network specifically considering these observations ALVIS PEERS –EU FP6 research project ( )
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE P2PIR Architecture Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI LI Local single-term index GKI Global key index (k, postinglist(k)) Structured P2P network with N peers –logarithmic lookup cost for keys Large document collection D Each peera) indexes part of the global collection D (P i ) and b) maintains part of the global index Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Single-term P2P indexing Q = {t1,t2} t1,t2:{d1,d4, d7} t1:{d1, d2, d4, d5, d7, d8} t2:{d1, d3, d6, d7} Global single- term index t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t1:{d7, d8} t2:{d7} Peer1 Peer2 Peer3 Local index Querying peer Retrieval traffic is not scalable! grows with (Heap’s law) ? D - collection size in no. of terms experimentally linear, frequent terms used frequently in queries key = single-term
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE HDK-based P2P indexing Q ={t1,t2, t3} k13, k2:{d5, d7, d1, d3, d6, d8} t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t3:{d5, d6} t1:{d7, d8} t2:{d8} t3:{d7} Peer1 Peer2 Peer3 Retrieval traffic is bounded by DF max and query size! Querying peer t1:{d4, d1, d8, d5} t2:{d1, d3, d6, d8} k13:{d5, d7} DF max = 4 posting list truncated to top-DF max postings (t1, t3) key = set of terms
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Single-term vs. HDK-based P2P indexing comparable retrieval quality (extended vocabulary) voc. size could grow exponentially!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Keys and key filtering Non-Discriminative Keys (NDKs) e.g. t 1 is an NDK iff: –t 1 appears in more than DF max collection documents posting lists truncated to top-DF max documents Highly-Discriminative Keys (HDKs) e.g. (t 1, t 2 ) is an HDK iff: –t 1 & t 2 appear in less than DF max collection documents (discriminative w.r.t document collection) –t 1 and t 2 are non-discriminative (redundancy filter) –t 1 and t 2 are within a window of size w (proximity filter) –the no. of terms comprising a key is limited by s max (size filter) posting lists by definition contain only DF max documents Key filtering enables scalable indexing!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) What is the upper bound on the index size for a very large document collection? D – collection size in no. of terms s – no. of terms comprising a key w – window size IS s – index size associated with keys of size s P f, (s-1) – probability of NDK occurrences where NDK size is (s-1) key sizeindex size (location index) 1 2 s constant ?
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) Zipf model z(r)z(r) r FfFf FrFr very frequent terms frequent terms rare terms F r DF max NDKsHDKs
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) C increases for an increasing collection size, a remains const. z(r)z(r) r FfFf FrFr D increases Theorem: Probability P f,(s-1) of NDK occurrence remains constant!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (retrieval) Retrieval traffic is bounded by DF max and the number of keys a query is mapped to (constant) Scalability theoretically guaranteed, but what are the constants? Experiments!
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Experiment System fully implemented in Java (available on request) Document collection –20.000, ,..., documents from Wikipedia ( Query log –Wikipedia query log for 2 months (08/2004 and 09/2004) –3,000 randomly chosen queries from 2,000,000 unique queries with more than 20 hits No. of peers: 4, 8,..., 28 –PCs running RedHat Linux with 1GB memory –100 Mbit Ethernet –Each peer indexes documents DF max = 400 or 500, s max = 3, w = 20
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Indexing costs Average index size per peer Average indexing traffic per peer HDK vs single-term (ST) indexing experimentally: HDK / ST = 13.9 (for documents) theoretically: HDK / ST = 40.7 (overestimated upper bound!)
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Retrieval costs Retrieval traffic per query (Wikipedia query log) remains constant with a growing collection size for the HDK approach (linear for single-term)
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Estimated total generated traffic Assumptions monthly indexing no. of queries per month: 1,5 * 10 6 (true no. of queries from the wikipedia log, conservative estimate) for 1 billion documents, HDK generates 42 times less overall traffic
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Retrieval performance Overlap on top 20 documents comparable performance of the HDK-based approach to the centralized single-term engine with BM25
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Conclusion Novel indexing model based on indexing terms and term sets; Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval; Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system; Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Ongoing work Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Acknowledgement The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)