Query-Driven Indexing for Peer-to-Peer Text Retrieval WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn I.Podnar is currently affiliated with University of Zagreb, Croatia The work presented in this paper was (partly) carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European projects BRICKS (507457) and ALVIS (002068). G.Skobeltsyn, T.Luu, I.Podnar, M.Rajman, K.Aberer Experiments: retrieval quality of the query-driven index when compared to Google Our goal: Scalable full text web retrieval in a structured P2P network. Features: Low bandwidth during retrieval as posting lists of bounded size are transmitted, The content of the index adapts to the current query popularity distribution, Tradeoff between retrieval quality and index size (i.e., indexing cost). Processing the query abc with a query-driven index More details in: Skobeltsyn et al: "Query-Driven Indexing for Scalable Peer-to- Peer Text Retrieval", in Infoscale'07, Suzhou, China, 2007 Skobeltsyn et al: "Web Text Retrieval with a P2P Query-Driven Index", in SIGIR'07, Amsterdam, The Netherlands, Alvis project web site: Overlap achieved for different sizes of the query log measured in number of days with QF min =1, DF max =600 Overlap achieved for different values of DF max with QF min =1 Overlap achieved for different values of QF min /3 months with DF max =600 what did babe ruth do in the 1920 >id=481, q="what did babe ruth do in the 1920" "1920 babe ruth", qf=0 ----> 100% "1920 babe", qf= > 9% "1920 ruth", qf= > 33% "babe ruth", qf= > 69% "1920", qf= > 1% "babe", qf= > 2% "ruth", qf= > 7% Size: 192, Keys used: 2, 94% Top-20 overlap measure: compare top-DF max Google results indexed Use Google to answer a query and compare it to the union of top-DF max Google results for each of its indexed keys, QF min Keys are indexed if contained in more than QF min queries in the global query history. Example of resolving a query: A distributed query-driven index maintains truncated posting lists (TPLs), storing top-DF max document references, for carefully selected term combinations (keys) To process a multi-term query abc we compute the top-k results by collecting (truncated) posting lists for currently indexed combinations, e.g., ab or bc. We maintain a global query history and use it to identify popular (qf≥QF min ) and non-redundant combinations Distributed query-driven index: Distributed single term index maintains global posting lists for each single term in a DHT To process a multi-term query abc it intersects the full posting lists of a, b and c. Intersections lead to unscalable retrieval traffic The naïve approach: