G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007.

Slides:



Advertisements
Similar presentations
George Anadiotis, Spyros Kotoulas and Ronny Siebes VU University Amsterdam.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
EPFL-I&C-LSIR [P-Grid.org] Workshop on Distributed Data and Structures ’04 NCCR-MICS [IP5] presented by Anwitaman Datta Joint work with Karl Aberer and.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Peer-to-peer file-sharing over mobile ad hoc networks Gang Ding and Bharat Bhargava Department of Computer Sciences Purdue University Pervasive Computing.
Mobile Ad-hoc Pastry (MADPastry) Niloy Ganguly. Problem of normal DHT in MANET No co-relation between overlay logical hop and physical hop – Low bandwidth,
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Query-Driven Indexing for P2P Text Retrieval The Future of Web Search Bertinoro,
Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.
National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Efficient Processing of Top-k Spatial Preference Queries
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.
DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.
Research of P2P Architecture based on Cloud Computing Speaker : 吳靖緯 MA0G0101.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
Content Delivery Networks: Status and Trends Speaker: Shao-Fen Chou Advisor: Dr. Ho-Ting Wu 5/8/
G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,
An overlay for latency gradated multicasting Anwitaman Datta SCE, NTU Singapore Ion Stoica, Mike Franklin EECS, UC Berkeley
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Malugo – a scalable peer-to-peer storage system..
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007 Suzhou, China Gleb Skobeltsyn EPFL, Switzerland June 6, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis

DHT Goal goal scalableOur goal is to achieve scalable full-text retrieval with structured P2P networks (DHTs) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 2 / 25 Each peer: Provides resources (bandwidth, storage) Searches the whole network Publishes its own documents

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Naïve (single-term) approach... is to distribute the global inverted index in a DHT: K I Query: “epfl & gleb” h(“epfl”)-{d 1,d 2 } h(“gleb”)-{d 2,d 3 } h(t’)-{d 4,d 5 } K I This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor {d 1,d 2 } {d 2 } 3 / 25

Indexing with Highly Discriminative Keys G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval [1] Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey 4 / 25

Indexing with HDKs: main properties G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Distributed index contains {key,PL} pairs: keyEach key corresponds to a term or a set of terms posting listEach key is assigned to a posting list DF max top-rankedEach posting list stores at most DF max top-ranked document references. Data-Driven key generation: kEach time a new document is indexed, some posting lists for a key k can reach the max size of DF max triggers It triggers the generation of new keys (k + other frequent keys) closewProximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). 5 / 25

HDK – exhaustive data driven indexing Pro’sPro’s: –ICDE’07 paper proves that the number of keys grows linearly –Elegant key generation mechanism –Low bandwidth while query processing (PL’s of limited size) Con’sCon’s: –Practically the number of keys is LARGE: 68M for 0.6M docs –High bandwidth consumption at indexing ProblemProblem: –Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 6 / 25

Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 7 / 25

Contents Introduction HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –ONM –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 8 / 25

Query-Driven Index (QDI) Too-Many-KeysQuery-Driven Indexing strategy solves the “Too-Many-Keys” problem: –Avoids maintenance of superfluous keys –Generates only such keys that are requested by users –Utilizes query-log to discover such keys ProblemsProblems –Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key Opportunistic Notification Mechanism Opportunistic Notification Mechanism(smart-broadcast) –Incomplete index causes degradation of query results quality Show that the degradation is low Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 9 / 25

Which keys to index? Each single-term found in the document collection is has to be indexed. basic single term index –We call all single-term keys a basic single term index. –The posting lists are truncated at DF max. non-superfluousactivatedA key k is non-superfluous and can be activated iff: –k is popular: QF(k) ≥QF min, where QF(k) is the popularity of the key k derived from the available query log and QF min is a parameter for our model (popularity filter). –k contains from 2 to s max terms: 2≤|k|≤ s max, where s max is a parameter of our model (size filter). –all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 10 / 25

QDI: Retrieval G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abc abbcac Single term index is generated Process abc 1)Probe P abc 2)Probe P ab P bc and P ac 3)Probe P a P b and P c 4)Obtain top-DF max results for a, b and c (ranked w.r.t a, b and c respectively) 5)Contact peers in the list, re-rank the obtained results w.r.t abc 6)Output top-10 Inc. the QF for ab, bc and ac Activate (index) ac peer ?abc nothing ?abc nothing ?abc +1 DF max popular 11 / 25

QDI: Retrieval 2 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abbcac abc Assume the frequency of b is below DF max Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) DF max abc abbc 12 / 25

QDI: Retrieval 3 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abbcac abc Single term index is generated and ac is indexed Process abc 1)Probe P abc 2)Probe P ab P bc and P ac – obtain the result for ac 3)Probe P b and obtain the result for b 4)Contact all peers in the list to re-rank the obtained results w.r.t abc 5)Output top-10 Inc. the QF for ab, bc and ac peer ?abc nothing ?abc nothing ?abc / 25

Opportunistic Notification Mechanism ONM used to activate a new multi-term key ONM is a “smart” broadcast with the following features: –It is based on the shower multicast [2]: each peer within a specified range is contacted only once –Notifications are small and low-priority => piggybacking –Broadcast is split into several multicast sessions, each time pruning low-score documents –It uses the high-performance DHT layer [3] [2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 14 / 25

Scalability The retrieval traffic is bounded by a constant due to trun- cated posting lists (depends on DF max and a query size) The indexing traffic depends on the number of keys to be activated. linearly –The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents does not depend on the document collection size –The number of keys does not depend on the document collection size but only on the size of the query log indexing traffic retrieval quality –We can use the QF min parameter to adjust the tradeoff: indexing traffic retrieval quality G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 15 / 25

Contents Introduction HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –ONM –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 16 / 25

Overlap experiment Use the Wikipedia query-log (9M queries/ ) to build the index Choose randomly 3K test queries query its combinationsAnswer each test query with Google and compare to the union of top- DF max Google results for each of its combinations that are indexed according to the logs. Mimics our P2PIR system if Google’s ranking is used. Example: Original query Non-superfluous (indexed) combinations X X G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 17 / 25

Overlap example G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval what did babe ruth do in the 1920” >id=481, q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> 100% “1920 babe”, qf= > 9% 1920 ruth”33% +++“1920 ruth”, qf= > 33% babe ruth” 69% +++“babe ruth”, qf= > 69% ---“1920”, qf= > 1% ---“babe”, qf= > 2% ---“ruth”, qf= > 7% % Size: 192, Keys used: 2, 94% Cut-n-paste from the simulation log: 18 / 25

Overlap with Google G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 19 / 25

Overlap with Yahoo G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 20 / 25

Overlap with Google (no/partial/full overlap) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 21 / 25

P2P Index Simulations G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Number of keys depends only on the query log size and QF min ! Does not depend on the collection size! Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc 22 / 25

Real query logs? G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Wikipedia queries are unrealistic (too skewed) as users know what they want. Real web-queries might perform worse? Large scale experiments with real web queries and the TREC collection in [4] [4] [4] Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer To appear in SIGIR’07 23 / 25

Conclusions query-driven indexing strategyWe presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: and –Stores posting lists in a DHT for terms and term combinations at most –Stores at most DF max top document references in a posting list statistics –Efficiently collects the query statistics in a distributed fashion popular –Based on this statistics activates (indexes) only popular keys no –Computes the result of a multi-term query based only on the index entries available at the moment – no costly intersections We also showed that: good retrieval quality –With real query-logs our approach achieves good retrieval quality tradeoff –The QF min parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 24 / 25

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Last slide Thank you for your attention! Questions? 25 / 25