Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.

Slides:



Advertisements
Similar presentations
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
Advertisements

Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
1 1 Chord: A scalable Peer-to-peer Lookup Service for Internet Applications Dariotaki Roula
Network Coding in Peer-to-Peer Networks Presented by Chu Chun Ngai
Massively Distributed Database Systems Distributed Hash Spring 2014 Ki-Joune Li Pusan National University.
PCP2P: Probabilistic Clustering for P2P networks 32nd European Conference on Information Retrieval 28 th -31 st March 2010, Milton Keynes, UK Odysseas.
Denial-of-Service Resilience in Peer-to-Peer Systems D. Dumitriu, E. Knightly, A. Kuzmanovic, I. Stoica and W. Zwaenepoel Presenter: Yan Gao.
The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Eddie Bortnikov/Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
A Distributed Search Service for Peer-to-Peer File Sharing in Mobile Application Presented by Tony Sung On Loy, MC Lab, CUHK IE 1 A Distributed Search.
Object Naming & Content based Object Search 2/3/2003.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007.
G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Query-Driven Indexing for P2P Text Retrieval The Future of Web Search Bertinoro,
Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn
P.1Service Control Technologies for Peer-to-peer Traffic in Next Generation Networks Part2: An Approach of Passive Peer based Caching to Mitigate P2P Inter-domain.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Efficient Peer to Peer Keyword Searching Nathan Gray.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
Temporal-DHT and its Application in P2P-VoD Systems Abhishek Bhattacharya, Zhenyu Yang & Shiyun Zhang.
Dynamic P2P Indexing and Search based on Compact Clustering Mauricio Marin Veronica Gil-Costa Cecilia Hernandez UNSL, Argentina Universidad de Chile Yahoo!
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Statistical Properties of Text
G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Information Retrieval in Practice
Steve Ko Computer Sciences and Engineering University at Buffalo
Martin Rajman, Martin Vesely
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Steve Ko Computer Sciences and Engineering University at Buffalo
Determining the Peer Resource Contributions in a P2P Contract
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Presentation transcript:

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana Podnar Žarko *, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences EPFL, Lausanne, Switzerland *FER, University of Zagreb, Croatia Contact:

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Motivation Clustered retrieval engines are reaching scalability limits –Fast growing public Web –Immense volume of privately owned content that will never be indexed by search engines like Google or Yahoo –Dynamically changing content P2P retrieval as a scalable alternative –Involve large number of peer machines (millions) –Exploit scalable P2P search techniques –Support community-oriented search

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE P2P full text retrieval Goals –retrieval performance comparable to state-of-the-art engines –scalable in terms of generated traffic (indexing and retrieval) Two basic approaches –Document partitioning  unstructured overlay network for search (e.g. Gnutella) –Term partitioning  structured overlay network for search (e.g. Chord, P-Grid) Problem: communication cost for search [Li et al, IPTPS 2003] –Document partitioning: broadcast search –Term partitioning: long posting lists transmitted over network, in particular when processing multi-term queries

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Approach Some facts about web retrieval –queries are in general short (on average 2 to 3 terms) –users pose queries containing frequent terms –users are interested in a few high-precision answers (fast) Full-text information retrieval engine built over a structured P2P network specifically considering these observations ALVIS PEERS –EU FP6 research project ( )

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE P2PIR Architecture Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI LI Local single-term index GKI Global key index (k, postinglist(k)) Structured P2P network with N peers –logarithmic lookup cost for keys Large document collection D Each peera) indexes part of the global collection D (P i ) and b) maintains part of the global index Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI Ranking HDK Indexing/Querying P2P Web service IF IR PEER LI GKI

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Single-term P2P indexing Q = {t1,t2} t1,t2:{d1,d4, d7} t1:{d1, d2, d4, d5, d7, d8} t2:{d1, d3, d6, d7} Global single- term index t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t1:{d7, d8} t2:{d7} Peer1 Peer2 Peer3 Local index Querying peer Retrieval traffic is not scalable!  grows with (Heap’s law) ? D - collection size in no. of terms  experimentally linear, frequent terms used frequently in queries key = single-term

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE HDK-based P2P indexing Q ={t1,t2, t3} k13, k2:{d5, d7, d1, d3, d6, d8} t1:{d1, d2} t2:{d1, d3} t1:{d4, d5} t2:{d6} t3:{d5, d6} t1:{d7, d8} t2:{d8} t3:{d7} Peer1 Peer2 Peer3 Retrieval traffic is bounded by DF max and query size! Querying peer t1:{d4, d1, d8, d5} t2:{d1, d3, d6, d8} k13:{d5, d7} DF max = 4 posting list truncated to top-DF max postings (t1, t3) key = set of terms

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Single-term vs. HDK-based P2P indexing comparable retrieval quality (extended vocabulary) voc. size could grow exponentially!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Keys and key filtering Non-Discriminative Keys (NDKs) e.g. t 1 is an NDK iff: –t 1 appears in more than DF max collection documents posting lists truncated to top-DF max documents Highly-Discriminative Keys (HDKs) e.g. (t 1, t 2 ) is an HDK iff: –t 1 & t 2 appear in less than DF max collection documents (discriminative w.r.t document collection) –t 1 and t 2 are non-discriminative (redundancy filter) –t 1 and t 2 are within a window of size w (proximity filter) –the no. of terms comprising a key is limited by s max (size filter) posting lists by definition contain only  DF max documents Key filtering enables scalable indexing!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) What is the upper bound on the index size for a very large document collection? D – collection size in no. of terms s – no. of terms comprising a key w – window size IS s – index size associated with keys of size s P f, (s-1) – probability of NDK occurrences where NDK size is (s-1) key sizeindex size (location index) 1 2 s constant ?

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) Zipf model z(r)z(r) r FfFf FrFr very frequent terms frequent terms rare terms F r  DF max NDKsHDKs

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (indexing) C increases for an increasing collection size, a remains const. z(r)z(r) r FfFf FrFr D increases Theorem: Probability P f,(s-1) of NDK occurrence remains constant!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Scalability analysis (retrieval) Retrieval traffic is bounded by DF max and the number of keys a query is mapped to (constant) Scalability theoretically guaranteed, but what are the constants? Experiments!

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Contents Motivation Indexing and Retrieval model (HDKs) Scalability analysis Experimental results Conclusion

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Experiment System fully implemented in Java (available on request) Document collection –20.000, ,..., documents from Wikipedia ( Query log –Wikipedia query log for 2 months (08/2004 and 09/2004) –3,000 randomly chosen queries from 2,000,000 unique queries with more than 20 hits No. of peers: 4, 8,..., 28 –PCs running RedHat Linux with 1GB memory –100 Mbit Ethernet –Each peer indexes documents DF max = 400 or 500, s max = 3, w = 20

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Indexing costs Average index size per peer Average indexing traffic per peer HDK vs single-term (ST) indexing  experimentally: HDK / ST = 13.9 (for documents)  theoretically: HDK / ST = 40.7 (overestimated upper bound!)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Retrieval costs Retrieval traffic per query (Wikipedia query log)  remains constant with a growing collection size for the HDK approach (linear for single-term)

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Estimated total generated traffic Assumptions  monthly indexing  no. of queries per month: 1,5 * 10 6 (true no. of queries from the wikipedia log, conservative estimate)  for 1 billion documents, HDK generates 42 times less overall traffic

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Retrieval performance Overlap on top 20 documents  comparable performance of the HDK-based approach to the centralized single-term engine with BM25

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Conclusion Novel indexing model based on indexing terms and term sets; Theoretical scalability model proves the proposed solution scales to large networks in terms of generated traffic both for indexing and retrieval; Running P2P prototype that exhibits retrieval performance fully comparable to a centralized term-based retrieval system; Associated resource requirements (storage, bandwidth consumption) grow in a scalable way as shown by experiments.

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Ongoing work Further reduce the number of indexing keys using query-driven indexing to produce and store only profitable keys for query answering

Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE Acknowledgement The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss National Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068)