PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands

Slides:



Advertisements
Similar presentations
Data Currency in Replicated DHTs Reza Akbarinia, Esther Pacitti and Patrick Valduriez University of Nantes, France, INIRA ACM SIGMOD 2007 Presenter Jerry.
Advertisements

Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.
Kademlia: A Peer-to-peer Information System Based on the XOR Metric Petar Mayamounkov David Mazières A few slides are taken from the authors’ original.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Peer-to-Peer Distributed Search. Peer-to-Peer Networks A pure peer-to-peer network is a collection of nodes or peers that: 1.Are autonomous: participants.
University of Cincinnati1 Towards A Content-Based Aggregation Network By Shagun Kakkar May 29, 2002.
Expediting Searching Processes via Long Paths in P2P Systems 05/30 IDEA Lab.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
Small-world Overlay P2P Network
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
Responder Anonymity and Anonymous Peer-to-Peer File Sharing. by Vincent Scarlata, Brian Levine and Clay Shields Presentation by Saravanan.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Object Naming & Content based Object Search 2/3/2003.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Searching in Unstructured Networks Joining Theory with P-P2P.
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
P2P File Sharing Systems
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Routing indexes A. Crespo & H. Garcia-Molina ICDCS 02.
2: Application Layer1 Chapter 2 outline r 2.1 Principles of app layer protocols r 2.2 Web and HTTP r 2.3 FTP r 2.4 Electronic Mail r 2.5 DNS r 2.6 Socket.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
03/19/02Scalab Seminar Series1 Routing in Peer-to-Peer Systems Ramaswamy N.Vadivelu Scalab, ASU.
Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Computer Networking P2P. Why P2P? Scaling: system scales with number of clients, by definition Eliminate centralization: Eliminate single point.
Semantic Overlay Networks in P2P systems A. Crespo, H. Garcia-Molina Speaker: Pavel Serdyukov Tutor: Jens Graupmann.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Peer to Peer Network Design Discovery and Routing algorithms
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CASCADE: AN ATTACK-RESISTANT DHT WITH MINIMAL HARD STATE
Accessing nearby copies of replicated objects
EE 122: Peer-to-Peer (P2P) Networks
DHT Routing Geometries and Chord
Paraskevi Raftopoulou, Euripides G.M. Petrakis
An Overview of Peer-to-Peer
Kademlia: A Peer-to-peer Information System Based on the XOR Metric
Presentation transcript:

pNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands

Presentation outline Problem statement Related work Our approach Model Example Simulation Results

Problem statement Searching content: Centralized solutions and semi-centralized soluctions are often working fine (Google, KaZaa, Napster), at least for popular content. However in some cases completely decentralized solutions are needed. Especially in cases where privacy, non- sensorship and undisclosed content plays a role

Problem statement Main aim: How to efficiently find content in a completely distributed P2P network? For the user we want high satisfaction (precision and recall of documents) For the network we want high performance (low nr of messages )

Related work and their limitations… Broadcasting/ Random forwarding (Gnutella) Many messages and/or low recall Distributed hash-tables (Pastry, Chord, CAN) Expensive maintenance when peers move/leave/join or content changes Single point of failures for keys (traditional) or multiple points (multiplying cost) No Load-balancing Semantic overlay networks Shared ‘semantic’ datastructures (P-Search, Bibster (build on jxta), [Crespo et al]) Rich vs small datastructure Expensive updating Term clustering (Voulgaris et al., FreeNet) Difficult to find clusters: assumption that the peer’s questions are related to its expertise (cluster) and the assumption that a peer has a description may both not be realistic in some cases

pNear pNear : Combines Distributed Hash-Tables Term Clustering Goal -> reducing the disadvantages of both individual approaches and keeping the good properties. Subgoal: getting an overview of the properties of the combination: which parameters are important.

pNear use DHT to find a (small) set of relevant peers for a query, and use a SON to find the remaining (larger) set of relevant peers. Only a small fixed set of peers is stored on the node responsible for the key (instead of all in pure DHT). Load balancing is handled automatically by the SON.

A pNear peer Each peer describes/summarizes its content in a set of terms which we call an expertise description. These descriptions are stored in expertise registers via DHT where the keys are a subset of individual terms from those expertise descriptions. In this way expertise registers responsible for a term are responsible for maintaining a list of peers that have registered themselves for that term (i.e. which functions as a kind of ‘yellow page’)

A pNear peer Besides maintaining expertise registers, each peer also has its own expertise cache, in which it stores peers with related expertise resulting in an Semantic Overlay Network. These related peers are found via the clustering process Peers use the expertise registers for getting pointers to some peers relevant for the queries Peers use the expertise caches for finding remaining peers within the clusters of relevant peers

A pNear peer Expertise Cache Expertise Registers Identifier Document Storage Expertise description: [t1, t2, …, tn] t x t y t z

How pNear works 1.Let N be the set of neighbors of a peer p in the SON 2.Select a (small) subset of terms S from the terms in the expertise description e_p from p 3.For each term s in S do: i.Hash the term s ii.Send a ‘register message’ (containing e_p and the network id of p) via the DHT overlay (using the key of s as message id) to the register responsible for the term s. The register responds with a set of [peer_id, relevance] pairs, R

How pNear works 4.Select a subset r from R (where r is not visited before in the clustering process) and do: i.Send an advertisement message (containing e and the peers id) to r ii.r responds (if online) with a set of [peer_id, E, relevance] triples R’ iii.The [peer_id, relevance] pairs are added to R and the [peer_id, E] pairs are added to N. Continue step 4 untill a maximum number.

How pNear works Costs of Clustering in amount of direct messages: log(nrOfPeers)*#registered_terms + #advertisementMsgs +#advertisementResultMsgs

How pNear works 1.Determine if its needed to query the registers or not. 2.- If yes, than for each query term q in the query: 1.Send a query consult message (containing q, the peer_id) to the register r responsible for term q. 2.r responds (if online) with a set of [peer_id, relevance] triples R 3.When possible, add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions) - If no, than add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

How pNear works 3.Select a subset r from R (where r is not visited before in the query process) and do: i.Send a query message (containing the query and the peers id) to r ii.r responds (if online) with a set of [peer_id, relevance] pairs R’ and a set of documents (answers to query) iii.The [peer_id, relevance] pairs are added to R’ and the query results are presented to the user Continue step 3 untill a maximum number of query rounds have passed or until the user stops the process.

How pNear works Costs of querying in terms of messages: Using registers log(nrOfPeers) x #queryterms + #queryMsgs + #queryResultMsgs Without using registers #queryMsgs + #queryResultMsgs

Example

 Extracting expertise description…

Example 1223 Suppose one peer with identifier 1223 unique in the system

Example 1223 Peer 1223 has a set of documents (research articles) that it wants to share in the network

Example An extraction tool is applied over the documents, resulting in the expertise description of the peer TextToOnto 1223 [animal,dog,bulldog,mammal,cat] documents Expertise description

 Registering expertise descriptions… Example

Peer 1223 joins the network [animal,dog,bulldog,mammal,cat]

Example Each peer has one or more expertise registers, where the peers can register their expertise descriptions [animal,dog,bulldog,mammal,cat]

Example Peer 1223 registers its expertise description (via DHT) in the topic register of the peers 3443 and [animal,dog,bulldog,mammal,cat] [(1223) animal,dog,bulldog,mammal,cat] Reg

Example Imagine that 2665 has a previously registered description from It calculates the relevance of 3100 for 1223 and in this case returns the [pointer, relevance] pair [animal,dog,bulldog,mammal,cat] [(3100) vegetable:9,tree:4,mammal:1] [(1223) animal,dog,bulldog,mammal,cat] Reg

Example 1223 sends an advertisemt query to [animal,dog,bulldog,mammal,cat] [3100,0.9]

Example 3100 answers with its expertise description and it returns from his cache also 7665 (peer_id, relevance pair) as a relevant peer for [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(7665) animal,dog, train,cat] [(1223) animal, dog, bulldog, mammal, cat] Cache

Example 1233 adds 3100 to its expertise cache [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(3100) vegetable:9,tree:4,mammal:1] Cache

Example 1223 sends 7665 an advertisement message, where 7665 stores the expertise description of [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(1223) animal, dog, bulldog, mammal, cat] Cache

Example  Solving a query…

Example Imagine a new peer 9999 that does not want to be clustered in the network

Example Peer 9999 has a query [cat] and imagine that 2665 is the register responsible for that term than 9999 sends a register consult message to 2665 (via DHT)

Example Peer 2665 answers that 1223 is a good candidate, and 9999 adds 1223 to the list of potential candidates [(3100) vegetable,tree,mammal] [(1223) animal,dog,bulldog,mammal,cat] [1223,0.9]

Example Peer 9999 queries peer [1223,0.9] [(3100) vegetable,cat,mammal]

Example Peer 1223 returns matching documents and 3100 as a relevant peer [1333,0.9] [3100,0.3] [(3100) vegetable,cat,mammal]

Example Peer 9999 queries [3100,0.9] [3100,0.3]

Example In this example 3100 also has some documents that match the query but no further pointers [3100,0.9] [3100,0.3]

Simulation Query set crawled from Excite’s “SearchSpy” consists of +/ real user queries For each query, we included the documents (web-pages) from the max 100 hits returned by Google

Simulation For each web-page we extracted around 100 terms by TextToOnto NLP tool, which serves as expertise description for each peer (web-pages are represented as peers in our simulation) resulting in a dataset of more than 1M peers Simulation platform runs up to peers

Results Total number of nodes in the system 100,000 Maximum number of expertise descriptions in a peer’s register 5 Maximum number of expertise descriptions in a peer’s cache 50 Maximum number of recommendations given by a register 5 Maximum number of recommendations given by a cache 30 Maximum number of advertisement rounds per advertisement initialization 5 Maximum number of neighbors selected per advertisement round 4 Maximum number of neighbors selected per query round 7 Average number of terms to register selected from expertise description 3

Results Only 3 out of 100 terms are needed to be registered to get 35% recall after sending < 200 query messages in the 100K network

Results Cache size is more important than nr. of recs. Still, only 50 slots are needed to get a recall of 35% with less than 200K query messages

Results Selecting more neighbors to query helps. Useful when user wants many results very fast

Results More advertising helps.

Results When on average 3 topics per ED are registered, the register does not need much more slots.

Results Even in a network of 400K peers, <200 query messages are needed to get a recall of 25%

Questions