Presentation is loading. Please wait.

Presentation is loading. Please wait.

PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands

Similar presentations


Presentation on theme: "PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands"— Presentation transcript:

1 pNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands ronny@cs.vu.nl

2 Presentation outline Problem statement Related work Our approach Model Example Simulation Results

3 Problem statement Searching content: Centralized solutions and semi-centralized soluctions are often working fine (Google, KaZaa, Napster), at least for popular content. However in some cases completely decentralized solutions are needed. Especially in cases where privacy, non- sensorship and undisclosed content plays a role

4 Problem statement Main aim: How to efficiently find content in a completely distributed P2P network? For the user we want high satisfaction (precision and recall of documents) For the network we want high performance (low nr of messages )

5 Related work and their limitations… Broadcasting/ Random forwarding (Gnutella) Many messages and/or low recall Distributed hash-tables (Pastry, Chord, CAN) Expensive maintenance when peers move/leave/join or content changes Single point of failures for keys (traditional) or multiple points (multiplying cost) No Load-balancing Semantic overlay networks Shared ‘semantic’ datastructures (P-Search, Bibster (build on jxta), [Crespo et al]) Rich vs small datastructure Expensive updating Term clustering (Voulgaris et al., FreeNet) Difficult to find clusters: assumption that the peer’s questions are related to its expertise (cluster) and the assumption that a peer has a description may both not be realistic in some cases

6 pNear pNear : Combines Distributed Hash-Tables Term Clustering Goal -> reducing the disadvantages of both individual approaches and keeping the good properties. Subgoal: getting an overview of the properties of the combination: which parameters are important.

7 pNear use DHT to find a (small) set of relevant peers for a query, and use a SON to find the remaining (larger) set of relevant peers. Only a small fixed set of peers is stored on the node responsible for the key (instead of all in pure DHT). Load balancing is handled automatically by the SON.

8 A pNear peer Each peer describes/summarizes its content in a set of terms which we call an expertise description. These descriptions are stored in expertise registers via DHT where the keys are a subset of individual terms from those expertise descriptions. In this way expertise registers responsible for a term are responsible for maintaining a list of peers that have registered themselves for that term (i.e. which functions as a kind of ‘yellow page’)

9 A pNear peer Besides maintaining expertise registers, each peer also has its own expertise cache, in which it stores peers with related expertise resulting in an Semantic Overlay Network. These related peers are found via the clustering process Peers use the expertise registers for getting pointers to some peers relevant for the queries Peers use the expertise caches for finding remaining peers within the clusters of relevant peers

10 A pNear peer Expertise Cache Expertise Registers Identifier Document Storage Expertise description: [t1, t2, …, tn] t x t y t z

11 How pNear works 1.Let N be the set of neighbors of a peer p in the SON 2.Select a (small) subset of terms S from the terms in the expertise description e_p from p 3.For each term s in S do: i.Hash the term s ii.Send a ‘register message’ (containing e_p and the network id of p) via the DHT overlay (using the key of s as message id) to the register responsible for the term s. The register responds with a set of [peer_id, relevance] pairs, R

12 How pNear works 4.Select a subset r from R (where r is not visited before in the clustering process) and do: i.Send an advertisement message (containing e and the peers id) to r ii.r responds (if online) with a set of [peer_id, E, relevance] triples R’ iii.The [peer_id, relevance] pairs are added to R and the [peer_id, E] pairs are added to N. Continue step 4 untill a maximum number.

13 How pNear works Costs of Clustering in amount of direct messages: log(nrOfPeers)*#registered_terms + #advertisementMsgs +#advertisementResultMsgs

14 How pNear works 1.Determine if its needed to query the registers or not. 2.- If yes, than for each query term q in the query: 1.Send a query consult message (containing q, the peer_id) to the register r responsible for term q. 2.r responds (if online) with a set of [peer_id, relevance] triples R 3.When possible, add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions) - If no, than add to R a set of neigbors ( [peer_id, relevance] pairs) from N that are relevant for the query (based on their expertise descriptions)

15 How pNear works 3.Select a subset r from R (where r is not visited before in the query process) and do: i.Send a query message (containing the query and the peers id) to r ii.r responds (if online) with a set of [peer_id, relevance] pairs R’ and a set of documents (answers to query) iii.The [peer_id, relevance] pairs are added to R’ and the query results are presented to the user Continue step 3 untill a maximum number of query rounds have passed or until the user stops the process.

16 How pNear works Costs of querying in terms of messages: Using registers log(nrOfPeers) x #queryterms + #queryMsgs + #queryResultMsgs Without using registers #queryMsgs + #queryResultMsgs

17 Example

18  Extracting expertise description…

19 Example 1223 Suppose one peer with identifier 1223 unique in the system

20 Example 1223 Peer 1223 has a set of documents (research articles) that it wants to share in the network

21 Example An extraction tool is applied over the documents, resulting in the expertise description of the peer TextToOnto 1223 [animal,dog,bulldog,mammal,cat] documents Expertise description

22  Registering expertise descriptions… Example

23 Peer 1223 joins the network 2665 3443 1223 3100 7665 [animal,dog,bulldog,mammal,cat]

24 Example 2665 3443 1223 7665 3100 Each peer has one or more expertise registers, where the peers can register their expertise descriptions [animal,dog,bulldog,mammal,cat]

25 Example Peer 1223 registers its expertise description (via DHT) in the topic register of the peers 3443 and 2665 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [(1223) animal,dog,bulldog,mammal,cat] Reg

26 Example Imagine that 2665 has a previously registered description from 3100. It calculates the relevance of 3100 for 1223 and in this case returns the [pointer, relevance] pair 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [(3100) vegetable:9,tree:4,mammal:1] [(1223) animal,dog,bulldog,mammal,cat] Reg

27 Example 1223 sends an advertisemt query to 3100 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [3100,0.9]

28 Example 3100 answers with its expertise description and it returns from his cache also 7665 (peer_id, relevance pair) as a relevant peer for 1223. 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(7665) animal,dog, train,cat] [(1223) animal, dog, bulldog, mammal, cat] Cache

29 Example 1233 adds 3100 to its expertise cache 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(3100) vegetable:9,tree:4,mammal:1] Cache

30 Example 1223 sends 7665 an advertisement message, where 7665 stores the expertise description of 1223 2665 3443 1223 7665 3100 [animal,dog,bulldog,mammal,cat] [3100,0.9] [7665,0.8] [(1223) animal, dog, bulldog, mammal, cat] Cache

31 Example  Solving a query…

32 Example Imagine a new peer 9999 that does not want to be clustered in the network. 2665 3443 1223 7665 3100 9999

33 Example Peer 9999 has a query [cat] and imagine that 2665 is the register responsible for that term than 9999 sends a register consult message to 2665 (via DHT) 2665 3443 1223 7665 3100 9999

34 Example Peer 2665 answers that 1223 is a good candidate, and 9999 adds 1223 to the list of potential candidates 2665 3443 1223 7665 3100 9999 [(3100) vegetable,tree,mammal] [(1223) animal,dog,bulldog,mammal,cat] [1223,0.9]

35 Example Peer 9999 queries peer 1223 2665 3443 1223 7665 3100 9999 [1223,0.9] [(3100) vegetable,cat,mammal]

36 Example Peer 1223 returns matching documents and 3100 as a relevant peer 2665 3443 1223 7665 3100 9999 [1333,0.9] [3100,0.3] [(3100) vegetable,cat,mammal]

37 Example Peer 9999 queries 3100 2665 3443 1223 7665 3100 9999 [3100,0.9] [3100,0.3]

38 Example In this example 3100 also has some documents that match the query but no further pointers 2665 3443 1223 7665 3100 9999 [3100,0.9] [3100,0.3]

39 Simulation Query set crawled from Excite’s “SearchSpy” consists of +/- 30.000 real user queries For each query, we included the documents (web-pages) from the max 100 hits returned by Google

40 Simulation For each web-page we extracted around 100 terms by TextToOnto NLP tool, which serves as expertise description for each peer (web-pages are represented as peers in our simulation) resulting in a dataset of more than 1M peers Simulation platform runs up to 400.000 peers

41 Results Total number of nodes in the system 100,000 Maximum number of expertise descriptions in a peer’s register 5 Maximum number of expertise descriptions in a peer’s cache 50 Maximum number of recommendations given by a register 5 Maximum number of recommendations given by a cache 30 Maximum number of advertisement rounds per advertisement initialization 5 Maximum number of neighbors selected per advertisement round 4 Maximum number of neighbors selected per query round 7 Average number of terms to register selected from expertise description 3

42 Results Only 3 out of 100 terms are needed to be registered to get 35% recall after sending < 200 query messages in the 100K network

43 Results Cache size is more important than nr. of recs. Still, only 50 slots are needed to get a recall of 35% with less than 200K query messages

44 Results Selecting more neighbors to query helps. Useful when user wants many results very fast

45 Results More advertising helps.

46 Results When on average 3 topics per ED are registered, the register does not need much more slots.

47 Results Even in a network of 400K peers, <200 query messages are needed to get a recall of 25%

48 Questions


Download ppt "PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands"

Similar presentations


Ads by Google