Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,

Similar presentations


Presentation on theme: "Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,"— Presentation transcript:

1 Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary, USA 2 Los Alamos National Laboratory, USA 3 Michigan State University, USA

2 Peer-to-Peer Search Two Performance Objectives –Individual peer: improve the search quality –Internet management: minimize the search cost Fast, fast, fast, and the more the better! P2P user Don’t be so greedy, the Internet is shared by all the people! Network manager

3 Existing Solutions Generally aim to one of the two objectives and have performance limits to the other Flooding: –Most effective for user’s experience –Least efficient for network resource utilization Random walk: –Traffic efficient, but –Long response time and limited number of search results

4 Super-Node Architecture Super-node –Index server for its leaf nodes Problems –Index based search has limits Hard for full-text search Impossible for encrypted content search – Not responsible for the content quality of its leaf nodes –The structure becomes large and inefficient. A leaf node has to connect to multiple super-nodes to avoid single point failure Generating an increasingly large number of super-nodes

5 Gnutella Population in One Day (2003) number of peers number of super peers One super node only connects to 3-4 peers in average!

6 Outline Our Measurement Study CAC: Constructing Content Abundant Cluster SPIRP: Selectively Prefetching Indices from Responding Peers CAC-SPIRP: Combining CAC and SPIRP Performance Evaluation Conclusion

7 Our Measurement Study Existing measurement studies –A small percentage of popular files account for most shared storage and transmissions in P2P systems –A small amount of peers contribute majority number of files in P2P. –They are only the indirect evidence of content locality Some files may be never accessed, or accessed rarely Our purpose –Fully understand the localities in the peer community and individual peers –Get first-hand traces for our simulation study

8 Trace Collection Four-day crawling on the Gnutella network –Open source code of LimeWire Gnutella –Session based collection (for the whole life time of peers) Query sending traces by different peers –25,764 peers –409,129 queries Content indices of different peers –Full indices of 18,255 peers –37% free riders

9 Top Content Providers (in percentage) Queries Replied by Top Query Responders (%) Results Replied by Top Result Providers (%) 100 80 60 40 20 0 100 80 60 40 20 0 0 20 40 60 80 100 Content Locality in the Peer Community A small group of peers can reply nearly all queries and provide most of results Number of Queries 10 0 10 1 10 2 10 3 10 4 Percentage of Peers (%) 100 80 60 40 20 0 Percentage of Peers (%) Number of Results 10 0 10 2 10 4 10 6 100 80 60 40 20 0

10 The Localities of Search Interests of Individual Peers A peer can get search results from a small number of its top query responders: they share the same search interests Similar to the idea in Locality of Interest scheme, but our conclusion is based on real P2P systems Top Query Responders Top Result Providers top 1 top 10 top 5% top 10% top 20% Query Contributions (%) Result Contributions (%) 100 80 60 40 20 0 50 40 30 20 10 0 60

11 Reorganizing the P2P Management Structure Clustering those small number of content abundant peers Prefetching indices from those top query responders

12 CAC: Constructing Content Abundant Cluster Objectives –Clustering those small number of content abundant peers in P2P overlay –Providing high quality and fast service Content Abundant Cluster –An overlay on top of P2P network –Self-evaluate, self-identify, and self-organize –Persistent public service for all peers in the system –Strong content-based (not index-based)

13 ClusteringLeveling CAC: System Structure C A C 0 0 00 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 X 3 3 2 Dynamic Update

14 CAC: Search Operations Queries are sent to CAC first –Up-flowing operation –Flooding in CAC Unsatisfied queries are propagated from CAC to the whole system –Down-flooding operation –Propagated from low levels to high levels

15 Up-flowing C A C 0 0 00 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4

16 Down-flooding C A C 0 0 00 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 4 Unused links

17 SPIRP: Selectively Prefetching Indices from Responding Peers Basic operations –Peer I initiates a query q Query hits: displays the results Misses: sends q –Peer R responds query q sends query results as well as piggybacks indices of all shared files –Peer I receives response Display the searching results as well as stores piggybacked indices Indices updating –Active updating indices by responding peers –Updating indices demanded by requesting peers Replacement of file indices

18 Where are these files? Pop music Classic music SPIRP Technique ♫ ♫ R1 R2 Query = “Beethoven mp3” I

19 SPIRP Technique pop classic NULL R1 R2 Query = “Beetle mp3” Where are these files? I

20 SPIRP Technique classic pop R1 R2 Query = “Beetle mp3” I

21 SPIRP Technique classic pop R1 R2 Query = “Beetle mp3” No enough space to save indices I

22 SPIRP Technique classic pop ♫ ♫ R1 R2 Replace complete I Query = “Beetle mp3”

23 CAC-SPIRP CAC: application level infrastructure –Significantly reducing bandwidth consumption –Good response time when queries success in CAC –Long response time when queries fail in CAC SPIRP: client-oriented and overlay independent –Significantly reducing response time –Small traffic when queries can be satisfied in cache –Same traffic as flooding when cache misses CAC-SPIRP –Easy to combine the two techniques –Consider the trade-off between the two performance objectives –Has both merits of search quality and search cost

24 Simulation Environment Content trace and query trace –4 day Gnutella crawling in our measurement Overlay topology –Traces by Clip2 Distributed Search Solutions Session duration –Pareto distribution fitted from measurement results P(x) = 14.5311 * x -1.8598

25 Evaluation Metrics Query success rate –CAC: success rate in CAC (normalized to flooding) –SPIRP: success rate in local cache (normalized to flooding) Overall network traffic –accumulated communication traffics for all queries, responses, and index transferring (normalized to flooding) Average response time –use the number of routing hops (normalized to flooding) Evaluate for different query satisfactions –1, 10, 50 results, representing different user demands

26 Performance Evaluation for CAC 0 10 20 30 40 50 Cluster Size (In Percentage of P2P Network Size) 5% top content abundant peers are good enough for cluster construction Overall Traffic (Normalized) 1 0.8 0.6 0.4 0.2 0 0 10 20 30 40 50 Cluster Size (In Percentage of P2P Network Size) 0 10 20 30 40 50 Cluster Size (In Percentage of P2P Network Size) Success Rate in CAC (normalized) 1 0.8 0.6 0.4 0.2 0 Avg Response Time (Normalized) 2 1.5 1 0.5 0 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 0 10 20 30 40 50 Cluster Size (In Percentage of P2P Network Size)

27 CAC Member Selection 0 0.01 0.02 0.03 0.04 Success Response Rate of Content-Abundant Peers Success Rate in CAC (normalized) 1 0.8 0.6 0.4 0.2 0 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 Avg Response Time (Normalized) Overall Traffic (Normalized) 0 0.01 0.02 0.03 0.04 Success response rate of CAC Peers 1 0.8 0.6 0.4 0.2 0 0 0.01 0.02 0.03 0.04 Success Response Rate of CAC Peers 2 1.5 1 0.5 0 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 Minimum Results = 1 Minimum Results = 10 Minimum Results = 50 Overall traffic is not sensitive to CAC member quality Traffic can be significantly reduced even for randomly selected CAC members CAC down flooding is very efficient

28 CAC-SPIRP Overall Performance Peers having 1 to 5 queries satisfied Peers having 10 to 20 queries satisfied Peers having 30 to 40 queries satisfied Peers having at least 50 queries satisfied Peers having 1 to 5 queries satisfied Peers having 10 to 20 queries satisfied Peers having 30 to 40 queries satisfied Peers having at least 50 queries satisfied Query Satisfaction = 1 Query Satisfaction = 10 Query Satisfaction = 50 0 2 4 6 8 10 Size of Incoming Index Set Buffer (in M Bytes) Average Response Time (Normalized) 2 1.6 1.2 0.8 0.4 0 Success Rate in Local Cache 1 0.8 0.6 0.4 0.2 Overall Traffic (Normalized) 1 0.8 0.6 0.4 0.2 0 2 4 6 8 10 Size of Incoming Index Set Buffer (in M Bytes) 0 0 2 4 6 8 10 Size of Incoming Index Set Buffer (in M Bytes) 0 CAC-SPIRP reduces both the overall traffic and response time significantly

29 Conclusion CAC-SPIRP fundamentally addresses the P2P search problem by a re-organization. –Exploiting organizational content locality CAC: a content abundant cluster provides high quality and fast services. –Exploiting user content locality SPIRP: a client prefetching technique to speed up search by avoiding unnecessary queries


Download ppt "Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,"

Similar presentations


Ads by Google