Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Peer to Peer Keyword Searching Nathan Gray.

Similar presentations


Presentation on theme: "Efficient Peer to Peer Keyword Searching Nathan Gray."— Presentation transcript:

1 Efficient Peer to Peer Keyword Searching Nathan Gray

2 Introduction Current applications (Chord, Freenet) don’t provide keyword search functionality System developed uses DHT that will store documents lists containing keywords

3 Introduction Topics to be covered: –Search model and design –Simulation Idea: Authors believe end user latency is the most important measurement metric –Most latency comes from network transfer time –Goal: Minimize the number of bytes sent

4 System Model Search – –associating keywords with document IDs –Retrieving document IDs matching keywords from DHT Invertices Index –Map words found in document to document list where word is found

5

6 Partitioning Horizontal –Requires all nodes be contacted –Broadcast queries to all nodes Vertical –Minimizes cost of searches -> ensure that no more than k servers participate in querying k keywords –Most changes in a FS occur in bursts  utilize lazy updating –Send queries to # of hosts –Throughput grows linearly with system size

7 Partitioning

8 Why Distribute the searching? Google succeeds and it uses centralized searching –Bad idea to concentrate both load and trust on small # of hosts –Distributed system would have voluntarily contributed end user machines –Distributed benefits more from replication Less susceptible to correlated failures

9 Ranking Key idea –Order of documents presented to the user –Google PageRank uses hyperlinked nature of the web –P2P doesn’t necessarily have the hyperlinked infrastructure of the web  but can use word position and proximity

10 Update Discovery Search engine must discover new, removed, or modified documents Distributed environs benefit most from pushed updates as opposed to broadcast –Efficiency –Currency of index

11 P2P search support Want to show that keyword search in p2p is feasible Remote servers contacted –Lookup mapping of words to documents Peers contacted across network –Intersection sets calculated  small subset of matching documents usually wanted by user

12 P2P Search Key challenge: –Perform efficient searches Limit amount of bandwidth used A inter B calculated  sent to server B Server B discards most of the server A intersect info because the result is smaller than A – the matches for A’s documents to the keyword

13

14 Bloomfilters (BF) Recall: BF summarize membership in a set In this paper, –BF act to compress the data sent between servers (the intersections) –Reduce amount of communication

15 BF Data assumed to have 128 bit hashes BF give a general 12:1 compression ratio

16

17 Caches Goal: –Want to store more keywords –Cache the BF for the successor host –Keyword population follows Zipf distribution (heavy tailed) –Popular keywords are dominant So caching of BF or entire doc list F(A) gives high hit ratio

18 Cache Cache hit rate # reduces # of excess bits Higher compression ratio grows linearly with bit reduction Consistency –TTL scheme used –Updates at keyword primary location only –Small staleness factor, expected given Web update patterns (assumption)

19 Incremental Results Look at scalability – Desired # of results wanted –Low cost O(n) with size of network –BF and Caching provide only constant O(1) improvement in data sent Chunks –Partial cache hits for each keyword –Reduce amount of cache allotted to each keyword –Con: Large cpu overhead –Soln: Send contiguous chunks Tell Server B which portion of hash to test

20

21 Discussion 2 issues –End to end query latency –# bytes sent BF gives compression bonus –Latency –Probability of False Positives Caching –Reduce FP prob –Reduces bandwidth costs

22 Discussion Incremental Results (IR) –Assume user wants only # of results 1) Reduce # of bytes sent 2) E-E query latency bonus to constant with network growth –Risk: Popular but uncorrelated results –Entire search space needed to be checked –Increase # bytes sent –BF still gives 10:1 compression ration over whole document list –BF and IR  complicate ranking schemes –BF: Do not allow: Order of set members Convey metadata along with result set

23 Discussion Cotd IF IR sends next chunks with lower rank  previous results are better Risk: Order within chuck lost –Maintained overall though Key: –Rank more important than small bandwidth or latency benefits

24 Simulation Goals: –Test number of nodes in the network with realistic numbers –Bloomfilter threshold Sizes –Caching –Incremental results

25 Simulation Characteristics Doc size: 1.85 Gb of html 1.17 Million unique words Three types of node distribution –Modems –Backbone links –Measure of gnutella-like network Randomized latencies –2500 square mile grid –Packets assumed to travel 100K miles/sec (SOL)

26 Simulation Cot’d Documents: Identifiers of 128bytes Process: –Simulate lookup of KW in inverted matrix –Map index to M search results –Node  intersections –Using BF send intersection to another host (size dependence, might be whole doc list) –Host checks for next hosts doc list in cache Yes? Perform intersection for host and skips that comm phase

27 Experimental Results Goal: Performance effects of keyword search in p2p network

28 Virtual Hosts Concept: Varying the number of nodes\hosts per machine Result: –Little effect on amount of data sent over network –Network times cut by 60% for local nodes –Reduced chance of load balance issues

29 BF and Caching BF: Drawback – increased network transactions (FP checking) –Initial Comparison –Remove False Positives

30

31 BF and Caching Results BF –BF Threshold: 300 –Smaller number of keywords requested  entire result list sent –Why? Benefit in bandwidth << latency introduced Caching –Decreased the number of bytes sent –Increase optimal BF size (~24 bits/entry) –50% decrease in # of bytes sent per query

32 Conclusions Keyword searching in P2P networks is feasible Traffic growth is linear with size of network Improved completeness relative to crawling (centralized keyword search) BF/VH/Caching/Incremental Results: –Reduce network resources consumed –End to end client search latency decrease


Download ppt "Efficient Peer to Peer Keyword Searching Nathan Gray."

Similar presentations


Ads by Google