Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Motivation It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics They don’t rank results Most are optimized to find popular content The current Internet search model has proven to be effective to locate data Intuitive term-based query model Quality metric and ranking critical factors in success of Internet search engines Help users to quickly pinpoint relevant documents from vast repository
Goals & challenges Empower P2P communities with search capabilities similar to Internet search engines No central servers Fault tolerance Cannot employ current model used by Internet search engines No central management and administration Resources are fragmented Peers behaviors are uncontrolled
[K 1,..,K n ] Bloom filter Gossiping Local Directory NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ] Local Files XML Snippets Local Directory NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ] Local Files Bloom filter XML Snippets Summary of PlanetP Nodes maintain an index of their content Represented as Bloom filters Indexes and Directories are replicated everywhere Gossiping keeps peers synchronized
Content search in PlanetP Query Diane Local Directory [K 1,..,K n ]Gary [K 1,..,K n ]Fred [K 1,..,K n ]Edward [K 1,..,K n ]Diane [K 1,..,K n ] Keys Charles Bob Alice Nickname Bob Fred Local lookup Fred Bob Diane Rank nodes Diane Contact candidates Fred File 3 File 1 File 2 Rank results STOP
The Vector Space model Documents and queries are represented as k-dimensional vectors Word are weighted according to their relevance for the document Documents are weighted according to their words The angle between a query and a document indicates its similarity Document Query
Weight assignment (TFxIDF) Idea Use per doc. Term Frequency (TF) to weight words (W D,t ) Use inverse global popularity (IDF) to find good discriminators among the query terms Intuition TF indicates how related a document is to a particular concept Inverse Document Frequency (IDF) identify the words that are good discriminators between documents W D,t =f(Frequency of t in D) IDF t =f(No. documents/Frequency of t across documents)
Unfortunately IDF is not suited for P2P Requires an appearance count for every word in the community We introduce the use of the Inverse Peer Frequency IPF t =f(No. Peers/Peers with documents containing t) IPF can be computed with local information IPF is compatible across the community Node & document ranking in PlanetP
Stopping condition Intuitive idea: Stop as soon as k documents are retrieved Not good A node might have few highly ranked documents and many that have a low rank We propose an adaptive approach: Contact nodes one by one and keep a list of the top k documents retrieved Stop contacting candidates when p nodes in a row fail to contribute to the top k
Evaluation method We use five well known document collections Each collection comes with a set of queries and relevance judgments Here we present results for one (AP89) We measure recall and precision TraceQueriesDocuments Number of words Collection size (MBs) AP
Evaluation method We use a simulator to test our algorithm Different file distributions Against a central search engine Quantifying the effect not using an adaptive stopping condition
Results cont.
More results Adjusting the stop condition according to the community size and number of results expected We provide a linear function to determine p Recall as the community grows to 1000 (scalability) Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF 80% on average
Conclusions PlanetP matches TFxIDF's performance using the TFxIPF approximation Give P2P communities search capabilities as powerful as environments with centralized resources TFxIPF is applicable beyond PlanetP PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results
Related Work Tapestry, Pastry, Chord and CAN Implement a distributed hash table for P2P environments Oriented towards name based searches (for FS) They already store all the information needed to implement TFxIPF Cori and Gloss Address the problem of indexing and searching distributed collections of documents They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes
Example Assume k=2 and p=1 Documents with a tick ( ) have been judged relevant Documents with a cross ( ) are related but not rele D 11 D 12 D 13 D 14 D 21 D 22 D 23 D 24 D 31 D 32 D 33 D 34 Trivial stop {D 11, D 12 } {D 21, D 11 } Adaptive stop {D 11, D 12 }