Presentation is loading. Please wait.

Presentation is loading. Please wait.

P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.

Similar presentations


Presentation on theme: "P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE."— Presentation transcript:

1 P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE 6392 06/27/2006

2 Why P2P Web Search  Full-fledged web search is under the control of centralized search engines.  Growing concern about the world’s dependency on a few quasi-monopolistic search engines and their susceptibility to commercial interests, spam, censorship, etc.  P2P search engine might be more robust than centralized search as the demise of a single server or site is unlikely to paralyze the entire search system.  All this leads to postulation that “the Web should be given back to the people”.

3 Challenges: P2P web search likely to work?  P2P search system has two main resource contstraints: storage and bandwidth.  Distribute conceptually global keyword index across a DHT-style network.  From a query processing and IR viewpoint, one of the key issues is query routing (Given a query, to which other peers should the query be forwarded to get the top-k ranked result set).  This decision requires statistical information about the data contents in the network. It can be made fairly efficient by utilizing a DHT-based distributed directory.

4 Challenges  Efficiency of P2P query routing is only one side of the coin. How about quality of the search result?  Goal is to be as good as centralized search engines.  P2P approach faces the challenge that the index lists and statistical information that lead to good search results are scattered across the network.

5 System Architecture - Minerva  Is a fully operational distributed search engine consisting of autonomous peers where:  Each peer has a local document collection.Local data collection is indexed by inverted lists, one for each keyword or term.  Conceptually global but physically distributed directory which is layered on top of a Chord-style distributed hash table (DHT) manages aggregated information about the peers local knowledge in compact form.  Chord DHT partitions the term space such that each peer is responsible for the statistics and metadata of a randomized set subset of terms within the directory.

6 Directory Maintenance  In the publishing process, each peer distributes per-term summaries (Posts) of its local index to the global directory.  The DHT determines the peer responsible for this term and this peer maintains a PeerList of all posts for this term.  Employs proactive replication of directory information to ensure certain degree of replication.

7 Query Execution  A query with multiple terms is processed as follows:  Query is executed locally using the peer’s local index.  If the user considers this result unsatisfactory, the peer issues a PeerList request to the directory for looking up potentially promising peers for each query term separately.  Query is executed completely on each of the remote peer.

8 Query Routing  Most query routing techniques works well on disjoint data collection.  What happens when autonomous peers crawl the web independently of each other.  It results in overlap of information which may be indexed my peers.

9 Exploiting correlations in Queries  Directory information about term correlation can be exploited for query routing in several ways.  First method:  Treat correlated term combinations as keys for DHT based overlay networks  Query initiator can locate the responsible directory peer by simply hashing the key and using standard DHT lookup routing.

10 ..cont’d  Directory entry directly provides the query initiator with the query-specific peerList that reflects the best peer for the entire query.  What happens if this peerList is too short?  Query initiator always has the fallback option of decomposing the query into its individual terms and retrieving peerLists for each term.  What is the problem with the above method?

11 ..cont’d  We still collect peerLists of high correlation term combinations.  Look up the directory for each query term separately.  Whenever a directory peer has a good peerList for the entire query, this information is returned to the query initiator, together with the per-term peerList.  This doesn’t cause any additional communication costs and also provides the query initiator with the best available information on all individual terms as well as the entire query.

12 Conclusion  Research efforts in the area of P2P content search is driven by the desire to “give the Web back to the people”.  This paper has explored the theme of leveraging “power of users” in a P2P Web search engine.  Observing user and community and user behavior is one potential key towards better search result quality.

13 References  Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,” Minerva:Collaborative P2P Search”, In VLDB,2005.  Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,” P2P Content Search: Give the Web Back to the People”.  Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer, “ Improving Collection Selection with Overlap-Awareness,” In SIGIR, 2005.


Download ppt "P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE."

Similar presentations


Ads by Google