Download presentation
Presentation is loading. Please wait.
Published byAsher Lloyd Modified over 8 years ago
1
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE 6392 06/27/2006
2
Why P2P Web Search Full-fledged web search is under the control of centralized search engines. Growing concern about the world’s dependency on a few quasi-monopolistic search engines and their susceptibility to commercial interests, spam, censorship, etc. P2P search engine might be more robust than centralized search as the demise of a single server or site is unlikely to paralyze the entire search system. All this leads to postulation that “the Web should be given back to the people”.
3
Challenges: P2P web search likely to work? P2P search system has two main resource contstraints: storage and bandwidth. Distribute conceptually global keyword index across a DHT-style network. From a query processing and IR viewpoint, one of the key issues is query routing (Given a query, to which other peers should the query be forwarded to get the top-k ranked result set). This decision requires statistical information about the data contents in the network. It can be made fairly efficient by utilizing a DHT-based distributed directory.
4
Challenges Efficiency of P2P query routing is only one side of the coin. How about quality of the search result? Goal is to be as good as centralized search engines. P2P approach faces the challenge that the index lists and statistical information that lead to good search results are scattered across the network.
5
System Architecture - Minerva Is a fully operational distributed search engine consisting of autonomous peers where: Each peer has a local document collection.Local data collection is indexed by inverted lists, one for each keyword or term. Conceptually global but physically distributed directory which is layered on top of a Chord-style distributed hash table (DHT) manages aggregated information about the peers local knowledge in compact form. Chord DHT partitions the term space such that each peer is responsible for the statistics and metadata of a randomized set subset of terms within the directory.
6
Directory Maintenance In the publishing process, each peer distributes per-term summaries (Posts) of its local index to the global directory. The DHT determines the peer responsible for this term and this peer maintains a PeerList of all posts for this term. Employs proactive replication of directory information to ensure certain degree of replication.
7
Query Execution A query with multiple terms is processed as follows: Query is executed locally using the peer’s local index. If the user considers this result unsatisfactory, the peer issues a PeerList request to the directory for looking up potentially promising peers for each query term separately. Query is executed completely on each of the remote peer.
8
Query Routing Most query routing techniques works well on disjoint data collection. What happens when autonomous peers crawl the web independently of each other. It results in overlap of information which may be indexed my peers.
9
Exploiting correlations in Queries Directory information about term correlation can be exploited for query routing in several ways. First method: Treat correlated term combinations as keys for DHT based overlay networks Query initiator can locate the responsible directory peer by simply hashing the key and using standard DHT lookup routing.
10
..cont’d Directory entry directly provides the query initiator with the query-specific peerList that reflects the best peer for the entire query. What happens if this peerList is too short? Query initiator always has the fallback option of decomposing the query into its individual terms and retrieving peerLists for each term. What is the problem with the above method?
11
..cont’d We still collect peerLists of high correlation term combinations. Look up the directory for each query term separately. Whenever a directory peer has a good peerList for the entire query, this information is returned to the query initiator, together with the per-term peerList. This doesn’t cause any additional communication costs and also provides the query initiator with the best available information on all individual terms as well as the entire query.
12
Conclusion Research efforts in the area of P2P content search is driven by the desire to “give the Web back to the people”. This paper has explored the theme of leveraging “power of users” in a P2P Web search engine. Observing user and community and user behavior is one potential key towards better search result quality.
13
References Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,” Minerva:Collaborative P2P Search”, In VLDB,2005. Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer,” P2P Content Search: Give the Web Back to the People”. Mathias Bender, Sebastin Michel,Peter Triantafillou, Gerhard Weikum, and Christian Zimmer, “ Improving Collection Selection with Overlap-Awareness,” In SIGIR, 2005.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.