Download presentation
Presentation is loading. Please wait.
1
Peer to Peer Information Retrieval
Going beyond Napster
2
What is P2P IR? No index on a central server
Content is distributed across all users of the system Content is more then text Binary files Associated Metadata
3
An example of a P2P system
4
Why go P2P Spiraling costs of maintaining indexes
Look at Google’s server farm New content forces new thinking on IR Large binary files are hard to index Freedom of speech Society is striving to communicate data which is being legislated against
5
First P2P Systems Central hash of distributed content
Only the central hash was used for queries Disadvantages: Scalability Known location of content Single point of failure Advantages Quick searching Deterministic search results
7
Bumps that caused change
Legal Centralized services were easy targets Owners of index could not claim they had no knowledge of content Growth Cost of maintaining service grew Hardware requirements exploded
8
Decentralized P2P Content spread between users w/ no explicit intent
Centralized server is replaced by self-maintaining network Every user is also a server There is no index of content How do we search?
9
Searching Decentralized P2P Systems
Many methods, none perfected yet Broadcast search Advantages Every node takes part in query Disadvantages As system grows, network bandwidth, query time grow exponentially
10
Intelligent P2P Crawls Ways to improve decentralized P2P query
Intelligently place data (FreeNet) By knowing the algorithm that distributes data, querying can be done more intelligently Clustering (Fireworks model) Clients with similar properties are logically grouped Queries that don’t apply to a group will not be sent to that entire group of clients Both change the paradigm of what kind of data is shared and the means of sharing
11
Other improvements Today, most networks still rely on brute-force-search CRC/MD5 hashing A checksum of each file is computed Instead of searching metadata, search for file hash Files that are identical, but mislabeled, are still returned
12
Query time limiting Save on inter-system bandwidth, searches terminate after X hops Client ends query after 100 results Searches time out after X seconds
13
Distributed IR Traditional IR with the advantages of distributed systems A central server still stores the index Multiple brokers allow access to the data repository Multiple gatherers crawl data near to them Advantages are seen in the data acquisition end
14
Examples
15
Future Directions Next steps will be drastic re-thinking of content placement ala FreeNet Donate X amount of bandwidth, Y amount of HD space Share Z directories of content Actual content files are distributed to the network intelligently Most requested files are blanketed Unique files are still accessible
16
Future directions for Traditional IR
Large central repositories such as Google will fade Internet will be fragmented into clusters of interest Similar interest groups will have decentralized search facilities An index of these groups will replace the Google’s of today
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.