Efficient Peer to Peer Keyword Searching Nathan Gray.

Slides:



Advertisements
Similar presentations
Peer-to-Peer and Social Networks An overview of Gnutella.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Pastiche: Making Backup Cheap and Easy. Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB)
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Analysis of Web Caching Architectures: Hierarchical and Distributed Caching Pablo Rodriguez, Christian Spanner, and Ernst W. Biersack IEEE/ACM TRANSACTIONS.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Rendezvous Points-Based Scalable Content Discovery with Load Balancing Jun Gao Peter Steenkiste Computer Science Department Carnegie Mellon University.
1 CAPS: A Peer Data Sharing System for Load Mitigation in Cellular Data Networks Young-Bae Ko, Kang-Won Lee, Thyaga Nandagopal Presentation by Tony Sung,
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Peer To Peer Distributed Systems Pete Keleher. Why Distributed Systems? l Aggregate resources! –memory –disk –CPU cycles l Proximity to physical stuff.
1Bloom Filters Lookup questions: Does item “ x ” exist in a set or multiset? Data set may be very big or expensive to access. Filter lookup questions with.
Wide-area cooperative storage with CFS
Peer-to-Peer Networks Slides largely adopted from Ion Stoica’s lecture at UCB.
1 ENHANCHING THE WEB’S INFRASTUCTURE: FROM CACHING TO REPLICATION ECE 7995 Presented By: Pooja Swami and Usha Parashetti.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.
Multi-level Hashing for Peer-to-Peer System in Wireless Ad Hoc Environment Dewan Tanvir Ahmed and Shervin Shirmohammadi Distributed & Collaborative Virtual.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
1. Outline  Introduction  Different Mechanisms Broadcasting Multicasting Forward Pointers Home-based approach Distributed Hash Tables Hierarchical approaches.
Energy-Efficient Data Caching and Prefetching for Mobile Devices Based on Utility Huaping Shen, Mohan Kumar, Sajal K. Das, and Zhijun Wang P 邱仁傑.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.
NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,
CSCI 599: Beyond Web Browsers Professor Shahram Ghandeharizadeh Computer Science Department Los Angeles, CA
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Information Retrieval in Practice
Peer-to-Peer Data Management
CHAPTER 3 Architectures for Distributed Systems
SCOPE: Scalable Consistency in Structured P2P Systems
The Anatomy of a Large-Scale Hypertextual Web Search Engine
EE 122: Peer-to-Peer (P2P) Networks
Peer to Peer Information Retrieval
5.2 FLAT NAMING.
Pastiche: Making Backup Cheap and Easy
Presentation transcript:

Efficient Peer to Peer Keyword Searching Nathan Gray

Introduction Current applications (Chord, Freenet) don’t provide keyword search functionality System developed uses DHT that will store documents lists containing keywords

Introduction Topics to be covered: –Search model and design –Simulation Idea: Authors believe end user latency is the most important measurement metric –Most latency comes from network transfer time –Goal: Minimize the number of bytes sent

System Model Search – –associating keywords with document IDs –Retrieving document IDs matching keywords from DHT Invertices Index –Map words found in document to document list where word is found

Partitioning Horizontal –Requires all nodes be contacted –Broadcast queries to all nodes Vertical –Minimizes cost of searches -> ensure that no more than k servers participate in querying k keywords –Most changes in a FS occur in bursts  utilize lazy updating –Send queries to # of hosts –Throughput grows linearly with system size

Partitioning

Why Distribute the searching? Google succeeds and it uses centralized searching –Bad idea to concentrate both load and trust on small # of hosts –Distributed system would have voluntarily contributed end user machines –Distributed benefits more from replication Less susceptible to correlated failures

Ranking Key idea –Order of documents presented to the user –Google PageRank uses hyperlinked nature of the web –P2P doesn’t necessarily have the hyperlinked infrastructure of the web  but can use word position and proximity

Update Discovery Search engine must discover new, removed, or modified documents Distributed environs benefit most from pushed updates as opposed to broadcast –Efficiency –Currency of index

P2P search support Want to show that keyword search in p2p is feasible Remote servers contacted –Lookup mapping of words to documents Peers contacted across network –Intersection sets calculated  small subset of matching documents usually wanted by user

P2P Search Key challenge: –Perform efficient searches Limit amount of bandwidth used A inter B calculated  sent to server B Server B discards most of the server A intersect info because the result is smaller than A – the matches for A’s documents to the keyword

Bloomfilters (BF) Recall: BF summarize membership in a set In this paper, –BF act to compress the data sent between servers (the intersections) –Reduce amount of communication

BF Data assumed to have 128 bit hashes BF give a general 12:1 compression ratio

Caches Goal: –Want to store more keywords –Cache the BF for the successor host –Keyword population follows Zipf distribution (heavy tailed) –Popular keywords are dominant So caching of BF or entire doc list F(A) gives high hit ratio

Cache Cache hit rate # reduces # of excess bits Higher compression ratio grows linearly with bit reduction Consistency –TTL scheme used –Updates at keyword primary location only –Small staleness factor, expected given Web update patterns (assumption)

Incremental Results Look at scalability – Desired # of results wanted –Low cost O(n) with size of network –BF and Caching provide only constant O(1) improvement in data sent Chunks –Partial cache hits for each keyword –Reduce amount of cache allotted to each keyword –Con: Large cpu overhead –Soln: Send contiguous chunks Tell Server B which portion of hash to test

Discussion 2 issues –End to end query latency –# bytes sent BF gives compression bonus –Latency –Probability of False Positives Caching –Reduce FP prob –Reduces bandwidth costs

Discussion Incremental Results (IR) –Assume user wants only # of results 1) Reduce # of bytes sent 2) E-E query latency bonus to constant with network growth –Risk: Popular but uncorrelated results –Entire search space needed to be checked –Increase # bytes sent –BF still gives 10:1 compression ration over whole document list –BF and IR  complicate ranking schemes –BF: Do not allow: Order of set members Convey metadata along with result set

Discussion Cotd IF IR sends next chunks with lower rank  previous results are better Risk: Order within chuck lost –Maintained overall though Key: –Rank more important than small bandwidth or latency benefits

Simulation Goals: –Test number of nodes in the network with realistic numbers –Bloomfilter threshold Sizes –Caching –Incremental results

Simulation Characteristics Doc size: 1.85 Gb of html 1.17 Million unique words Three types of node distribution –Modems –Backbone links –Measure of gnutella-like network Randomized latencies –2500 square mile grid –Packets assumed to travel 100K miles/sec (SOL)

Simulation Cot’d Documents: Identifiers of 128bytes Process: –Simulate lookup of KW in inverted matrix –Map index to M search results –Node  intersections –Using BF send intersection to another host (size dependence, might be whole doc list) –Host checks for next hosts doc list in cache Yes? Perform intersection for host and skips that comm phase

Experimental Results Goal: Performance effects of keyword search in p2p network

Virtual Hosts Concept: Varying the number of nodes\hosts per machine Result: –Little effect on amount of data sent over network –Network times cut by 60% for local nodes –Reduced chance of load balance issues

BF and Caching BF: Drawback – increased network transactions (FP checking) –Initial Comparison –Remove False Positives

BF and Caching Results BF –BF Threshold: 300 –Smaller number of keywords requested  entire result list sent –Why? Benefit in bandwidth << latency introduced Caching –Decreased the number of bytes sent –Increase optimal BF size (~24 bits/entry) –50% decrease in # of bytes sent per query

Conclusions Keyword searching in P2P networks is feasible Traffic growth is linear with size of network Improved completeness relative to crawling (centralized keyword search) BF/VH/Caching/Incremental Results: –Reduce network resources consumed –End to end client search latency decrease