1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

Slides:



Advertisements
Similar presentations
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in P2P Networks Using Evolutionary Neural Networks Presentation for International Conference on Advances in.
Advertisements

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
UNIVERSITY OF JYVÄSKYLÄ Building NeuroSearch – Intelligent Evolutionary Search Algorithm For Peer-to-Peer Environment Master’s Thesis by Joni Töyrylä
Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.
PeerDB: A P2P-based System for Distributed Data Sharing Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, Aoying Zhou Shawn Jeffery CS294-4 Peer-to-Peer Systems.
Peer-to-Peer Networks João Guerreiro Truong Cong Thanh Department of Information Technology Uppsala University.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Traffic Engineering With Traditional IP Routing Protocols
“A Local Search Mechanism for Peer-to-Peer Networks”
Spotlighting Decentralized P2P File Sharing Archie Kuo and Ethan Le Department of Computer Science San Jose State University.
Building Low-Diameter P2P Networks Eli Upfal Department of Computer Science Brown University Joint work with Gopal Pandurangan and Prabhakar Raghavan.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Protecting Free Expression Online with Freenet Presented by Ho Tsz Kin I. Clarke, T. W. Hong, S. G. Miller, O. Sandberg, and B. Wiley 14/08/2003.
Count / Top-k Continuous Queries on P2P Networks 01/11/2006.
presented by Hasan SÖZER1 Scalable P2P Search Daniel A. Menascé George Mason University.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Searching in Unstructured Networks Joining Theory with P-P2P.
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery Using NeuroSearch Presentation for the Agora Center InBCT-seminar Mikko Vapa, researcher InBCT 3.2.
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Cache Updates in a Peer-to-Peer Network of Mobile Agents Elias Leontiadis Vassilios V. Dimakopoulos Evaggelia Pitoura Department of Computer Science University.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
P2P Architecture Case Study: Gnutella Network
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Developing Analytical Framework to Measure Robustness of Peer-to-Peer Networks Niloy Ganguly.
Paraskevi Raftopoulou 1,2 Paraskevi Raftopoulou 1,2 and Euripides G.M. Petrakis 2 1 Max-Planck Institute for Informatics, Saarbruecken, Germany
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Routing Indices For P-to-P Systems ICDCS Introduction Search in a P2P system –Mechanisms without an index –Mechanisms with specialized index nodes.
Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,
“Information Retrieval in Peer-to-Peer Systems” Demetrios Zeinalipour-Yazti M.Sc. Thesis Defense Monday, May 5,
A Routing Underlay for Overlay Networks Akihiro Nakao Larry Peterson Andy Bavier SIGCOMM’03 Reviewer: Jing lu.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
1 MSc Project Yin Chen Supervised by Dr Stuart Anderson 2003 Grid Services Monitor Long Term Monitoring of Grid Services Using Peer-to-Peer Techniques.
A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Information Retrieval in Peer to Peer Systems Modern Information Retrieval Sharif University of Technology Fall 2005.
K-Anycast Routing Schemes for Mobile Ad Hoc Networks 指導老師 : 黃鈴玲 教授 學生 : 李京釜.
By Jonathan Drake.  The Gnutella protocol is simply not scalable  This is due to the flooding approach it currently utilizes  As the nodes increase.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
Peer to Peer Network Design Discovery and Routing algorithms
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design Authors: Matei Ripeanu Ian Foster Adriana.
Energy Efficient Data Management for Wireless Sensor Networks with Data Sink Failure Hyunyoung Lee, Kyoungsook Lee, Lan Lin and Andreas Klappenecker †
School of Electrical Engineering &Telecommunications UNSW Cost-effective Broadcast for Fully Decentralized Peer-to-peer Networks Marius Portmann & Aruna.
Information Retrieval in Peer to Peer Systems Modern Information Retrieval Sharif University of Technology Fall 2005.
Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
William Stallings Data and Computer Communications
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Peer-to-Peer Information Systems Week 6: Performance
DATA RETRIEVAL IN ADHOC NETWORKS
Joydeep Chandra, Santosh Shaw and Niloy Ganguly
Presentation transcript:

1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous November, 2004

2 Issues in p2p networks  Content based / file identifiers information retrieval.  Dynamic networks (ad-hoc).  Scalability (global knowledge).  Query messages (flooding – network congestion).  Recall rate.  Efficiency (recall rate / query messages).  Query Response Time (QRT).

3 IR in pure p2p networks  BFS technique  Each peer forwards the query to all its neighbors  Simple  Performance  Network utilization  Use of TTL  RBFS technique  Each peer forwards the query to a random subset of its neighbors  Reduce query messages  Probabilistic algorithm

4 IR in pure p2p networks  >RES technique  Each peer forwards the query to some of its peers based on some aggregated statistics.  Heuristic: The Most Results in Past (for the last 10 queries).  Explore..  The larger network segments.  The most stable neighbors.  ! (The nodes which contain content related to the query.) >RES is a quantitative rather than qualitative approach.

5 The intelligent search mechanism (ISM)  Main Idea: Peers estimate for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only.  Exploit the locality of past queries.  Some characteristics:  Entirely distributed (requires only local knowledge).  Scales well with the size of the network.  Scales well to large data sets.  Works well in dynamic environments.  High recall rates.  Minimize the communication costs.

6 Architecture (ISM) (1/4)  Profiling structure:  Single queries table  LRU policy to keep the most recent queries  Table size is limited  good performance

7 Architecture (ISM) (2/4)  Query Similarity function (cosine similarity)  Assumption: A peer that has a document relevant to a given query is also likely to have other documents that are relevant to other similar queries. Qsim : Q 2  [0,1] L: the set of all words appeared in queries {1,1,1,1} q:{1,1,0,0} q i :{1,0,1,0}

8 Architecture (ISM) (3/4)  Peer ranking (Relevance Rank) P i : each peer. P l : the decision-maker node. a: allows us to add more weight to the most similar queries. S(P i, q j ): the number of results returned by P i for query q j.

9 Architecture (ISM) (4/4)  Search Mechanism  Invoke RR function.  Forward query to k (threshold) peers only.

10 Experiments  Peerware: A distributed middleware infrastructure  GraphGen: generates network topologies.  dataPeer: p2p client which answers to boolean queries from its local xml repository(XQL).  SearchPeer: p2p client that performs queries and harvest answers back from a Peerware network (connect to a dataPeer and perform queries).

11 Experiments - DMP  If node P k receives the same query q with some TTL 2, where TTL 2 >TTL 1 we allow the TTL 2 message to proceed.  This may allow q to reach more peers than its predecessor  Without this fix the BFS behaviour is not predictable and therefore is not able to find the nodes that we were supposed to find.  Our experiments revealed that almost 30% of the forwarded queries were discarded because of DMP.  The experimental results presented in this work are not suffering from DMP.  This is the reason why the number of messages is slightly higher (~30%) than the expected number of messages.  The total number of messages should be for n nodes each of which with a degree d i.

12 Experiments-DMP  Query examples  A set of 4 keywords  1 keyword >= 4 characters #Query 1 AUSTRIA INTERVENE DOES DOLLAR 2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES 3 AGREES PEACE NEW MOVES Random Topology : Each vertex selects its d neighbors randomly. Simple. Leads to connected topologies if the degree d > log 2 n.

13 Experiments (Set1)  Reuters – Peerware  Random topology of 104 nodes (static) with average degree 8 (running on network 75 workstations).  Categorize the documents by their country attribute (104 country files - each for a node) - Each country file has at least 5 articles.  Data Sets:  Reuters 10X10: set of 10 random queries which are repeated 10 consecutive times (high locality of similar queries) – suits better to ISM.  Reuters 400: set of 400 random queries which are uniformly sampled from the initial 104 country files (lower repetition).

14 Results (Set1) – Reuters 10X10 (1/4)  Reducing query messages  ISM finds the most documents compared to RBFS and >RES.  ISM achieves almost 90% (recall rate) while using only 38% of BFS’s messages.  ISM and >RES start out with low recall rate.  Suffer from low recall rate.

15 Results (Set1) – Reuters 10X10 (2/4)  Digging deeper by increasing TTL  Reach more nodes deeper.  ISM achieves 100% recall rate while using only 57% of BFS’s messages with TTL=4.

16 Results (Set1) – Reuters 10X10 (3/4)  Reducing query response time (QRT)  ~30-60% of BFS’s QRT for TTL=4 and ~60-80% for TTL=5.  ISM requires more time than >RES because it’s decision involves some computation over the past queries.

17 Results (Set1) – Reuters 400 (4/4)  Improving the recall rate over time  ISM achieves 95% recall rate while using 38% of BFS’s messages.  During queries major outbreaks occur in BFS.  ISM requires a learning period of about 100 queries before it starts competing the performance of >RES.

18 Experiments (Set2)  TREC-LATimes Preeware (random topology of 1000 nodes – static)  It contains approximately 132,000 articles.  These articles were horizontally partitioned in 1000 documents (Each document contain 132 articles).  Each peer shares one or more of 1000 documents (replicated articles).

19 Experiments (Set2)  Data Sets:  TREC 100: a set of 100 queries out of the initial 150 topics.  TREC 10X10: a list of 10 randomly sampled queries, out of the initial 150 topics, which are repeated 10 consecutive times.  TREC 50X2: for which we first generated a set a=“50 randomly sampled queries out of the initial 150 topics” merged with a generated list of another 50 queries which are randomly sampled out of a.

20 Results (Set2) – TREC100 (1/3)  Searching in a large-scale network topology  For TTL=5 we reach 859 of 1000 nodes (BFS).  For TTL=6 we reach 998 of 1000 nodes at a cost of 8500 m/q.  For TTL=7 we reach all nodes at a cost of 10,500 m/q.  ISM will not exhibit any learning behavior if the frequency of terms is very low.

21 Results (Set2) – TREC 10X10 (2/3)  The effect of high term frequency  The recall rate will improve dramatically if the frequency of terms is high.  ISM achieves higher recall rate than BFS (BFS’s TTL=5).  After the learning phase of queries it scores 120% of BFS’s recall rate by using 4 times less messages.

22 Results (Set2) – TREC 50X2 (3/3)  The effect of high term frequency  More realistic set, a few terms occur many times in queries and most terms occur less frequently.  ISM monotonically improves its recall rate and at the 90 th query it again exceeds BFS performance.  >RES’s recall rate fluctuate and behave as bad as RBFS if the queries don’t follow any constant pattern.

23 Experiments (Set3)  Searching in dynamic network topologies  Why network failures?  Misusage at the application layer (shutdown PC without disconnecting).  Overwhelming amount of generated network traffic.  Because of some poorly written p2p clients.  Simulate dynamic environment  Total number of suspended nodes is no more than drop_rate.  drop_rate is evaluated every k seconds against a random number r.  If r < drop_rate node will break all incoming and outgoing connections (for l seconds).  In our experiments:  K=60,000 ms and l=60,000 ms.  TREC-LATimes Peerware with the TREC 10X10 query set.  drop_rate belongs to (0.0, 0.05, 0.1, 0.2)  r is a random number which is uniformly generated in [ )

24 Results (Set3) (1/3)  BFS mechanism  The increase of drop_rate decreases the number of messages.  BFS does not exhibit any learning behavior at any level of drop_rate.  BFS is tolerable to small drop_rates (5%) because is highly redundant.

25 Results (Set3) (2/3)  >RES mechanism  The increase of drop_rate decreases the number of messages.  >RES does not exhibit any learning behavior at any level of drop_rate.

26 Results (Set3) (3/3)  ISM mechanism  The increase of drop_rate decreases the number of messages.  Quite well at low levels of drop_rate.  Not expected to be tolerant to large drop_rates (The information gathered by the profiling structure becomes obsolete before it gets the chance to be utilized).

27 Extend ISM to different environments  ISM mechanism could easily become the query routing protocol for some hybrid p2p environments (KaZaa, Gnutella).  Super Peers form a backbone of infrastructure (long-time network connectivity).  Regular Peers are unstable and less powerful.  How could it work?  Regular peer obtain a list of active Super peers.  Connects to one or more Super peer and post queries.  Super peer utilize the ISM mechanism and forward the query to a selective subset of its super peer neighbors.

28 Thank you