Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous.

Similar presentations

Presentation on theme: "1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous."— Presentation transcript:

1 1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous November, 2004

2 2 Issues in p2p networks  Content based / file identifiers information retrieval.  Dynamic networks (ad-hoc).  Scalability (global knowledge).  Query messages (flooding – network congestion).  Recall rate.  Efficiency (recall rate / query messages).  Query Response Time (QRT).

3 3 IR in pure p2p networks  BFS technique  Each peer forwards the query to all its neighbors  Simple  Performance  Network utilization  Use of TTL  RBFS technique  Each peer forwards the query to a random subset of its neighbors  Reduce query messages  Probabilistic algorithm

4 4 IR in pure p2p networks  >RES technique  Each peer forwards the query to some of its peers based on some aggregated statistics.  Heuristic: The Most Results in Past (for the last 10 queries).  Explore..  The larger network segments.  The most stable neighbors.  ! (The nodes which contain content related to the query.) >RES is a quantitative rather than qualitative approach.

5 5 The intelligent search mechanism (ISM)  Main Idea: Peers estimate for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only.  Exploit the locality of past queries.  Some characteristics:  Entirely distributed (requires only local knowledge).  Scales well with the size of the network.  Scales well to large data sets.  Works well in dynamic environments.  High recall rates.  Minimize the communication costs.

6 6 Architecture (ISM) (1/4)  Profiling structure:  Single queries table  LRU policy to keep the most recent queries  Table size is limited  good performance

7 7 Architecture (ISM) (2/4)  Query Similarity function (cosine similarity)  Assumption: A peer that has a document relevant to a given query is also likely to have other documents that are relevant to other similar queries. Qsim : Q 2  [0,1] L: the set of all words appeared in queries {1,1,1,1} q:{1,1,0,0} q i :{1,0,1,0}

8 8 Architecture (ISM) (3/4)  Peer ranking (Relevance Rank) P i : each peer. P l : the decision-maker node. a: allows us to add more weight to the most similar queries. S(P i, q j ): the number of results returned by P i for query q j.

9 9 Architecture (ISM) (4/4)  Search Mechanism  Invoke RR function.  Forward query to k (threshold) peers only.

10 10 Experiments  Peerware: A distributed middleware infrastructure  GraphGen: generates network topologies.  dataPeer: p2p client which answers to boolean queries from its local xml repository(XQL).  SearchPeer: p2p client that performs queries and harvest answers back from a Peerware network (connect to a dataPeer and perform queries).

11 11 Experiments - DMP  If node P k receives the same query q with some TTL 2, where TTL 2 >TTL 1 we allow the TTL 2 message to proceed.  This may allow q to reach more peers than its predecessor  Without this fix the BFS behaviour is not predictable and therefore is not able to find the nodes that we were supposed to find.  Our experiments revealed that almost 30% of the forwarded queries were discarded because of DMP.  The experimental results presented in this work are not suffering from DMP.  This is the reason why the number of messages is slightly higher (~30%) than the expected number of messages.  The total number of messages should be for n nodes each of which with a degree d i.

12 12 Experiments-DMP  Query examples  A set of 4 keywords  1 keyword >= 4 characters #Query 1 AUSTRIA INTERVENE DOES DOLLAR 2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES 3 AGREES PEACE NEW MOVES Random Topology : Each vertex selects its d neighbors randomly. Simple. Leads to connected topologies if the degree d > log 2 n.

13 13 Experiments (Set1)  Reuters – 21578 Peerware  Random topology of 104 nodes (static) with average degree 8 (running on network 75 workstations).  Categorize the documents by their country attribute (104 country files - each for a node) - Each country file has at least 5 articles.  Data Sets:  Reuters 10X10: set of 10 random queries which are repeated 10 consecutive times (high locality of similar queries) – suits better to ISM.  Reuters 400: set of 400 random queries which are uniformly sampled from the initial 104 country files (lower repetition).

14 14 Results (Set1) – Reuters 10X10 (1/4)  Reducing query messages  ISM finds the most documents compared to RBFS and >RES.  ISM achieves almost 90% (recall rate) while using only 38% of BFS’s messages.  ISM and >RES start out with low recall rate.  Suffer from low recall rate.

15 15 Results (Set1) – Reuters 10X10 (2/4)  Digging deeper by increasing TTL  Reach more nodes deeper.  ISM achieves 100% recall rate while using only 57% of BFS’s messages with TTL=4.

16 16 Results (Set1) – Reuters 10X10 (3/4)  Reducing query response time (QRT)  ~30-60% of BFS’s QRT for TTL=4 and ~60-80% for TTL=5.  ISM requires more time than >RES because it’s decision involves some computation over the past queries.

17 17 Results (Set1) – Reuters 400 (4/4)  Improving the recall rate over time  ISM achieves 95% recall rate while using 38% of BFS’s messages.  During queries 150-200 major outbreaks occur in BFS.  ISM requires a learning period of about 100 queries before it starts competing the performance of >RES.

18 18 Experiments (Set2)  TREC-LATimes Preeware (random topology of 1000 nodes – static)  It contains approximately 132,000 articles.  These articles were horizontally partitioned in 1000 documents (Each document contain 132 articles).  Each peer shares one or more of 1000 documents (replicated articles).

19 19 Experiments (Set2)  Data Sets:  TREC 100: a set of 100 queries out of the initial 150 topics.  TREC 10X10: a list of 10 randomly sampled queries, out of the initial 150 topics, which are repeated 10 consecutive times.  TREC 50X2: for which we first generated a set a=“50 randomly sampled queries out of the initial 150 topics” merged with a generated list of another 50 queries which are randomly sampled out of a.

20 20 Results (Set2) – TREC100 (1/3)  Searching in a large-scale network topology  For TTL=5 we reach 859 of 1000 nodes (BFS).  For TTL=6 we reach 998 of 1000 nodes at a cost of 8500 m/q.  For TTL=7 we reach all nodes at a cost of 10,500 m/q.  ISM will not exhibit any learning behavior if the frequency of terms is very low.

21 21 Results (Set2) – TREC 10X10 (2/3)  The effect of high term frequency  The recall rate will improve dramatically if the frequency of terms is high.  ISM achieves higher recall rate than BFS (BFS’s TTL=5).  After the learning phase of 20-30 queries it scores 120% of BFS’s recall rate by using 4 times less messages.

22 22 Results (Set2) – TREC 50X2 (3/3)  The effect of high term frequency  More realistic set, a few terms occur many times in queries and most terms occur less frequently.  ISM monotonically improves its recall rate and at the 90 th query it again exceeds BFS performance.  >RES’s recall rate fluctuate and behave as bad as RBFS if the queries don’t follow any constant pattern.

23 23 Experiments (Set3)  Searching in dynamic network topologies  Why network failures?  Misusage at the application layer (shutdown PC without disconnecting).  Overwhelming amount of generated network traffic.  Because of some poorly written p2p clients.  Simulate dynamic environment  Total number of suspended nodes is no more than drop_rate.  drop_rate is evaluated every k seconds against a random number r.  If r < drop_rate node will break all incoming and outgoing connections (for l seconds).  In our experiments:  K=60,000 ms and l=60,000 ms.  TREC-LATimes Peerware with the TREC 10X10 query set.  drop_rate belongs to (0.0, 0.05, 0.1, 0.2)  r is a random number which is uniformly generated in [0.0.. 1.0)

24 24 Results (Set3) (1/3)  BFS mechanism  The increase of drop_rate decreases the number of messages.  BFS does not exhibit any learning behavior at any level of drop_rate.  BFS is tolerable to small drop_rates (5%) because is highly redundant.

25 25 Results (Set3) (2/3)  >RES mechanism  The increase of drop_rate decreases the number of messages.  >RES does not exhibit any learning behavior at any level of drop_rate.

26 26 Results (Set3) (3/3)  ISM mechanism  The increase of drop_rate decreases the number of messages.  Quite well at low levels of drop_rate.  Not expected to be tolerant to large drop_rates (The information gathered by the profiling structure becomes obsolete before it gets the chance to be utilized).

27 27 Extend ISM to different environments  ISM mechanism could easily become the query routing protocol for some hybrid p2p environments (KaZaa, Gnutella).  Super Peers form a backbone of infrastructure (long-time network connectivity).  Regular Peers are unstable and less powerful.  How could it work?  Regular peer obtain a list of active Super peers.  Connects to one or more Super peer and post queries.  Super peer utilize the ISM mechanism and forward the query to a selective subset of its super peer neighbors.

28 28 Thank you

Download ppt "1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous."

Similar presentations

Ads by Google