Download presentation
Presentation is loading. Please wait.
Published byDarren Norton Modified over 9 years ago
1
1 Exploiting locality for scalable information retrieval in peer-to-peer networks D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios Gunopulos Manos Moschous November, 2004
2
2 Issues in p2p networks Content based / file identifiers information retrieval. Dynamic networks (ad-hoc). Scalability (global knowledge). Query messages (flooding – network congestion). Recall rate. Efficiency (recall rate / query messages). Query Response Time (QRT).
3
3 IR in pure p2p networks BFS technique Each peer forwards the query to all its neighbors Simple Performance Network utilization Use of TTL RBFS technique Each peer forwards the query to a random subset of its neighbors Reduce query messages Probabilistic algorithm
4
4 IR in pure p2p networks >RES technique Each peer forwards the query to some of its peers based on some aggregated statistics. Heuristic: The Most Results in Past (for the last 10 queries). Explore.. The larger network segments. The most stable neighbors. ! (The nodes which contain content related to the query.) >RES is a quantitative rather than qualitative approach.
5
5 The intelligent search mechanism (ISM) Main Idea: Peers estimate for each query, which of its peers are more likely to reply to this query, and propagates the query message to those peers only. Exploit the locality of past queries. Some characteristics: Entirely distributed (requires only local knowledge). Scales well with the size of the network. Scales well to large data sets. Works well in dynamic environments. High recall rates. Minimize the communication costs.
6
6 Architecture (ISM) (1/4) Profiling structure: Single queries table LRU policy to keep the most recent queries Table size is limited good performance
7
7 Architecture (ISM) (2/4) Query Similarity function (cosine similarity) Assumption: A peer that has a document relevant to a given query is also likely to have other documents that are relevant to other similar queries. Qsim : Q 2 [0,1] L: the set of all words appeared in queries {1,1,1,1} q:{1,1,0,0} q i :{1,0,1,0}
8
8 Architecture (ISM) (3/4) Peer ranking (Relevance Rank) P i : each peer. P l : the decision-maker node. a: allows us to add more weight to the most similar queries. S(P i, q j ): the number of results returned by P i for query q j.
9
9 Architecture (ISM) (4/4) Search Mechanism Invoke RR function. Forward query to k (threshold) peers only.
10
10 Experiments Peerware: A distributed middleware infrastructure GraphGen: generates network topologies. dataPeer: p2p client which answers to boolean queries from its local xml repository(XQL). SearchPeer: p2p client that performs queries and harvest answers back from a Peerware network (connect to a dataPeer and perform queries).
11
11 Experiments - DMP If node P k receives the same query q with some TTL 2, where TTL 2 >TTL 1 we allow the TTL 2 message to proceed. This may allow q to reach more peers than its predecessor Without this fix the BFS behaviour is not predictable and therefore is not able to find the nodes that we were supposed to find. Our experiments revealed that almost 30% of the forwarded queries were discarded because of DMP. The experimental results presented in this work are not suffering from DMP. This is the reason why the number of messages is slightly higher (~30%) than the expected number of messages. The total number of messages should be for n nodes each of which with a degree d i.
12
12 Experiments-DMP Query examples A set of 4 keywords 1 keyword >= 4 characters #Query 1 AUSTRIA INTERVENE DOES DOLLAR 2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES 3 AGREES PEACE NEW MOVES Random Topology : Each vertex selects its d neighbors randomly. Simple. Leads to connected topologies if the degree d > log 2 n.
13
13 Experiments (Set1) Reuters – 21578 Peerware Random topology of 104 nodes (static) with average degree 8 (running on network 75 workstations). Categorize the documents by their country attribute (104 country files - each for a node) - Each country file has at least 5 articles. Data Sets: Reuters 10X10: set of 10 random queries which are repeated 10 consecutive times (high locality of similar queries) – suits better to ISM. Reuters 400: set of 400 random queries which are uniformly sampled from the initial 104 country files (lower repetition).
14
14 Results (Set1) – Reuters 10X10 (1/4) Reducing query messages ISM finds the most documents compared to RBFS and >RES. ISM achieves almost 90% (recall rate) while using only 38% of BFS’s messages. ISM and >RES start out with low recall rate. Suffer from low recall rate.
15
15 Results (Set1) – Reuters 10X10 (2/4) Digging deeper by increasing TTL Reach more nodes deeper. ISM achieves 100% recall rate while using only 57% of BFS’s messages with TTL=4.
16
16 Results (Set1) – Reuters 10X10 (3/4) Reducing query response time (QRT) ~30-60% of BFS’s QRT for TTL=4 and ~60-80% for TTL=5. ISM requires more time than >RES because it’s decision involves some computation over the past queries.
17
17 Results (Set1) – Reuters 400 (4/4) Improving the recall rate over time ISM achieves 95% recall rate while using 38% of BFS’s messages. During queries 150-200 major outbreaks occur in BFS. ISM requires a learning period of about 100 queries before it starts competing the performance of >RES.
18
18 Experiments (Set2) TREC-LATimes Preeware (random topology of 1000 nodes – static) It contains approximately 132,000 articles. These articles were horizontally partitioned in 1000 documents (Each document contain 132 articles). Each peer shares one or more of 1000 documents (replicated articles).
19
19 Experiments (Set2) Data Sets: TREC 100: a set of 100 queries out of the initial 150 topics. TREC 10X10: a list of 10 randomly sampled queries, out of the initial 150 topics, which are repeated 10 consecutive times. TREC 50X2: for which we first generated a set a=“50 randomly sampled queries out of the initial 150 topics” merged with a generated list of another 50 queries which are randomly sampled out of a.
20
20 Results (Set2) – TREC100 (1/3) Searching in a large-scale network topology For TTL=5 we reach 859 of 1000 nodes (BFS). For TTL=6 we reach 998 of 1000 nodes at a cost of 8500 m/q. For TTL=7 we reach all nodes at a cost of 10,500 m/q. ISM will not exhibit any learning behavior if the frequency of terms is very low.
21
21 Results (Set2) – TREC 10X10 (2/3) The effect of high term frequency The recall rate will improve dramatically if the frequency of terms is high. ISM achieves higher recall rate than BFS (BFS’s TTL=5). After the learning phase of 20-30 queries it scores 120% of BFS’s recall rate by using 4 times less messages.
22
22 Results (Set2) – TREC 50X2 (3/3) The effect of high term frequency More realistic set, a few terms occur many times in queries and most terms occur less frequently. ISM monotonically improves its recall rate and at the 90 th query it again exceeds BFS performance. >RES’s recall rate fluctuate and behave as bad as RBFS if the queries don’t follow any constant pattern.
23
23 Experiments (Set3) Searching in dynamic network topologies Why network failures? Misusage at the application layer (shutdown PC without disconnecting). Overwhelming amount of generated network traffic. Because of some poorly written p2p clients. Simulate dynamic environment Total number of suspended nodes is no more than drop_rate. drop_rate is evaluated every k seconds against a random number r. If r < drop_rate node will break all incoming and outgoing connections (for l seconds). In our experiments: K=60,000 ms and l=60,000 ms. TREC-LATimes Peerware with the TREC 10X10 query set. drop_rate belongs to (0.0, 0.05, 0.1, 0.2) r is a random number which is uniformly generated in [0.0.. 1.0)
24
24 Results (Set3) (1/3) BFS mechanism The increase of drop_rate decreases the number of messages. BFS does not exhibit any learning behavior at any level of drop_rate. BFS is tolerable to small drop_rates (5%) because is highly redundant.
25
25 Results (Set3) (2/3) >RES mechanism The increase of drop_rate decreases the number of messages. >RES does not exhibit any learning behavior at any level of drop_rate.
26
26 Results (Set3) (3/3) ISM mechanism The increase of drop_rate decreases the number of messages. Quite well at low levels of drop_rate. Not expected to be tolerant to large drop_rates (The information gathered by the profiling structure becomes obsolete before it gets the chance to be utilized).
27
27 Extend ISM to different environments ISM mechanism could easily become the query routing protocol for some hybrid p2p environments (KaZaa, Gnutella). Super Peers form a backbone of infrastructure (long-time network connectivity). Regular Peers are unstable and less powerful. How could it work? Regular peer obtain a list of active Super peers. Connects to one or more Super peer and post queries. Super peer utilize the ISM mechanism and forward the query to a selective subset of its super peer neighbors.
28
28 Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.