Large-Scale Monitoring of DHT Traffic Ghulam Memon – University of Oregon Reza Rejaie – University of Oregon Yang Guo – Corporate Research, Thomson Daniel Stutzbach – Stutzbach Enterprises International Workshop on Peer-to-Peer Systems (IPTPS) 2009, Boston MA.
3/5/2016IPTPS 2009 Boston, MA. 2 Introduction Distributed Hash Tables (DHT) provide a scalable approach for distributed content management, e.g. file sharing DHTs have been an active area of research since 2001 DHTs have been recently deployed in real world applications. e.g. Kad, Azureus, Mojito Characterizing traffic in widely deployed DHTs allows us to: Identify opportunities for performance improvement. Detect anomalous behavior. Accurately capturing traffic in a widely deployed DHT is challenging.
3/5/2016IPTPS 2009 Boston, MA. 3 Challenges in Capturing DHT Traffic Common approach for capturing DHT traffic is to use a instrumented peers as monitors. Using a small number of monitors can not capture an accurate view of traffic Using a large number of monitors is expensive and may change and/or disrupt the DHT. e.g. 8 monitors per peer [Steiner:DBISP2P 2007 ] Goal: Capturing a representative view of DHT traffic efficiently without changing and/or disrupting the system.
3/5/2016IPTPS 2009 Boston, MA. 4 Classifying DHT traffic Two types of messages are observed by peer p: DHT Traffic PrPr PiPi PiPi PtPt PrPr PiPi PiPi PtPt We focus on capturing destination traffic Routing Traffic: Messages that are routed by but not destined to peer p. Depends on DHT geometry and peer visibility. Destination Traffic: Messages that are destined to peer p Demonstrates DHT usage.
3/5/2016IPTPS 2009 Boston, MA. 5 This paper presents Montra, a new approach to efficiently & accurately capture DHT traffic without disrupting the system Montra should be applicable to most DHTs Validation of Montra over a deployed DHT, Kad. Preliminary characterization of Kad traffic
3/5/2016IPTPS 2009 Boston, MA. 6 Real-world DHTs add redundancy to cope with churn: Each file is published at multiple peers Search operation identifies multiple peers If monitor peer P m is the closest peer to the target peer P t, P m will observe all the destination traffic of P t Montra Key Idea
3/5/2016IPTPS 2009 Boston, MA. 7 Key Idea 0x0 ID Space 0x80xe PrPr 0x8 …… 0xe …… 0xf …… 0xe 0xf PtPt PmPm 0xe Request Orig. (P r ) searches destination for content ID 0xe. Node 0xe (P t ) is closest to requested ID 0xe. Monitor 0xf (P m ) captures the request. 0x90x1 Placing one monitor per peer will provide an accurate view of traffic. How to avoid/minimize the impact on system? Montra Routing Table
3/5/2016IPTPS 2009 Boston, MA. 8 Minimally Visible Monitors (MVMs) To minimize the disruption on the system, we use Minimally Visible Monitors (MVMs). MVMs are only visible to (i.e. exchange messages with) their target peer. Deploying a large number of MVMs causes minmum/no disruption in the system. Each MVM slightly changes the routing table of the target peer. PtPt PmPm Request ID Space PrPr Request PrPr PrPr Response Request Montra
3/5/2016IPTPS 2009 Boston, MA. 9 Identifying Destination Peers In the presence of churn and packet loss, a single peer (or MVM) can not reliably identify its destination traffic. Closer peers may exist. Requires a regional view of traffic Montra - MVMs 0xad0xa90xaf PmPm 0xac PmPm 0xa8 PmPm 0xae 0xa We monitor all peers in a continuous zone of ID space. e.g. 4 bit zone 0xa Periodically crawl to detect all the peers in the zone. All the captured requests within a zone have a destination in that zone. Destination peers are identified during post-processing. For a given captured request find the closest monitored peer.
3/5/2016IPTPS 2009 Boston, MA. 10 Validation We quantify the accuracy of Montra from 2 different angles, using the Kad network: Content Accuracy: What fraction of destination traffic per zone is captured? Peer Accuracy: How accurately Montra determines destination peers? Validation Methodology: Instrumented Source Instrumented Destination
3/5/2016IPTPS 2009 Boston, MA. 11 Instrumented Source Validation Use instrumented Kad client to send requests for random IDs in a zone (Instrumented Source). Log all requests and their destinations. Monitor the same zone using Montra. Compare source and monitor logs to determine content and peer accuracy. Uses synthetic workload but the requests are distributed over the entire zone. Validation
3/5/2016IPTPS 2009 Boston, MA. 12 Instrumented Destination Validation Use instrumented Kad client to passively observe and log requests (Instrumented Destination). Monitor the same zone simultaneously. Compare destination and monitor logs. Using some heuristics Uses real-world workload but the requests are localized to the instrumented destination. Validation
3/5/2016IPTPS 2009 Boston, MA. 13 Results Zone size decreases with zone prefix length. Both the figures show similar results. Instrumented Source: increasing zone size beyond 6-bit degrades accuracy Time taken to crawl <=5 bit zone hinders prompt addition of MVMs. Instrumented Destination: zone size has minimal impact on accuracy. MVMs are promptly added around instrumented destination. Validation Content Accuracy Peer Accuracy
3/5/2016IPTPS 2009 Boston, MA. 14 Publish Request Rate How request rate varies across different zones? The heavily skewed behavior is consistent across different zones Each zone has some hot keywords and files Rate for Publish keywords is higher than files. A lot of common names occur in filenames See the paper for more results. Characterization Kad Keywords Files
3/5/2016IPTPS 2009 Boston, MA. 15 Relation Between Published and Searched Content. Characterization Kad What is the balance between supply and demand for a file? Balance = Pub./(Sear. + Pub) 15% of files are searched but never published Newly popular files that are not yet widely available. 60% of files are published but never searched. Popular files from past that are highly available. 95% of keywords are published but never searched A very small pool of keywords is actually used. Keywords Files
3/5/2016IPTPS 2009 Boston, MA. 16 Conclusion Montra is a new technique for capturing DHT traffic accurately and efficiently without disrupting the system. Montra’s accuracy was validated over the Kad network. Presented initial characterization of traffic in Kad Ongoing work: Further evaluation of Montra over other DHTs, e.g. Azureus, Mojito Further analysis of captured traffic in Kad and other DHTs Exploring other usage of Montra, e.g. detecting botnet c&c
3/5/2016IPTPS 2009 Boston, MA. 17 Search Request Rate Search file and search keyword requests have the lowest range of requests Demonstrates user behavior. User behavior for search keywords is different across different zones. Some zones have more popular keywords User behavior for search files across different zones is consistent. Characterization Kad Keywords Files