Transport Layer Identification of P2P Traffic

Transport Layer Identification of P2P Traffic
Thomas Karagiannis#, Andre Broido*, Michalis Faloutsos# and Kc Claffy* #Department of Computer Science and Engineering, UC Riverside. *CAIDA, SDSC Presented By: Anirban Banerjee, Jin Shieh, Piyush Satapathy, Trivikram Phatak and Van Lepham Department of Computer Science and Engineering, UC Riverside.

Problem Addressed Developing a trustworthy estimation mechanism for P2P traffic. To characterize P2P traffic based on knowledge of network parameters. Thus, to verify if P2P traffic is actually declining or not. 9/21/2018 CS253, Fall 05

Motivation Can we identify P2P traffic without payload identification?
Sniffing payload can lead to privacy issues Can we verify the claim that P2P traffic is decreasing? Quoting IFPI, ``music piracy files falls 27%''. 9/21/2018 CS253, Fall 05

Contribution To develop a non-payload based mechanism to identify P2P traffic based on network parameters. Reverse engineer and analyze 9 most popular P2P protocols. Develop effective methodology capable of identifying 99% of flows and more than 95% of P2P bytes. Identify approx 10% more flows missed by payload analysis. 9/21/2018 CS253, Fall 05

Contribution Prove conclusively that P2P traffic is certainly not declining. 9/21/2018 CS253, Fall 05

Outline Introduction Data Description Previous Work
Payload Analysis of P2P Traffic and Limitations Nonpayload Identification of P2P Traffic Evaluation P2P/File Sharing Traffic Trends Conclusions 9/21/2018 CS253, Fall 05

Introduction Why is this work important
Because it debunks the claim that P2P traffic is decreasing. It provides a simple methodology based on network parameters to identify P2P flows. It can successfully identify 99% of P2P flows and over 95% of P2P bytes. 9/21/2018 CS253, Fall 05

Data Description Traces sourced from Software
CAIDA’s backbone data kit Trace from OC-48 link, tier 1 ISP Traces analyzed were taken on May 5, 2003: Jan 22, 2004: Feb 25, 2004: April 21, 2004. Software CAIDA’s Coralreef suite DAG 4 monitors Endace Packet capture software from Univ. of Waikato 9/21/2018 CS253, Fall 05

Data Description contd.
Monitors captured 44 bytes of each packet. IP/TCP headers + 4bytes of payload 60%-80% of packets encapsulated with MPLS label Feb and April traces had 16 bytes of UDP/TCP payload info (used to benchmark non-payload mechanism) Anonymized IP addresses (CryptoPAn) 9/21/2018 CS253, Fall 05

Previous Work Bootstrapping in Gnutella, 5 months of a torrents lifetime. Characterize very specific P2P systems Topological characteristics of P2P networks. Investigation of bottleneck bandwidth. Possibility of caching and retrieval of content. 9/21/2018 CS253, Fall 05

Previous Work Signature based payload identification techniques
Studies by entities like Sprint 9/21/2018 CS253, Fall 05

Attention ! How is this paper different from the pack?
Analysis of diverse data set from tier 1 ISP. A large group of P2P applications included in the study. Combine and cross validate methods using fixed ports, payload and transport layer dynamics. 9/21/2018 CS253, Fall 05

Payload Analysis of P2P Traffic & Limitations
Payload analysis done by identifying characteristic bit strings in packet payload. Monitor 9 most popular P2P protocols: eDonkey, Fasttrack, BitTorrent, OpenNap & WinMX, Gnutella, MP2P, Soulseek, Ares, Direct Connect. Most use proprietary protocols so packet formats must be identified by additional analysis. 9/21/2018 CS253, Fall 05

Limitations Captured payload only 16 bytes!
Small size limits number of heuristics that can reliably pinpoint P2P flow. Even worse, some of the older captured payload only have 4 bytes. Some P2P protocols use HTTP requests and responses to transfer files. We can’t tell from “HTTP/ Partial Content” if it is for HTTP or P2P traffic. 9/21/2018 CS253, Fall 05

More Limitations Increasing number of P2P protocols use encryption and SSL. We can’t string match if payload is encrypted. Traffic is diverse, there are other protocols. Can’t guarantee identification of all P2P flow, but the majority of traffic is from the 9 protocols covered. 9/21/2018 CS253, Fall 05

More Limitations Unidirectional trace:
Some of the traces only reflect a single direction. So if we only see TCP acknowledgement for P2P file download, we cannot identify. No Payload! Even if bidirectional, asymmetric routing. We can overcome some of these limitations by using non-payload method. 9/21/2018 CS253, Fall 05

Methodology Use unique bit strings to identify P2P.
Documentation poor, had to install P2P applications and perform tcpdump. 9/21/2018 CS253, Fall 05

Methodology Classify packets into flows
{source IP, destination IP, protocol, source port, destination port} Use commonly accepted 64 second timeout, where flow expires if idle for 64 seconds 9/21/2018 CS253, Fall 05

Three Methods M1: Source or destination port matches well known P2P ports. M2: Compare payload (if any) to table of known bit strings. If match, flag as P2P with protocol type. Else, flag as non-P2P. M3: If P2P identified by M2, hash source and destination IP into a table. All flows with an IP entry will be classified as ‘possible P2P.’ 9/21/2018 CS253, Fall 05

Benefits and Concerns Clients maintain many connections even when idle. User identified from M2 will participate in future P2P flows and will be flagged already. But the user might just be browsing the web, gaming, or checking mail. Reduce false positives by excluding from M3, flows with port 80, 8000, 8080, 25, 21, 22, etc. 9/21/2018 CS253, Fall 05

Payload Analysis Wrapup
M3 allows us to identify IP’s that participate in P2P flow. Can be used to overcome some limitations. Most payload analysis will be done using M2. Similar work by Sen et al. Examine only 5 protocols but inspects full payload. Difference should be minor since characteristic bit strings appear at the beginning. We also have M3. 9/21/2018 CS253, Fall 05

Non Pay Load Identification of P2P Traffic
Overview: - Examining only packet header to detect P2P Flows - First Attempt of P2P identification on arbitrary port basis without any inspection of user payload - Based on 2 heuristics 1. TCP/UDP Heuristics - Examines Source – destination IP pairs that use both TCP and UDP to transfer data 2. {IP, port} Heuristics - How P2P peer connects to each other by studying connection characteristics of {IP, port} pairs.

Algorithm in a Nutshell
1. Data Processing: - Building the flow table as packets cross the link (similar to 5-tuples in payload method) - Collecting information on various characteristics of {IP, port} pairs, packet sizes used, and transfer flow size. 2. Identification of potential P2P pairs: - Flag a traffic as P2P based on TCP/UDP usage and {IP, port} characteristic 3. False Positive: - Detecting nonP2P traffic as P2P traffic - Comparison against a set of heuristics that identify mail servers, DNS flows, Malware etc. 9/21/2018 CS253, Fall 05

Heuristics 1 : (TCP/UDP) IP Pairs
If a Source – destination IP pair concurrently uses both TCP and UDP as transport protocols, it is considered that flows between this pair P2P so long as the source or destination ports are not in the set as in the table below. 9/21/2018 CS253, Fall 05

TCP/UDP Heuristic … 6 out of 9 analyzed P2P protocols use both TCP and UDP at source and destination. - Other applications such as DNS and Streaming media also use TCP and UDP. - To determine the non P2P applications using TCP UDP, the following table is followed. Bottom Line: - We will flag all the TCP and UDP source – destination IP pairs as P2P but we will exclude the ones using port number presented in the table below. - 98.5% P2P identification. 9/21/2018 CS253, Fall 05

Heuristics 2 : {IP, port} Pairs
Steps: 1. New P2P host A wants to connect to P2P network 2. A connects to a super peer in its host cache and informs its IP address and port info 3. Super peer propagates {IP, port} to rest of the P2P network 4. Peers willing to connect to A use this {IP, port} pair {A,1} 9/21/2018 CS253, Fall 05

{IP, port} pair heuristic…
Observation: - Number of Distinct IPs (C,B) connected to A is equal to the number of distinct ports (10,15) used to connect to it. - Exception for Web and HTTP “Web traffic will have a higher ratio than P2P traffic of the number of distinct ports versus number of distinct IPs connected to the {IP, port} pair {W,80}” 9/21/2018 CS253, Fall 05

Methodology Step1: Choose a time interval “t” and build the flow table for the link based on the 5-tuple key and 64-second flow time out. Step2: Apply heuristic 1. If source – destination IP pairs concurrently use both TCP and UDP during tine “t” and don’t use any port from the given table then it’s a P2P Step3: Apply heuristic 2. Examine all source {srcIP, srcport} and destination {dstIP, dstport} pairs during “t”. If distinct connected IPs is equal to number of distinct connected ports for both source and destination then it’s a P2P. 9/21/2018 CS253, Fall 05

False Positives Classifying nonP2P pairs as P2P
Different heuristics to decrease the risk of false positives. Mail (SMTP, POP) DNS Gaming and Malware Other heuristics like (Scans, One packet pairs, MSN Messenger servers, Port History) 9/21/2018 CS253, Fall 05

1. Mail heuristic: Mail false positives are common as because connection behavior resembles {IP, port} heuristic. Detecting False positives: For the Source pair { ,25} the set of destination ports is [3267, 25, 50827, 3301, 3872]. Port 25 appears in this set so its inferred that IP is a mail server and consider all of its flow as nonP2P. 9/21/2018 CS253, Fall 05

2. DNS Heuristics - DNS protocol runs on top of both TCP and UDP port 53. - DNS connection patterns are analogous to {IP, port} pair heuristic. Detecting False Positives: For all flows where the source port is equal to the destination port, and the port number is less than 501, both the source and destination {IP, port} pairs are considered nonP2P, and they are inserted in a list of definitively non P2P pairs. Here the observed {IP, port} pairs are considered nonP2P, e.g., { ,53}, { ,53) due to the use of port 53 as source and destination port in the flow 5-tuple. 9/21/2018 CS253, Fall 05

3. Gaming and Malware Heuristic:
- on-line gaming runs on top of UDP - Malware tends to run over TCP - Analogous to {IP, port} heuristics. Ex: For pair { , 27015}, no of distinct IPs = 3 and ports = 3 For pair { ,1990}, no of distinct IPs = 1 and ports = 1 - To remove such pairs its maintained a set of distinct average packet sizes and a set of distinct total transfer sizes. Also two sets of port numbers are maintained. 9/21/2018 CS253, Fall 05

Gaming and Malware Heuristics…
Algorithm: A pair {IP, port} is nonP2P if { length(pair.avg_sizesSet) == 1 or length(pair.avg_pktssizesSet) < 3} And port not in knownP2PPortsSet {length(pair.IPSet) > 5 or port < 501 or port in Malwareportset } 9/21/2018 CS253, Fall 05

4. Other Heuristics for False Positives:
A. Scans: - count the number of {IP, port} pairs having a specific IP - reject all IPs that appear in a large number of {IP, port} pairs B. One-packet Pairs: - Remove all one packet flows whose IP do not appear in any other flows in the trace. C. MSN Messenger Servers: - port 1863 and three distinct destination IPs within the same prefix D. Port History: - P2P clients randomize the port at which they accept connections - Its rare because large fraction of P2P will have to change their client’s listening port.. 9/21/2018 CS253, Fall 05

Evaluation Fraction of Identified P2P Traffic :
Payload Vs Non-Payload Method 9/21/2018 CS253, Fall 05

Evaluation Cntd . . . Non-Payload Method:
Identified Vs Missed P2P Flows 9/21/2018 CS253, Fall 05

Correctly Identified P2P Flows and Bytes
Evaluation Cntd . . . False Positives Vs. Correctly Identified P2P Flows and Bytes 9/21/2018 CS253, Fall 05

Evaluation Cntd . . . 98% of misclassification were based on pairs with fewer than 5 distinct IPs in the IPSet. i.e. these false positives are due to insufficient sample for the specific pairs CDF of number of distinct IPs in {IP-Port} pairs leading to misclassification 9/21/2018 CS253, Fall 05

Robustness of PTP Algorithm
Spike caused by address space scan in the trace Effect of time interval on Missed and False Positive flows 9/21/2018 CS253, Fall 05

Payload vs Non-Payload Analysis
Privacy Issues Anonymization of IP Addresses Storage overhead Processing Overhead Reverse Engineering of Protocols Encryption 9/21/2018 CS253, Fall 05

Payload vs Non-Payload Analysis
Payload methods cannot identify “ The Invisible “ Payload methods can Identify NOT Discover 9/21/2018 CS253, Fall 05

Claims P2P traffic decline sharply
P2P user populations dropping as much as 50% 9/21/2018 CS253, Fall 05

But … The paper proves that those claims are not correct and overall P2P traffic never decrease over time Reason: Media reports rarely based on measurement, less classification Based on telephone surveys or periodic samples of log files 9/21/2018 CS253, Fall 05

Factors Affect P2P Traffic Comparison
44-byte packets MPLS ISP caching P2P versus copyrighted traffic Link utilization and time of the day Conflicting traffic engineering goals 9/21/2018 CS253, Fall 05

44-byte packets In older traces, CAIDA monitors capture 44 bytes of each packet TCP headers are typically 40 bytes without options => 4 bytes for examination UDP header is only 8 bytes To facilitate fair payload use only 4-byte payload for all traces 9/21/2018 CS253, Fall 05

MPLS 60%-80% experimental packets have 4-byte MPLS headers
MPLS decreases the number of packets that can be matched again the string table (since there is no payload info) 9/21/2018 CS253, Fall 05

ISP Caching To improve P2P traffic
P2P requests do not reach the backbone Give a different result when compare with previous years before ISP caching took place 9/21/2018 CS253, Fall 05

P2P versus Copyrighted Traffic
The study cannot identify the trends in the use of P2P networks for exchanging copyrighted material 9/21/2018 CS253, Fall 05

Link Utilization and Time of the Day
Different times in a day produce different results with same link Compare P2P traffic relative to the total volume 9/21/2018 CS253, Fall 05

Conflicting Traffic Engineering Goals
ISP tempts to manipulate P2P traffic according to their economic objectives => affect P2P traffic Networks that pay for transit Networks that charge for transit 9/21/2018 CS253, Fall 05

Result Compare: D09 (May 03), D10 (Jan 04), D11 (Feb 04), and D13 (Apr 04) Northbound and southbound Payload method (M1, M2, and M3) and non-payload method P2P volume is presented as percentage relative to total traffic volume 9/21/2018 CS253, Fall 05

P2P traffic either stays the same or increases
Measurements based on port numbers underestimates P2P traffic by more than 50% Effectiveness of payload sizes, 4-byte vs 16-byte M3 vs non-payload: 9/21/2018 CS253, Fall 05

Conclusion The paper focuses on non-payload identification of P2P traffic due to the ability of operating of P2P protocols on any arbitrary ports Develops an algorithm based on profiling flow pattern of IP addresses Ability to identify “unknown” P2P protocols 9/21/2018 CS253, Fall 05

Future Works Exploit the availability of bidirectional traces by merging IP pairs that appear in both directions of the link Consider additional heuristics that use knowledge of specific packet sizes that may reflect control traffic of P2P protocols 9/21/2018 CS253, Fall 05

Discussion Pros Cons Simple effective methodology
Can detect unknown P2P protocols Can bypass privacy issues Cons Only good for P2P apps not trying to hide. Do not clearly specify if they go back to original data to benchmark payload/non-payload results. 9/21/2018 CS253, Fall 05

Transport Layer Identification of P2P Traffic

Similar presentations

Presentation on theme: "Transport Layer Identification of P2P Traffic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transport Layer Identification of P2P Traffic

Similar presentations

Presentation on theme: "Transport Layer Identification of P2P Traffic"— Presentation transcript:

Similar presentations

About project

Feedback