Revealing Skype Traffic: When Randomness Plays with You D. Bonfiglio 1, M. Mellia 1, M. Meo 1, D. Rossi 2, P. Tofanelli 3 Dipartimento di Elettronica, Politecnico di Torino 1 ENST T é l é com Paris 2 Motorola Inc. 3 ACM Sigcomm 2007 Presented by Te-Yuan Huang
Outline Goal Contribution Know More about Skype Classifiers Experiments Conclusions
Outline Goal Contribution Know More about Skype Classifiers Experiments Conclusions
Goal Identify Skype Traffic among aggregated traffic Direct session Either UDP or TCP The algorithm should be Work in Real-Time Reliable Able to detect short flows (only last several seconds)
Outline Goal Contribution Know More about Skype Classifiers Experiments Conclusions
Importance of Skype Traffic Identification Interest of network operator Network Design & Provisioning Traffic and Performance Monitoring Tariff Policies Traffic Differentiation
Difference from Related Work K.T. Chen et al. “ Quantifying Skype USI ” Only identify UDP traffic Need Skype login phase to be monitored Fail on backbone links Fail if any modification on Skype login proc. K. Suh et al. “ Characterizing and Detect relayed traffic: A case study using Skype ” Only identify relayed Skype traffic
Outline Goal Contribution Know More about Skype Classifiers Experiments Conclusions
Let ’ s get hands dirty – Know more about Skype traffic sources A Skype Message
Skype Parameters Rate Codec Rate Delta T Skype Message Framing Time The time between two subsequent Skype Message RF (Redundancy Factor) The number of past blocks that Skype retransmits
Parameters changes on Network Conditions
Skype Communication Mode End-to-End (E2E) Skype user call Skype user End-to-Out (E2O) Skype-in/Skype-out PSTN involved Only voice data No video / file transfer / IM
Skype Codec Codecs Automatically selected ISAC The preferred codec for E2E G.729 The preferred codec for E2O
More on Skype Message Skype encrypt the message TCP: Reliable transport Receive packet in correct sequence (from application layer point of view) encrypt the whole content of the message UDP: Unreliable Maybe out-of-order Application layer header is needed to resolve incorrect order Only can be obfuscated Only encrypt partial message
TCP E2E Message All ciphered 123Byte Frame
Identified Field ID: 16-bit long identifier. Randomly selected Fun: 5-bit long field masked by 0x8f Used to stating the payload type 0x02, 0x03, 0x07,0x0f : signaling message 0x0d : Data message (all 4 types DATA) Not Random, but obfuscate (Mixed) Frame: ciphered information UDP E2E Message 1234Byte … ID FunFun Frame
Identified Field CCID: 4 bytes Connection Identifier (CID) of PSTN gateway Deterministic After initial signaling E2O Message 1234Byte … CIDFrame
Outline Goal Contribution Know More about Skype Classifiers Experiments Conclusions
How to Identify Skype Traffic? Chi-Square Classifier (CSC) Utilize the knowledge of ciphering mechanism Na ï ve Bayes Classifier (NBC) Utilize the general characteristics of VoIP traffics Payload-Based Classifier (PBC) Look into the non-ciphered SoM Only used for traffic in UDP
Chi-Square Classifier (CSC) Purpose: To Know whether message portion is encrypted Rationale Given a message, Only the third bytes is not random Probably, E2E Skype flow by UDP The first four bytes are deterministic, others are ciphered Probably, E2O Skype flow by UDP The whole message is ciphered Probably, Skype flow transported by TCP
Chi-Square Classifier (CSC) – Cont. Chi-Square Distr. Observing the objects ’ ouput for n TOT times There are n possible output For i th output, it is expected to occur E i times among n TOT, and is observed to occur O i times Then, is Chi-Square Distr. With n-1 degree of freedom
Chi-Square Classifier (CSC) – Cont. For each flow, take first G group of b bits For each group g, there are 2 b possible output If the content of the flow is random, then E i for each group is n TOT / 2 b b bits …..b bits 123G ….. ……
Chi-Square Classifier (CSC) – Cont. Evaluate the test statistic as: Define the thresholds by
Chi-Square Classifier (CSC) – Cont. G = 16, b = 4bits are used E2E over UDP The block g = 5 or 6 is mixed Others are random Classified Criteria
Chi-Square Classifier (CSC) – Cont. E2O over UDP E2E or E2O over TCP Not Skype Otherwise
Chi-Square Classifier (CSC) – Cont. Deterministic test satistics Linear with n TOT
Chi-Square Classifier (CSC) – Cont. Mixed block: If one bit is fixed and the others are random Linearly increase with n TOT
Chi-Square Classifier (CSC) – Cont.
Chi-Square works only if the observation is large enough, that is E i = n TOT /2 b >=5 Namely, n TOT >= 80 Choose n TOT = 100 Also, set
Na ï ve Bayes Classifier Feature vector x = [x i ] P{C|x} : the probability that the object is belong to class C, given the feature x is observed P{x|C}: the probability that the feature x will be observed, given the object is belong to class C Bayes Rule P{C|x} = P{x|C}P{C} / P{x}
Na ï ve Bayes Classifier – cont. Na ï ve : features are independent P{x|C} called belief
NBC – Feature Selection VoIP Small Message Size Less burstier than data traffic Feature Message size Observe a window of message at a time x = [s 1, s 2, …, s w ] Average-Inter Packet Gap (average-IPG)
NBC – Feature Selection Belief How to determine P{s i |C} &
NBC – Feature Characterization For each codec, the message size is determined by Rate Header length Redundancy factor (RF) Message framing time (delta T) The message size can be represented by Gaussian distribution
NBC – Feature Characterization Map each codec to a Gaussian distr. Model average-IPG to a Gaussian distr. with For Constant Bit Rate Codec For variable Bit Rate Codec
NBC – Derive Beliefs
NBC – Make Decision Let Define a threshold B min If B > B min Valid Skype flow Otherwise Not Skype flow
Payload Based Classifier (PBC) Used as cross check for previous two classifier Only useful for UDP traffic Two Part Per-flow Identification Per-host Identification
PBC - Per-flow Identification Utilize the knowledge about UDP E2E Message Fun: 5-bit long field masked by 0x8f Used to stating the payload type 0x02, 0x03, 0x07,0x0f : signaling message 0x0d : Data message (all 4 types DATA) 1234Byte … ID FunFun Frame
PBC - Per-flow Identification Terminology n TOT : the total number of packets in the flow n sig : the number of Skype signaling message n E2E : the number of Skype E2E data/video/chat/voice message n E2O : the number of Skype E2O voice message
PBC - Per-flow Identification Criteria
PBC - Per-host Identification Known: a Skype client always uses the same UDP port to send/receive traffic Before start conversation, Signaling messages are sent between two clients Able to identify a Skype client running at a specific IP and port
PBC - Per-host Identification Criteria to identify the Skype client IP/port
Experiment Two Data Set Campus – 95 hours took on 2006/5/29 No P2P traffic is allowed Most traffic are TCP data flows ISP – one day took on 2006/5/15 All traffic is allowed More heterogeneous Expect little Skype traffic
Measurement Result
Measurement Result – UDP, Campus
Measurement Result – UDP, ISP
Measurement Result - TCP
Parameter Tuning - B min
Parameter Tuning – X 2 (Thr)
Parameter Tuning – B min & X 2 (Thr)
Conclusion Reveal Skype Traffic from aggregate streams of packets Two Approach Statistical properties of randomness Stochastic characteristics of voice traffic Negligible False Positives Few False Negative left out