Balancing Risk and Utility in Flow Trace Anonymization Martin Burkhart, ETH Zurich burkhart@tik.ee.ethz.ch Joint work with Daniela Brauckhoff, Elisa Boschi, Martin May
Motivation Sharing of traffic measurements is crucial Only a limited set of sources available Reproducibility of results Dynamics / variability of traffic Get the big picture (e.g. Internet Storm Center) Keep up with globalized attacks (e.g. botnets) More and more traces are collected but not shared Data protection legislation Security concerns Competitive advantage
State-Of-The-Art: Anonymization Black Marking Truncation E.g. last bits of IP addresses Permutation Random (Partial) Prefix-preserving IP address permutation Enumeration E.g. Timestamps: keep the logical order of events Categorization Randomization (data mining community) K-Anonymity (data mining community)
The Tradeoff in Anonymization It‘s a trade-off RU-Maps t: Anony. Strength X-Axis: Utility(t) Y-Axis: Risk(t) Not quantitatively studied, lack of metrics Strongly dependent on the application / attacker model Risk(t) Algorithm X X t=0.1 X t=0.2 X t=0.4 X Prefix Pres. X Random Perm. X t=0.7 Sweet Spot Utility(t)
A Case Study: IP Address Truncation Techniques that permute IP addresses 1:1 are reversible Characteristic object sizes/frequencies, behavioral profiling, fingerprint active ports, exploit prefix structure Apply IP address truncation and evaluate the risk and utility dimensions Lower risk: Hosts are aggregated to subnets Lower utility: Resolution of entities is reduced Quantifying the tradeoff: How bad is it in numbers? IP address 8 bits trunc. 16 bits trunc. 123.45.67.89 123.45.67.0 123.45.0.0 123.45.67.123 123.45.12.34 123.45.12.0
Internal vs. External Prefixes Factor 3 Factor 53 x = 8 Asymmetry in prefixes external Internal (AS 559) Is this reflected in Risk reduction? Utility reduction? Unique Count (log) Prefix length (32-x)
Measuring Utility of Truncated Data Specific application: anomaly detection Compare detection quality of scans and (D)DoS attacks in original and truncated data Two IP-based metrics Unique address count Address entropy 3 weeks of NetFlow data ~ 43 billion flows SWITCH network
Measuring Detection Quality Ground truth: Manual identification of scans/(D)DoS attacks Run a Kalman filter on metric timeseries Utility measured by AUC (area under the ROC curve) Vary threshold
Utility of Truncated Data Internal metrics degrade faster than external metrics Counts degrade faster than Entropy
Approximating Risk of Host Identification In general: Truncation of x bits leads to 2^(32-x) prefixes with 2^x addresses per prefix But: only a fraction (A) of potential addresses is usually active Hence, On average A*2^x addresses per prefix 1, 2, 3, ... 10, 11, 12, ... 240, 241, ... 254, 255 129.130.80. e.g. A = 10%
Risk of Truncated Data (total: 2.2 million) (total: 4.3 billion) Risk for external addresses is higher due to sparcity! Constant offset:
The Risk-Utility Tradeoff No truncation 4 bits 8 bits 12 bits 16 bits best tradeoff Metric x Utility Risk internal entropy 8 0.94 0.035 12 0.87 0.002 external entropy 16 0.97 0.02
Conclusion We made a quantitative evaluation of the risk-utility tradeoff in anonymization Entropy is much more resistant to truncation than unique counts Risk and utility degrade faster for internal addresses For detection of scans and (D)DoS attacks, it is possible to get a good tradeoff with high utility and low risk
Thank You for the Attention