Baik Hoh Marco Gruteser Hui Xiong Ansaf Alrabady All images are credited to “ACM” Hoh et al (2007), pp
Problem GPS traces are taken from “probe” vehicles to provide services Traffic Monitoring Application GPS location, heading, and speed data Other research has shown that even if this data is anonymized, individual routes can be identified.
Problem: Traffic Monitoring GPS points are mapped to a road segment Average speed of those vectors are calculated Congestion is inferred Requirements Spatial Accuracy Road Coverage Achieved by “penetration rate” Initial deployments fall short – and privacy suffers
Problem: How Privacy is compromised Individuals can be identified by starting and ending points in the GPS trace Data points can be linked together using target tracking and “Maximum Likelihood Detection” For a set of possible points, select the point with the highest probability of belonging to this route. Other research has shown that even if this data is anonymized, individual routes can be identified.
Problem: Existing Algorithms Existing anonymity algorithms cause severe degradation to the utility of the data
Problem: Existing Algorithms K-anonymity using CliqueCloak modifies trace data beyond usability Thought to be the most accurate system with any anonymity guarantee Even making the anonymity set as small as 3, location accuracy drops down to between m, even if they use 2000 probes. Increasing penetration rate would help, but: Higher penetration rates not possible in early deployment Lower density areas of the map would never be accurate.
Factors to Consider The longer the attacker can follow an individual trace, the better they are able to guess who you are, and where you are going
Relative Weighted Coverage Metric When samples are withheld, road coverage decreases Congestion monitoring is more important on popular routes Coverage is limited by the original data set, so coverage can’t get better; it can only go down. High Level: The metric measures the coverage delta between the original data set and the confused data set. It is a measure of data quality.
Time-to-confusion Metric The mean time-to-confusion (MTTC) is meant to be a measurement of privacy The lower the average trackable trip time, the more privacy you have as an individual in the overall system. How long an individual can be tracked is a time-to- confusion threshold. High Level: Time-to-confusion is the time you are able to be “tracked” after de-anonymization.
“Uncertainty-aware” algorithm Calculates the probability of a particular point belonging to a “trip” and verifies that the trip cannot be followed, due to the existence of other points which could just as probably fit that trip High Level: Ensures that a specific level of uncertainty is maintained for every “trip” in the trace data.
Put it all together Given all the points in a particular slice of time, if a single point could have been tracked longer than the time-to- confusion threshold, AND the point in this time slice can be correlated to that trace with high probability, that point is omitted from the set of published data. Allows tracking for a limited time, but prevents tracking the entire trip. The starting location and ending location are not connected, so it’s not possible to identify who the individual is or where they are going, thus privacy is preserved. Mean time-to-confusion is the average time between omitted points on a “trip”
Data Used data collected from 233 volunteer vehicles collected over 7 days Data covers a 70km by 70km metropolitan area (70km = 43.5 miles) Samples are taken every 1 minute while ignition is “on”
Results: Off-Peak, High Density Off-Peak, High Density 10am – 11:30am Gray dots are released Black dots are excluded
Results: On-Peak, High Density On-Peak, High Density 5pm – 6:30pm Gray dots are released Black dots are excluded
Results: Comparison Off-Peak On-Peak
Results: Maximum TTC If UT = 40%, TTC=5m 92.5% of points may be published If UT is 99%, TTC=5m still over 65% of points may be published. If only 92.5% of points are published and randomly selected, at least one route is traceable for 35 minutes.
Results: Median TTC If UT is 40%, TTC = 5m MTTC is 1 minute for the data set. If UT is 99%, TTC = 5m MTTC is 1 minute for the data set. Publishing 80% of points randomly still identified 15% of routes for over 10 minutes. (median not specified)
Results: Relative Weighted Road Coverage When Uncertainty Threshold = 95% and TTC = 5min 81% of data samples are released Road coverage is still 95% If 20% of data samples are removed randomly 80% of samples are published Road coverage is only 79.3% As you can see, there is significantly more degradation in the case of randomly throwing out data.
Other Considerations The authors also consider algorithm modifications to address reacquisition. Maximum TTC is still preserved, but quality is only marginally better than when data points are randomly removed The authors also do not make their algorithm aware of real topography, which could be taken advantage of by an attacker If topography were also considered, this problem could be averted. There are many open research areas (in 2007).
Conclusion Intelligently removing data points to confuse a de- anonymization algorithm is successful for even low- penetration deployments. All images are credited to “ACM”, Hoh et al (2007), pp