User Fingerprinting Jeff Pang, Ben Greenstein, Ramki Gummadi, Srini Seshan, and David Wetherall Most slides borrowed from Ben
Location Privacy is at Risk You “The adversary” (a.k.a., some dude with a laptop) Your MAC address: 00:0E:35:CE:1F:59 Usually < 100m
Are pseudonyms enough? MAC address now: 00:0E:35:CE:1F:59 MAC address later: 00:AA:BB:CC:DD:EE
Implicit Identifiers Remain Consider one user at SIGCOMM 2004 Visible in an “anonymized” trace MAC addresses scrubbed Effectively a pseudonym Transferred 512MB via bittorrent => Crappy performance for everyone else Let’s call him Bob Can we figure out who Bob is?
Implicit Identifier: SSIDs SSIDs in Probe Requests Windows XP, Mac OS X probe for your preferred networks by default Set of networks advertised in a traffic sample Determined by a user’s preferred networks list SSID Probe: “roofnet” Bob
What if Bob used pseudonyms? “roofnet” probe occurred during different session than bittorrent download Can no longer explicitly associate “roofnet” with poor network etiquette Can we do it implicitly?
Implicit Identifier: Network Destinations Network Destinations Set of IP pairs in a traffic sample In SIGCOMM, each visited by 1.15 users on average A user is likely to visit a site repeatedly (e.g., an server) SSH/IMAP server: Bob
What if network is encrypted? Can’t see IP addresses through link- layer encryption like WPA Is Bob safe now?
Implicit Identifier: Broadcast Packet Sizes Broadcast Packet Sizes Set of broadcast packet sizes in a traffic sample E.g., Windows machines NetBIOS naming advertisements; FileMaker and Microsoft Office advertise themselves In SIGCOMM, only 16% more unique tuples than unique sizes Broadcast packet sizes: 239, 245, 257 Bob
Implicit Identifier: MAC Protocol Fields MAC Protocol Fields Header bits (e.g., power mgmt., order) Supported rates Offered authentication algorithms Mac Protocol Fields: 11,4,2,1Mbps, WEP, etc. Bob
David J. Wetherall Anonymized Traces from SIGCOMM 2004 Search on Wigle for “djw” in the Seattle area Google pinpoints David’s home (to within 200 ft) A pseudonym What else do implicit identifiers tell us?
Automating Implicit Identifiers TRAINING: Collect some traffic known to be from Bob OBSERVATION: Which traffic is from Bob? ? ??
Methodology Simulate using SIGCOMM, USCD Split trace into training data and observation data Sample = 1hour of traffic to/from a user Assume pseudonyms “The adversary”
Did this traffic sample come from Bob? How to convert implicit identifiers into features? Naïve Bayesian Classifier: We say sample s (with features f i ) is from Bob if Pr[s from Bob | s has features f i ] > T
Did This Traffic Sample Come from Bob? Features: Set similarity (Jaccard Index), weighted by frequency: linksys IR_Guest djw SIGCOMM_1 PROFILE FROM TRAINING SAMPLE FOR VALIDATION Rare Common
Individual Feature Accuracy 60% TPR with 99% FPR Higher FPR, likely due to not being user specific Useful in combination with other features, to rule out identities
Multi-feature Accuracy Samples from 1 in 4 users are identified >50% of the time with FPR bcast + ssids + fields + netdests bcast + ssids + fields bcast + ssids
Was Bob here today? Maybe… Suppose N users present Over an 8 hour day, 8*N opportunities to misclassify a user’s traffic Instead, say Bob is present iff multiple samples are classified as his
Was Bob here today? In a busy coffee shop with 25 concurrent users, more than half (54%) can be identified with 90% accuracy 4 hour median to detect (4 samples) 27% with two 9s.
Conclusion: Pseudonyms Are Insufficient 4 new identifiers: netdests, ssids, fields, bcast Average user emits highly distinguishing identifiers Adversary can combine features Future Uncover more identifiers (timing, etc.) Validate on longer/more diverse traces (SSIDs stable in home setting for >=2 weeks) Build a better link layer