Jeffrey Pang Carnegie Mellon University

Jeffrey Pang Carnegie Mellon University
Quantifying and Mitigating Privacy Threats in Wireless Protocols and Services Today, I will be discussing the privacy problems in wireless protocols and services and ways to improve them to better protect our privacy. Jeffrey Pang Carnegie Mellon University

Prevent third parties from collecting location traces
Location Privacy Prevent third parties from collecting location traces Location trace: identity + locations Third parties: eavesdroppers and services Trends  threats: Wireless Mobility Location-based services Yelp.com network More specifically, I will talk about location privacy. Location privacy is our ability to prevent third parties from collecting location traces about our movements. Third parties collect such traces by linking a consistent identity to the places we visit. This then lets them link, say, an AA meeting place with the person that lives at a particular location. In this talk I will be focusing on two types of third parties that are enabled by recent computing trends: eavesdroppers and location-based services. Our increasing reliance on mobile wireless devices and location based services means that these parties can collect such traces today. Identity + Location

Location Privacy Threats
These threats have recently drawn much concern because practically anyone can setup a surveillance network or location service to track users. We argue that the threats will only grow in the future as we become more dependent on wireless devices, location based services, and their use while mobile.

Thesis The existing protocols and techniques that wireless devices use to discover and communicate with each other pose risks to users' location privacy... But it is practical to redesign these protocols and techniques to substantially mitigate these threats.

Current Wireless Protocols
Bootstrap Name: Alice’s Device Secret: Alice<3Bob Name: Bob’s Device Secret: Alice<3Bob Out-of-band (e.g., password, PIN) Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement From: 11:22:33:44:55:66 To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind  Ksession Use to encrypt & authenticate Confidentiality Authenticity Integrity From: 11:22:33:44:55:66 To: AA:BB:CC:DD:EE:FF From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Send Data

Report JiWire.com Database Alice’s Report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP Select Find the best AP near (47°36′35″N 122°19′59″W) Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Send Data Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Send Data Eavesdropping threats In the case of infrastructure, such as hotspots, Alice may need to search for better hotspots using a hotspot directory, either online or offline. In addition, the directory may rely on its users to report about hotspots they have used in order to populate their database. In both cases, users potentially reveal their identity and location. The problem, of course, is that these protocols enable third parties such as eavesdroppers and location services to build location traces about us. These are the threats we will address in this talk. Location service threats

Talk Overview Quantifying tracking threats
Mitigating eavesdropping threats Mitigating location service threats Conclusions & open problems In the remainder of this talk, I will present some of our results that show how easily real users can be tracked and then I will show how we can build wireless protocols and services to mitigate these threats.

Previous techniques insufficient [MobiCom 07] Mitigating eavesdropping threats Efficient protocols that conceal all bits [MobiSys 08] Mitigating location service threats Anonymous & fraud resistant reporting [MobiSys 09] Conclusions & open problems

Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Send Data Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe From: 22:1E:3E:4F:A1:45 Announcement To: 22:1E:3E:4F:A1:45 Credentials, key exchange To: 11:22:33:44:55:66 Authenticate and Bind Send Data 11:22:33:44:55:66 11:22:33:44:55:66 Eavesdroppers can observe our wireless traffic with commodity hardware. Because current protocols expose consistent unique identifiers such as MAC addresses in every packet, they can easily track us where ever we use our devices. time

Proposed Countermeasures
Pseudonyms: per-session MAC addresses [Gruteser 05, Hu 06, Jiang 07, Stajano 05] Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Send Data Rendezvous From: 19:1A:1B:1C:1D:1E To: BROADCAST Search probe From: 22:1E:3E:4F:A1:45 Announcement To: 22:1E:3E:4F:A1:45 Credentials, key exchange To: 19:1A:1B:1C:1D:1E Authenticate and Bind Send Data 11:22:33:44:55:66 19:1A:1B:1C:1D:1E To counteract these tracking threats, previous research has proposed using pseudonyms. That is, changing MAC addresses each time we engage in the wireless protocol. And this works just fine because the protocol will work fine so long as the MAC address is consistent within a single session. time

Problem: Long-term Linkability
Rendezvous & authentication fields can be identifying (e.g., network names) Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Is Bob’s Network here? Bob’s Network is here To: AA:BB:CC:DD:EE:FF Proof that I’m Alice From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Proof that I’m Bob Authenticate and Bind Send Data Rendezvous From: 19:1A:1B:1C:1D:1E To: BROADCAST Is Bob’s Network here? From: 22:1E:3E:4F:A1:45 Bob’s Network is here To: 22:1E:3E:4F:A1:45 Proof that I’m Alice To: 19:1A:1B:1C:1D:1E Proof that I’m Bob Authenticate and Bind Send Data However, many other bits during the rendezvous and authentication phases remain exposed and these fields, such as network names that Alice is searching for, can be identifying as well. time

Problem: Short-term Linkability
Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Is Bob’s Network here? Bob’s Network is here To: AA:BB:CC:DD:EE:FF Proof that I’m Alice From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Proof that I’m Bob Authenticate and Bind Send Data Rendezvous From: 19:1A:1B:1C:1D:1E To: BROADCAST Is Bob’s Network here? From: 22:1E:3E:4F:A1:45 Bob’s Network is here To: 22:1E:3E:4F:A1:45 Proof that I’m Alice To: 19:1A:1B:1C:1D:1E Proof that I’m Bob Authenticate and Bind Send Data In addition, our work, in addition to recent security research has showed that sequences of packet sizes and timings can reveal information about their contents even if they are encrypted and thus be identifying. For example, we found that broadcast packet sizes are often consistent during different sessions and distinguish different users. While a single packet size isn’t identifying, an eavesdropper can still link all the packets in a data stream together, giving him a lot of statistical information. 247 bytes 247 bytes 124 bytes 124 bytes Sequences of packet sizes & timing can be identifying (e.g., broadcast packet sizes) [Pang 07, Liberatore 06, Saponas 07, Song 01, Wright 08, Wright 07] time

Fingerprint Accuracy Developed a machine learning based identification algorithm (See dissertation for details) Fingerprints: network names broadcast packet sizes supported capabilities Simulated user tracking with traffic from 500+ users Simulate pseudonyms & encryption Was Alice here? Of course, these fingerprints will not always be 100% accurate. So we want to ask the question, given some traffic sample that we know to be from alice, such as one collected at her home, can we then identity when she is present at other locations. To evaluate this question, we developed a identification algorithm based on these fingerprints using standard machine learning techniques. We then simulated tracking on more than 500 devices in different wireless traces. Of course, we also simulated MAC address pseudonyms so we can not explicitly link a user’s sessions together using any explicit addresses. Known to be from Alice Question: Given some traffic samples from a device, can we identify when it is present in the future?

Fingerprint Accuracy Results: 53% of devices can be identified with 90% accuracy when at a small hotspot for the day (5 devices/hour) 27% with 99% accuracy 17% even if in a very busy hotspot (100 users/hour) More fingerprints exist  this is only a lower bound! Was Alice here? What we found was that the majority of devices can still be tracked with 90% accuracy if they were at a hotspot with 5 users per hour for a day. 27% of devices could be tracked with 99% accuracy; meaning we would could answer the question “was alice here or not?” correctly at least 99% of the time. It would be more difficult to answer this question in a busier hotspot, since we are more likely to misclassify someone else’s traffic as from Alice and get a false positive. Nonetheless, we found that even in an environment with 100 users generating traffic each hour, 17% of devices could still be tracked with 90% accuracy. The most important thing to note, however, is more fingerprints exist in practice; we only examined 3 in this study and subsequent work has identified others, such as based on timing characteristics. Thus, these numbers may only be a lower bound on the accuracy a real eavesdropper may be able to achieve. Known to be from Alice Question: Given some traffic samples from a device, can we identify when it is present in the future?

Is There a Common Defense?
Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement Is Bob’s Network here? Bob’s Network is here Problem: Long-term Linkability The two main reasons why these fingerprints exist is that there are fields in the rendezvous and authentication/binding phases that remain exposed and remain linkable across different sessions. And data packets in the same session remain linked at short time scales, exposing sequences of packet sizes and timings to eavesdroppers. How might we eliminate all this information? From: 11:22:33:44:55:66 To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Proof that I’m Alice Proof that I’m Bob Problem: Short-term Linkability From: 11:22:33:44:55:66 To: AA:BB:CC:DD:EE:FF From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Send Data

Goal: Make All Bits Appear Random
Rendezvous No bits linkable over the long-term We argue that we can make this entire class of attacks much more difficult to carry out if wireless protocols make all bits appear random to third parties. This ensures that no bits anywhere are linkable across sessions and eavesdroppers can not distinguish one person’s data packets from another’s, making it much more difficult to extract side-channel information from individual traffic streams. Authenticate and Bind Many streams overlap in real traffic  much nosier side-channels Send Data

How do I know if Bob is here? Rendezvous Identifiers needed for rendezvous! Of course, the reason why this is non-trivial to do is because we needed those fields for important protocol functions. For example, we needed an identifier in the rendezvous phase in order for alice and bob to figure out whether the other party is nearby or not. Authenticate and Bind Send Data

Send Data Identifiers needed for efficient packet filtering! 500 bytes Which packets are mine? 500 bytes Which key to use? 250 bytes Source Address Decryption key 12:34:56:78:90:ab KAlice 00:00:99:99:11:11 KCharlie In addition, we needed those short term mac addresses in the data phase for efficient packet filtering. For example, Bob has some session state associated with each device sending him traffic, such as a decryption key. So he needs someway to figure out which key to use to decrypt packets he receives. Moreover, Alice and Charlie here might actually be sending packets to different people. So Bob needs someway to determine which packets are for him, which is difficult without identifiers. 200 bytes 250 bytes 200 bytes Which packets are mine? 250 bytes

Mitigating eavesdropping threats Mitigating location service threats Conclusions & open problems

Design Requirements When A generates Message to B, she sends: F(A, B, Message) → PrivateMessage where F has these properties: Confidentiality: Only A and B can determine Message. Authenticity: B can verify A created PrivateMessage. Integrity: B can verify Message not modified. Unlinkability: Only A and B can link PrivateMessages to same sender or receiver. Efficiency: B can process PrivateMessages as fast as he can receive them. A→B Header… Unencrypted payload More specifically, what I mean…

Straw man: Encrypt Everything
Name: Bob’s Network Secret: Alice<3Bob Name: Alice’s Laptop Bootstrap KAB KBA derive keys - Key for Alice→Bob - Key for Bob→Alice Idea: Use bootstrapped keys to encrypt everything To obtain these security properties, the first thing we might try…

Straw man: Symmetric Key Protocol
Probe “Lucy” Client Service Check MAC: KAB Probe “Bob” KShared1 KShared2 KShared3 … MAC: KAB KAB KSharedM To address this problem, we might instead try to use symmetric key encryption, because it is much faster than public key encryption… Try to decrypt with each key (accounts + associations) Symmetric encryption (e.g., AES w/ random IV) O(M)

Straw man: Symmetric Key Protocol
Client Service Too slow! (APs have 100s of accounts) Check MAC: KAB Probe “Bob” KShared1 KShared2 KShared3 … MAC: KAB KAB KSharedM To address this problem, we might instead try to use symmetric key encryption, because it is much faster than public key encryption… Try to decrypt with each key (accounts + associations) Symmetric encryption (e.g., AES w/ random IV) 1.5 ms/packet (M=100) (Need < 200 μs/packet for g)

SlyFi Symmetric key almost works, but tension between:
Unlinkability: can’t expose the identity of the key Efficiency: need to identify the key to avoid trying all keys Idea: Identify the key in an unlinkable way Approach: Sender A and receiver B agree on tokens: T1 , T2 , T3 , … A attaches Ti to encrypted packet for B To try to regain efficiency, we make the following observation: … The key idea we leverage in our design of our solution slyfi is… AB AB AB AB

Sender and receiver must synchronize i
SlyFi Required properties: Third parties can not link Ti and Tj if i ≠ j A doesn’t reuse Ti A and B can compute Ti independently AB Client Service Check MAC: KAB Probe “Bob” MAC: KAB Main challenge: Sender and receiver must synchronize i KAB Lookup Ti in hash table to get KAB KAB That is, SlyFi is basically the same as the symmetric key approach, but … Ti AB AB Symmetric encryption (e.g., AES w/ random IV) Ti = AES(KAB, i) AB 150 μs/packet (software)

i = transmission number
Data T1 AB T 2 AB T 3 AB T 3 AB i = 3  T 3 T 3+k-1 AB … T 4 AB i = 3 1 4 2 4 hashtable T 4 AB i = transmission number Data messages are only sent on established connections  On receipt of Ti , B computes next expected: Ti+1 Handling message loss? Save the next k tokens in table Tolerates k consecutive losses (k=50 is enough [Reis ‘06]) No loss  compute one new token per reception In SlyFi, we synchronize i in two ways. First, we note that data messages are only sent over established connections. Therefore, we can simply make i the transmission number during a session. Bob expects to receive most messages, so he can simply compute the token i+1 on the receipt of token i. Of course, wireless is a lossy medium and some packets might be lost. For example, if T3 is lost, Bob will never receive T4 because he will still be anticipating T3. To handle loss, he saves the next k tokens rather than just the next 1. This allows him to tolerate k consecutive losses and previous work has showed that more than 50 is very very rare. If more losses to occur, Alice and Bob will simply reestablish the connection, as they would have to do today. Note that in the common case, when there is no loss, Bob will only compute 1 addition token per packet reception because the next k-1 tokens will already have been computed. AB

Discovery/Binding i = ??? AB T1 AB T2 AB Ti
Probe: “Bob’s Device” T1 AB Not here. Probe: “Bob’s Device” T2 AB Not here. ... Probe: “Bob’s Device” Ti AB i = ??? Discovery messages are sent when other party is not here  Can’t rely on packet reception to synchronize i We can not use the same approach for discovery and binding messages however, because discovery messages are often sent when the other party is not present. For example, Alice sends discovery probes whenever she opens up her laptop, even if bob is not present.

Discovery/Binding i = current time/1 min AB Ti
Probe: “Bob’s Device” Ti AB T i-c T i+c AB … T i AB i = i =  hashtable i = current time/1 min Discovery messages only require long-term unlinkability Infrequent: only sent when trying to associate Narrow interface: single application, few side-channels Handling clock skew: Save previous c and next c tokens in table Tolerates clock skew of c minutes Instead, we recall that we only needed long-term unlinkability for discovery messages. This is because they are sent infrequently to begin with. Unlike data messages, which can be sent at rates of 10s of 1000s every second, discovery messages are only sent when searching for a new network or device. They are unlikely to occur more than a few times every couple minutes. In addition, they have a very narrow interface. Data messages encapsulate arbitrary application contents which can expose lots of side channels. But discovery messages are only used for one type of packet so they don’t have to expose any identifying side-channels. Thus, we use loosely synchronized time to compute i. i is just the current minute, so we compute a new i each minute. Of course, clocks may be slightly off. To handle clock skew of c minutes, bob just saves the previous c and next c tokens in his table, in addition to the token for the current minute. Again note that in steady state, bob will only have to compute 1 new token per time period because the next c-1 tokens will have been computed before.

SlyFi: Putting it Together
Bootstrap Name: Alice’s Laptop Secret: Alice<3Bob Name: Bob’s Network Secret: Alice<3Bob KAB token encrypt auth KBA derive keys AB Ti = AES(KAB , i ) token nonce Ti BA AB Enc(KAB ,nonce, …) MAC(KAB , …) encrypt auth Discover from, to, capabilities, other protocol fields Is Bob’s Network here? from, to, capabilities, other protocol fields Bob’s Network is here Those are the two key primitives that enable slyfi. Now I’ll just quickly show how we put them together. Here is the original wireless protocol we had earlier. First, we derive a set of keys from the secret exchanged; this is basically the same as existing protocols except we derive a set of keys for discovery in each direction and get a key for token generation, encryption, and message authentication codes. Then we use the token generation key to compute discovery tokens using the time as the input. We also attach a nonce to each packet. This is used to encrypt the remainder with the encryption key. By using a random nonce, this also appears to be random. We do the same thing for authentication messages, where we generate and exchange a new session key between alice and bob. Finally, we use the session key to generate tokens in the data phase. Here we use the transmission number rather than the time since the link has been setup. Note that we don’t need a separate nonce in this case because we are ensured that no token will be used twice. Thus the token effectively serves as a nonce. Ti nonce BA AB Authenticate and Bind from, to, capabilities, other protocol fields Credentials, key exchange  Ksession1,2 from, to, capabilities, other protocol fields AB tj = AES(KBA , j ) session1 transmission # t0 BA AB Enc(KAB , t0 , …) MAC(KAB , …) session1 session2 AB Send Data from, to, seqno, … from, to, seqno, …

SlyFi: Other Protocol Details
Broadcast Higher-layer binding Time synchronization Roaming Coexistence with Link-layer ACKs Multi-party discovery Preventing replay attacks See dissertation for details So I’ve described how we can deliver messages in an unlinkable fashion in slyfi for data transport and discovery and binding….

Performance Evaluation
Time to setup a link Data throughput (Previous proposal similar to symmetric key) Implemented as a software kernel module on embedded devices SlyFi is nearly as efficient as (wifi-open)

Solution Summary Confidentiality Authenticity Unlinkability Integrity
Efficiency Today’s protocols (e.g., WPA) Only Data Payload Only Data Payload Only Data Payload Pseudonyms (e.g., [Gruteser 05, Jiang 07]) Only Data Payload Only Data Payload Only Data Payload Finger- prints remain Encrypt everything Just to summarize the solutions we presented: Using today’s protocols, such as WPA, don’t get unlinkability because MAC addresses link packets together. Using pseudonyms, we do a little better, but we saw that fingerprints can still be used to link packets together. The encrypt everything approach makes all the bits we transmit appear random, eliminating these fingerprints. But we sacrifice efficiency in the process. Finally, we showed that SlyFi, using novel ways to synchronize random token sequences, can achieve all the security properties we want without sacrificing efficiency. Long Term SlyFi: Discovery/Binding SlyFi: Data packets

Mitigating eavesdropping threats Mitigating location service threats Conclusions & open problems Now I will discuss how we can mitigate location privacy threats from location services.

JiWire.com Select Alice: “Find me the best AP near (47°36′35″N 122°19′59″W)” Problem: identity + location Slyfi makes it difficult for eavesdroppers to track users, but users may have to report their identities to location-based services. For example, a user might submit his login name and location to a hotspot directory to find the best hotspot nearby. Report Alice’s report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP JiWire.com Report Database

Previous Countermeasures
JiWire.com Select Alice: “Find me the best AP near (47°36′35″N 122°19′59″W)” Seattle, WA Coarser or noisier location queries [Gruteser 03, Bettini 05, Mokbel 06, Krumm 07, Hoh 05] To prevent the directory from tracking fine-grain user movements, other researchers have proposed using coarser location queries. For example, by asking for all hotspots in a city instead of a specific point, and then filtering the results locally. This works fine for queries, because we often want offline access anyway. But this doesn’t work for services that have users themselves submit reports about physical services. For example, making the location in a report coarser means not identifying the AP. This makes it useless because it only says “some AP” in Seattle has bandwidth X, not which one. Doesn’t work for locations in reports (Coarser locations make reports useless) Report Alice’s report on ??? AP Location: Seattle, WA Quality: 1 Mbps, blocks VoIP JiWire.com Report Database

Why are Reports Needed? Measurement study of hotspots in Seattle
13 locations in one district over 1 week Finding the best AP is non-trivial: Large selection: 4 hotspot APs at each location, on average Variable performance: AP bandwidth differs by up to 50x Not obvious: official AP is not best at 30% of locations Not testable: most APs cost money to use  Need historical data to choose the best AP Before I discuss how we can solve this problem, we might ask why reports on hotspots are needed to begin with. To answer this question, we performed a measurement study of hotspots in seattle. We found that selecting the best AP to use is non-trivial because there is a large selection, even at a single location, they vary in performance substantially, the official AP is not always the best, and often, users cannot even test AP performance before paying money. So having historical reports help users make much better selections.

Goal: Wifi-Reports So our goal is build a system like this. First a client would measure the APs that he uses. Then, he would login to the service via an account authority, like Google. This account authority would give him the right to submit the report to a database where all reports would be summarized. Future users can download and cache these summaries, like with the iPass hotspot client today. Then they can find the best APs nearby. The client’s goal, of course, is to get accurate summaries. In addition, it doesn’t want to be tracked. The service’s goal is to collect accurate reports from clients. Client Goal: Get accurate summaries, don’t be trackable Service Goal: Collect accurate reports

Design Requirements Bob’s Report on AP5 Bandwidth: 300 kbps Mallory’s Report on AP4 Bandwidth: 10 Mbps Bandwidth: 100 Mbps Bob’s Report on AP3 Doesn’t work! Bob’s Report on AP4 Doesn’t work! Bob’s Report on AP2 Doesn’t work! So how can we guarantee to users that they can not be tracked? Well, if they submit reports with their identity then they can easily be tracked by the account authority or the database. So we need to make the reports unlinkable. However, some clients may also submit many fraudulent reports, such as to boost their own AP’s reputation. So for the service to get mostly accurate reports, we also need to limit the influence of each user. Our goal is to only count one report per user per AP. Bob’s Report on AP1 Doesn’t work! Unlinkability: Authority/databases cannot link a user’s reports Client Goal: Get accurate summaries, don’t be trackable Limited Influence: Only count 1 report per AP, per user Service Goal: Collect accurate reports

Threat Model Account authority obeys protocol
violations can be detected Prevents large-scale sybil attacks e.g., signup requires credit card e.g., Most clients are honest To meet these conflicting goals, we make three assumptions. First, we assume that the account authority obeys the protocol I will describe, since violations can be detected. Second, we assume that the account authority can prevent a single user from obtaining many different accounts. This might be done by requiring a hard to forge credential such as a credit card number. This solution is imperfect, but supports many reputation systems today. Finally, we assume that most users are honest. That is, they just download our measurement client and automatically measure and report on APs that they use. Unlinkability: Authority/databases cannot link a user’s reports Limited Influence: Only count 1 report per AP, per user

Straw men Protocols Unlinkability Limited Influence authenticate Alice
Alice’s locations: cafe1 tmobile #3 Bob’s Network Alcohol Anon Net CMU … Unlinkability authenticate Alice submit: R If Alice has already submitted a report on cafe1 then abort, else save the report measure cafe1 Anonymous Report on cafe1 Bandwidth: 100 Mb Anonymous Report on cafe1 Bandwidth: 100 Mb Anonymous Report on cafe1 Bandwidth: 5 Mb Anonymous Report on cafe1 Bandwidth: 100 Mb Anonymous Report on cafe1 Bandwidth: 5 Mb Anonymous Report on cafe1 Bandwidth: 100 Mb Anonymous Report on cafe1 Bandwidth: 100 Mb mix network submit: R R  report on cafe1 Limited Influence

Wifi-Reports List of all APs authenticate and download list of APs
cafe1 cafesolstice tmobile #4 AT&T #54  authenticate and download list of APs {kcafe1, k-1cafe1}  new key pair Blind the token kcafe1  Tblind request: cafe1, Tblind If Alice requested cafe1 before then abort else sign the token  Sblind reply: Sblind Unblind the signature  Scafe1 measure cafe1 Stack of tokens mix network submit: cafe1, Scafe1, kcafe1, R, SR Report on cafe1 Bandwidth: 5 Mbps Report on cafe1 Bandwidth: 5 Mbps Verify the signatures Delete old reports signed with kcafe1 R  report on cafe1 Sign the report  SR

Wifi-Reports Unlinkability Limited Influence authenticate and
download list of APs {kcafe1, k-1cafe1}  new key pair Blind the token kcafe1  Tblind Unlinkability request: cafe1, Tblind If Alice requested cafe1 before then abort else sign the token  Sblind Limited Influence cafe1 Report on cafe2 Bandwidth: 5 Mbps cafe2 reply: Sblind Unblind the signature  Scafe1 measure cafe1 Report on cafe1 Bandwidth: 5 Mbps mix network submit: cafe1, Scafe1, kcafe1, R, SR Verify the signatures Delete old reports signed with kcafe1 R  report on cafe1 Sign the report  SR

Wifi-Reports … authenticate and download list of APs
shinkatea starbucks2 Problem: Asking for a token reveals the target AP Solution: Ask for the tokens for all APs in a city tullys cafe2 cafe1 request: cafe1, Tblind UW … reply: Sblind APs in Seattle measure cafe1 Report on cafe1 Bandwidth: 5 Mbps Stack here Report on cafe1 Bandwidth: 5 Mbps Bandwidth: 100 Mb Problem: Some users may submit bad reports Solution: Robust summary functions (e.g., median) mix network submit: cafe1, Scafe1, kcafe1, R, SR

Wifi-Reports: Other Details
Adding & removing APs AP changes over time Rate limiting reports AP spoofing attacks Eclipse attacks Side-channel attacks Collusion attacks Wireless channel quality See dissertation for details There are, of course, several practical details that we need to deal with. For example, to ensure that the process of requesting tokens is anonymous, Alice can’t only request tokens for APs that she uses, because otherwise requesting a token indicates that she has used an AP. Thus, Alice requests the tokens for all the cities where she visits. The account authority can not know which ones she has used. Second, some users may still submit bad reports, so we need to filter these out. To do this we ensure that the functions used to compute summaries are robust to outliers, such as the median function. If most users are honest, this is sufficient.

Evaluation Overhead is small. Performance Overhead
What is the overhead of obtaining tokens? Besides providing privacy and limiting influence, Wifi-Reports must also be practical. Therefore, we implemented it and evaluated its performance on real data. First, we look at performance overhead. That is, the first time a user wants to report on an AP in a city, he has to fetch tokens for all the APs in that city. This graph here shows the time it takes to compute and download those tokens for the cities with the most APs, according to an existing hotspot directory. We see that even for the most dense city, New York, the overhead is only several seconds. Furthermore, we could parallelize the computation to make it faster --- the cost of doing this, say on Amazon’s EC2 compute service, would only cost 0.02 cents per user. Implementation on single CPU server Hotspot density estimated from JiWire.com 0.02 cents/city/user on Amazon EC2 Overhead is small.

How robust are predictions to fraud?
Evaluation Performance Overhead What is the overhead of obtaining tokens? Prediction Accuracy How robust are predictions to fraud? ideal distribution Next we look at prediction accuracy in the face of fraud. That is, we simulate bad guys trying to boost the predicted bandwidth of an AP by saying that it has infinite bandwidth. This graph here shows how accurate predictions are as we increase the number of fraudulent reports for an AP. The x-axis shows the ratio of predicted and actual throughput, using our measurement study as the ground truth. If there was no fraud, we could get the red line, where our predictions are almost perfect 60% of the time. But we see that even if 10% of reports are fraudulent, the distribution of ratios is still nearly identical, so even a simple summary function like median is robust to this amount of fraud in practice. Implementation on single CPU server Hotspot density estimated from JiWire.com 0.02 cents/city/user on Amazon EC2 Ground truth = measurement study Fraud = report AP has infinite bandwidth Overhead is small. Robust to 10% fraud.

Mitigating eavesdropping threats Mitigating location service threats Conclusions & open problems To conclude, I’ll show how this all ties together and outline some problems that remain.

Goal: Prevent the collection of location traces (identity + locations) Select Alice: Find the best AP near (47°36′35″N 122°19′59″W) Rendezvous From: 11:22:33:44:55:66 To: BROADCAST Search probe Announcement To: AA:BB:CC:DD:EE:FF Credentials, key exchange From: AA:BB:CC:DD:EE:FF To: 11:22:33:44:55:66 Authenticate and Bind Send Data Recall that our goal was to prevent third parties from collecting location traces about us. To do this we need to conceal either our identity or locations that remain exposed in current protocols. JiWire.com Report Database Alice’s Report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP Report

Improved Wireless Protocols
Goal: Prevent the collection of location traces (identity + locations) Select Coarser queries Hides location Alice: Give me the list of all APs in Seattle, WA Rendezvous Authenticate and Bind Send Data SlyFi Hides identity In this dissertation, I presented ways to do this in all these protocol phases. Previous work showed how location can be concealed in the selection phase using coarse location queries. We showed how a link layer protocol can hide identity, both explicit and implicit, by making all bits appear random. We also showed how reports to a location-based service can hide identity but still allow the service to regulate fraud. JiWire.com Report Database Report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP Wifi-Reports Hides identity Report

Open Problem: Bootstrapping
Select Bootstrap Name: Alice’s Laptop Secret: Alice<3Bob Name: Bob’s Network Existing protocols require manual bootstrapping. Can we bootstrap SlyFi automatically and privately ? Rendezvous Authenticate and Bind Send Data What problems remain? Well, first, in all link layer protocols, including slyfi, there is still a manual bootstrapping phase where alice and bob exchange a secret. This can be cumbersome and non-intuitive. Thus there is interest in automating this phase as much as possible. An open question is how we can automate this for slyfi in a private way. I discuss some initial solutions in my dissertation. Report Report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP JiWire.com Report Database

Open Problem: Physical-layer Identifiers
Select Bootstrap Name: Alice’s Laptop Secret: Alice<3Bob Name: Bob’s Network Signal analyzers can detect physical- layer fingerprints Rendezvous Authenticate and Bind Send Data Secondly, eavesdroppers with more expensive equipment, such as signal analyzers can potentially detect fingerprints at the physical layer below the link layer. An open question is how we can modify the hardware to conceal these implicit identifiers without undue economic cost. How can we conceal physical-layer fingerprints? JiWire.com Report Database Report on Bob’s AP Location: ( , ) Quality: 1 Mbps, blocks VoIP

Open Problem: Collaborative Filtering
Select Bootstrap Name: Alice’s Laptop Secret: Alice<3Bob Name: Bob’s Network Collaborative filtering services need to link a person’s location history. What is the appropriate privacy model here? Rendezvous Authenticate and Bind Send Data Finally, location-based collaborative filtering services may need to link our location history to provide their service. This is because their goal is to help users find people with similar histories to their own. An open question is what the appropriate privacy model is for these types of services. “Find people with similar frequent places” Dating Site Alice’s frequent places cafe1: ( , ) CMU: ( , ) Report

Summary The existing protocols and techniques that wireless devices use to discover and communicate with each other pose risks to users' location privacy, but it is practical to redesign these protocols and techniques to substantially mitigate these threats. Acknowledgements Before I end, I just want to acknowledge that my work was not done in isolation. This slide shows the short list of the people that have most influenced it. Ben Greenstein Ramakrishna Gummadi Michael Kaminsky Tadayoshi Kohno Damon McCoy Bryan Parno Adrian Perrig Srinivasan Seshan Peter Steenkiste David Wetherall many others

=== MOTIVATION ===

What can Protocol Control Info Reveal?
Location traces can be deanonymized [Beresford 03, Hoh 05-07, Krum 07] Kim’s House 00:16:4E:11:22:33 How to link to personal ident

Who Should Care About Tracking?
End-users CRA Grand Challenge: “Give computer end-users privacy they can control” Service providers They can’t protect customers from eavesdroppers even if they don’t track users themselves Device manufacturers Privacy concerns about tracking can hurt sales (e.g., Intel CPUID debacle, Benetton RFID boycott) Now, there are obviously other ways that we can be tracked, for example by cell phone companies or by web pages. However, in each of these circumstances, we are trading our privacy for some useful service. Tracking by unknown eavesdroppers is concerning because we can not control who gets our information. This should concern end-users just like they are concerned with web bugs tracking their web browsing habits. Service providers should be concerned because even if they protect their customer’s data, they can’t prevent other parties from tracking them while using their networks. And finally, history has shown that tracking concerns can hurt sales of products such as when Intel added a unique ID to their CPUs or when Benetton added radio identifier tags to their clothing.

Fingerprints Related Work
Other fingerprints Device driver fingerprints [Franklin 06] TCP clock-skew fingerprints [Kohno 05] AP beacon click skew [Jana 08] Physical layer fingerprints (using specialized hardware) [Brik 08, Patwari 07, Hall 04] Our contributions in comparison: First link-layer fingerprints for individual devices Enabling tracking when link-layer encryption is employed Enabling better coverage than some previous work Showing how to combine implicit identifiers

Unlinkable Tokens Related Work
Unlinkable tokens in discovery Public key protocol (slow in practice) [Abadi 04] Application layer protocol to find friends (uses hash-chain) [Cox 07] Unlinkable tokens in data transport General proposal, analysis for TCP (masking piecemeal) [Nikander 05] proposal (inefficient) [Armknecht 07] Bluetooth proposal (uses hash-chain) [Singelee 06] SlyFi contributions in comparison First to ensure no bits exposed (not masking identifiers piecemeal) First to handle all major wireless protocol functions First to leverage existing hardware (AES counter instead of hash-chain) First link layer protocol implementation & evaluation on real devices

Location Privacy Related Work
Privacy in location-based services, e.g., [Beresford ‘03, Gruteser ‘03, Schilit ‘03] Don’t protect against third party eavesdroppers Privacy in RFID, e.g., [Fishkin ‘04, Juels ‘05] General protocols do more than identification Privacy using temporary device addresses [Gruteser ’05, Hu ’06, Jiang ’07, Stajano ‘05]

Other Attacks Enabled User profiling attack Movie signature attack
User profiling, inventorying, relationship profiling [Greenstein 07, Jiang 07, Pang 07] Side-channel analysis on packet sizes and timing Exposes keystrokes, webpages, movies, VoIP calls, … [Liberatore 06, Saponas 07, Song 01, Wright 08, Wright 07] Home header Is “djw” here? “djw” is here User profiling attack ≈ DFT Movie signature attack So far, I only talked about tracking, but the ability to link sessions together also enables other attacks, such as profiling and inventorying. For example, at the start of the talk, I showed how network names can reveal where you have been before. In addition, a sequence of linked data packets can reveal information about its contents even if encrypted. This includes things like keystrokes, movies streamed, and even words in your voice-over-ip-calls. This is because in aggregate, packet sizes and timings can act as fingerprints for content as well as identity. Keystroke timing attack

=== MOBISYS ===

Why not GSM Pseudonyms? GSM pseudonym properties
Provider must assign new pseudonym to client to change it Only a single application used on GSM network GSM pseudonyms not sufficient when Both parties in discovery want to be private May require using pseudonym when the provider is not present (e.g., during discovery) Many applications with many side-channels Must accommodate device heterogeneity, evolution

802.11w: Protected Management Frames
Confidentiality Authenticity Unlinkability Integrity Efficiency Data Payload Data Payload Data Payload 802.11i (WPA) Unicast Frames 802.11i w Long Term MAC Pseudonyms Public Key Symmetric Key Long Term SlyFi: Discovery/Binding SlyFi: Data packets

Straw man: Public Key Protocol
Client Service Check signature: Try to decrypt K-1Bob KAlice Probe “Bob” Sign: K-1Alice Key-private encryption (e.g., ElGamal) KBob Too slow in practice! FIX XXXX One way we can do this is … ~100 ms/packet O(1) Based on [Abadi ’04]

PHY Layer Signatures Physical layer fingerprints [Brik 08, Patwari 07, Hall 04] Require uncommon and/or expensive hardware Not as accurate in all circumstances These may be obscured by adding analog noise in hardware SlyFi raises the bar and is a necessary first step ??? -> AP Charlie -> AP Charlie -> AP ??? -> AP Alice -> AP ??? -> AP ??? -> AP Charlie -> AP Charlie -> AP ??? -> AP Charlie -> AP ??? -> AP

Linking with Signal Strength
Attack: website finger-printing [Liberatore ‘06] Attacker has 5 nodes to record packets’ RSSIs Attacker uses k-means clustering to determine which packets belong to each client. Set of RSSIs is the feature vector. Varying RSSI by +/- 5db reduces accuracy even further to 30% [Bauer ‘08] Side-channel attack accuracy degrades significantly even if attacker tries to use signal strength to link packets

Why not Time for Data Transport?
Data messages: Frequent: sent often to deliver data (1000+ pkts/sec) Wide interface: many applications, many side-channels  Linkability at short timescales is NOT usually OK  Can NOT use loosely synchronized time to synchronize i AB Ti = AB Ti = AB Ti = AB Ti

Other Protocol Details
Broadcast All broadcast packets routed through the AP Use same shared key for all the clients of the AP Higher-layer binding Clients report “pseudonym MAC address”-to-IP address bindings to AP AP answers all ARP queries Time synchronization and roaming Use protected broadcast to transmit timestamps, same BSSID info Coexistence with Encapsulate SlyFi in “anonymous” frame with unused FC code Clients first search for SlyFi AP, then fall back to non-private AP search Link-layer ACKs If fast enough, just acknowledge last SlyFi token sent Our software implementation uses windowed ACKs

Multi-Party Discovery
AB Ti = AES(KAB , i) token Ti = AES(KAB , i) AC token . . . random key AB AC Ti Kpayload, offset Ti Kpayload, offset . . . Search Probe Enc(Kpayload1, 0, …) MAC(Kpayload2, …) Enc(KAB ,0, …) MAC(KAB , …) encrypt auth Enc(KAB ,0, …) MAC(KAB , …) encrypt auth On receipt, check each 16-byte block that could be a token 1500 byte packets  up to 31 lookups per packet Can be done in parallel in hardware What if there are more than 31 receivers? Option 1: Duplicate packet with different tokens in each Option 2: Approximate token matching; open problem and future work

Packet Format Token Unencrypted Message Tryst: Discovery/Binding
Shroud: Data Transport

Protocol Timing Diagram
Discover Authenticate and Bind Send Data

Performance Evaluation
Comparison protocols wifi-open: with no security wifi-wpa: with WPA PSK/CCMP public-key: straw man symmetric-key: straw man armknecht: previous header encryption proposal Similar We argued that SlyFi is faster than the encrypt everything approach because it doesn’t have to do any cryptographic operations to filter packets, only a table lookup. To evaluate how well SlyFi can perform in practice, … Background traffic AP with 500 accounts 50 associations Measure Alice’s connection to Bob Experiment:

“Encrypt everything” fails to setup many links
Link Setup Failures Setup Failures Time to setup a link “Encrypt everything” fails to setup many links

Token Computation Time
(Once every token interval) Using software AES, 256 Mhz Geode processor Token computation time is negligible

Data Throughput vs. Packet Size
SlyFi data transport overhead is similar to WPA

SlyFi data filtering is about as efficient as 802.11
Data Throughput Higher = Better With simulated AES hardware Performs like symmetric key Next we look at data throughput. That is how fast can Alice send Bob traffic once the link has been established…. All lines decrease as the rate of bg traffic increases… slyfi has a small overhead compared to wifi-open, which has no security, but both degrade in the same graceful manner. This is in contrast to the other header encryption protocol in green that performs like the symmetric key protocol because it must try all keys to filter out a packet. … SlyFi data filtering is about as efficient as

Empirical Stream Interleaving
Many streams interleaved even at short timescales

Empirical Background Probe Rate
Background probes are frequent in practice

Empirical # Probes per Client
Some clients probe for many network names

=== MOBICOM ===

Fingerprint Summary 802.11 Networks: Public Home Enterprise
SSIDs in probes Broadcast pkt sizes MAC header fields e.g. supported rates, auth algs. e.g. NetBIOS, mDNS queries (e.g., iTunes) e.g. home, work network names SSID: djw mDNS packet sizes: 245, 239 rates: 11, 2, 1Mbps SSID: djw mDNS packet sizes: 245, 239 rates: 11, 2, 1Mbps time

..... Tracking 802.11 Users Tracking scenario:
Every users changes pseudonyms every hour Adversary monitors some locations One hourly traffic sample from each user in each location Build a profile from training samples: First collect some traffic known to be from user X and from random others tcpdump ..... ? ? tcpdump Traffic at 2-3PM Traffic at 3-4PM Traffic at 4-5PM Then classify observation samples

Sample Classification Algorithm
Core question: Did traffic sample s come from user X? A simple approach: naïve Bayes classifier Derive probabilistic model from training samples Given s with features F, answer “yes” if: Pr[ s from user X | s has features F ] > T for a selected threshold T. F = feature set derived from implicit identifiers Verbiage: In our paper, we show that an adversary can pick an appropriate threshold T using training samples. I omit this now for brevity.

Sample Classification Algorithm
Deriving features F from implicit identifiers Set similarity (Jaccard Index), weighted by frequency: linksys djw SIGCOMM_1 PROFILE FROM TRAINING IR_Guest SAMPLE FROM OBSERVATION Rare Common w(e) = low w(e) = high Note that this feature is a real-valued ratio (and thus is continuous). Therefore, we estimate its probability distributions from the training data using standard density estimation.

Evaluating Classification Effectiveness
Simulate tracking scenario with wireless traces: Split each trace into training and observation phases Simulate pseudonym changes for each user X Duration Profiled Users Total Users SIGCOMM conf. (2004) 4 days 377 465 UCSD office building (2006) 1 day 153 615 Apartment building (2006) 14 days 39 196

Evaluating Classification Effectiveness
Question: Is observation sample s from user X? Evaluation metrics: True positive rate (TPR) Fraction of user X’s samples classified correctly False positive rate (FPR) Fraction of other samples classified incorrectly = ??? Measure TPR = Fix T for FPR Verbiage: we note that ad adversary can lower the FPR by increasing the classification threshold at the cost of TPR. E.G., this graph shows the tradeoff for the MAC protocol fields feature. We assume that the adversary wants a low FPR (= 1%) and show the TPR that they can achieve. (See paper) Pr[ s from user X | s has features F ] > T

Results: Individual Feature Accuracy
1.0 TPR  60% TPR  30% Individual implicit identifiers give evidence of identity

Results: Multiple Feature Accuracy
Users with TPR >50%: Public: 63% Home: 31% Enterprise: 27% Public Home Enterprise netdests ssids bcast fields We can identify many users in all environments

Results: Multiple Feature Accuracy
Public networks: ~20% users identified >90% of the time Public Home Enterprise netdests ssids bcast fields Some users much more distinguishable than others

One Application Question: Was user X here today?
More difficult to answer: Suppose N users present each hour Over an 8 hour day, 8N opportunities to misclassify Decide user X is here only if multiple samples are classified as his Revised: Was user X here today for a few hours?

Results: Tracking with 90% Accuracy
Many users can be identified

=== WIFI-REPORTS ===

official key pair for cafe1, …
Wifi-Reports List of all APs cafe1 = Kcafe1 … cafe1 cafe1 cafesolstice tmobile #4 AT&T #54  {Kcafe1, K-1cafe1} = official key pair for cafe1, … authenticate and download list of APs {kcafe1, k-1cafe1}  new key pair r  random() Tblind  blind(Kcafe1, H(kcafe1), r) request: cafe1, Tblind If Alice already requested cafe1 then abort, else: Sblind  sign(K-1cafe1, Tblind) reply: Sblind Scafe1  unblind(Kcafe1, Sblind, r) verify(Kcafe1 , Scafe1, H(kcafe1)) = 1 measure cafe1 database send over mix network submit: cafe1, Scafe1, kcafe1, R, SR Report on cafe1 Bandwidth: 5 Mbps Report on cafe1 Bandwidth: 5 Mbps verify(Kcafe1 , Scafe1, H(kcafe1)) = 1 verify(kcafe1, SR, H(R)) = 1 Delete old reports signed with kcafe1 R  new report on cafe1 SR  sign(k-1cafe1, H(R))

Bandwidth of commercial APs in Seattle (by location)
Are Reports Needed? Bandwidth of commercial APs in Seattle (by location) The first question we ask, however, is whether it is useful to have users submit reports to a hotspot directory to begin with. To do this, we performed a measurement study of hotspots in seattle over a one week period. We want to see if there is substantial variance in performance or functionality. This graph here shows the download throughput achieved at a number of hotspot APs in one commercial district. There is one point per AP and we organize them by the specific location where measurement was taken. First we see that there are lots of APs to choose from, even at a single location. red = “official” AP grey = other visible AP

Are Reports Needed? Bandwidth of commercial APs in Seattle (by location) Mean capacity range is 2 orders of magnitude. Note the logscale on the y-axis. red = “official” AP grey = other visible AP

Are Reports Needed? Bandwidth of commercial APs in Seattle (by location) In addition, note that even the “official” AP at a given location is not always the best. Moreover there are differences in cost and other metrics. Finally, we note that most of the best APs were “pay-per-access” so users could not test APs beforehand without paying money. So choosing an AP blindly is probably going to result in subobtimal performance. Having historical reports would thus help users select the best AP for them. red = “official” AP grey = other visible AP

Wifi-Reports Improves AP Selection

Jeffrey Pang Carnegie Mellon University

Similar presentations

Presentation on theme: "Jeffrey Pang Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jeffrey Pang Carnegie Mellon University

Similar presentations

Presentation on theme: "Jeffrey Pang Carnegie Mellon University"— Presentation transcript:

Similar presentations

About project

Feedback