Automatic for the people: Reducing inadvertent leaks by personal machines Landon Cox Duke University
Inadvertent leaks Usability and privacy: A Study of Kazaa... ‣ Good and Krekelberg, CHI, 2003 ‣ In 12 hours, found 150 inboxes on Kazaa ‣ Observed people downloading dummy inbox Problem hasn’t gone away
Stories from 2009
Technical solution? Reference monitor Policy Process Process Process Network Files IPC Servers: Asbestos, HiStar, Flume Languages: Jif, Laminar, Resin Desktop: PrivacyScope, TightLip DevAdminUser Automation
Automatic policy specific. State of the art: pattern matching ‣ Look for strings that look like SSNs, CCs, etc. ‣ find_SSNs, Firefly, SENF, Spider, etc. ‣ A bit brittle and error-prone ‣ High false positive/negative rates Let’s take a different approach
Key observations 1) Personal machines often cache sensitive data 2) Servers force clients to access files using crypto 3) Crypto is general technique, used across admin. domains and applications
RedFlag overview Identifies processes that store decrypted data ‣ Unobtrusive (requires no user input) ‣ Compatible with legacy applications ‣ Compatible with existing Internet protocols High-level insights ‣ Stop trying to figure out what sensitive data looks like ‣ Use heuristics of how sensitive data is handled
Caveats We cannot stop all inadvertent leaks ‣ Stop large, important class of leaks Trust and threat model ‣ Uncompromised host ‣ No IP spoofing or DNS hijacking ‣ Correct, trusted reference monitor (take your pick) ‣ Buggy/absent access-control policies
RedFlag system overview Monitor sockets Inspect process Compose rules
Monitoring sockets Goal ‣ Try to identify incoming encrypted data ‣ Only at application level (e.g., SSL) Easy for most widely used apps ‣ Look at remote port (e.g., 443 or 993) Not always sufficient ‣ Non-standard ports: Skype, Groove, Groupwise ‣ XMPP sends SSL, non-SSL data to same port (5222/TCP)
Information entropy Compute entropy score for ambiguous ports ‣ Negligible performance overhead ‣ If score above threshold (~7.9 bits/byte), invoke inspection process Can induce false positives ‣ Compressed data sent in the clear (e.g., mp3s) ‣ On-the-fly compression schemes (e.g., http content-coding=gzip ) Luckily, doesn’t need to be 100% accurate ‣ Really just a performance optimization to save work ‣ Only used as a first-pass filter ‣ Correct any mistakes in inspection phase
RedFlag system overview Monitor sockets Inspect process Compose rules
Inspect process Goals of inspection ‣ Infer when file write depends on network read ‣ Determine whether file write is decrypted data Use taint-tracking ‣ Too slow to perform in critical path of desktop apps ‣ Perform asynchronously via deterministic replay ‣ Fork if network monitor flags process (port or entropy) ‣ Log libc calls in original, use log in replay process ‣ Attach taint-tracker to replayed process (e.g., PIN) ‣ Perform analysis on a free core in the background
Taint tracking Implement with PIN ‣ Rewrite instructions to propagate taint ‣ Record taint in shadow memory Key questions ‣ What are the taint sources? ‣ What info to send to the policy composer?
} Shadow memory } Taint label (byte) IDSource : : } <!DOCTYPE html PUBLIC... “/tmp/attach.pdf, :443” Fine when there is no ambiguity about the source But what about ambiguous ports? Address space
Ambiguous ports Search process memory for AES s-boxes ‣ S-boxes are set by algorithm designer ‣ S-boxes are unlikely to appear randomly ‣ (also look for well-known transformations)
Ambiguous ports If we find s-boxes in a library data section ‣ Assume image is a crypto library ‣ Vast majority of crypto libraries include AES implementation Instrument lib to set “crypto bit” of inbound taint labels ‣ If crypto bit == 1, network data was “routed” through crypto lib ‣ If crypto bit == 0, assume network data was not decrypted Also use s-boxes as taint source ‣ Data derived from s-boxes have “AES bit” set ‣ Can use to gauge strength of crypto algorithm Taint label (byte) } ID index AES bitCrypto bit
RedFlag system overview Monitor sockets Inspect process Compose rules
Compose rules Taint-tracking gives three pieces of info ‣ Description of network source ‣ If data was routed through crypto library ‣ If data was derived from AES s-box Can use this to compose policies
Compose rules Same source ‣ Allow sensitive files to be copied back to their source ‣ Raise alert otherwise ‣ Generalize hostnames (e.g., *.google.com) Obfuscation vs. confidentiality ‣ Many P2P clients use crypto to obfuscate ‣ Aren’t trying to protect data so use weak algorithms ‣ (e.g., BitTorrent and LimeWire explicitly do not support AES) ‣ If ambiguous port + no AES, then ignore file
RedFlag implementation Runs on Ubuntu 8.10 Modified Jockey for logging/replay ‣ Supports multi-threaded programs ‣ User-level thread library PIN tool for tainting ‣ Based on sequential taint tracker from Speck ‣ Modified to allow tainting during replay ‣ Implemented s-box search, crypto and AES bits in taint label
Evaluation Accuracy ‣ How well can RedFlag identify crypto libraries using s-boxes? ‣ How well does RedFalg categorize sensitive files? Performance ‣ Will asynchronous taint-tracking fall behind?
Identifying crypto libraries Looked at 10 Ubuntu programs ‣ checkgmail, thunderbird ‣ IM: pidgin ‣ P2P: Azureus, Limewire, Skype, Transmission ‣ Web: Firefox, Opera, wget Successfully identified crypto libs in all ‣ Including custom implementations, plugins (flash player) ‣ Interesting case: Opera folds crypto into exectable
Categorizing sensitive files Non-sensitive files ‣ Used Firefox ‣ Loaded 30 most popular webistes (alexa) ‣ RedFlag produced no false positives/negatives Sensitive files ‣ Downloaded 17 representative sensitive docs ‣ Firefox, thunderbird, pidgin
Categorizing sensitive files
Taint-tracking performance
Conclusions RedFlag automates policy specification ‣ Heuristic-based approach ‣ Monitor process behavior, not file content ‣ Sensitive files usually downloaded using crypto ‣ Deal with ambiguous ports using entropy scores, AES s-boxes Evaluation highlights ‣ Automatically identified crypto libraries ‣ Correctly categorized files in 45/47 scenarios ‣ No false positives, three false negatives ‣ Sufficient idle time in long-running process
Thanks! I’m happy to take questions