Dynamic information-flow tracking Landon Cox March 24, 2017
Information flow Crucial goal of secure system Prevent inappropriate information flows Can model “appropriateness” with a lattice of tags i.e., only allow “low” objects to flow into “high” objects Non-interference := all flows are appropriate Information-flow analysis Helps track where sensitive data goes Getting this right is tricky
Information flow Building blocks Tracking information Storage objects (information receptacles) Processes (move information to/from objects) Tracking information Tag (or label) describes information sensitivity Each storage object is assigned a tag Need to update tags as processes execute
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged What must we assume about any of P’s outputs? Must assume that they contain sensitive information Which processes are allowed to communicate with P? Other processes that are allowed to read D Why is this problematic? Probably want P to communicate with processes that can’t access D Hard to do anything useful otherwise
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client Password file
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw Password file
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw Password file
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw Password file
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw How do you solve this? Password file
Often use a trusted “declassifier” Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw How do you solve this? Password file Often use a trusted “declassifier”
Small piece of code trusted to remove tags Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged accept uid/pw; if (pw not in file) { return error; } else { fork/exec shell; } SSH client uid/pw Small piece of code trusted to remove tags Declassifier Password file
Information flow Issue 1: precision Say that storage object is an address space If process P reads sensitive data item D P’s entire address space is tagged What else could we do to improve precision? Use finer-grained storage objects Tag program variables or memory words What are the implications for performance? Have to update tags much more frequently i.e., every time an instruction executes Can introduce a lot of overhead
Tracking explicit flows Propagate taint tags with data flows c ← a op b taint(c) ← taint(a) ∪ taint(b) setTaint(a,t) taint(a) ← {t} c = a + b taint(c) ← {t} ∪ {} = {t} Send(c,foo.net) Can foo.net see a?
Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive int foo (int a){ int b, w, x, y, z; a = 11; b = 5; w = a * 2; x = b + 1; y = w + 1; z = x + y; print (z); } Each line is an explicit flow from source operands to destination operand
Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive int foo (int a){ int b, w, x, y, z; a = 11; b = 5; w = a * 2; x = b + 1; y = w + 1; z = x + y; print (z); } Very easy to implement: just interpose on each instruction to update each var’s tag
Where is the implicit flow? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { x = 1; } y = 10; print (x); print (y); Where is the implicit flow?
How would you update x’s tag? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { x = 1; } y = 10; print (x); print (y); How would you update x’s tag?
What is tricky about this code? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { x = 1; } else { y = 10; } print (x); print (y); What is tricky about this code?
What is trickier about this code? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { baz (&x); } else { bar (&y); } print (x); print (y); What is trickier about this code?
Where is the implicit flow here? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { exit(0); } else { exit(1); } y = 10; print (x); print (y); Where is the implicit flow here?
How would you track this? Information flow Issue 2: explicit vs implicit flows Two ways to propagate information Explicitly := direct transfer from one object to another Implicitly := indirect transfer usually via control flow // a is sensitive void foo (int a) { int x, y; if (a > 10) { exit(0); } else { exit(1); } y = 10; print (x); print (y); How would you track this?
Hidden channels Get system to communicate in unintended ways Example: tenex (supposedly secure OS) Created a team to break in Team had all passwords within 48 hours … oops. Goal: require 256^8 tries to see if password is right Password checker for (i=0; i<8; i++) { if (input[i] != password[i]) { break; }
Hidden channels: tenex Password checker for (i=0; i<8; i++) { if (input[i] != password[i]) { break; } How to break? (user passes in input buffer, virtual mem faults are visible) Specially arrange the input’s layout in memory Force a page fault if second character is read If you get a fault, the first character was right Do again for third, fourth, … eighth character Can check the password in 256*8 tries
Course administration Project proposals Due today (ok if you send it to me by Monday) Guidelines in the syllabus One page should be fine Amount of work Three weeks of effort Focus on answering one interesting question
Cloud large-scale analysis, collection, dissemination. Mobile present at work, home, and play. Sensors rich, personal data. High-level overview of today’s modern phone-based system. Devices place computation, communication and sensing at the heart of nearly all human activity Sensors have access to lots of rich, personal data. Connectivity to the cloud allows users to participate in large-scale services that make use of this rich, personal data. ••••••••• me@gmail.com Username Password
App-centric operating systems Apps access sensitive information in many contexts Location, images, and communication Home, work, and play Apps run on behalf of many stakeholders Users, services, developers, platform providers, advertisers How do we manage apps instead of users?
Monitoring app behavior Permissions are coarse. No insight into what is collected and by whom.
Consumer: “Why is my wallpaper app sending my phone number to another country?” http://blog.mylookout.com/2010/07/mobile-application-analysis-blackhat/
Enterprise: “Who is collecting information about our workers?”
Wider interest in the issue Earlier http://online.wsj.com/article/SB20001424052748703806304576242923804770968.html
Emerging malware threat New mobile malware1 New mobile malware family or variant2 1McAfee Threats Report: Q1 2012 - http://www.mcafee.com/us/resources/reports/rp-quarterly-threat-q1-2012.pdf 2F-Secure Mobile Threat Report Q1 2012 - http://www.f-secure.com/weblog/archives/MobileThreatReport_Q1_2012.pdf
Where does data go after you grant access? Add a big picture of how tainting works. Anchor to what audience knows.
Monitoring goals Monitor where apps send data Monitor apps at runtime What happens after you grant access? Is observed behavior expected? Monitor apps at runtime Want users to monitor their own apps Must balance accuracy and efficiency Solution: TaintDroid Original collaboration with Penn State, Intel Will Enck (NCSU), Jaeyeon Jung (Samsung), others Better mesh intro to tainting
Check tags of emitted data Track how information propagates Taint tracking TaintDroid: system-wide taint tracking for Android Records “explicit” data dependencies via taint tags Does not capture “implicit” data dependencies Check tags of emitted data Track how information propagates Tag data as enters app ••••••••• me@gmail.com Username Password
Taint tracking TaintDroid: system-wide taint tracking for Android Records “explicit” data dependencies via taint tags Does not capture “implicit” data dependencies Key issues for tag propagation How are tags stored? What is the tag-propagation logic? Is tracking precise and efficient? Project website: http://appanalysis.org
Tag propagation Goal: balance precision and efficiency Process-grained Fast Process-grained (All outputs tainted) Ideal Instruction-grained (2-20x overhead) Slow Imprecise Precise
Native system libraries Multi-level approach Variable-level tracking through Dalvik VM (DEX instructions) Patch state after native method invocation Extend tracking to IPC and file system Message-level tracking Application code Application code msg Dalvik VM Dalvik VM Variable-level tracking Native system libraries Method-level tracking Network File system File-level tracking
Variable-level tracking Tag-propagation logic for Dalvik executables (DEX)
Variable-level tracking out0 Modified Dalvik VM Store and propagate 32-bit tags Local vars and args Store tags adjacent to vars on stack Correspond to VM registers 64-bit vars require two tags Class fields Store tags inside heap objects Arrays One tag per array Trade precision for efficient storage Performance optimizations Per-variable tags reduce storage overhead Adjacent tags provide spatial locality out0 taint tag out1 out1 taint tag SP (unused) VM goop FP v0 == local0 v0 taint tag v1 == local1 v1 taint tag v2 == in0 … v4 taint tag
Method-grained tracking Huge opportunity for performance gains JNI code is often CPU intensive Challenge for method-grained tracking In worst case, must manually reason about side-effects Luckily, a very simple heuristic works most of the time class java.lang.Math { public static double cos (double d); }
Method-grained tracking Tainting heuristic “Assign union of arguments’ tags to return value on exit.” Most JNI methods have no side effects Many JNI methods operate on native types When it doesn’t work, use method profiles Generic framework for defining argument/retval dependencies So far, only needed to define for IBM charset converter See paper for more details … class java.lang.Math { public static double cos (double d); }
Method-grained tracking Found 2,844 JNI methods in Android source 913 did not use Object references Others could induce false negatives Third-party JNI is not supported Apps must be written entirely in Java Survey of Android Market, ~25% used .so file Subject of ongoing research
Evaluation Is TaintDroid fast and precise? Process-grained (All outputs tainted) TaintDroid Instruction-grained (2-20x overhead) Slow Imprecise Precise
Performance evaluation 20% overhead (extra memory accesses) Not shown 4.4% memory overhead 14% overhead (higher is better)
Performance evaluation Reasons for efficiency (1) Method-grained tracking of JNI calls (2) Spatial locality of taint tags (3) One tag per array (higher is better)
App study Selected 30 apps from Android Market App permissions Biased toward popular apps Sampled from 12 categories App permissions Access to Internet Access to location, camera, phone state, mic No native libraries Ran apps manually under TaintDroid
App study Of 105 flagged connections, only 37 to expected servers
App study: location 15 of 30 apps shared location with ad server admob.com, ad.qwapi.com, ads.mobclix.com, data.flurry.com Most traffic was plaintext (e.g., AdMob HTTP GET) data.flurry.com used binary format In no cases were users informed by EULA In one case, app sent location every 30 seconds ...&s=a14a4a93f1e4c68&..&t=062A1CB1D476DE85 B717D9195A6722A9&d%5Bcoord%5D=47.661227890000006%2C-122.31589477&...
App study: phone identifiers 7 apps sent device id (IMEI) 2 apps sent phone info (Ph. #, IMSI*, ICC-ID) Done without informing the user One app’s EULA indicated the IMEI was sent Another app sent the hash of the IMEI Frequency was app-specific One sent info every time the phone booted
appanalysis.org Source code available http://appanalysis.org/ Most recent version is for Android 4.3 Great platform for research Compatible with vast majority of Android apps Playground for all kinds of information-flow projects Video demo by Peter Gilbert
TaintDroid demo http://www.youtube.com/watch?v=qnLujX1Dw4Y
Media coverage Earlier
Limitations Implicit flows Native code Fundamentally difficult problem Can handle passwords (SpanDex, USENIX Sec) Native code Ongoing work Talk to Ali!