Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016
2 Accelerators are increasingly popular… Good for performance, energy-efficiency, programmability, exciting new applications…. What can we do if they’re imperfect?
Executive Summary Motivation What are accelerators? Why are they popular? What does imperfect mean? I. Accelerator Security Taxonomy Define threat landscape Anticipate threats rather than fixing one by one 3
Executive Summary II. Border Control Protects host memory from stray reads/writes III. Crossing Guard Protects host from coherence protocol violations Eases accelerator development 4
Overview 5 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
What is an accelerator? Broadly: Specialized hardware that can perform a subset of computation tasks with higher performance and/or lower energy than a CPU 6
Types of Accelerators 7 A9, from SoCs, soft-IP accelerators FPGA accelerators IBM CAPI CCIX
Example Accelerators 8 Lots of (GP)GPU papers!
However… What if accelerator hardware is imperfect? Due to bugs? Due to malicious design? 9
Overview 10 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
I. Accelerator Security Taxonomy Security Implications of Third-Party Accelerators* Lena E. Olson, Simha Sethumadhavan, Mark D. Hill 11 *CAL, June 2016
Motivating Example: GPU leaks Guess which website left this data in the GPU texture memory? 12 “Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities”, Lee et al. (Oakland ’14)
Why a taxonomy? Could discover and fix threats one by one Hard to patch existing hardware Doesn’t fix root problem Taxonomy provides a framework What classes of threats are there? Where are they coming from? How to prevent them? 13
Threat Scope Accelerator Scope Only affect processes running on accelerator Example: GPU leaks data between processes Challenge: Cannot fix accelerator internals Defense: Don’t run sensitive process on untrusted accelerator System Scope Can affect processes not running on accelerator Example: Bad access to system memory Challenge: Affects unrelated processes! Defense: Good system / interface design 14
Security: CIA model 3 considerations for security Confidentiality: Can someone steal data? Integrity: Do we get the right answer? Availability: Can we use the resource? Integrity & Availability also important for reliability! 15
Accelerator Risk Categories 16 Configuration, Computation, Termination Access to {accelerator, host} memory Microarchitectural commands, Exceptions/interrupts Power
Threat Matrix: Accelerator Scope 17 Known exploit ConfidentialityIntegrityAvailability Configuration Side-channel, kleptography Kleptography, wrong output Lock up accelerator Computation Side-channel, kleptography Kleptography, wrong output Lock up accelerator Termination Failure to clear registers / memory / cache Stale data in registers / memory/ cache Fail to release resources Accel. Memory Bad access Evict others System Memory Side-channel µarch Commands Inconsistent (stale) data Exceptions Side-channel Power Power analysis attacks Excessive heat Unreliability Excessive heat damage
Threat Matrix: System Scope 18 ConfidentialityIntegrityAvailability Configuration Incorrect registers (e.g. CR3) Incorrect registers Computation Termination Stale translations Fail to release resources Accel. Memory System Memory Bad access Saturate bandwidth, cause swapping µarch Commands Snoop on coherence traffic; ignored invalidations Ignore invalidations Excessive / ignored coherence requests Exceptions Spurious exceptions / interrupts Power Excessive heat
Example Defenses Reset accelerator upon termination Limits performance; non-volatile memory? ARM TrustZone Coarse-grained: trusted vs. untrusted Protection at interfaces 19
Our Focus Accelerators that Share unified virtual memory with host Share unified physical memory with host May participate in coherence with host …but, which are less trusted than the CPU Or, which don’t need full access to everything! If compromised, can affect the host memory, not just processes running on accelerator! 20
Two Memory Access Threats Accesses to invalid addresses Wild writes Reads to sensitive data Effectively, allow full access to host system! Our solution: II. Border Control Incorrect accelerator coherence protocols Incorrect messages Deadlocks Denial of service attacks Our solution: III. Crossing Guard 21
Overview 22 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
Border Control: Sandboxing Accelerators* Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood 23 *MICRO, December 2015
Threat Model Protect host from incorrect or malicious accelerators that could perform stray reads, violating confidentiality stray writes, violating integrity of host processes that do and do NOT run on the accelerator 24 Question: Which accesses are stray?
Principle of Least Privilege Every program and every user of the system should operate using the least set of privileges necessary to complete the job. Primarily, this principle limits the damage that can result from an accident or error. Jerome Saltzer 25 hardware component Border Control Authors
Accelerator Access Permissions What permissions should an accelerator have? NOT to OS data NOT to sensitive data from other processes Principle of Least Privilege: to what it needs Access to addresses corresponding to process it is currently running These can be found in the page table We will use page permissions (like prior work) 26
Example System 27 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Address translation? MMU TLB Address translation path Translation update path Security? Accel.
Full IOMMU 28 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Full IOMMU MMU TLB Address translation path Translation update path
IOMMU’s Address Translation Service (ATS) translates every memory reference to host + Protection - Translation latency - Bandwidth - Synonyms in virtual caches? - Coherence? Can add (physical) caches and TLB… Full IOMMU Challenges 29
Bypassable IOMMU (Baseline) 30 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P
Bypassable IOMMU (Baseline) 31 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P Mem req: Phys. addr = Q
We can’t remove the caches and TLBs Too slow! Why not use trusted design for caches and TLBs? So… caches are the problem? 32
CAPI-like 33 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Cache access latency?
Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like TLB + Caches?NoYesSlow Customizable Caches? NoYesNo Safe?YesNoYes 34
Border Control 35 CPU $$ Trusted data path Untrusted data path $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path Memory or Shared LLC OS Memory (Q) Process Memory (P) Border Control
36 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = P Mem req: Virtual addr = V Mem req: Phys. addr = P
Border Control 37 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = Q
Border Control: Implementation One Border Control instance per accelerator Protection Table In system memory Contains all needed permissions by PPN Sufficient for correct design 0.006% physical memory overhead Border Control Cache (BCC) Caches recent permissions A 64 byte entry covers 512 4KB pages 38
Protection Table Design Flat physically indexed table in memory 39 2 bits (R/W) per physical page Initialized to 0 (no permission) Lazily updated on IOMMU translation Checked on all accelerator memory requests ●●● PPNRW N-400 N-310 N-210 N-100 What about execute permission?
Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like Border Control TLB + Caches?NoYesSlowYes Customizable Caches? NoYesNoYes Safe?YesNoYes 40 EVALUATION GPGPU accelerator safety stress-test gem5-gpu Rodinia Benchmarks
Border Control Overheads 41 Takeaway: On average 0.48% performance overhead vs. unsafe Moderately-Threaded GPU
II. Border Control Summary Bad addresses blocked: check! 2 bits / (4KB) page = 0.006% space overhead Could be optimized further On average, 0.48% (moderately threaded) performance overhead What about bad coherence messages? 42
Overview 43 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
III. Crossing Guard Mediating Host-Accelerator Coherence Interactions* Lena E. Olson, Mark D. Hill, David A. Wood 44 *Currently under submission
Threat Model Protect host from incorrect or malicious accelerators that could perform stray reads, violating confidentiality stray writes, violating integrity incorrect coherence activity, violating availability of host processes that do and do NOT run on the accelerator 45
Crossing Guard Goals 1. Allow accelerators customized caches 2. Simple, standardized coherence interface Work with many diverse host protocols 3. Provide safety for the host system No unexpected messages No deadlocks 46
1. Why Customize Caches? CPU caches have to work with all workloads Accelerators may only run some workloads! Streaming? More prefetching. GPGPUs? Relax coherence between GPU cores. Etc…. 47
2. Why Simple Interface? Redesigning for each host is too much work Intel, AMD, ARM, IBM, Oracle… CCIX shows companies care! Host protocols may be proprietary Host protocols are complex! 48
2. Why Simple Interface? 49 (Transition table in style of Sorin et al.)
Addr State AS 3. Why Host Safety? 50 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Accel CPU
Addr State AS 3. Why Host Safety? 51 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 ? ? ?
Addr State AI 3. Why Host Safety? 52 AddrStateOwner/Sharers Req A MT 0- Addr State A M Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Inv Req: dir AddrStateOwner/Sharers Req A MT_I 0-
Crossing Guard Overview Hardware implemented in trusted host Implements simple, standard interface complex enough to allow hierarchical protocol works with range of host protocols safe for host maintains Border Control protections Moves protocol complexity into XG hardware Only implemented once per host system By experts! 53
1. Customize Caches Designed + implemented two sample systems 54 Accel L1 CPU L1 Host Directory / L2 XG Private Per-Core L1 at Accelerator
1. Customize Caches Designed + implemented two sample systems 55 Accel L1 CPU L1 Host Directory / L2 XG Private L1s + Shared L2 at Accelerator Accel L2
2. Simple Interface Accelerator Host Requests GetS, GetM PutS, PutE, PutM Host Accelerator Responses DataS, DataE, DataM Writeback Ack 56 Host Accelerator Requests Invalidate Accelerator Host Responses InvAck, Clean Writeback, Dirty Writeback
2. Simple Interface 57 Single-level Accelerator Cache using Crossing Guard Interface
2. Simple Interface Implemented Crossing Guard interface to two host protocols AMD Hammer-like Exclusive MOESI MESI Inclusive Modularity: Host and Accelerator protocol choice is independent 58
AddrStateAcksReqs Timer A I AddrStateAcksReqs Timer A IM AddrStateAcksReqs Timer A SM AddrStateAcksReqs Timer A SM AddrStateAcksReqs Timer A M Addr State AI 2. Simple Interface 59 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A B GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Inv Req: 0 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-
AddrStateAcksReqs Timer A I AddrStateAcksReqs Timer A IM AddrStateAcksReqs Timer A SM AddrStateAcksReqs Timer A SM AddrStateAcksReqs Timer A M Addr State AI 2. Simple Interface 60 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A IM GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-
AddrStateAcksReqs Timer A I Addr State AS 3. Host Safety 61 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0
AddrStateAcksReqs Timer A M Addr State AS 3. Host Safety 62 AddrStateOwner/Sharers Req A MT0- Addr State A M Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0 Inv (Req: dir) AddrStateOwner/Sharers Req A MT_I0- AddrStateAcksReqs Timer A MI 0 dir 1210 Inv Time: 200 Time: 210 Time: 500 Time: 1000 Time: 1500 Data AddrStateAcksReqs Timer A I AddrStateOwner/Sharers Req A WB0-
3. Host Safety Crossing Guard Guarantees to Host: 1. Accelerator requests must be correct a) Consistent with block stable state b) Consistent with block transient state 2. Accelerator responses must be correct a) Consistent with block stable state b) Consistent with block transient state c) Within a reasonable time 63 ( + Border Control Protections!)
Crossing Guard Variants Full State Crossing Guard Inclusive directory of accelerator state + Places few restrictions on host protocol + Can hide all errors - Requires tag + metadata storage for all blocks Transactional Crossing Guard Stores only data for in-flight transactions + Small storage + Provides most safety properties - Requires some host tolerance 64
Evaluation 1. Does it provide coherence to correct accelerator? 2. Does it provide safety to host? 3. Does it allow high performance? 65
Correctness Testing Are coherence invariants are maintained when accelerator is acting correctly? How? Random tester Store-Load pairs to random addresses Check integrity of data Local coverage: > 99% 66
Fuzz Testing Is host safety maintained when accelerator misbehaves? How? Replace accelerator cache with evil controller Generates random coherence messages to random addresses Desired outcome: No deadlocks / crashes Local Coverage: > 99.3% 67
Performance Testing Tertiary concern, but cannot degrade performance too much gem5-gpu Rodinia workloads CAVEATS: Immaturity of workloads / infrastructure Directly comparing coherence protocols hard General trends only! 68
Performance (Hammer-like) 69
Performance: MESI Inclusive 70
III. Crossing Guard Summary Provides simple, standardized interface to ease accelerator development Correctness when accelerator is correct Host safety when accelerator is incorrect Low performance overhead 71
Overview 72 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
Publications “Crossing Guard: Mediating Host-Accelerator Coherence Interactions” Olson, Hill, Wood (under submission) “Border Control: Sandboxing Accelerators” Olson, Power, Hill, Wood (MICRO 2015) “Security Implications of Third-Party Accelerators” Olson, Sethumadhavan, Hill (CAL 2016) “Probabilistic Directed Writebacks for Exclusive Caches” Olson, Hill (TR 2016) “Revisiting Stack Caches for Energy Efficiency”, Olson, Eckert, Manne, Hill (TR 2014) 73
Accelerators raise new security questions We can design secure interfaces To prevent bad memory accesses To prevent coherence bugs To ease accelerator development At low overhead, so people might use them! Conclusion 74
Questions? 75 Investigating Border Control at the Canada-USA Border CANADA No passport
Backup Follows 76
Why now? Breakdown of Dennard Scaling 3D Die Stacking Cool new programming models like HSA, CAPI allow unified memory address space Less copying data Great for programmability! Tight integration with host 77
Company Reputations “Companies would never produce malicious hardware, their reputation would be ruined!” 78
Border Control Operation 79 Accel TLB Trusted data path Untrusted data path Address translation path Translation update path Memory $$ Protection Table Border Control update path IOMMU Border Control BC Cache
Full IOMMU Safe, but no caches (slow) Bypassable IOMMU Has caches, TLB – very fast! Totally unsafe CAPI-like Safe, and has caches and TLB… But longer access latency, less designer control To summarize… 80 Can we do better?
Full IOMMU Safe, but no caches (slow) Bypassable IOMMU Has caches, TLB – very fast! Totally unsafe CAPI-like Safe, and has caches and TLB… But longer access latency, less designer control Border Control Safe, physical caches+TLB, AND fast To summarize again… 81 EVALUATION GPGPU accelerator safety stress-test gem5-gpu Rodinia Benchmarks
Simulation Parameters 82
Comparison of Configurations 83
Border Control Overheads Highly-Threaded GPU 84 Takeaway: On average 0.15% performance overhead vs. unsafe
Border Control Cache 85 Takeaway: A small (1KB) BCC is sufficient for our workloads
TLB Shootdown Steps If page was read-only: update entry in Protection Table and BCC If page was read-write: 1. Invalidate entry in TLB 2. Flush dirty blocks from page in accelerator cache 3. Update entry in Protection Table and BCC 86
Border Control Flush Overhead 87 Takeaway: Permission downgrades affect performance, but not much
Information Flow Tracking Goal: track untrusted information, prevent it from modifying sensitive data / control e.g., prevent buffer overflow in software Hardware-assisted techniques: prevent threats from bugs in software (same address space) – different threat than Border Control Hardware (e.g. Tiwari et al., ISCA 2011) – very powerful technique, but high area/runtime overhead and not transparent to software 88
Mondriaan Replacement for traditional page table + TLB Allows fine-grained permissions Border Control is independent of the policy for deciding permissions But permission granularity might mean alternate Protection Table organizations are better 89
Single-Level Cache 90
Simulation Parameters 91
Time Spent Simulating (Random) ConfigurationTime XG Full + Hammer + 1 Level5.28 years XG Full + Hamer + 2 Level2.51 years XG Full + MESI Inc + 1 Level133 days XG Full + MESI Inc + 2 Level223 days XG Trans. + Hammer + 1 Level3.17 years XG Trans. + Hammer + 2 Level1.38 years XG Trans + Inc + 1 Level90 days XG Trans + Inc + 2 Level103 days TOTAL13.9 years 92
Full Coverage %s (Random) Full State XGSingle-levelTwo-level Hammer-like MESI Inclusive Transactional XGSingle-levelTwo-level Hammer-like MESI Inclusive
Time Spent Simulating (Fuzz) ConfigurationTime XG Full + Hammer-like1.62 years XG Full + MESI Inclusive287days XG Transactional + Hammer-like5.3 years XG Transactional + MESI Inclusive41 days Total7.82 years 94
Full Coverage %s (Fuzz) Full State Crossing GuardFuzz Tester Hammer-like99.3 MESI Inclusive99.7 Transactional Crossing GuardFuzz Tester Hammer-like99.7 MESI Inclusive100 95
Performance: Hammer-like 96
Performance: MESI Inclusive 97
AddrStateAcksReqs Timer A I Addr State AI Template 98 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI GetM Addr State AI Ack Directory Accel Cache Cache #1 Cache #2 Cache #0
Old Slides 99
3. Why Host Safety? 100 Accelerator cache Directory Addr A: ? Addr A: RW Addr A: Not Present in caches ? ? ? Ack Addr: A
Directory 3. Why Host Safety? 101 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A
Directory Crossing Guard Example 102 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Writeback Addr: A Fwd-GetM Addr: A Invalidate Addr: A
Directory Crossing Guard Example 103 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Invalidate Addr: A Writeback Addr: A Fwd-GetM Addr: A
Where to next? 104
What I’ve Learned 1. Anticipate questions, make backup slides =) 2. Talk to colleagues! They’re really smart. 3. If you can’t explain why your idea is exciting, no one will care about it. 4. Be confident! 105