Download presentation
Presentation is loading. Please wait.
Published byJoleen Ferguson Modified over 8 years ago
1
Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016
2
2 Accelerators are increasingly popular… Good for performance, energy-efficiency, programmability, exciting new applications…. What can we do if they’re imperfect?
3
Executive Summary Motivation What are accelerators? Why are they popular? What does imperfect mean? I. Accelerator Security Taxonomy Define threat landscape Anticipate threats rather than fixing one by one 3
4
Executive Summary II. Border Control Protects host memory from stray reads/writes III. Crossing Guard Protects host from coherence protocol violations Eases accelerator development 4
5
Overview 5 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
6
What is an accelerator? Broadly: Specialized hardware that can perform a subset of computation tasks with higher performance and/or lower energy than a CPU 6
7
Types of Accelerators 7 A9, from www.chipworks.com SoCs, soft-IP accelerators FPGA accelerators IBM CAPI CCIX
8
Example Accelerators 8 Lots of (GP)GPU papers!
9
However… What if accelerator hardware is imperfect? Due to bugs? Due to malicious design? 9
10
Overview 10 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
11
I. Accelerator Security Taxonomy Security Implications of Third-Party Accelerators* Lena E. Olson, Simha Sethumadhavan, Mark D. Hill 11 *CAL, June 2016
12
Motivating Example: GPU leaks Guess which website left this data in the GPU texture memory? 12 “Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities”, Lee et al. (Oakland ’14)
13
Why a taxonomy? Could discover and fix threats one by one Hard to patch existing hardware Doesn’t fix root problem Taxonomy provides a framework What classes of threats are there? Where are they coming from? How to prevent them? 13
14
Threat Scope Accelerator Scope Only affect processes running on accelerator Example: GPU leaks data between processes Challenge: Cannot fix accelerator internals Defense: Don’t run sensitive process on untrusted accelerator System Scope Can affect processes not running on accelerator Example: Bad access to system memory Challenge: Affects unrelated processes! Defense: Good system / interface design 14
15
Security: CIA model 3 considerations for security Confidentiality: Can someone steal data? Integrity: Do we get the right answer? Availability: Can we use the resource? Integrity & Availability also important for reliability! 15
16
Accelerator Risk Categories 16 Configuration, Computation, Termination Access to {accelerator, host} memory Microarchitectural commands, Exceptions/interrupts Power
17
Threat Matrix: Accelerator Scope 17 Known exploit ConfidentialityIntegrityAvailability Configuration Side-channel, kleptography Kleptography, wrong output Lock up accelerator Computation Side-channel, kleptography Kleptography, wrong output Lock up accelerator Termination Failure to clear registers / memory / cache Stale data in registers / memory/ cache Fail to release resources Accel. Memory Bad access Evict others System Memory Side-channel µarch Commands Inconsistent (stale) data Exceptions Side-channel Power Power analysis attacks Excessive heat Unreliability Excessive heat damage
18
Threat Matrix: System Scope 18 ConfidentialityIntegrityAvailability Configuration Incorrect registers (e.g. CR3) Incorrect registers Computation Termination Stale translations Fail to release resources Accel. Memory System Memory Bad access Saturate bandwidth, cause swapping µarch Commands Snoop on coherence traffic; ignored invalidations Ignore invalidations Excessive / ignored coherence requests Exceptions Spurious exceptions / interrupts Power Excessive heat
19
Example Defenses Reset accelerator upon termination Limits performance; non-volatile memory? ARM TrustZone Coarse-grained: trusted vs. untrusted Protection at interfaces 19
20
Our Focus Accelerators that Share unified virtual memory with host Share unified physical memory with host May participate in coherence with host …but, which are less trusted than the CPU Or, which don’t need full access to everything! If compromised, can affect the host memory, not just processes running on accelerator! 20
21
Two Memory Access Threats Accesses to invalid addresses Wild writes Reads to sensitive data Effectively, allow full access to host system! Our solution: II. Border Control Incorrect accelerator coherence protocols Incorrect messages Deadlocks Denial of service attacks Our solution: III. Crossing Guard 21
22
Overview 22 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
23
Border Control: Sandboxing Accelerators* Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood 23 *MICRO, December 2015
24
Threat Model Protect host from incorrect or malicious accelerators that could perform stray reads, violating confidentiality stray writes, violating integrity of host processes that do and do NOT run on the accelerator 24 Question: Which accesses are stray?
25
Principle of Least Privilege Every program and every user of the system should operate using the least set of privileges necessary to complete the job. Primarily, this principle limits the damage that can result from an accident or error. Jerome Saltzer 25 hardware component Border Control Authors
26
Accelerator Access Permissions What permissions should an accelerator have? NOT to OS data NOT to sensitive data from other processes Principle of Least Privilege: to what it needs Access to addresses corresponding to process it is currently running These can be found in the page table We will use page permissions (like prior work) 26
27
Example System 27 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Address translation? MMU TLB Address translation path Translation update path Security? Accel.
28
Full IOMMU 28 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Full IOMMU MMU TLB Address translation path Translation update path
29
IOMMU’s Address Translation Service (ATS) translates every memory reference to host + Protection - Translation latency - Bandwidth - Synonyms in virtual caches? - Coherence? Can add (physical) caches and TLB… Full IOMMU Challenges 29
30
Bypassable IOMMU (Baseline) 30 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P
31
Bypassable IOMMU (Baseline) 31 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P Mem req: Phys. addr = Q
32
We can’t remove the caches and TLBs Too slow! Why not use trusted design for caches and TLBs? So… caches are the problem? 32
33
CAPI-like 33 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Cache access latency?
34
Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like TLB + Caches?NoYesSlow Customizable Caches? NoYesNo Safe?YesNoYes 34
35
Border Control 35 CPU $$ Trusted data path Untrusted data path $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path Memory or Shared LLC OS Memory (Q) Process Memory (P) Border Control
36
36 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = P Mem req: Virtual addr = V Mem req: Phys. addr = P
37
Border Control 37 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = Q
38
Border Control: Implementation One Border Control instance per accelerator Protection Table In system memory Contains all needed permissions by PPN Sufficient for correct design 0.006% physical memory overhead Border Control Cache (BCC) Caches recent permissions A 64 byte entry covers 512 4KB pages 38
39
Protection Table Design Flat physically indexed table in memory 39 2 bits (R/W) per physical page Initialized to 0 (no permission) Lazily updated on IOMMU translation Checked on all accelerator memory requests ●●● PPNRW 000 111 210 300 N-400 N-310 N-210 N-100 What about execute permission?
40
Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like Border Control TLB + Caches?NoYesSlowYes Customizable Caches? NoYesNoYes Safe?YesNoYes 40 EVALUATION GPGPU accelerator safety stress-test gem5-gpu Rodinia Benchmarks
41
Border Control Overheads 41 Takeaway: On average 0.48% performance overhead vs. unsafe Moderately-Threaded GPU
42
II. Border Control Summary Bad addresses blocked: check! 2 bits / (4KB) page = 0.006% space overhead Could be optimized further On average, 0.48% (moderately threaded) performance overhead What about bad coherence messages? 42
43
Overview 43 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
44
III. Crossing Guard Mediating Host-Accelerator Coherence Interactions* Lena E. Olson, Mark D. Hill, David A. Wood 44 *Currently under submission
45
Threat Model Protect host from incorrect or malicious accelerators that could perform stray reads, violating confidentiality stray writes, violating integrity incorrect coherence activity, violating availability of host processes that do and do NOT run on the accelerator 45
46
Crossing Guard Goals 1. Allow accelerators customized caches 2. Simple, standardized coherence interface Work with many diverse host protocols 3. Provide safety for the host system No unexpected messages No deadlocks 46
47
1. Why Customize Caches? CPU caches have to work with all workloads Accelerators may only run some workloads! Streaming? More prefetching. GPGPUs? Relax coherence between GPU cores. Etc…. 47
48
2. Why Simple Interface? Redesigning for each host is too much work Intel, AMD, ARM, IBM, Oracle… CCIX shows companies care! Host protocols may be proprietary Host protocols are complex! 48
49
2. Why Simple Interface? 49 (Transition table in style of Sorin et al.)
50
Addr State AS 3. Why Host Safety? 50 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Accel CPU
51
Addr State AS 3. Why Host Safety? 51 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 ? ? ?
52
Addr State AI 3. Why Host Safety? 52 AddrStateOwner/Sharers Req A MT 0- Addr State A M Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Inv Req: dir AddrStateOwner/Sharers Req A MT_I 0-
53
Crossing Guard Overview Hardware implemented in trusted host Implements simple, standard interface complex enough to allow hierarchical protocol works with range of host protocols safe for host maintains Border Control protections Moves protocol complexity into XG hardware Only implemented once per host system By experts! 53
54
1. Customize Caches Designed + implemented two sample systems 54 Accel L1 CPU L1 Host Directory / L2 XG Private Per-Core L1 at Accelerator
55
1. Customize Caches Designed + implemented two sample systems 55 Accel L1 CPU L1 Host Directory / L2 XG Private L1s + Shared L2 at Accelerator Accel L2
56
2. Simple Interface Accelerator Host Requests GetS, GetM PutS, PutE, PutM Host Accelerator Responses DataS, DataE, DataM Writeback Ack 56 Host Accelerator Requests Invalidate Accelerator Host Responses InvAck, Clean Writeback, Dirty Writeback
57
2. Simple Interface 57 Single-level Accelerator Cache using Crossing Guard Interface
58
2. Simple Interface Implemented Crossing Guard interface to two host protocols AMD Hammer-like Exclusive MOESI MESI Inclusive Modularity: Host and Accelerator protocol choice is independent 58
59
AddrStateAcksReqs Timer A I 0 - 0 AddrStateAcksReqs Timer A IM 0 - 0 AddrStateAcksReqs Timer A SM -2 - 0 AddrStateAcksReqs Timer A SM -1 - 0 AddrStateAcksReqs Timer A M 0 - 0 Addr State AI 2. Simple Interface 59 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A B GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Inv Req: 0 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-
60
AddrStateAcksReqs Timer A I 0 - 0 AddrStateAcksReqs Timer A IM 0 - 0 AddrStateAcksReqs Timer A SM -2 - 0 AddrStateAcksReqs Timer A SM -1 - 0 AddrStateAcksReqs Timer A M 0 - 0 Addr State AI 2. Simple Interface 60 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A IM GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-
61
AddrStateAcksReqs Timer A I 0 - 0 Addr State AS 3. Host Safety 61 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0
62
AddrStateAcksReqs Timer A M 0 - 0 Addr State AS 3. Host Safety 62 AddrStateOwner/Sharers Req A MT0- Addr State A M Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0 Inv (Req: dir) AddrStateOwner/Sharers Req A MT_I0- AddrStateAcksReqs Timer A MI 0 dir 1210 Inv Time: 200 Time: 210 Time: 500 Time: 1000 Time: 1500 Data AddrStateAcksReqs Timer A I 0 - 1210 AddrStateOwner/Sharers Req A WB0-
63
3. Host Safety Crossing Guard Guarantees to Host: 1. Accelerator requests must be correct a) Consistent with block stable state b) Consistent with block transient state 2. Accelerator responses must be correct a) Consistent with block stable state b) Consistent with block transient state c) Within a reasonable time 63 ( + Border Control Protections!)
64
Crossing Guard Variants Full State Crossing Guard Inclusive directory of accelerator state + Places few restrictions on host protocol + Can hide all errors - Requires tag + metadata storage for all blocks Transactional Crossing Guard Stores only data for in-flight transactions + Small storage + Provides most safety properties - Requires some host tolerance 64
65
Evaluation 1. Does it provide coherence to correct accelerator? 2. Does it provide safety to host? 3. Does it allow high performance? 65
66
Correctness Testing Are coherence invariants are maintained when accelerator is acting correctly? How? Random tester Store-Load pairs to random addresses Check integrity of data Local coverage: > 99% 66
67
Fuzz Testing Is host safety maintained when accelerator misbehaves? How? Replace accelerator cache with evil controller Generates random coherence messages to random addresses Desired outcome: No deadlocks / crashes Local Coverage: > 99.3% 67
68
Performance Testing Tertiary concern, but cannot degrade performance too much gem5-gpu Rodinia workloads CAVEATS: Immaturity of workloads / infrastructure Directly comparing coherence protocols hard General trends only! 68
69
Performance (Hammer-like) 69
70
Performance: MESI Inclusive 70
71
III. Crossing Guard Summary Provides simple, standardized interface to ease accelerator development Correctness when accelerator is correct Host safety when accelerator is incorrect Low performance overhead 71
72
Overview 72 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control
73
Publications “Crossing Guard: Mediating Host-Accelerator Coherence Interactions” Olson, Hill, Wood (under submission) “Border Control: Sandboxing Accelerators” Olson, Power, Hill, Wood (MICRO 2015) “Security Implications of Third-Party Accelerators” Olson, Sethumadhavan, Hill (CAL 2016) “Probabilistic Directed Writebacks for Exclusive Caches” Olson, Hill (TR 2016) “Revisiting Stack Caches for Energy Efficiency”, Olson, Eckert, Manne, Hill (TR 2014) 73
74
Accelerators raise new security questions We can design secure interfaces To prevent bad memory accesses To prevent coherence bugs To ease accelerator development At low overhead, so people might use them! Conclusion 74
75
Questions? 75 Investigating Border Control at the Canada-USA Border CANADA No passport
76
Backup Follows 76
77
Why now? Breakdown of Dennard Scaling 3D Die Stacking Cool new programming models like HSA, CAPI allow unified memory address space Less copying data Great for programmability! Tight integration with host 77
78
Company Reputations “Companies would never produce malicious hardware, their reputation would be ruined!” 78
79
Border Control Operation 79 Accel TLB Trusted data path Untrusted data path Address translation path Translation update path Memory $$ Protection Table Border Control update path IOMMU Border Control BC Cache
80
Full IOMMU Safe, but no caches (slow) Bypassable IOMMU Has caches, TLB – very fast! Totally unsafe CAPI-like Safe, and has caches and TLB… But longer access latency, less designer control To summarize… 80 Can we do better?
81
Full IOMMU Safe, but no caches (slow) Bypassable IOMMU Has caches, TLB – very fast! Totally unsafe CAPI-like Safe, and has caches and TLB… But longer access latency, less designer control Border Control Safe, physical caches+TLB, AND fast To summarize again… 81 EVALUATION GPGPU accelerator safety stress-test gem5-gpu Rodinia Benchmarks
82
Simulation Parameters 82
83
Comparison of Configurations 83
84
Border Control Overheads Highly-Threaded GPU 84 Takeaway: On average 0.15% performance overhead vs. unsafe
85
Border Control Cache 85 Takeaway: A small (1KB) BCC is sufficient for our workloads
86
TLB Shootdown Steps If page was read-only: update entry in Protection Table and BCC If page was read-write: 1. Invalidate entry in TLB 2. Flush dirty blocks from page in accelerator cache 3. Update entry in Protection Table and BCC 86
87
Border Control Flush Overhead 87 Takeaway: Permission downgrades affect performance, but not much
88
Information Flow Tracking Goal: track untrusted information, prevent it from modifying sensitive data / control e.g., prevent buffer overflow in software Hardware-assisted techniques: prevent threats from bugs in software (same address space) – different threat than Border Control Hardware (e.g. Tiwari et al., ISCA 2011) – very powerful technique, but high area/runtime overhead and not transparent to software 88
89
Mondriaan Replacement for traditional page table + TLB Allows fine-grained permissions Border Control is independent of the policy for deciding permissions But permission granularity might mean alternate Protection Table organizations are better 89
90
Single-Level Cache 90
91
Simulation Parameters 91
92
Time Spent Simulating (Random) ConfigurationTime XG Full + Hammer + 1 Level5.28 years XG Full + Hamer + 2 Level2.51 years XG Full + MESI Inc + 1 Level133 days XG Full + MESI Inc + 2 Level223 days XG Trans. + Hammer + 1 Level3.17 years XG Trans. + Hammer + 2 Level1.38 years XG Trans + Inc + 1 Level90 days XG Trans + Inc + 2 Level103 days TOTAL13.9 years 92
93
Full Coverage %s (Random) Full State XGSingle-levelTwo-level Hammer-like9999.8 MESI Inclusive10099.4 Transactional XGSingle-levelTwo-level Hammer-like99.399.5 MESI Inclusive10099.7 93
94
Time Spent Simulating (Fuzz) ConfigurationTime XG Full + Hammer-like1.62 years XG Full + MESI Inclusive287days XG Transactional + Hammer-like5.3 years XG Transactional + MESI Inclusive41 days Total7.82 years 94
95
Full Coverage %s (Fuzz) Full State Crossing GuardFuzz Tester Hammer-like99.3 MESI Inclusive99.7 Transactional Crossing GuardFuzz Tester Hammer-like99.7 MESI Inclusive100 95
96
Performance: Hammer-like 96
97
Performance: MESI Inclusive 97
98
AddrStateAcksReqs Timer A I 0 - 0 Addr State AI Template 98 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI GetM Addr State AI Ack Directory Accel Cache Cache #1 Cache #2 Cache #0
99
Old Slides 99
100
3. Why Host Safety? 100 Accelerator cache Directory Addr A: ? Addr A: RW Addr A: Not Present in caches ? ? ? Ack Addr: A
101
Directory 3. Why Host Safety? 101 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A
102
Directory Crossing Guard Example 102 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Writeback Addr: A Fwd-GetM Addr: A Invalidate Addr: A
103
Directory Crossing Guard Example 103 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Invalidate Addr: A Writeback Addr: A Fwd-GetM Addr: A
104
Where to next? 104
105
What I’ve Learned 1. Anticipate questions, make backup slides =) 2. Talk to colleagues! They’re really smart. 3. If you can’t explain why your idea is exciting, no one will care about it. 4. Be confident! 105
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.