Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016.

Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016

2 Accelerators are increasingly popular… Good for performance, energy-efficiency, programmability, exciting new applications…. What can we do if they’re imperfect?

Executive Summary Motivation  What are accelerators?  Why are they popular?  What does imperfect mean? I. Accelerator Security Taxonomy  Define threat landscape  Anticipate threats rather than fixing one by one 3

Executive Summary II. Border Control  Protects host memory from stray reads/writes III. Crossing Guard  Protects host from coherence protocol violations  Eases accelerator development 4

Overview 5 Motivation I. Accelerator Security Taxonomy III. Crossing GuardII. Border Control

What is an accelerator? Broadly: Specialized hardware that can perform a subset of computation tasks with higher performance and/or lower energy than a CPU 6

Types of Accelerators 7 A9, from www.chipworks.com SoCs, soft-IP accelerators FPGA accelerators IBM CAPI CCIX

Example Accelerators 8 Lots of (GP)GPU papers!

However… What if accelerator hardware is imperfect? Due to bugs? Due to malicious design? 9

I. Accelerator Security Taxonomy Security Implications of Third-Party Accelerators* Lena E. Olson, Simha Sethumadhavan, Mark D. Hill 11 *CAL, June 2016

Motivating Example: GPU leaks Guess which website left this data in the GPU texture memory? 12 “Stealing Webpages Rendered on Your Browser by Exploiting GPU Vulnerabilities”, Lee et al. (Oakland ’14)

Why a taxonomy? Could discover and fix threats one by one  Hard to patch existing hardware  Doesn’t fix root problem Taxonomy provides a framework  What classes of threats are there?  Where are they coming from?  How to prevent them? 13

Threat Scope Accelerator Scope  Only affect processes running on accelerator  Example: GPU leaks data between processes  Challenge: Cannot fix accelerator internals  Defense: Don’t run sensitive process on untrusted accelerator System Scope  Can affect processes not running on accelerator  Example: Bad access to system memory  Challenge: Affects unrelated processes!  Defense: Good system / interface design 14

Security: CIA model 3 considerations for security  Confidentiality: Can someone steal data?  Integrity: Do we get the right answer?  Availability: Can we use the resource? Integrity & Availability also important for reliability! 15

Accelerator Risk Categories 16 Configuration, Computation, Termination Access to {accelerator, host} memory Microarchitectural commands, Exceptions/interrupts Power

Threat Matrix: Accelerator Scope 17 Known exploit ConfidentialityIntegrityAvailability Configuration Side-channel, kleptography Kleptography, wrong output Lock up accelerator Computation Side-channel, kleptography Kleptography, wrong output Lock up accelerator Termination Failure to clear registers / memory / cache Stale data in registers / memory/ cache Fail to release resources Accel. Memory Bad access Evict others System Memory Side-channel µarch Commands Inconsistent (stale) data Exceptions Side-channel Power Power analysis attacks Excessive heat  Unreliability Excessive heat  damage

Threat Matrix: System Scope 18 ConfidentialityIntegrityAvailability Configuration Incorrect registers (e.g. CR3) Incorrect registers Computation Termination Stale translations Fail to release resources Accel. Memory System Memory Bad access Saturate bandwidth, cause swapping µarch Commands Snoop on coherence traffic; ignored invalidations Ignore invalidations Excessive / ignored coherence requests Exceptions Spurious exceptions / interrupts Power Excessive heat

Example Defenses Reset accelerator upon termination  Limits performance; non-volatile memory? ARM TrustZone  Coarse-grained: trusted vs. untrusted Protection at interfaces 19

Our Focus  Accelerators that  Share unified virtual memory with host  Share unified physical memory with host  May participate in coherence with host  …but, which are less trusted than the CPU  Or, which don’t need full access to everything!  If compromised, can affect the host memory, not just processes running on accelerator! 20

Two Memory Access Threats  Accesses to invalid addresses  Wild writes  Reads to sensitive data  Effectively, allow full access to host system! Our solution: II. Border Control  Incorrect accelerator coherence protocols  Incorrect messages  Deadlocks  Denial of service attacks Our solution: III. Crossing Guard 21

Border Control: Sandboxing Accelerators* Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood 23 *MICRO, December 2015

Threat Model Protect host from incorrect or malicious accelerators that could perform  stray reads, violating confidentiality  stray writes, violating integrity of host processes that do and do NOT run on the accelerator 24 Question: Which accesses are stray?

Principle of Least Privilege Every program and every user of the system should operate using the least set of privileges necessary to complete the job. Primarily, this principle limits the damage that can result from an accident or error. Jerome Saltzer 25 hardware component Border Control Authors

Accelerator Access Permissions  What permissions should an accelerator have?  NOT to OS data  NOT to sensitive data from other processes  Principle of Least Privilege: to what it needs  Access to addresses corresponding to process it is currently running  These can be found in the page table  We will use page permissions (like prior work) 26

Example System 27 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Address translation? MMU TLB Address translation path Translation update path Security? Accel.

Full IOMMU 28 CPU $$ Accel. Trusted data path Untrusted data path $$ Memory or Shared LLC Full IOMMU MMU TLB Address translation path Translation update path

 IOMMU’s Address Translation Service (ATS) translates every memory reference to host + Protection - Translation latency - Bandwidth - Synonyms in virtual caches? - Coherence?  Can add (physical) caches and TLB… Full IOMMU Challenges 29

Bypassable IOMMU (Baseline) 30 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P

Bypassable IOMMU (Baseline) 31 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Mem req: Virtual addr = V Mem req: Phys. addr = P Mem req: Phys. addr = Q

 We can’t remove the caches and TLBs  Too slow!  Why not use trusted design for caches and TLBs? So… caches are the problem? 32

CAPI-like 33 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Cache access latency?

Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like TLB + Caches?NoYesSlow Customizable Caches? NoYesNo Safe?YesNoYes 34

Border Control 35 CPU $$ Trusted data path Untrusted data path $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path Memory or Shared LLC OS Memory (Q) Process Memory (P) Border Control

36 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = P Mem req: Virtual addr = V Mem req: Phys. addr = P

Border Control 37 CPU $$ Trusted data path Untrusted data path Memory or Shared LLC $$ MMU TLB Accel. $$ TLB Accel. $$ IOMMU Address translation path Translation update path OS Memory (Q) Process Memory (P) Border Control Mem req: Phys. addr = Q

Border Control: Implementation  One Border Control instance per accelerator  Protection Table  In system memory  Contains all needed permissions by PPN  Sufficient for correct design  0.006% physical memory overhead  Border Control Cache (BCC)  Caches recent permissions  A 64 byte entry covers 512 4KB pages 38

Protection Table Design  Flat physically indexed table in memory 39  2 bits (R/W) per physical page  Initialized to 0 (no permission)  Lazily updated on IOMMU translation  Checked on all accelerator memory requests ●●● PPNRW 000 111 210 300 N-400 N-310 N-210 N-100 What about execute permission?

Summary Comparison Full IOMMU Bypassable IOMMU CAPI- like Border Control TLB + Caches?NoYesSlowYes Customizable Caches? NoYesNoYes Safe?YesNoYes 40 EVALUATION GPGPU  accelerator safety stress-test gem5-gpu Rodinia Benchmarks

Border Control Overheads 41 Takeaway: On average 0.48% performance overhead vs. unsafe Moderately-Threaded GPU

II. Border Control Summary  Bad addresses blocked: check!  2 bits / (4KB) page = 0.006% space overhead  Could be optimized further  On average, 0.48% (moderately threaded) performance overhead  What about bad coherence messages? 42

III. Crossing Guard Mediating Host-Accelerator Coherence Interactions* Lena E. Olson, Mark D. Hill, David A. Wood 44 *Currently under submission

Threat Model Protect host from incorrect or malicious accelerators that could perform  stray reads, violating confidentiality  stray writes, violating integrity  incorrect coherence activity, violating availability of host processes that do and do NOT run on the accelerator 45

Crossing Guard Goals 1. Allow accelerators customized caches 2. Simple, standardized coherence interface  Work with many diverse host protocols 3. Provide safety for the host system  No unexpected messages  No deadlocks 46

1. Why Customize Caches?  CPU caches have to work with all workloads  Accelerators may only run some workloads!  Streaming? More prefetching.  GPGPUs? Relax coherence between GPU cores.  Etc…. 47

2. Why Simple Interface?  Redesigning for each host is too much work  Intel, AMD, ARM, IBM, Oracle…  CCIX shows companies care!  Host protocols may be proprietary  Host protocols are complex! 48

2. Why Simple Interface? 49 (Transition table in style of Sorin et al.)

Addr State AS 3. Why Host Safety? 50 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Accel CPU

Addr State AS 3. Why Host Safety? 51 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 ? ? ?

Addr State AI 3. Why Host Safety? 52 AddrStateOwner/Sharers Req A MT 0- Addr State A M Addr State AI Directory Accel Cache (#0) Cache #1 Cache #2 Inv Req: dir AddrStateOwner/Sharers Req A MT_I 0-

Crossing Guard Overview  Hardware implemented in trusted host  Implements simple, standard interface  complex enough to allow hierarchical protocol  works with range of host protocols  safe for host  maintains Border Control protections  Moves protocol complexity into XG hardware  Only implemented once per host system  By experts! 53

1. Customize Caches  Designed + implemented two sample systems 54 Accel L1 CPU L1 Host Directory / L2 XG Private Per-Core L1 at Accelerator

1. Customize Caches  Designed + implemented two sample systems 55 Accel L1 CPU L1 Host Directory / L2 XG Private L1s + Shared L2 at Accelerator Accel L2

2. Simple Interface Accelerator  Host Requests  GetS, GetM  PutS, PutE, PutM Host  Accelerator Responses  DataS, DataE, DataM  Writeback Ack 56 Host  Accelerator Requests  Invalidate Accelerator  Host Responses  InvAck, Clean Writeback, Dirty Writeback

2. Simple Interface 57 Single-level Accelerator Cache using Crossing Guard Interface

2. Simple Interface  Implemented Crossing Guard interface to two host protocols  AMD Hammer-like Exclusive MOESI  MESI Inclusive  Modularity: Host and Accelerator protocol choice is independent 58

AddrStateAcksReqs Timer A I 0 - 0 AddrStateAcksReqs Timer A IM 0 - 0 AddrStateAcksReqs Timer A SM -2 - 0 AddrStateAcksReqs Timer A SM -1 - 0 AddrStateAcksReqs Timer A M 0 - 0 Addr State AI 2. Simple Interface 59 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A B GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Inv Req: 0 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-

AddrStateAcksReqs Timer A I 0 - 0 AddrStateAcksReqs Timer A IM 0 - 0 AddrStateAcksReqs Timer A SM -2 - 0 AddrStateAcksReqs Timer A SM -1 - 0 AddrStateAcksReqs Timer A M 0 - 0 Addr State AI 2. Simple Interface 60 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Addr State AS Addr State A IM GetM AddrStateOwner/Sharers Req A SM_MB1, 20 Ack Data Acks:-2 Addr State AI Ack DataM Addr State A M Directory Accel Cache Cache #1 Cache #2 Cache #0 UnblockM AddrStateOwner/Sharers Req A M0-

AddrStateAcksReqs Timer A I 0 - 0 Addr State AS 3. Host Safety 61 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI Ack Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0

AddrStateAcksReqs Timer A M 0 - 0 Addr State AS 3. Host Safety 62 AddrStateOwner/Sharers Req A MT0- Addr State A M Addr State AI Directory Accel Cache Cache #1 Cache #2 Cache #0 Inv (Req: dir) AddrStateOwner/Sharers Req A MT_I0- AddrStateAcksReqs Timer A MI 0 dir 1210 Inv Time: 200 Time: 210 Time: 500 Time: 1000 Time: 1500 Data AddrStateAcksReqs Timer A I 0 - 1210 AddrStateOwner/Sharers Req A WB0-

3. Host Safety Crossing Guard Guarantees to Host: 1. Accelerator requests must be correct a) Consistent with block stable state b) Consistent with block transient state 2. Accelerator responses must be correct a) Consistent with block stable state b) Consistent with block transient state c) Within a reasonable time 63 ( + Border Control Protections!)

Crossing Guard Variants  Full State Crossing Guard  Inclusive directory of accelerator state  + Places few restrictions on host protocol  + Can hide all errors  - Requires tag + metadata storage for all blocks  Transactional Crossing Guard  Stores only data for in-flight transactions  + Small storage  + Provides most safety properties  - Requires some host tolerance 64

Evaluation 1. Does it provide coherence to correct accelerator? 2. Does it provide safety to host? 3. Does it allow high performance? 65

Correctness Testing  Are coherence invariants are maintained when accelerator is acting correctly?  How? Random tester  Store-Load pairs to random addresses  Check integrity of data  Local coverage: > 99% 66

Fuzz Testing  Is host safety maintained when accelerator misbehaves?  How? Replace accelerator cache with evil controller  Generates random coherence messages to random addresses  Desired outcome: No deadlocks / crashes  Local Coverage: > 99.3% 67

Performance Testing  Tertiary concern, but cannot degrade performance too much  gem5-gpu  Rodinia workloads  CAVEATS:  Immaturity of workloads / infrastructure  Directly comparing coherence protocols hard  General trends only! 68

Performance (Hammer-like) 69

Performance: MESI Inclusive 70

III. Crossing Guard Summary  Provides simple, standardized interface to ease accelerator development  Correctness when accelerator is correct  Host safety when accelerator is incorrect  Low performance overhead 71

Publications  “Crossing Guard: Mediating Host-Accelerator Coherence Interactions”  Olson, Hill, Wood (under submission)  “Border Control: Sandboxing Accelerators”  Olson, Power, Hill, Wood (MICRO 2015)  “Security Implications of Third-Party Accelerators”  Olson, Sethumadhavan, Hill (CAL 2016)  “Probabilistic Directed Writebacks for Exclusive Caches”  Olson, Hill (TR 2016)  “Revisiting Stack Caches for Energy Efficiency”,  Olson, Eckert, Manne, Hill (TR 2014) 73

 Accelerators raise new security questions  We can design secure interfaces  To prevent bad memory accesses  To prevent coherence bugs  To ease accelerator development  At low overhead, so people might use them! Conclusion 74

Questions? 75 Investigating Border Control at the Canada-USA Border CANADA No passport

Backup Follows 76

Why now?  Breakdown of Dennard Scaling  3D Die Stacking  Cool new programming models like HSA, CAPI allow unified memory address space  Less copying data  Great for programmability!  Tight integration with host 77

Company Reputations “Companies would never produce malicious hardware, their reputation would be ruined!” 78

Border Control Operation 79 Accel TLB Trusted data path Untrusted data path Address translation path Translation update path Memory $$ Protection Table Border Control update path IOMMU Border Control BC Cache

 Full IOMMU  Safe, but no caches (slow)  Bypassable IOMMU  Has caches, TLB – very fast!  Totally unsafe  CAPI-like  Safe, and has caches and TLB…  But longer access latency, less designer control To summarize… 80 Can we do better?

 Full IOMMU  Safe, but no caches (slow)  Bypassable IOMMU  Has caches, TLB – very fast!  Totally unsafe  CAPI-like  Safe, and has caches and TLB…  But longer access latency, less designer control  Border Control  Safe, physical caches+TLB, AND fast To summarize again… 81 EVALUATION GPGPU  accelerator safety stress-test gem5-gpu Rodinia Benchmarks

Simulation Parameters 82

Comparison of Configurations 83

Border Control Overheads Highly-Threaded GPU 84 Takeaway: On average 0.15% performance overhead vs. unsafe

Border Control Cache 85 Takeaway: A small (1KB) BCC is sufficient for our workloads

TLB Shootdown Steps  If page was read-only:  update entry in Protection Table and BCC  If page was read-write: 1. Invalidate entry in TLB 2. Flush dirty blocks from page in accelerator cache 3. Update entry in Protection Table and BCC 86

Border Control Flush Overhead 87 Takeaway: Permission downgrades affect performance, but not much

Information Flow Tracking  Goal: track untrusted information, prevent it from modifying sensitive data / control  e.g., prevent buffer overflow in software  Hardware-assisted techniques: prevent threats from bugs in software (same address space) – different threat than Border Control  Hardware (e.g. Tiwari et al., ISCA 2011) – very powerful technique, but high area/runtime overhead and not transparent to software 88

Mondriaan  Replacement for traditional page table + TLB  Allows fine-grained permissions  Border Control is independent of the policy for deciding permissions  But permission granularity might mean alternate Protection Table organizations are better 89

Single-Level Cache 90

Simulation Parameters 91

Time Spent Simulating (Random) ConfigurationTime XG Full + Hammer + 1 Level5.28 years XG Full + Hamer + 2 Level2.51 years XG Full + MESI Inc + 1 Level133 days XG Full + MESI Inc + 2 Level223 days XG Trans. + Hammer + 1 Level3.17 years XG Trans. + Hammer + 2 Level1.38 years XG Trans + Inc + 1 Level90 days XG Trans + Inc + 2 Level103 days TOTAL13.9 years 92

Full Coverage %s (Random) Full State XGSingle-levelTwo-level Hammer-like9999.8 MESI Inclusive10099.4 Transactional XGSingle-levelTwo-level Hammer-like99.399.5 MESI Inclusive10099.7 93

Time Spent Simulating (Fuzz) ConfigurationTime XG Full + Hammer-like1.62 years XG Full + MESI Inclusive287days XG Transactional + Hammer-like5.3 years XG Transactional + MESI Inclusive41 days Total7.82 years 94

Full Coverage %s (Fuzz) Full State Crossing GuardFuzz Tester Hammer-like99.3 MESI Inclusive99.7 Transactional Crossing GuardFuzz Tester Hammer-like99.7 MESI Inclusive100 95

Performance: Hammer-like 96

Performance: MESI Inclusive 97

AddrStateAcksReqs Timer A I 0 - 0 Addr State AI Template 98 AddrStateOwner/Sharers Req A SS1, 2- Addr State AI GetM Addr State AI Ack Directory Accel Cache Cache #1 Cache #2 Cache #0

Old Slides 99

3. Why Host Safety? 100 Accelerator cache Directory Addr A: ? Addr A: RW Addr A: Not Present in caches ? ? ? Ack Addr: A

Directory 3. Why Host Safety? 101 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator Fwd-GetM Addr: A

Directory Crossing Guard Example 102 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Writeback Addr: A Fwd-GetM Addr: A Invalidate Addr: A

Directory Crossing Guard Example 103 Accelerator cache Addr A: M Addr A: RW Addr A: M, owned by accelerator A: waiting for WB Invalidate Addr: A Writeback Addr: A Fwd-GetM Addr: A

Where to next? 104

What I’ve Learned 1. Anticipate questions, make backup slides =) 2. Talk to colleagues! They’re really smart. 3. If you can’t explain why your idea is exciting, no one will care about it. 4. Be confident! 105

Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016.

Similar presentations

Presentation on theme: "Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016.

Similar presentations

Presentation on theme: "Protecting Host Systems from Imperfect Hardware Accelerators Lena E. Olson PhD Final Defense August 17 th, 2016."— Presentation transcript:

Similar presentations

About project

Feedback