Managing GPU Concurrency in Heterogeneous Architectures

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

1 MacSim Tutorial (In ISCA-39, 2012). Thread fetch policies Branch predictor Thread fetch policies Branch predictor Software and Hardware prefetcher Cache.

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

How Multi-threading can increase on-chip parallelism

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Haiyang Jiang, Gaogang Xie, Kave Salamatian and Laurent Mathy

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Lecture 5 Approaches to Concurrency: The Multiprocessor

ISPASS th April Santa Rosa, California

Resource Aware Scheduler – Initial Results

Concurrent Data Structures for Near-Memory Computing

Managing GPU Concurrency in Heterogeneous Architectures

Implementation of GPU based CCN Router

Rachata Ausavarungnirun, Kevin Chang

Multi-Processing in High Performance Computer Architecture:

Lecture 5: GPU Compute Architecture

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Rachata Ausavarungnirun

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

MASK: Redesigning the GPU Memory Hierarchy

Lecture 2: Performance Today’s topics: Technology wrap-up

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture 5: GPU Compute Architecture for the last time

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Shanjiang Tang1, Bingsheng He2, Shuhao Zhang2,4, Zhaojie Niu3

Chapter 1 Introduction.

Chapter 4 Multiprocessors

Database System Architectures

Chip&Core Architecture

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Presentation transcript:

Managing GPU Concurrency in Heterogeneous Architectures MICRO 2014

Baseline heterogeneous architecture Throughput-optimized GPU cores and latency- optimized CPU cores on the same chip Cores are connected to the LLC and MCs via an interconnect

Motivation – application interference 20% Performance loss <- contention in shared hardware resources GPU: the dominant consumer of shared resources due to high TLP CPU applications are affected much more compared to GPU applications. 85%

Motivation – latency tolerance High GPU thread level parallelism: Memory system congestion Low CPU performance GPU can tolerate latency due to multi-threading Better performance at lower TLP GPU TLP management

Effects of GPU TLP on GPU performance Reduce GPU TLP, GPU performance can be Better：less cache thrashing and congestion in memory Worse: reduced parallelism and latency tolerance Unchanged

Effects of GPU TLP on CPU performance Reduce GPU TLP, GPU performance can be Better: less congestion in memory subsystem Unchanged

: Proposal I – CM-CPU CPU-Centric Concurrency Management Main goal: reduce GPU concurrency to boost CPU performance 2 metrics Memory congestion: # of stalled requests due to an MC being full Network congestion: # of stalled requests due to reply network being full CPU performance : congestion

Proposal I – CM-CPU Congestion level: low, medium or high At least one metric is high: decrease # of warps Both metrics are low: increase # of warps Otherwise: # of warps unchanged Downside: insufficient GPU latency tolerance due to low TLP

Proposal II – CM-BAL Balanced Concurrency Management stallGPU: # of cycles that GPU core fails to issue a warp Latency tolerance of GPU cores Low latency tolerance High memory contention

×  Proposal II – CM-BAL Part 1: the same as CM-CPU Part 2: override CM-CPU TLP: stallGPU by more than k Higher k: more difficult to improve the latency tolerance × 

GPU performance results GPU/CPU DYNCTA: +2% CM-CPU: -11% CM-BAL1: +7%

CPU performance results GPU/CPU DYNCTA: +2% CM-CPU: +24% CM-BAL1: +7%

System performance Overall System Speedup = (1 − α) × WSCPU + α × SUGPU α is between 0 and 1 Higher α -> higher GPU importance CM-CPU CM-BAL

Conclusions Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other Existing GPU TLP management techniques are not well-suited for heterogeneous architectures Propose two GPU TLP management techniques for heterogeneous architectures CM-CPU reduces GPU TLP to improve CPU performance CM-BAL is similar to CM-CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores TLP can be tuned based on user’s preference for higher CPU or GPU performance