Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

Slides:

Advertisements

Similar presentations

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Managing GPU Concurrency in Heterogeneous Architect ures Shared Resources Network LLC Memory.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.

GPGPU platforms GP - General Purpose computation using GPU

Sunpyo Hong, Hyesoon Kim

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications Jaekyu LeeNagesh B. Lakshminarayana Hyesoon KimRichard Vuduc.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Performance in GPU Architectures: Potentials and Distances

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

My Coordinates Office EM G.27 contact time:

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

Managing DRAM Latency Divergence in Irregular GPGPU Applications Niladrish Chatterjee Mike O’Connor Gabriel H. Loh Nuwan Jayasena Rajeev Balasubramonian.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.

A Case for Toggle-Aware Compression for GPU Systems

µC-States: Fine-grained GPU Datapath Power Management

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Gwangsun Kim, Jiyun Jeong, John Kim

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Controlled Kernel Launch for Dynamic Parallelism in GPUs

ISPASS th April Santa Rosa, California

Managing GPU Concurrency in Heterogeneous Architectures

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Rachata Ausavarungnirun

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Presented by: Isaac Martin

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Managing GPU Concurrency in Heterogeneous Architectures

Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh

6- General Purpose GPU Programming

Address-Stride Assisted Approximate Load Value Prediction in GPUs

Haonan Wang, Adwait Jog College of William & Mary

Presented by Ondrej Cernin

Presentation transcript:

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das

GPU Computing 2 GPUs are known for providing high thread-level parallelism (TLP). Core Uni-processors Core Multi-cores... Many-cores, GPUs

3 “Too much of anything is bad, but too much good whiskey is barely enough” - Mark Twain

Executive Summary Current state-of-the-art thread-block schedulers make use of the maximum available TLP More threads → more memory requests Contention in memory sub-system Improves average application performance by 28% 4 Proposal: A thread-block scheduling algorithm Optimizes TLP and reduces memory sub-system contention

Outline Proposal Background Motivation DYNCTA Evaluation Conclusions 5

GPU Architecture 6 DRAM SIMT Cores Warp Scheduler ALUs L1 Caches Threads W W W W W W W W W W W W Warps L2 cache Interconnect CTA Cooperative Thread Arrays (CTAs) CTA Scheduler

Warp Scheduler GPU Scheduling 7 Warp Scheduler CTA Warp Scheduler CTA Scheduler Warp Scheduler Pipeline W W W W W W W W W W W W W W W W CTA

Properties of CTAs Threads within a CTA synchronize using barriers. There is no synchronization across threads belonging to different CTAs. CTAs can be distributed to cores in any order. Once assigned to a core, a CTA cannot be preempted. 8 Threads CTA barrier

Properties of CTAs The number of CTAs executing on a core is limited by:  the number of threads per CTA  the amount of shared memory per core  the number of registers per core  a hard limit (depends on CUDA version for NVIDIA GPUs)  the resources required by the application kernel These factors in turn limit the available TLP on the core. By default, if available, a core executes maximum number of CTAs. 9

Outline Proposal Background Motivation DYNCTA Evaluation Conclusions 10

Effect of TLP on GPGPU Performance 11 19% 39% potential improvement

Effect of TLP on GPGPU Performance 12 16% 51% 95% potential improvement

Why is not more TLP always optimal? 13

Why is not more TLP always optimal? More threads result in larger working data set  Causes cache contention More L1 misses cause more network injections  Network latency increases 14

Outline Proposal Background Motivation DYNCTA Evaluation Conclusions 15

DYNCTA Approach Execute the optimal number of CTAs for each application Requires exhaustive analysis for each application, thus inapplicable 16 Idea: Dynamically modulate the number of CTAs on each core using the CTA scheduler

DYNCTA Approach Objective 1: keep the cores busy  If a core has nothing to execute, give more threads to it Objective 2: do not keep the cores TOO BUSY  If the memory sub-system is congested due to high number of threads, lower TLP to reduce contention  If the memory sub-system is not congested, increase TLP to improve latency tolerance 17

DYNCTA Approach Objective 1: keep the cores busy  Monitor C_idle, the number of cycles during which a core does not have anything to execute  If it is high, increase the number of CTAs executing on the core Objective 2: do not keep the cores TOO BUSY  Monitor C_mem, the number of cycles during which a core is waiting for the data to come back from memory  If it is low, increase the number of CTAs executing on the core  If it is high, decrease the number of CTAs executing on the core 18

DYNCTA Overview 19 LH HML C_mem C_idle Increase # of CTAs Decrease # of CTAs No change in # of CTAs

CTA Pausing 20 Warps of the most recently assigned CTA are deprioritized in the warp scheduler Once assigned to a core, a CTA cannot be preempted! Then, how to decrement the number of CTAs ? PAUSE

Outline Proposal Background Motivation DYNCTA Evaluation Conclusions 21

Evaluation Methodology Evaluated on GPGPU-Sim, a cycle accurate GPU simulator Baseline Architecture  30 SIMT cores, 8 memory controllers, crossbar connected  1300MHz, SIMT Width = 8, Max threads/core  32 KB L1 data cache, 8 KB Texture and Constant Caches  GDDR3 800MHz Applications Considered (in total 31) from:  Map Reduce Applications  Rodinia – Heterogeneous Applications  Parboil – Throughput Computing Focused Applications  NVIDIA CUDA SDK – GPGPU Applications 22

Dynamism 23 Default number of CTAs Optimal number of CTAs

Dynamism 24 Default number of CTAs Optimal number of CTAs

Average Number of CTAs

IPC 26 39% 13% 28%

Outline Proposal Background Motivation DYNCTA Evaluation Conclusions 27

Conclusions Maximizing TLP is not always optimal in terms of performance We propose a CTA scheduling algorithm, DYNCTA, that optimizes TLP at the cores based on application characteristics DYNCTA reduces cache, network and memory contention DYNCTA improves average application performance by 28% 28

THANKS! QUESTIONS? 29

BACKUP 30

Utilization 31 Core 1 Idle CTA 1 CTA 3 CTA 5 CTA 7 Core 2 CTA 2 CTA 4 CTA 6 CTA 8 CTA 1 CTA 3 CTA 4 CTA 6 CTA 5 CTA 7CTA 2 CTA 8 Idle Core 1 Core 2

Initial n All cores are initialized with ⌊ N/2 ⌋ CTAs. Starting with 1 CTAs and ⌊ N/2 ⌋ CTAs usually converge to the same value. Starting with the default number of CTAs might not be as effective 32

Comparison against optimal CTA count Optimal number of CTAs might be different for different intervals for applications that exhibit compute- and memory- intensive behaviors at different intervals Our algorithm outperforms optimal number of CTAs in some applicationsoutperforms 33

Parameters VariableDescriptionValue NactActive time, where cores can fetch new warps NinactInactive time, where cores cannot fetch new warps RACTActive time ratio, Nact/(Nact + Ninact) C_idleThe number of core cycles during which the pipeline is not stalled, but there are no threads to execute C_memThe number of core cycles during which all the warps are waiting for their data to come back t_idleThreshold that determines whether C_idle is low or high 16 t_mem_l & t_mem_hThresholds that determine if C_mem is low, medium or high 128 & 384 Sampling periodThe number of cycles to make a modulation decision

Round Trip Fetch Latency

Other Metrics L1 data miss rate: 71% → 64% Network latency: ↓ 33% Active time ratio: ↑ 14% 36

Sensitivity Large system with 56 and 110 cores: around 20% performance improvement MSHR size: 64 – 32 – 16: 0.3% and 0.6% performance loss DRAM frequency: 1333 MHz: 1% performance loss Sampling period 2048 – 4096: 0.1% performance loss Thresholds: 50% - 150% of the default values: losses between 0.7% - 1.6% 37