Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Instruction-Level Parallelism (ILP)
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
University of Michigan EECS PEPSC : A Power Efficient Computer for Scientific Computing. Ganesh Dasika 1 Ankit Sethia 2 Trevor Mudge 2 Scott Mahlke 2 1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
WarpPool: Sharing Requests with Inter-Warp Coalescing for Throughput Processors John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski,
Pipelining Example Laundry Example: Three Stages
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.
Sunpyo Hong, Hyesoon Kim
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
CUDA programming Performance considerations (CUDA best practices)
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
Single Instruction Multiple Threads
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
ISPASS th April Santa Rosa, California
The University of Adelaide, School of Computer Science
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 5: GPU Compute Architecture
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
RegLess: Just-in-Time Operand Staging for GPUs
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Warp Scheduling.
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1 ELF: Maximizing Memory-level Parallelism for GPUs with Coordinated Warp and Fetch Scheduling Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1 1University of Michigan, Ann Arbor 2Hongik University

GPUs in Modern Computer Systems

GPU Usage Throughput-oriented Latency-critical Resource utilization Frame rate

Reality: Achieved Performance GTX480 Achieves 40% of peak performance

GPU Execution Model Inside streaming multiprocessor (SM) Fetch Warp: Issue Issue : Warp scheduler

GPU Execution Model Commonly used warp scheduling Fetch Issue Issue RR (Round robin) GTO (Greedy then oldest)

Exposed Memory Latency Assuming Maxwell architecture # of instructions per warp to hide memory latency 300 / 15 * 2 ≈ 40 ~300 cycles for memory latency 16 warps per warp scheduler (4 warp schedulers) Upto 2 instruction issues per cycle …

Stall Reasons 18% 6% 27% 8%

Objective of This Work Maximize Memory-level Parallelism L1I cache Fetch Problem - Memory Dependency Stalls Solution - Earliest Load First (ELF) scheduling - Coordinated warp and fetch scheduling Issue Issue LSU

Objective of This Work Maximize Memory-level Parallelism L1I cache Problem - Fetch Stalls Solution - Instruction prefetching Fetch Issue Issue LSU

Objective of This Work Maximize Memory-level Parallelism L1I cache Fetch Problem - Memory Conflict Stalls Solution - Memory reordering Issue Issue LSU

Earliest Load First (ELF) Overlap memory latency with other operations : Compute : Memory load : Memory latency Next instructions Schedule GTO … W0: … W1: … W2: ELF … W3: Lost issue slots

Distance to Next Memory Load Detected at compile time : Compute : Memory load : Branch 4 2 8 5 Most of the time, distance is decreased by one Distance may be recomputed after Memory load Branch

Transferring Program Points JIT compiler Embeds program points information in the binary : Compute : Memory load : Branch 7 4 21 8 11 17 min

Coordinated Warp and Fetch Scheduling Use the same priority for issue and fetch Fetch Issue Issue

Instruction Prefetching Prefetch instructions, but avoid memory contention If outstanding memory requests < ½ of maximum L1I cache L1I cache miss Fetch Issue Issue

Memory Reordering Allow other memory requests to go first LSU : Stalled because of memory conflict : Other requests Stall (MSHR full, associativity stall) Hit-under-miss or no conflict LSU Re-execute later

: Load/Store from Warp N Memory Reordering : Load from Warp 0 : Store from Warp 0 : Load/Store from Warp N : Stalled : Stalled

Experimental Setup GPGPU-Sim v3.2.2 Workloads GPU Model: GTX 480 1536 threads per SM 32768 registers per SM 48kB shared memory per SM Workloads 25 benchmarks from NVIDIA SDK, Parboil, and Rodinia Baseline: Greedy-then-oldest (GTO) scheduler

Performance Contribution 4.1% 4.9% 2.5% 12% 49% 54% 79% 79%

Performance Comparison 2-LV CCWS DYNCTA -3% 1% -2% ELF 12% + Good for cache-sensitive applications - Not better than GTO in general + Tackles three types of stalls + Effective in general

Summary Memory latency is still a problem in GPUs ELF Compute vs. memory ratio ≈ 40 to hide the latency ELF Maximize memory-level parallelism Earliest load first scheduling Coordinated warp and fetch Scheduling Instruction Prefetching Memory reordering 12% performance improvement

Questions