Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*

Slides:



Advertisements
Similar presentations
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Performance of Cache Memory
Optimization on Kepler Zehuan Wang
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Sunpyo Hong, Hyesoon Kim
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Energy Aware Consolidation for Cloud Computing Srikanaiah, Kansal, Zhao Usenix HotPower 2008.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
- 세부 1 - 이종 클라우드 플랫폼 데이터 관리 브로커 연구 및 개발 Network and Computing Lab.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming Qiumin Xu*, Hyeran Jeon ✝, Keunsoo Kim ❖, Won.
A Case for Toggle-Aware Compression for GPU Systems
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Reducing Memory Interference in Multicore Systems
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,
Adaptive Cache Partitioning on a Composite Core
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
ISPASS th April Santa Rosa, California
PA an Coordinated Memory Caching for Parallel Jobs
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Some challenges in heterogeneous multi-core systems
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Tosiron Adegbija and Ann Gordon-Ross+
Exploring Non-Uniform Processing In-Memory Architectures
Haishan Zhu, Mattan Erez
Ann Gordon-Ross and Frank Vahid*
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs fast Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)

GPU deployment in the cloud GPU multitasking support

GPU spatial multitasking GPU with 20 streaming multiprocessors (SMs) Apps execute on different SMs Even SM partitioning (widely used) Same SM count per app 2 applications App1 - 10 SMs App2 - 10 SMs

Alternative SM partitioning GPU with 20 streaming multiprocessors (SMs) Even SM partitioning App1 - 10 SMs App2 - 10 SMs Uneven SM partitioning Power-gating SMs

Why SM partitioning matters Exploring uneven SM partitioning 24 SMs in GPU Even and uneven SM partitioning 3 typical workloads GAUSSIA_LEU Performance opportunity LBM_DWT2D Energy opportunity BINO_LEU No opportunity How to find the most effective SM partition? X_Y - the number of SMs assigned to the left and right application

Workload classification Heterogeneous workloads Memory-intensive apps LBM, DWT2D, LAVAMD Compute-intensive apps DXTC, LEU, MERGE Key take-away messages Opportunity: optimize system throughput (STP) Shift SMs from memory-intensive app to compute-intensive app overall system performance can be significantly improved through SM partitioning for heterogeneous workload mixes system performance is maximized by assigning a relatively small fraction of the available SMs to the memory-intensive application

Workload classification Memory-intensive workloads Memory-intensive apps LBM, DWT2D, LAVAMD, GAUSSIAN Key take-away messages Opportunity: reduce power consumption Fewer SMs should be assigned to memory-intensive apps Performance may also improve Cache-sensitive apps (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention.

Workload classification Compute-intensive workloads Compute-intensive apps BINO, LEU, DXTC, MERGE Key take-away messages Cannot optimize performance Cannot optimize power Should keep even SM partitioning (i) when co-executing memory-intensive applications, there is no need to allocate all available SMs in the GPU — optimum performance is achieved when allocating a fraction of the available SMs, which creates an opportunity to save power; (ii) the number of SMs assigned to each memory-intensive application depends on the co-executing application and hence needs to be determined dynamically; and (iii) reducing the number of active SMs sometimes leads to a performance boost due to reduced cache contention. Workloads offer different opportunities Towards classification-driven search

SM sensitivity - compute apps Performance increases linearly

SM sensitivity - memory apps Performance saturates quickly How to classify applications?

BW utilization is not enough LLC APKI (accesses per kilo instructions) & LLC Miss Rate & Memory BW utilization Previously used memory bandwidth utilization does not work well 18.2 15.2 120.7 The classification based on a single metric is not effective

Off-SM bandwidth model SM part Total bandwidth per app needs Off-SM part Total bandwidth GPU provides Bandwidth demand < Provided bandwidth Performance increases linearly L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network Bandwidth demand > Provided bandwidth Performance saturates quickly L2 MC L2 MC SM Part Off-SM Part

Data bandwidth GPU provides Total data bandwidth Memory bandwidth L2 cache bandwidth NoC bandwidth Bandwidth Formula Min {NoC_BW, F (MEM, L2))} Build F (MEM, L2) based on L2 bandwidth L2 cache hit ratio Memory bandwidth utilization L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 MC Off-SM Part

Data bandwidth per app needs Data bandwidth demand #SM count Bandwidth demand per SM Memory accesses per cycle IPC L2 accesses per kilo insts SM operating frequency ALUs Instruction Cache Warp Scheduler Register File Load/Store Units Shared Memory/ L1 Cache Bandwidth demand IPC L2 MC L2 MC Streaming Multirprocessors (SMs) Crossbar Network L2 MC L2 APKI L2 MC

CD-search: Performance mode GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app Compute-intensive app How it works? Start with even partitioning Record performance Iteratively stall the SMs assigned to mem-app Stop stalling when performance of mem-app changes Assign stalled SMs to compute-app

CD-search: Energy mode GPUs with 20 streaming multiprocessors (SMs) Workload consists of a Memory-intensive app1, app2 How it works? Start with even partition Record performance Stall all but leave only one SM for each mem-app Estimate the SM count based on the IPC10SMs and IPC1SM Iteratively resume one SM until saturation Power-gate stalled SMs Power- gate

Experimental setup GPGPU-sim 24 SMs - Fermi-like architecture 14 applications Rodina, Parboil, SDK et.al. 7 compute-intensive apps 7 memory-intensive apps 91 2-app workloads 49 heterogeneous workloads 21 memory workloads 21 compute workloads GPGPU-sim 24 SMs - Fermi-like architecture Crossbar NoC Bisection bandwidth - 538 GB/s 6 MCs Memory bandwidth - 177 GB/s 2 LLC slices per MC Total LLC size - 768 KB GPUWattch 40 nm technology node

Performance results Heterogeneous workloads STP improvement 10.4% on average ANTT improvement 22% on average

Energy results Homogeneous workloads Power reduction 25% on average

Performance neutral on average Energy results Homogeneous workloads STP improves due to cache-sensitive applications STP degrades due to the time-varying execution behavior of few applications Performance neutral on average

streaming multiprocessors (SMs) SMK discussion GPUs with 20 streaming multiprocessors (SMs) SMK [ISCA’16] Co-executing 2 applications on one SM 2 applications App1 - 20 SMs App2 - 20 SMs Maestro [ASPLOS’17] Combines SMK with spatial multitasking Still assumes even SM partitioning

SMK discussion SMK severely degrades performance for some applications CD-search and Maestro both improve performance CD-search + Maestro achieves highest performance 24.4% performance improvement on average

Conclusions We demonstrate the opportunity for uneven SM partitioning Performance potential Power potential We propose CD-search which First classifies the workloads Then identifies a suitable SM partitioning We propose an off-SM bandwidth model to classify apps In the evaluation, CD-search Improves system throughput by 10.4% for heterogeneous workloads Reduces power by 25% for homogeneous workloads

Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* Classification-Driven Search for Effective SM Partitioning in Multitasking GPUs Xia Zhao*, Zhiying Wang+, Lieven Eeckhout* *Ghent University, +National University of Defense Technology The 32nd ACM International Conference on Supercomputing (ICS-2018)