Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.

Slides:



Advertisements
Similar presentations
1 S4: Small State and Small Stretch Routing for Large Wireless Sensor Networks Yun Mao 2, Feng Wang 1, Lili Qiu 1, Simon S. Lam 1, Jonathan M. Smith 2.
Advertisements

Microarchitectural Performance Characterization of Irregular GPU Kernels Molly A. O’Neil and Martin Burtscher Department of Computer Science.
ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.
Fundamentals of Python: From First Programs Through Data Structures
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
st International Conference on Parallel Processing (ICPP)
OpenFOAM on a GPU-based Heterogeneous Cluster
The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.
Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.
The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Characterizing Unstructured Overlay Topologies in Modern P2P File-Sharing Systems Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
UNIVERSITY OF JYVÄSKYLÄ Resource Discovery in Unstructured P2P Networks Distributed Systems Research Seminar on Mikko Vapa, research student.
Low Power Design of Integrated Systems Assoc. Prof. Dimitrios Soudris
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
GPU-Qin: A Methodology For Evaluating Error Resilience of GPGPU Applications Bo Fang , Karthik Pattabiraman, Matei Ripeanu, The University of British.
Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science.
SAGE: Self-Tuning Approximation for Graphics Engines
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 VLSI Design SMD154 LOW-POWER DESIGN Magnus Eriksson & Simon Olsson.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
On Distinguishing the Multiple Radio Paths in RSS-based Ranging Dian Zhang, Yunhuai Liu, Xiaonan Guo, Min Gao and Lionel M. Ni College of Software, Shenzhen.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
GPU Architecture and Programming
Computational Sprinting on a Real System: Preliminary Results Arun Raghavan *, Marios Papaefthymiou +, Kevin P. Pipe +#, Thomas F. Wenisch +, Milo M. K.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
Divergence-Aware Warp Scheduling
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
1/20 Study of Highly Accurate and Fast Protein-Ligand Docking Method Based on Molecular Dynamics Reporter: Yu Lun Kuo
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL †, JOSEPH GREATHOUSE †, SRILATHA MANNE †, AND SUDHKAHAR.
QCAdesigner – CUDA HPPS project
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Data Structures and Algorithms in Parallel Computing Lecture 7.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
CES 592 Theory of Software Systems B. Ravikumar (Ravi) Office: 124 Darwin Hall.
CS 420 Design of Algorithms Parallel Algorithm Design.
GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
ARCHITECTURE-ADAPTIVE CODE VARIANT TUNING
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Supervised Time Series Pattern Discovery through Local Importance
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Short Circuiting Memory Traffic in Handheld Platforms
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Synchronization trade-offs in GPU implementations of Graph Algorithms
Summary Background Introduction in algorithms and applications
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Chapter 1 Introduction.
INTRODUCTION TO ALOGORITHM DESIGN STRATEGIES
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1

Introduction ■GPU-based accelerators ■Used in high-performance computing ■Spreading in PCs and handheld devices 2 Power Characteristics of Irregular GPGPU Programs ■Power and energy efficiency ■Power heat ■Electric bill and battery life ■50x boost in performance per watt for exascale computing ■Important research area ■Need to develop techniques to reduce power and energy ■Have to develop an understanding of the power consumption behavior of regular and irregular programs

Regular vs. Irregular Algorithms ■Irregular ■Control flow and/or memory access patterns are data dependent ■Runtime behavior cannot be statically predicted ■Example: BST ■Values and their order affect the shape of the binary search tree 3 3 Power Characteristics of Irregular GPGPU Programs ■Regular ■Not data dependant ■Can determine dynamic behavior based only on ■Input size (not values) ■Data structure starting addresses ■Example: Matrix-vector multiplication

Benchmark Programs ■LonestarGPU: common, real-world irregular codes ■Barnes-Hut (BH) ■Breadth-first Search (BFS) ■Delaunay Mesh Refinement (DMR) ■Minimum Spanning Tree (MST) ■Points-to-Analysis (PTA) ■Single-Source Shortest Paths (SSSP) ■Survey Propagation (NSP) ■Parboil: mostly regular, throughput-computing codes ■Lattice-Boltzmann Method Fluid Dynamics (LBM) ■Two-Point Angular Correlation Function (TPACF) ■Sparse Matrix Vector Multiplication (SPMV) 4 4 Power Characteristics of Irregular GPGPU Programs

Methodology ■Power profiles ■Power as a function of time Figure 1: Sample power profile ■Test bed ■K20c compute GPU ■K20 Power Tool ■Three frequency settings: default, 614, and 324 ■Multiple program inputs and implementations 5 5 Power Characteristics of Irregular GPGPU Programs Active Runtime Idle Power Tail Power ■Idle Power ■Tail Power ■Active threshold ■Active Runtime

Idealized Power Profile ■1. GPU receives work ■2. Power draw stable ■3. All cores finish ■Profile can be captured with two parameters: active runtime and average power Figure 2: Idealized power profile 6 6 Power Characteristics of Irregular GPGPU Programs

Regular Codes ■Active power not quite constant ■Profiles basically follow idealized shape ■Power peaks at different levels Figure 3: Power profile of three regular codes 7 7 Power Characteristics of Irregular GPGPU Programs

Irregular Codes ■MST and PTA ■Profiles contain many peaks and valleys ■Dynamically changing data dependencies ■No such thing as a standard profile for irregular codes ■NSP ■Topology driven ■Exhibits regular power behavior Figure 4a: Power profile of irregular codes 8 8 Power Characteristics of Irregular GPGPU Programs

■BFS and SSSP ■Topology driven ■Exhibit regular power behavior ■Lots of unnecessary work ■DMR ■Many peaks and valleys ■~90 seconds of near idle power shows loss of parallelism ■BH ■Appears mostly regular ■10k bodies, 10k timesteps ■Irregularity is masked by short runtimes of individual kernels Irregular Codes (cont.) Figure 4b: Power profile of irregular codes 9 9 Power Characteristics of Irregular GPGPU Programs

BH 100k Bodies, 100 Time Steps ■100k bodies, 100 timesteps ■Kernel invocation evident ■Higher average power draw with 100k bodies over 10k bodies (red dashed line) ■10k bodies not enough to fill GPU ■Irregularity within each timestep still not visible Figure 5: Power profile of BH with 100k bodies and 100 time steps 10 Power Characteristics of Irregular GPGPU Programs

BH 22M bodies, Single Time Step 11 Power Characteristics of Irregular GPGPU Programs Decrease due to load imbal. Two similar irreg. kernels One more irreg. kernel Very short regular kernel Regularized main kernel

Idealized-Regular-Irregular ■Idealized and Regular ■Similar shapes ■Irregular ■Obviously different ■Load imbalance ■Memory access patterns ■Control flow ■Power fluctuates wildly ■Averages 82W ■From high to low, 60% difference ■Cannot be accurately captured by averages Figure 6: Comparison of idealized, regular, and irregular profiles 12 Power Characteristics of Irregular GPGPU Programs

Different Implementations of BFS ■WLA ■Topology driven ■Lowest avg power and peak power ■Atomic ■Fastest topology driven version ■Irregular power profile ■WLC and WLW ■Data driven ■Irregularity is masked by short runtimes ■Shape of the profile depends on implementation strategy Figure 7: Profile of different implementations of BFS 13 Power Characteristics of Irregular GPGPU Programs

SSSP with Different Frequencies ■Topology driven implementation ■Same input (roadmap of USA) ■Default setting ■705 MHz core; 2600 MHz memory ■614 ■614 MHz core; 2600 MHz memory ■Energy stays about the same ■324 ■324 MHz core; 324 MHz memory ■Energy increases ■Terrible for memory bound codes Figure 8: Profile of SSSP on different frequencies 14 Power Characteristics of Irregular GPGPU Programs

PTA with Different Inputs ■Distinctly different profiles ■Pine averages 15 W more than vim ■Tshark has initial spike followed by a 1.5 second valley before ramping up ■Pine and vim have more spikes and ramp down slightly over time ■Cannot use a profile from one input to characterize the power draw over time of another Figure 9: Profile of PTA for different inputs 15 Power Characteristics of Irregular GPGPU Programs

Summary ■Regular codes often similar to idealized ■Irregular Codes ■No such thing as a standard power profile ■Implementation and input can greatly affect power profile ■Early behavior may not be indicative of later behavior ■Cannot easily be captured by averages ■Power over time ■Must be considered ■Re-evaluated for each input and any code modification ■Provides understanding of software effects on hardware 16 Power Characteristics of Irregular GPGPU Programs

The work reported herein is supported by:  U.S. National Science Foundation  DUE  CNS  CNS  CCF  Texas State University  NVIDIA Corporation 17 Power Characteristics of Irregular GPGPU Programs Questions