GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Tools for Investigating Graphics System Performance

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An Introduction to Programming with CUDA Paul Richmond

GPU Computing with CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.

Enhancing GPU for Scientific Computing Some thoughts.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Single Instruction Multiple Threads

Parallel Programming Models

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Gwangsun Kim, Jiyun Jeong, John Kim

Employing compression solutions under openacc

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

CMPE419 Mobile Application Development

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Presented by: Isaac Martin

NVIDIA Fermi Architecture

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Mattan Erez The University of Texas at Austin

CMPE419 Mobile Application Development

6- General Purpose GPU Programming

CS Introduction to Operating Systems

Presentation transcript:

GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh Chi Xu

Outline Introduction and Motivation Analytical Model Description Experiment Setup Results Conclusion and Further Work

Introduction

Motivation

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Parser

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Power Model PTX Level

Power Model Assembly Level

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Experiment Setup - Hardware Measure Power Consumption and Temperature o Current Clamp for PCIE & GPU Power Cable  Data Acquisition 100Hz o GPU Performance Counter o Sample 10Hz, GPU sensor

Experiment Setup - Software Driver API Generate and Modify PTX code o Minimize control loops CUDA 4.0 o Built in Binary -> Assembly Converter (cuobjdump) MATLAB to build model Remote login

CUDA- Fermi Architecture Third Generation Streaming Multiprocessor(SM) o 32 CUDA cores per SM, 4x over GT200 o 1024 thread block size, 2x over GT200 o Unified address space enables full C++ support o Improved Memory Subsystem

Benchmarks Small number of overhead operations (loop counters, initialization, etc.). Computational intensive work to allow for an experiment of significant length for accurate current measurement. Exhibit high utilization of the CUDA cores, few data hazards as possible. Grid and block sizes appropriately so that all SM are used, since idle SM leak. Accordingly 7 benchmarks were selected from CUDA SDK.

Benchmarks For this project we tested out a few benchmarks. 2D convolution Matrix Multipication Vector Addition Vector Reduction Scalar Product DCT 8x8 3DFD

Limitations of PTX Higher level than assembly o Divide & Sqrt: 1 PTX line, library in assembly Compiler optimizations from PTX -> assembly Doesn’t reflect RAW dependencies Performance counters use assembly

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Results

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Conclusion Further Work o Take into account context switches o Consider Multiple kernels running simultaneously

The End Thanks Q&A