GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Tools for Investigating Graphics System Performance
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
GPU Computing with CBI Laboratory. Overview GPU History & Hardware – GPU History – CPU vs. GPU Hardware – Parallelism Design Points GPU Software.
Enhancing GPU for Scientific Computing Some thoughts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Single Instruction Multiple Threads
Parallel Programming Models
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
Basic CUDA Programming
CMPE419 Mobile Application Development
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Presented by: Isaac Martin
NVIDIA Fermi Architecture
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Mattan Erez The University of Texas at Austin
CMPE419 Mobile Application Development
6- General Purpose GPU Programming
CS Introduction to Operating Systems
Presentation transcript:

GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh Chi Xu

Outline Introduction and Motivation Analytical Model Description Experiment Setup Results Conclusion and Further Work

Introduction

Motivation

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Parser

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Power Model PTX Level

Power Model Assembly Level

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Experiment Setup - Hardware Measure Power Consumption and Temperature o Current Clamp for PCIE & GPU Power Cable  Data Acquisition 100Hz o GPU Performance Counter o Sample 10Hz, GPU sensor

Experiment Setup - Software Driver API Generate and Modify PTX code o Minimize control loops CUDA 4.0 o Built in Binary -> Assembly Converter (cuobjdump) MATLAB to build model Remote login

CUDA- Fermi Architecture Third Generation Streaming Multiprocessor(SM) o 32 CUDA cores per SM, 4x over GT200 o 1024 thread block size, 2x over GT200 o Unified address space enables full C++ support o Improved Memory Subsystem

Benchmarks Small number of overhead operations (loop counters, initialization, etc.). Computational intensive work to allow for an experiment of significant length for accurate current measurement. Exhibit high utilization of the CUDA cores, few data hazards as possible. Grid and block sizes appropriately so that all SM are used, since idle SM leak. Accordingly 7 benchmarks were selected from CUDA SDK.

Benchmarks For this project we tested out a few benchmarks. 2D convolution Matrix Multipication Vector Addition Vector Reduction Scalar Product DCT 8x8 3DFD

Limitations of PTX Higher level than assembly o Divide & Sqrt: 1 PTX line, library in assembly Compiler optimizations from PTX -> assembly Doesn’t reflect RAW dependencies Performance counters use assembly

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Results

Outline Introduction and Motivation Analytical Model Description o Parser o Power Model Experiment Setup Results Conclusion and Further Work

Conclusion Further Work o Take into account context switches o Consider Multiple kernels running simultaneously

The End Thanks Q&A