Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box.

Slides:



Advertisements
Similar presentations
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.
GPU Computing with CUDA as a focus Christie Donovan.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
An Analytical Model for a GPU. Overview SVM Kernel Behavior: Need for other metrics.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Sunpyo Hong, Hyesoon Kim
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
1. So far, one thread is responsible for one data element, can you change this, say one thread takes care of several data entries ? test N = 512*10 We.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Single Instruction Multiple Threads
General Purpose computing on Graphics Processing Units
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Gwangsun Kim, Jiyun Jeong, John Kim
EECE571R -- Harnessing Massively Parallel Processors ece
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Parallel Computing Lecture
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 5: GPU Compute Architecture
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Hoda NaghibiJouybari Khaled N. Khasawneh and Nael Abu-Ghazaleh
Support for Adaptivity in ARMCI Using Migratable Objects
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box

 GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate scientific, engineering, and enterprise applications. What’s GPGPU?

 a simplified representation of a system or phenomenon  it is the most explicit way in which to describe a system or phenomenon  use the parameter we set to build formula to Analysis system What’s modeling

 Hong and Kim [3] introduce two metrics, Memory Warp Parallelism (MWP) and Computation Warp Parallelism (CWP) in order to describe the GPU parallel architecture.  Zhang and Owens [4] develop a performance model based on their microbenchmarks so that they can identify bottlenecks in the program.  Supada [5] performance model consider memory latencies are varied depending on the data type and the type of memory Relate work

 Different application and device cannot use same setting  Find the relationship between each parameters in this model, and choose the best block size for each application on different device to get peak performance. 1Introduction and background

varies data size with varies size of block have different performance

How GPU working

Memory latency hiding

The structure of threads

Specification of GeForce GTX 650

Parameters

 N MB >= N TB = N* N TW (N is integer) >= N RT / N RB Block size setting under threads limitation

Memory resource

 M R / M TR >= N* N TB (N is integer)  N* N TB (N is integer) <= N RT N<= M SM / M SB Block size setting under stream multiprocessor resource

 Though more threads can hide memory access latency, but the more thread use the more resource needed. Find the balance point between resource limitation and memory latency is a shortcut to touch the peak performance. By different application and device this performance model shows it advantage, adaptable and without any rework and redesign let application running on the best tuning. Conclusion