1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Computer Architecture Instruction-Level Parallel Processors

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Pipelining and Parallelism Mark Staveley

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Pipelining Example Laundry Example: Three Stages

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

PipeliningPipelining Computer Architecture (Fall 2006)

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Computer Graphics Graphics Hardware

GPU Architecture and Its Application

CSCI206 - Computer Organization & Programming

Graphics Processor Graphics Processing Unit

Gwangsun Kim, Jiyun Jeong, John Kim

Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor

Visit for more Learning Resources

A Closer Look at Instruction Set Architectures

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Lecture 5: GPU Compute Architecture

Hyperthreading Technology

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

CSCI206 - Computer Organization & Programming

Lecture 5: GPU Compute Architecture for the last time

Serial versus Pipelined Execution

Computer Graphics Graphics Hardware

/ Computer Architecture and Design

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Part IV Data Path and Control

6- General Purpose GPU Programming

Presentation transcript:

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British Columbia

2 Intuition suggests integrating parallel and sequential cores on a single chip should provide performance benefits by lowering communication overheads.

3 This work: Perform limit study of heterogeneous architecture performance when running a single general purpose application. Two main results: Single thread performance (read-after-write latency) of GPUs ought to improve for GPUs to accelerate a wider set of non-graphics workloads. Putting CPU and accelerator on single chip does not seem to improve performance “much” versus separate CPU and accelerator.

4 Outline Introduction Background: - GPU Computing / Heterogeneous - Barrel processing (relevant to GPUs) Limit Study Model - Sequential and Parallel Models - Dynamic programming algorithm - Modeling Bandwidth Results

5 Graphics Processing Unit (GPU) Polygons Textures Lights

6 Programmable GPU Rendering pipeline Polygons go in Pixels come out DX10 has 3 programmable stages

7 GPU/Stream Computing Use shader processors without rendering pipeline C-like high-level language for convenience

8 Separate GPU + CPU Off-chip latency Copy data between memory spaces

9 Single-Chip Lower latency Single memory address space: Share data, don't copy

10 Sequential Performance of Parallel Processor Contemporary GPUs have slow single thread performance. “Designed for cache miss” => use “barrel processing” to hide off-chip latency. This impacts minimum read-to-write latency for a single thread. Not an issue if you have 10 6 pixels each requiring 100 instruction long thread.

11 Sequential Performance of Parallel Processor GPUs can do many operations per clock cycle Nvidia G80 needs 3072 independent instructions every 24 clocks to keep pipelines filled Can model G80 as executing up to 3072 independent scalar instructions every 24 clocks For single thread CPU produces results ~100x faster: 2 IPC * 2 clock speed * 24 instruction latency Parallel Instruction Latency = ratio of read-to-write latency of dependent instructions on parallel processor (measured in CPU clock cycles) to CPU CPI.

12 Limit Study Optimistic abstract model of GPU and CPU “ILP limit study”-type trace analysis with optimistic assumptions. Assume constant CPI (=1.0) for sequential core. Parallel processor is ideal data flow processor, but with read- after-write latency some multiple of the sequential core clock. Parallel processor has unlimited parallelism Optimally schedule instructions on cores using dynamic programming algorithm.

13 Trace Analysis Assumptions Perfect branch prediction Perfect memory disambiguation Remove stack-pointer dependencies Remove induction variable dependencies by removing all instructions that depend (dynamically) only on compile time constants.

14 Scheduling a Trace

15 Dynamic Programming Switching between processors takes time Find optimal schedule by decomposing problem, using optimal solution to subproblem to create optimal solution to larger problem. Input: Trace of N instructions. Output: Optimum (minimum) number of cycles required to execute on abstract heterogeneous processor model. serial parallel serial parallel instructions

16 Optimal algorithm is quadratic in instruction trace length. Approximation: First, sort trace of instructions in dataflow order to uncover parallelism. Then, apply dynamic programming over traces of 30,000 instructions.

17 Bandwidth Latency of mode switch depends upon amount of data consumed on new processor produced by old processor. Use earliest-deadline-first scheduling. Simple model of bandwidth, e.g., max 32-bits every 8 cycles. Allow overlap of computation with communication. Iterative model: Use average mode switch latency from last iteration as fixed mode switch latency for next iteration. Results based upon actual implied latency of last iteration.

18 PTLSim (x86-64): micro-op traces SimPoint (phase classification): ~12 x 10M instruction segments. Benchmarks: Spec 2000, PhysicsBench, SimpleScalar (used as a benchmark), microbenchmarks. Experiment Setup

19 Average Parallelism As in prior ILP limit studies: lots of parallelism.

20 Instructions Scheduled on Parallel Cores As parallel processor’s sequential performance gets worse, more instructions scheduled on sequential core.

21 Parallelism on Parallel Processor As parallel processor’s sequential performance gets worse, work scheduled on parallel core needs to be more parallel.

22 Speedup over Sequential Core Applications exist with enough parallelism to fully utilize GPU function units. GPU

23 Speedup over Sequential Core “General Purpose” Workloads: Performance limited by sequential performance (read-after-write latency) of parallel cores. GPU

24 Slowdown of infinite communication cost (NoSwitch) Up to 5x performance improvement versus infinite cost. Communication cost matters most for GPU like parallel instruction latency. So, put on same chip?

25 Slowdown due to 100,000 cycles of mode-switch latency Can achieve 85% of the performance of single-chip with large (but not infinite) mode switch latency.

26 Mode Switches Number of mode switches decreases with increasing mode switch cost. More mode switches occur at intermediate values of parallel instruction latency. zero cycles 10 cycles 1000 cycles

27 PCI Express-like Bandwidth (and Latency) 1.07x to 1.48x performance improvement if reduce latency to zero and make bandwidth infinite. Less improvement if parallel instruction latency reduced--e.g. for better accelerator architecture.

28 Conclusions & Caveats GPUs could tackle more general-purpose applications if single thread performance was better. Performance improvement due to integrating CPU and accelerator on single chip (versus separate CPU and accelerator) does not appear staggering. Bandwidth has greater impact than latency. Caveats: It’s a limit study. Heterogeneous may still make sense for other reasons… e.g., if cheaper to add parallel cores than another chip sockets, power, etc…

29 Future Work Control dependence analysis Model interesting design points in more detail

30 Bandwidth sensitivity for GPU-like parallel instruction latency

31 Proportion of instructions on parallel processor

32 Slowdown of infinite communication Twophase shows strong sensitivity to communication latency for widely varying parallel instruction latency