Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang gzhang@tju.edu.cn.

Slides:

Advertisements

Similar presentations

Performance Models for Application Optimization

Advertisements

Chapter 7 Multicores, Multiprocessors, and Clusters.

Streaming SIMD Extension (SSE)

Lecture 6: Multicore Systems

The University of Adelaide, School of Computer Science

Parallel computer architecture classification

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

Chapter Hardwired vs Microprogrammed Control Multithreading

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

1 Chapter 01 Authors: John Hennessy & David Patterson.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Fan Zhang, Yang Gao and Jason D. Bakos

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

History of Microprocessor MPIntroductionData BusAddress Bus

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

Introduction to MMX, XMM, SSE and SSE2 Technology

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

CS203 – Advanced Computer Architecture Performance Evaluation.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Prof. Zhang Gang School of Computer Sci. & Tech.

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

CS427 Multicore Architecture and Parallel Computing

Chapter 6 Parallel Processors from Client to Cloud

Parallel Processing - introduction

Multi-core processors

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

Architecture & Organization 1

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 7 Physical Infrastructure of WSC Prof. Zhang Gang

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 13 Using Energy Efficiently Inside the Server Prof. Zhang.

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 4 Storage Prof. Zhang Gang School of.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 22 Similarities & Differences between Vector Arch & GPUs Prof. Zhang Gang.

Prof. Zhang Gang School of Computer Sci. & Tech.

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Morgan Kaufmann Publishers

Prof. Zhang Gang School of Computer Sci. & Tech.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

NVIDIA Jetson Platform Characterization

Clusters of Computational Accelerators

Multi-/Many-Core Processors

Architecture & Organization 1

The University of Adelaide, School of Computer Science

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

EE 4xx: Computer Architecture and Performance Programming

Graphics Processing Unit

Multicore and GPU Programming

The University of Adelaide, School of Computer Science

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Utsunomiya University

Multicore and GPU Programming

Presentation transcript:

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang gzhang@tju.edu.cn School of Computer Sci. & Tech. Tianjin University, Tianjin, P. R. China

The Roofline model Roofline is a visually intuitive performance model One visual, intuitive way to compare potential floating-point performance of variations of SIMD architectures is the Roofline model It ties together floating-point performance, memory performance, and arithmetic intensity in a two-dimensional graph. Arithmetic intensity is the ratio of floating- point operations per byte of memory accessed. Floating-point operations per byte read

Figure 4.10 Arithmetic intensity Arithmetic intensity= (the total number of floating-point operations for a program)/(the total number of data bytes transferred to main memory during program execution) Figure 4.10 Arithmetic intensity

How to find the peak memory performance? Peak floating-point performance can be found using the hardware specifications. Many of the kernels in this case study do not fit in on-chip caches, so peak memory performance is defined by the memory system behind the caches. Note that we need the peak memory bandwidth that is available to the processors, not just at the DRAM pins. One way to find the (delivered) peak memory performance is to run the Stream benchmark.

Examples of Roofline model NEC SX-9 is a vector supercomputer Intel Core i7 920 is a multicore computer with SIMD Extensions Note that the graph is a log–log scale, and that Rooflines are done just once for a computer. Figure 4.11 Roofline model for one NEC SX-9 and the Intel Core i7 920

How could we plot the peak memory performance? Since the X-axis is FLOP/byte and the Y-axis is FLOP/sec, bytes/sec is just a diagonal line at a 45-degree angle. We can express the limits as a formula to plot these lines in the graphs: Attainable GFLOPs/sec = Min(Peak Memory BW × Arithmetic Intensity, Peak Floating-Point Perf.)

Exercises What is the meaning of Roofline? What is the meaning of arithmetic intensity? How to find the peak memory performance? How to plot the peak memory performance?