CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

1 Optimizing compilers Managing Cache Bercovici Sivan.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

The MachSuite Benchmark

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Chapter 1: Introduction. 1.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 1: Introduction What Operating Systems Do Computer-System.

GPU Architecture and Programming

Lecture 8 February 29, Topics Questions about Exercise 4, due Thursday? Object Based Programming (Chapter 8) –Basic Principles –Methods –Fields.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Yaohang Li.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS427 Multicore Architecture and Parallel Computing

Quiz for Week #5.

Stash: Have Your Scratchpad and Cache it Too

课程名编译原理 Compiling Techniques

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

for more information ... Performance Tuning

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Key to Scalable Parallelism - Regularity and Locality

Lecture 5: GPU Compute Architecture for the last time

Computer System Design Lecture 9

The University of Adelaide, School of Computer Science

6- General Purpose GPU Programming

CSE 502: Computer Architecture

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts

Summary (1) Architecture Modern architecture designs are driven by energy constraints Shortening latencies is too costly, so we use parallelism in hardware to increase potential throughput Some parallelism is implicit (out-of-order superscalar processing,) but have limits Others are explicit (vectorization and multithreading,) and rely on software to unlock 2

Summary (2) Memory Memory technologies trade off energy and cost for capacity, with SRAM registers on one end and spinning platter hard disks on the other Locality (relationships between memory accesses) can help us get the best of all cases Caching is the hardware-only solution to capturing locality, but software-driven solutions exist too (memcache for files, etc.) 3

Summary (3) Software Want to fully occupy your hardware? – Express locality (tiling) – Vectorize (compiler or manual) – Multithread (e.g. OpenMP) – Accelerate (e.g. CUDA, OpenCL) Take the cost into consideration. Unless you’re optimizing in your free time, your time isn’t free. 4

Research Perspective (2010) Can we generalize and categorize the most important, generally applicable GPU Computing software optimizations? – Across multiple architectures – Across many applications What kinds of performance trends are we seeing from successive GPU generations? Conclusion – GPUs aren’t special, and parallel programming is getting easier 5

Application Survey Surveyed the GPU Computing Gems chapters Studied the Parboil benchmarks in detail Results: Eight (for now) major categories of optimization transformations – Performance impact of individual optimizations on certain Parboil benchmarks included in the paper 6

1: (Input) Data Access Tiling 7 DRAM Cache DRAM Scratchpad Explicit Copy Implicit Copy Local Access

2. (Output) Privatization Avoid contention by aggregating updates locally Requires storage resources to keep copies of data structures 8 Private Results Local Results Global Results

Running Example: SpMV 9 Ax = v Row Data Col vx A

Running Example: SpMV 10 Ax = v Row Data Col A vx

3. “Scatter to Gather” Transformation 11 Ax = v v Row Data Col A x

3. “Scatter to Gather” Transformation 12 Ax = v v Row Data Col A x

4. Binning 13 A

5. Regularization (Load Balancing) 14

6. Compaction 15

7. Data Layout Transformation 16

7. Data Layout Transformation 17

8. Granularity Coarsening Parallel execution often requires redundant and coordination work – Merging multiple threads into one allows reuse of result, reducing redundancy Essential Redundant 4-way parallel 2-way parallel Time 18

How much faster do applications really get each hardware generation?

Unoptimized Code Has Improved Drastically 20 Orders of magnitude speedup in many cases Hardware does not solve all problems – Coalescing (lbm) – Highly contentious atomics (bfs)

Optimized Code Is Improving Faster than “Peak Performance” Caches capture locality scratchpad can’t efficiently (spmv, stencil) Increased local storage capacity enables extra optimization (sad) Some benchmarks need atomic throughput more than flops (bfs, histo) 21

Optimization Still Matters Hardware never changes algorithmic complexity (cutcp) Caches do not solve layout problems for big data (lbm) Coarsening still makes a big difference (cutcp, sgemm) Many artificial performance cliffs are gone (sgemm, tpacf, mri-q) 22

Stuff we haven’t covered Good tools out there for profiling code beyond good timing (cache misses, etc.) If you can’t find why a particular piece of code is taking so long, look into hardware performance counters. Patterns and practice – Some of the major patterns of optimization we covered, but only the basic ones. Many optimization patterns are algorithmic. 23

Fill Out Evaluations! 24