High Performance Computing (CS 540)

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Parallel Processing (CS 730) Lecture 7: Shared Memory FFTs*

Streaming SIMD Extension (SSE)

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Optimizing single thread performance Dependence Loop transformations.

1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

Introductory Courses in High Performance Computing at Illinois David Padua.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Instruction Level Parallelism (ILP) Colin Stevens.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Sparse Matrix-Dense Vector Multiply on G80: Probing the CUDA Parameter Space Comp 790 GPGP Project Stephen Olivier.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Multicore Systems CET306 Harry R. Erwin University of Sunderland.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.

FFT: Accelerator Project Rohit Prakash Anand Silodia.

Carnegie Mellon Generating High-Performance General Size Linear Transform Libraries Using Spiral Yevgen Voronenko Franz Franchetti Frédéric de Mesmay Markus.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

GPU Architecture and Programming

Dave Murray: Developing Fast DSP Libraries for Advanced Processors DSP libraries need to be efficient Efficiency is expensive to achieve Liberator is our.

Drew Freer, Beayna Grigorian, Collin Lambert, Alfonso Roman, Brian Soumakian.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.

Understanding Parallel Computers Parallel Processing EE 613.

NSF/TCPP Curriculum Planning Workshop Joseph JaJa Institute for Advanced Computer Studies Department of Electrical and Computer Engineering University.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

PMLAB, IECS, FCU Designing Efficient Matrix Transposition on Various Interconnection Networks Using Tensor Product Formulation Presented by Chin-Yi Tsai.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

My Coordinates Office EM G.27 contact time:

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Progress Report—11/13 宗慶. Problem Statement Find kernels of large and sparse linear systems over GF(2)

CS498 DHP Program Optimization Fall Course organization  Instructors: Mar í a Garzar á n David Padua.

1 Lecture 5a: CPU architecture 101 boris.

Lecture 38: Compiling for Modern Architectures 03 May 02

Prof. Zhang Gang School of Computer Sci. & Tech.

CS427 Multicore Architecture and Parallel Computing

Yuanrui Zhang, Mahmut Kandemir

Quiz for Week #5.

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang

Lecture 5: GPU Compute Architecture

Fast Fourier Transform

Automatic Performance Tuning

Lecture 5: GPU Compute Architecture for the last time

CS 179 Project Intro.

Compiler Back End Panel

Benjamin Goldberg Compiler Verification and Optimization

STUDY AND IMPLEMENTATION

CS/EE 217 – GPU Architecture and Parallel Programming

Compiler Back End Panel

Project Title Team Members EE/CSCI 451: Project Presentation

Numerical Algorithms Quiz questions

The Fast Curvelet Transform

EE 4xx: Computer Architecture and Performance Programming

The Challenge of Teaching Program Performance Tuning

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

6- General Purpose GPU Programming

Presentation transcript:

High Performance Computing (CS 540) Overview and Challenge Jeremy Johnson Dept. of Computer Science Drexel University

High Performance Computing Tools Algorithms FFT (Cooley-Tukey) Integer multiplication (Karatsuba, Shönhage-Strassen) Matrix multiplication (Block, Strassen, Coppersmith-Winograd) Compiler optimization Loop unrolling, fusion Tiling Instruction reordering, CSE High performance computer architecture Instruction level parallelism Memory hierarchy Vectorization (short vector, e.g. SSE) Parallelism (multithreading, multicore, SMP, GPU) Autotuning ATLAS, FFTW, GMP, SPIRAL

The power of a good algorithm

Matrix Multiplication Performance

Challenge of Obtaining Efficient Code Multiple threads: 2x Vector instructions: 3x Memory hierarchy: 5x High performance library development has become a nightmare