John Levesque Director Cray Supercomputing Center of Excellence

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
M. Mateen Yaqoob The University of Lahore Spring 2014.
FOUNDATION IN INFORMATION TECHNOLOGY (CS-T-101) TOPIC : INFORMATION SYSTEM – SOFTWARE.
Scripting Languages Diana Trandab ă ț Master in Computational Linguistics - 1 st year
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Programmability Hiroshi Nakashima Thomas Sterling.
Computing Higher – SD Unit - Topic 8 – Procedure and Standard Algorithms P Lynch, St Andrew’s High School Unit 2 Software Development Process Topic.
Threaded Programming Lecture 2: Introduction to OpenMP.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Background Computer System Architectures Computer System Software.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
These slides are based on the book:
Information and Computer Sciences University of Hawaii, Manoa
Topics to be covered Instruction Execution Characteristics
Introduction to Parallel Processing
Computer Architecture Principles Dr. Mike Frank
SHARED MEMORY PROGRAMMING WITH OpenMP
Computer Architecture
Computer Engg, IIT(BHU)
Prepared By: G.UshaRani B.Pranalini A.S.Lalitha
Morgan Kaufmann Publishers
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Chapter 9 :: Subroutines and Control Abstraction
Memory Hierarchies.
Lecture 5: GPU Compute Architecture for the last time
Chapter 4: Threads.
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
Shared Memory Programming
Implementing Mutual Exclusion
Implementing Mutual Exclusion
Mastering Memory Modes
Arrays ICS2O.
Chapter 4: Threads & Concurrency
Shared Memory Accesses
Superscalar and VLIW Architectures
Accessing Variables in your project
Memory System Performance Chapter 3
HPC User Forum: Back-End Compiler Technology Panel
Chapter 01: Introduction
Problems with Locks Andrew Whitaker CSE451.
Instruction Level Parallelism
Types of Parallel Computers
6- General Purpose GPU Programming
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Shared-Memory Paradigm & OpenMP
Presentation transcript:

John Levesque Director Cray Supercomputing Center of Excellence Is Fortran getting further and further away from the Hardware Architecture? John Levesque Director Cray Supercomputing Center of Excellence

Future hardware directions Flop rich environment More flops per clock More cores per socket More threads per core The memory wall is thicker and thicker Increased importance of cache re-use Effective use of hardware and software pre-fetching

Future Compiler/User Concerns Flop rich environment More flops per clock More cores per socket Compilers must vectorize more code, no longer only for Cray and NEC. New CONCURRENT directive will help Compiler/Users must employ threads across the cores to mitigate memory bandwidth issues OpenMP and shared caches? Pthreads to hide memory latency

Future Compiler/User Concerns Increased importance of cache re-use Users must take care in blocking loops to more effectively utilize multi-levels of cache Not trivial May not be portable Different size caches

Future Compiler/User Concerns Memory Wall Users must avoid striding, storing when not required,….. Never, never use an array when a scalar will do Don’t pass array sections

Fortran features that ignore hardware issues and performance Derived Types Inhibit many compilers’ ability to vectorize DO loops Can help cache reuse Modules Inhibit Compiler and User from scoping variables. EG scoping a data element from a module as private. Same problem with COMMON blocks Sub-modules may help Array section passing Incurs increased data motion to and from memory at the subroutine boundary Array assignments Implies that the operation be performed for the full extent of the array limits Very Cache unfriendly Takes a good compiler to solve this

What Can Be done in the language? DONTNEEDTOSTORE directive. Advisories to the users as to the performance pitfalls when using certain Fortran statements PLEASE, PLEASE, don’t say it is up to the compiler. There are some compilers that can deal with the issues; however, not all compiler can deal with the issues. Performance Portability goes out the window