High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Introduction to Openmp & openACC
Chapter 7 Introduction to Procedures. So far, all programs written in such way that all subtasks are integrated in one single large program. There is.
MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.
Copyright 2010 by Pearson Education Building Java Programs Chapter 7 Lecture 7-2: Arrays as Parameters reading: , 3.3 self-checks: Ch. 7 #5, 8,
1 Various Methods of Populating Arrays Randomly generated integers.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Chapter 8 Introduction to Arrays Part II Dr. Ali Can Takinacı İstanbul Technical University Faculty of Naval Architecture and Ocean Engineering İstanbul.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
FORTRAN Short Course Week 4 Kate Thayer-Calder March 10, 2009.
Chapter 8 Arrays and Strings
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Beginning Fortran Fortran (77) Basics 22 October 2009 *Black text on white background provided for easy printing.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multi-Dimensional Arrays
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Chapter 8 Arrays and Strings
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
C++ Programming: From Problem Analysis to Program Design, Fifth Edition Arrays.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Some Fortran programming tips ATM 562 Fall 2015 Fovell (see also PDF file on class page) 1.
South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 3 Classes, Structs, Enums Passing by reference and value Arrays.
Parallel Programming & Cluster Computing Stupid Compiler Tricks Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Comparing Cray Tasking and OpenMP NERSC User Services Overview of Cray Tasking Overview of OpenMP.
Beginning Fortran Fortran (77) Advanced 29 October 2009 *Black text on white background provided for easy printing.
Enhancing the Role of Inlining in Effective Interprocedural Parallelization Jichi Guo, Mike Stiles Qing Yi, Kleanthis Psarris.
Supercomputing and Science An Introduction to High Performance Computing Part IV: Dependency Analysis and Stupid Compiler Tricks Henry Neeman, Director.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
SEQUENTIAL AND OBJECT ORIENTED PROGRAMMING Arrays.
Supercomputing in Plain English An Introduction to High Performance Computing Part IV:Stupid Compiler Tricks Henry Neeman, Director OU Supercomputing Center.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
John Levesque Director Cray Supercomputing Center of Excellence
Dependence Analysis Important and difficult
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Computer Engg, IIT(BHU)
Algorithm Analysis CSE 2011 Winter September 2018.
Subroutine Comp 208 Yi Lin.
Midterm Review Programming in Fortran
Parallel Computing Explained How to Parallelize a Code
Chapter 6 Programming the basic computer
Introduction to Optimization
Presentation transcript:

High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999

13 April, 1999 High Performance on the J90 Systems 2 Philosophical Ramblings Design for optimization? Where to start? When to stop?

13 April, 1999 High Performance on the J90 Systems 3 J90 Potential STREAM benchmark results Sustainable memory bandwidth ( John McCalpin, SGI bytes/iter FLOPS/iter COPY a(i)=b(i) 16 0 TRIAD a(i)=b(i)+q*c(i) 24 2

13 April, 1999 High Performance on the J90 Systems 4 STREAM Results Machine ncpus COPY TRIAD MFLOPS Cray_C Cray_C Cray_C Cray_J Cray_J Cray_J Cray_T3E Cray_T3E Cray_T3E SGI_Origin_2K SGI_Origin_2K SGI_Origin_2K Sun_UE_ Sun_UE_ Sun_UE_

13 April, 1999 High Performance on the J90 Systems 5 STREAM Results (cont.) Machine COPY TRIAD MFLOPS Cray_C Cray_J Compaq_AlphaServer_DS IBM_RS Cray_T3E SGI_Origin_2K Generic_440BX_ Sun_Ultra Sun_UE_ Apple_Mac_G3_

13 April, 1999 High Performance on the J90 Systems 6 Tools F90 (with lots of options) ja./name ja -cst -n name hpm prof flowview atexpert

13 April, 1999 High Performance on the J90 Systems 7 Program “SLOW” PROGRAM SLOW IMPLICIT NONE INTEGER, PARAMETER :: DIMSIZE= REAL, DIMENSION(DIMSIZE) :: X, Y, Z INTEGER:: I, J X = RANF() Y = RANF() DO J = 1, 10 DO I = 1, DIMSIZE Z(I)=LOG(SIN(X(I))**2+COS(Y(I))**4) END DO PRINT *, Z(DIMSIZE-1) ENDDO STOP END PROGRAM SLOW

13 April, 1999 High Performance on the J90 Systems 8 No Optimization f90 -O0 -r6 -O,msgs,negmsgs -o slow slow.f90 x = RANF() cf f90:VECTOR SLOW,File = slow.f90, Line=8 A loop starting at line 8 was vectorized. y = RANF() cf f90:VECTOR SLOW,File = slow.f90, Line=9 A loop starting at line 9 was vectorized.

13 April, 1999 High Performance on the J90 Systems 9 Moderate Optimization f90 -O1 -r6 -O,msgs,negmsgs -o slow slow.f90 do j = 1, 10 cf f90:VECTOR SLOW,File = slow.f90,Line=10 A loop starting at line 10 was not vectorized because it contains input/output operations at line 14. DO i = 1, DIMSIZE cf f90:VECTOR SLOW,File = slow.f90,Line=11 A loop starting at line 11 was vectorized. z(i) = LOG(SIN(x(i))**2 + COS(y(i))**4) cf f90:SCALAR SLOW,File=slow.f90,Line=12 An exponentiation was replaced by optimization. This may cause numerical differences.

13 April, 1999 High Performance on the J90 Systems 10 High Optimization f90 -O3 -r6 -O,msgs,negmsgs -o slow slow.f90 cf f90:TASKING SLOW,File=slow.f90,Line=10 A loop starting at line 10 was not tasked because it contains input/output operations at line 14. cf f90:TASKING SLOW,File=slow.f90,Line=11 A loop starting at line 11 was tasked.

13 April, 1999 High Performance on the J90 Systems 11 Optimization Results Opt NCPUS Elapsed User Sys

13 April, 1999 High Performance on the J90 Systems 12 2 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) * = * = (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) * =

13 April, 1999 High Performance on the J90 Systems 13 3 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) * = * = * = (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) * =

13 April, 1999 High Performance on the J90 Systems 14 4 CPU Speedup (Concurrent CPUs * Connect seconds = CPU seconds) * = * = * = * = (Concurrent CPUs * Connect seconds = CPU seconds) (Avg.) (total) (total) * =

13 April, 1999 High Performance on the J90 Systems 15 Useful F90 Options -e (0 or i) - initializes storage or flags use of unitialized vars -e n - flags nonstandard fortran usage -e v - make all variables static -g - same as -G0 -G (0 or 1) - sets debugging level to statement or block -m (0 - 4) - message verbosity (0 gives most output) -N (72, 80, or 132) - source line length -O - Optimization levels 0,1,2,3, aggress, fastint, msgs, negmsgs, inline(0-3), scalar(0-3), task(0-3), vector (0-3) -r (0-6, …) - listing levels (6 is EVERYthing) -R (a, b, c) - runtime checking: args, array bounds, indexing

13 April, 1999 High Performance on the J90 Systems 16 Using flowtrace/flowview f90 -O1 -ef -o slow slow.f90./slow flowview -Luch > slow.flow Routine Tot Time Percentage Accum% SUB2 5.66E SUB1 2.43E SLOW 1.11E

13 April, 1999 High Performance on the J90 Systems 17 Using prof f90 -O1 -l prof -o slow slow.f90./slow prof -x./slow > slow.prof profview slow.prof

13 April, 1999 High Performance on the J90 Systems 18 profview Output

13 April, 1999 High Performance on the J90 Systems 19 Optimization Strategies First, let the compiler do it Vectorize and scalar optimize, then parallelize Vectorization can give you a factor of 10 speedup Scalar optimization can improve performance by % Parallelism will give you a linear speedup, max Memory contention inhibits gains from parallelism Let the compiler advise you Add directives where appropriate Be sure you tell the truth Check your answers

13 April, 1999 High Performance on the J90 Systems 20 Scalar Optimization Subroutine or function inlining Fast (32-bit) integers -Oallfastint -Ofastint Use INTERFACE specifications if passing array sections

13 April, 1999 High Performance on the J90 Systems 21 Vectorization

13 April, 1999 High Performance on the J90 Systems 22 Inhibitors to Vectorization Function or subroutine references Inline Push loop Split loop Backwards data dependencies Reorder loop, use temporary vector I/O statements Character or bit manipulations Branches into loop or backward out of loop

13 April, 1999 High Performance on the J90 Systems 23 Nonvectorizable Code DO I = 1, N CALL CALC(X(I), Y(I), Z(I)) ENDDO... SUBROUTINE CALC(X, Y, Z) Z = ALOG(SQRT((SIN(X) * COS(Y)) ** X)) RETURN END

13 April, 1999 High Performance on the J90 Systems 24 Inlining DO I = 1, N Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I))) ENDDO

13 April, 1999 High Performance on the J90 Systems 25 Pushing CALL CALC(X(I), Y(I), Z(I), N)... SUBROUTINE CALC(X, Y, Z, N) DIMENSION X(N), Y(N), Z(N) DO I = 1, N Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I))) ENDDO RETURN END

13 April, 1999 High Performance on the J90 Systems 26 Splitting DO I = 1, N A(I) = ABS(CALC(C(I))) B(I) = A(I) ** T * SQRT(C(I)) A(I) = SIN(ALOG(C(I))) ENDDO

13 April, 1999 High Performance on the J90 Systems 27 Splitting (cont.) EXTERNAL CALC DO I = 1, N A(I) = ABS(CALC(C(I))) ENDDO DO I = 1, N B(I) = A(I) ** T * SQRT(C(I)) A(I) = SIN(ALOG(C(I))) ENDDO

13 April, 1999 High Performance on the J90 Systems 28 Scalar Recurrence DIMENSION A(1000), C(1000) DO J = 1, M S = BB DO I = 1, N S = S * C(I) A(I) = A(I) + S ENDDO Loop starting at line 7 was unrolled 16 times.

13 April, 1999 High Performance on the J90 Systems 29 Scalar Recurrence (cont.) DIMENSION A(1000), C(1000), S(1000) DO I = 1, M S(I) = BB ENDDO DO I = 1, N DO J = 1, M S(J) = S(J) * C(I) A(I) = A(I) + S(J) ENDDO Loop starting at line 5 was unrolled 2 times. A loop starting at line 5 was vectorized. A loop starting at line 9 was vectorized.

13 April, 1999 High Performance on the J90 Systems 30 Compiler Vector Directives CDIR$ directive !DIR$ directive VECTOR, NOVECTOR Turn vectorization on or off until end of program unit. IVDEP Ignore vector dependencies in next loop.

13 April, 1999 High Performance on the J90 Systems 31 Parallel Computing Multitasking, microtasking, autotasking, parallel processing, multiprocessing, etc. This is “fine-grained” parallelism parallelism mostly comes from loop slicing One possible goal: parallelize outer loop(s), vectorize inner loop(s) F90 is capable of autotasking, but it can always benefit from help

13 April, 1999 High Performance on the J90 Systems 32 Parallelism

13 April, 1999 High Performance on the J90 Systems 33 Parallelism, cont.

13 April, 1999 High Performance on the J90 Systems 34 Data “Scoping” DIMENSION A(N) SUM = 0.0 DO I = 1, N TEMP = DEEP_THOUGHT(A,I) SUM = SUM + TEMP * A(I) ENDDO A, N Shared, read-only everywhere I, TEMP Private, read-write everywhere SUM Shared, read-write everywhere

13 April, 1999 High Performance on the J90 Systems 35 Compiler Tasking Directives DIMENSION A(N) SUM = 0.0 !MIC$ DOALL SHARED(A,N),PRIVATE(I,TEMP) DO I = 1, N TEMP = DEEP_THOUGHT(A,I) * A(I) !MIC$ GUARD SUM = SUM + TEMP !MIC$ ENDGUARD ENDDO

13 April, 1999 High Performance on the J90 Systems 36 Threshold Test DIMENSION A(N) SUM = 0.0 !MIC$ DOALL VECTOR !MIC$ IF(N.GT.1000) !MIC$ SHARED(A,N),PRIVATE(I,TEMP) DO I = 1, N TEMP = DEEP_THOUGHT(A,I) !MIC$ GUARD SUM = SUM + TEMP * A(I) !MIC$ ENDGUARD ENDDO

13 April, 1999 High Performance on the J90 Systems 37 Helping F90 with Parallelism DIMENSION A(N), SUM(NumTasks) !MIC$ DOALL SHARED(A,N),PRIVATE(J,I,TEMP) DO J = 1, NumTasks SUM(J) = 0.0 !MIC$ CNCALL DO I = 1, N SUM(J) = SUM(J) = DEEP_THOUGHT(A,I,J) * A(I) ENDDO DO J = 1, NumTasks TSUM = TSUM + SUM(J) ENDDO

13 April, 1999 High Performance on the J90 Systems 38 Helping F90 with Directives Useful compiler directives for tasking CASE, ENDCASE CNCALL DOALL DOPARALLEL, ENDDO GUARD, ENDGUARD MAXCPUS NUMCPUS PERMUTATION PARALLEL, ENDPARALLEL These all begin with !MIC$ NOTE: There are also OpenMP directives...

13 April, 1999 High Performance on the J90 Systems 39 Helping F90 with Directives, cont. Directive Parameters AUTOSCOPE IF MAXCPUS PRIVATE SAVELAST SHARED Directive Work Distribution CHUNKSIZE GUIDED NCPUS_CHUNKS NUMCHUNKS SINGLE VECTOR These all augment !MIC$ directives NOTE: There are also OpenMP directive parameters...

13 April, 1999 High Performance on the J90 Systems 40 atexpert f90 -eX -O3 -r6 -o slow slow.f90 setenv NCPUS 1./slow atexpert

13 April, 1999 High Performance on the J90 Systems 41 atexpert Output

13 April, 1999 High Performance on the J90 Systems 42 atexpert Output, cont.

13 April, 1999 High Performance on the J90 Systems 43 atexpert Output, cont.