Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carlos Rosales, John Cazes, Kent Milfeld

Similar presentations


Presentation on theme: "Carlos Rosales, John Cazes, Kent Milfeld"— Presentation transcript:

1 Carlos Rosales, John Cazes, Kent Milfeld
TACC KNL Training Carlos Rosales, John Cazes, Kent Milfeld 9/20/2018

2 Hands-on Session 1 KNL Introduction Lab Kent Milfeld
9/20/2018

3 Login Usernames and passwords will be handed out before lab.
Login to Stampede $ ssh This is a Haswell login node. You can compile code here and submit batch jobs– but we’ll get interactive access to a KNL compute node, and compile and execute directly on a KNL processor in just a bit! lower case L (l), followed by one (1). 9/20/2018

4 Get Files for Compiling
Untar all lab files into your directory: $ tar xvf ~train00/knl_labs.tar Move into the newly created (knl_intro) directory: $ cd knl_intro 9/20/2018

5 List Queues Display list of SLURM queues (partitions):
$ sinfo # list includes nodes $ idev -queues # just list of queues Each queue has nodes that are configured for a specific Memory Mode and Clustering: Queues (Partitions): Mem. Mode – Clustering Flat-All2All* Cache-All2All Flat-Quadrant Cache-Quadrant Flat-SNC-4 Cache-SNC-4 This is the list you will see – more about this in class. *Flat-All2All is the default partition (queue). If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

6 Access a KNL node interactively
Interactively access a KNL node with idev: $ idev #default: 30 min, Flat-All2All node $ … #slurm reporting, when you have access, c###-###...$ #an interactive prompt will appear. c###-### in the new prompt is the KNL node you are on (you can even ssh to this node in another window– see last slide if you want 2 windows). If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

7 Access a KNL node interactively
Let’s look at the hardware, and some useful commands: $ grep processor /proc/cpuinfo #what is the processor count? $ hwloc-ls -l #How many logical cores are there (See L#). $ hwloc-ls -p #Which physical cores have been disabled (See P#)? What is the Memory Mode and Clustering for this node? See load on each cpu. $ sinfo --format="%.12P %.5a %.6D" # Queue names (mem. mode-clustering), up/down # $ top # Hit 1 key (no return)– load of each cpu is shown. # Resize terminal to see more cpus! (hit q to quit) If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

8 Compile and Run OpenMP code
Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes): $ ifort -qopenmp -xMIC-AVX512 omp_hello.F90 -o omp_hello or $ icc -qopenmp -xMIC-AVX512 omp_hello.c -o omp_hello $ export OMP_NUM_THREADS=68 OMP_DISPLAY_ENV=TRUE $ ./omp_hello | sort $ export OMP_NUM_THREADS=272 If you use vi you might find the vim graphical cheat-sheet useful: (Will show OpenMP details at beginning of run.) 9/20/2018

9 Compile and Run MPI code
Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes). $ mpif90 -xMIC-AVX512 mpi_hello.F90 -o mpi_hello or $ mpicc -xMIC-AVX512 mpi_hello.c -o mpi_hello $ mpiexec.hydra -np 68 ./mpi_hello | sort Or $ mpirun -n 68 ./mpi_hello | sort If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

10 Vector Performance Compile vector.c as a native MIC application:
$ icc -qopenmp -O3 -xMIC-AVX512 ./vector.c -o vec And also as a MIC application but disabling vectorization: $ icc -qopenmp -O3 -xMIC-AVX512 -no-vec ./vector.c -o novec Run both executables and take note of the timing difference. How much speedup comes from the vectorization? Does this make sense given what you have learned about the KNL architecture?

11 Vector Reports (I) Let's get some information about the vectorization in this example code. Compile the code again, but add a basic optimization report option to the compilation line: $ icc -qopenmp -O3 -xMIC-AVX512 -qopt-report=2 \ /vector.c -o vec This will generate a report file called vector.optrpt Open the optimization report file with your favorite text editor, or simply cat the contents to screen: $ cat ./vector.optrpt 9/20/2018

12 Vector reports (II) There is a lot of information in the optimization report file. We find out that our array initialization can’t be vectorized because we call an external function (RAND) in lines 34 and 35 or the example: LOOP BEGIN at ./vector.c(34,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(34,33) ] LOOP END LOOP BEGIN at ./vector.c(35,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(35,33) ] But the main loop has been vectorized: LOOP BEGIN at ./vector.c(45,3) remark #15300: LOOP WAS VECTORIZED 9/20/2018

13 Vector reports (III) Let’s use a higher reporting level in order to find out more about the quality of the main loop vectorization: $ icc -qopenmp -O3 –xMIC-AVX512 –qopt-report=4./vector.c -o vec LOOP BEGIN at ./vector.c(45,3) remark #15388: vectorization support: reference x has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference y has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference z has aligned access [ ./vector.c(46,4) ] remark #15305: vectorization support: vector length 8 remark #15399: vectorization support: unroll factor set to 8 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 11 remark #15477: vector loop cost: 0.620 remark #15478: estimated potential speedup: remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=4 LOOP END 9/20/2018

14 Vectorization Reports Memory Configuration
Hands-on Session 2 Vectorization Reports Memory Configuration Carlos Rosales 9/20/2018

15 Setup Login to Stampede
$ ssh Move into the directory: $ cd knl_mem Obtain an interactive session in Stampede: $ idev -p Flat-All2All The idev command will start an interactive session on a KNL compute node lower case letter L (l), followed by number one (1). 9/20/2018

16 Memory configuration If you are still on a compute node (cxxx-xxx in your prompt) please exit now, and also exit after every test in this exercise so that you run them in the specified queue. In this exercise you will submit interactive jobs to queues with KNL in different memory configurations. Exit each session after the exercise is completed. 9/20/2018

17 Cache (All-to-all / Quadrant)
Get an interactive session in a node with Cache-All2All configuration: $ idev -p Cache-All2All Once the session starts check the NUMA configuration: $ numactl -H Note: the Cache-Quadrant configuration will look exactly the same available: 1 nodes (0) node 0 cpus: node 0 size: MB DDR4

18 Cache (SubNUMA-4) Get an interactive session in a node with Cache-SNC-4 configuration and check its configuration: $ idev -p Cache-SNC-4 $ numactl -H available: 4 nodes (0-3) node 0 cpus: node 0 size: MB node 3 cpus: node 3 size: MB Important features: 4 NUMA nodes All NUMA nodes assigned to the DDR4 since we are using Cache mode Memory equally distributed across NUMA nodes ( 96 GB / 4 = 24 GB ) Nodes 0 and 1 have 18 CPU cores associated to them, while 2 and 3 have 16 (no tile splitting allowed) DDR4

19 Flat (All-to-all / Quadrant)
Get an interactive session in a node with Flat-All2All configuration and check its configuration: $ idev -p Flat-All2All $ numactl -H available: 2 nodes (0-1) node 0 cpus: node 0 size: MB node 1 cpus: node 1 size: MB Important features: Note that MCDRAM is NUMA node 1 ( 16 GB ) and DDR4 is NUMA node 0 ( 96 GB, default allocation ) CPUs associated to NUMA node 0 are also associated to NUMA node 1 DDR4 MCDRAM

20 Flat (SubNUMA-4) DDR MCDRAM
Get an interactive session in a node with Flat-SNC-4 configuration and check its configuration: $ idev -p Flat-SNC-4 $ numactl -H available: 8 nodes (0-7) node 0 cpus: node 0 size: MB node 3 cpus: node 3 size: MB node 4 cpus: node 4 size: 4096 MB node 7 cpus: node 7 size: 4096 MB Eight NUMA nodes: (0,1,2,3) are DDR4 and (4,5,6,7) are MCDRAM Memory equally split across NUMA nodes ( 96/4 = 24 GB DDR4, 16/4 = 4 GB MCDRAM ) CPUS for NUMA node 0 are also associated to NUMA node 4, etc DDR MCDRAM

21 Allocating in MCDRAM (I)
The stream benchmark is the industry standard to measure memory bandwidth. In this exercise you will run it on both DDR4 and MCDRAM and look at the difference in performance. Make sure you are in the login node. Change directory into the stream benchmark: $ cd stream Build the executable: $ icc -qopenmp -O2 -xMIC-AVX512 ./stream.c -o stream 9/20/2018

22 Allocating in MCDRAM (II)
In your interactive session, set the number of OMP threads to 68: $ idev -p Flat-All2All $ export OMP_NUM_THREADS=68 Run binding to DDR4 and MCDRAM and compare results: $ numactl --membind=0 ./stream $ numactl --membind=1 ./stream Does the performance difference you measured match what you learned about architectural features of KNL? What performance do you see if you run in the Cache-All2All queue with the same number of threads (and no numactl) 9/20/2018

23 Hands-on Session 3 KNL Affinity Lab Kent Milfeld
9/20/2018

24 Login Login to Stampede
$ ssh Move into the affinity directory created from the untar commmad in the intro lab: $ cd knl_affinity Access a KNL node: $ idev #we call this the idev window lower case L (l), followed by one (1). 9/20/2018

25 Login again in another window
1.) In another terminal window on you laptop login to login-knl1 2.) ssh to the knl compute node 3.) run top so that you can watch the cpu loads. $ ssh $ ssh c###-### #we call this the ssh window $ top Hit the “1” key, and Adjust the screen size/font so you can see 137 cpus, or at least 69. 9/20/2018

26 What you will do and learn
In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

27 Create an OpenMP load generator
Compile omp_load.c (uses functions in timers.c and load.c) $ icc –xMIC-AVX512 -qopenmp load.c timers.c \ omp_load.c -o omp_load In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

28 Distribution (PROC_BIND spread)
Set spread policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=spread #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st HW-thread of every core) [ 0, 1,..., 67] If you use vi you might find the vim graphical cheat-sheet useful: (1st & 3rd HW-thread of each core) [ 0, 1,..., 67] [ ] [136, 137,...,203] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

29 Distribution (PROC_BIND close)
Set close policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=close #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st Quadrant) [ 0, 1,..., 17,---] [ 68, 69,..., 84,---] [136,137,...,152,---] [204,205,...,220,---] If you use vi you might find the vim graphical cheat-sheet useful: (1st & 2nd Quadrant) [ 0, 1,..., 33,---] [ 68, 69,...,101,---] [136,137,...,169,---] [204,205,...,237,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

30 Locations (PLACES) Specify 4 places with a stride 4, starting from HW-thread 0. $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PLACES="{0},{2},{4},{6}" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display $ export OMP_PLACES="{0}:4:2" (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] If you use vi you might find the vim graphical cheat-sheet useful: (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

31 Locations (PLACES) $ export OMP_DISPLAY_ENV=TRUE #what does this do? Specify 4 places with a stride 68, starting from HW-thread 0. $ export OMP_PLACES="{0}:4:68" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display Specify 8 places with a stride 2, starting from HW-thread 0 (uses two expressions) $ export OMP_PLACES="{0}:4:2,{64}:4:2" $ export OMP_NUM_THREADS=8 (1st core) [ 0,---] [ 68,---] [135,---] [204,---] If you use vi you might find the vim graphical cheat-sheet useful: (1st & 2nd HW-threads in 1st 4 tiles) [ 0, 2, 4, 6,---] [ 68,70,72,74,---] [---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

32 Locations (masked PLACES)
A list within a place forms a mask of HW-thread locations where an OpenMP thread can execute. Top command only shows the present load at a location, not the mask. In this exercise omp_viewload reports the mask for each thread, like this: thrd | | | | |... HEADER (group val) Mask (1st digit) ... Each row is a mask for a thread-id. Each mask is labeled by the thread-id on the left. Each column of the mask, cols 0 – 285, represent a HW-thread id of the mask. The header demarks HW-thread ids in groups of 10’s (| | | |...) and includes the beginning value for the group (| 10 | | | ...). The bits set in the mask are indicated by a number (eg ) which is the first digit of the HW-thread id. (group value + first digit = HW-thread id) If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

33 Locations ( masking PLACES for cores)
Change to the mask directory (in the $HOME/lab_affinity directory) $ cd mask Build the omp_viewload executable $ make Create 68 places with a mask for each core, and execute omp_viewload with 68 OpenMP threads. $ export OMP_NUM_THREADS=68 $ export OMP_PLACES="{0,68,136,204}:68“ $ omp_viewload #what does each line mean? #See next slide for help # this will run for awhile --use ^c to exit. If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

34 Locations (Mask Report)
(15, 15) (4, 34) OMP_NUM_THREADS=68 OMP_PLACES={0,68,136,204}:68 Occupation Report from omp_viewload (thread-id, HW-thread id) (2, { 2,70,136,206}) … If you use vi you might find the vim graphical cheat-sheet useful: (15, { 15,83,151,219}) … 9/20/2018

35 Locations ( masking PLACES for tiles)
Create 34 places with a mask for each tile, and execute omp_viewload with 34 OpenMP threads (a thread on each tile). $ export OMP_NUM_THREADS=34 $ export OMP_PLACES="{0:2,68:2,136:2,204:2}:34:2" $ omp_viewload #what does each line mean? If you use vi you might find the vim graphical cheat-sheet useful: 9/20/2018

36 Another Window to your KNL interactive node
Optional How to create another window to your interactive KNL compute node: $ ssh $ ssh c###.### #get nodename from prompt or squeue $ ... c###.###...$ #This is your 2nd window on KNL #run top, compile code, etc. lower case L (l), followed by one (1). If you use vi you might find the vim graphical cheat-sheet useful: login-knl1.stampede(1)$ whoami train450 login-knl1.stampede(1)$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 755 Flat-All2Al idv milfeld R : c 754 Flat-All2Al idv11568 train450 R : c login-knl1.stampede(2)$ ssh c c stampede(1)$ 9/20/2018

37 Done! 9/20/2018


Download ppt "Carlos Rosales, John Cazes, Kent Milfeld"

Similar presentations


Ads by Google