Carlos Rosales, John Cazes, Kent Milfeld TACC KNL Training Carlos Rosales, John Cazes, Kent Milfeld 9/20/2018
Hands-on Session 1 KNL Introduction Lab Kent Milfeld milfeld@tacc.utexas.edu 9/20/2018
Login Usernames and passwords will be handed out before lab. Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu This is a Haswell login node. You can compile code here and submit batch jobs– but we’ll get interactive access to a KNL compute node, and compile and execute directly on a KNL processor in just a bit! lower case L (l), followed by one (1). 9/20/2018
Get Files for Compiling Untar all lab files into your directory: $ tar xvf ~train00/knl_labs.tar Move into the newly created (knl_intro) directory: $ cd knl_intro 9/20/2018
List Queues Display list of SLURM queues (partitions): $ sinfo # list includes nodes $ idev -queues # just list of queues Each queue has nodes that are configured for a specific Memory Mode and Clustering: Queues (Partitions): Mem. Mode – Clustering Flat-All2All* Cache-All2All Flat-Quadrant Cache-Quadrant Flat-SNC-4 Cache-SNC-4 This is the list you will see – more about this in class. *Flat-All2All is the default partition (queue). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Access a KNL node interactively Interactively access a KNL node with idev: $ idev #default: 30 min, Flat-All2All node $ … #slurm reporting, when you have access, c###-###...$ #an interactive prompt will appear. c###-### in the new prompt is the KNL node you are on (you can even ssh to this node in another window– see last slide if you want 2 windows). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Access a KNL node interactively Let’s look at the hardware, and some useful commands: $ grep processor /proc/cpuinfo #what is the processor count? $ hwloc-ls -l #How many logical cores are there (See L#). $ hwloc-ls -p #Which physical cores have been disabled (See P#)? What is the Memory Mode and Clustering for this node? See load on each cpu. $ sinfo --format="%.12P %.5a %.6D" # Queue names (mem. mode-clustering), up/down # $ top # Hit 1 key (no return)– load of each cpu is shown. # Resize terminal to see more cpus! (hit q to quit) If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Compile and Run OpenMP code Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes): $ ifort -qopenmp -xMIC-AVX512 omp_hello.F90 -o omp_hello or $ icc -qopenmp -xMIC-AVX512 omp_hello.c -o omp_hello $ export OMP_NUM_THREADS=68 OMP_DISPLAY_ENV=TRUE $ ./omp_hello | sort $ export OMP_NUM_THREADS=272 If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (Will show OpenMP details at beginning of run.) 9/20/2018
Compile and Run MPI code Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes). $ mpif90 -xMIC-AVX512 mpi_hello.F90 -o mpi_hello or $ mpicc -xMIC-AVX512 mpi_hello.c -o mpi_hello $ mpiexec.hydra -np 68 ./mpi_hello | sort Or $ mpirun -n 68 ./mpi_hello | sort If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Vector Performance Compile vector.c as a native MIC application: $ icc -qopenmp -O3 -xMIC-AVX512 ./vector.c -o vec And also as a MIC application but disabling vectorization: $ icc -qopenmp -O3 -xMIC-AVX512 -no-vec ./vector.c -o novec Run both executables and take note of the timing difference. How much speedup comes from the vectorization? Does this make sense given what you have learned about the KNL architecture?
Vector Reports (I) Let's get some information about the vectorization in this example code. Compile the code again, but add a basic optimization report option to the compilation line: $ icc -qopenmp -O3 -xMIC-AVX512 -qopt-report=2 \ ./vector.c -o vec This will generate a report file called vector.optrpt Open the optimization report file with your favorite text editor, or simply cat the contents to screen: $ cat ./vector.optrpt 9/20/2018
Vector reports (II) There is a lot of information in the optimization report file. We find out that our array initialization can’t be vectorized because we call an external function (RAND) in lines 34 and 35 or the example: LOOP BEGIN at ./vector.c(34,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(34,33) ] LOOP END LOOP BEGIN at ./vector.c(35,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(35,33) ] But the main loop has been vectorized: LOOP BEGIN at ./vector.c(45,3) remark #15300: LOOP WAS VECTORIZED 9/20/2018
Vector reports (III) Let’s use a higher reporting level in order to find out more about the quality of the main loop vectorization: $ icc -qopenmp -O3 –xMIC-AVX512 –qopt-report=4./vector.c -o vec LOOP BEGIN at ./vector.c(45,3) remark #15388: vectorization support: reference x has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference y has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference z has aligned access [ ./vector.c(46,4) ] remark #15305: vectorization support: vector length 8 remark #15399: vectorization support: unroll factor set to 8 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 11 remark #15477: vector loop cost: 0.620 remark #15478: estimated potential speedup: 17.600 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=4 LOOP END 9/20/2018
Vectorization Reports Memory Configuration Hands-on Session 2 Vectorization Reports Memory Configuration Carlos Rosales carlos@tacc.utexas.edu 9/20/2018
Setup Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu Move into the directory: $ cd knl_mem Obtain an interactive session in Stampede: $ idev -p Flat-All2All The idev command will start an interactive session on a KNL compute node lower case letter L (l), followed by number one (1). 9/20/2018
Memory configuration If you are still on a compute node (cxxx-xxx in your prompt) please exit now, and also exit after every test in this exercise so that you run them in the specified queue. In this exercise you will submit interactive jobs to queues with KNL in different memory configurations. Exit each session after the exercise is completed. 9/20/2018
Cache (All-to-all / Quadrant) Get an interactive session in a node with Cache-All2All configuration: $ idev -p Cache-All2All Once the session starts check the NUMA configuration: $ numactl -H Note: the Cache-Quadrant configuration will look exactly the same available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98207 MB DDR4
Cache (SubNUMA-4) Get an interactive session in a node with Cache-SNC-4 configuration and check its configuration: $ idev -p Cache-SNC-4 $ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 node 0 size: 24479 MB … node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 3 size: 24576 MB Important features: 4 NUMA nodes All NUMA nodes assigned to the DDR4 since we are using Cache mode Memory equally distributed across NUMA nodes ( 96 GB / 4 = 24 GB ) Nodes 0 and 1 have 18 CPU cores associated to them, while 2 and 3 have 16 (no tile splitting allowed) DDR4
Flat (All-to-all / Quadrant) Get an interactive session in a node with Flat-All2All configuration and check its configuration: $ idev -p Flat-All2All $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98207 MB node 1 cpus: node 1 size: 16384 MB Important features: Note that MCDRAM is NUMA node 1 ( 16 GB ) and DDR4 is NUMA node 0 ( 96 GB, default allocation ) CPUs associated to NUMA node 0 are also associated to NUMA node 1 DDR4 MCDRAM
Flat (SubNUMA-4) DDR MCDRAM Get an interactive session in a node with Flat-SNC-4 configuration and check its configuration: $ idev -p Flat-SNC-4 $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 node 0 size: 24479 MB … node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 3 size: 24576 MB node 4 cpus: node 4 size: 4096 MB node 7 cpus: node 7 size: 4096 MB Eight NUMA nodes: (0,1,2,3) are DDR4 and (4,5,6,7) are MCDRAM Memory equally split across NUMA nodes ( 96/4 = 24 GB DDR4, 16/4 = 4 GB MCDRAM ) CPUS for NUMA node 0 are also associated to NUMA node 4, etc DDR MCDRAM
Allocating in MCDRAM (I) The stream benchmark is the industry standard to measure memory bandwidth. In this exercise you will run it on both DDR4 and MCDRAM and look at the difference in performance. Make sure you are in the login node. Change directory into the stream benchmark: $ cd stream Build the executable: $ icc -qopenmp -O2 -xMIC-AVX512 ./stream.c -o stream 9/20/2018
Allocating in MCDRAM (II) In your interactive session, set the number of OMP threads to 68: $ idev -p Flat-All2All $ export OMP_NUM_THREADS=68 Run binding to DDR4 and MCDRAM and compare results: $ numactl --membind=0 ./stream $ numactl --membind=1 ./stream Does the performance difference you measured match what you learned about architectural features of KNL? What performance do you see if you run in the Cache-All2All queue with the same number of threads (and no numactl) 9/20/2018
Hands-on Session 3 KNL Affinity Lab Kent Milfeld milfeld@tacc.utexas.edu 9/20/2018
Login Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu Move into the affinity directory created from the untar commmad in the intro lab: $ cd knl_affinity Access a KNL node: $ idev #we call this the idev window lower case L (l), followed by one (1). 9/20/2018
Login again in another window 1.) In another terminal window on you laptop login to login-knl1 2.) ssh to the knl compute node 3.) run top so that you can watch the cpu loads. $ ssh <username>@login-knl1.stampede.tacc.utexas.edu $ ssh c###-### #we call this the ssh window $ top Hit the “1” key, and Adjust the screen size/font so you can see 137 cpus, or at least 69. 9/20/2018
What you will do and learn In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Create an OpenMP load generator Compile omp_load.c (uses functions in timers.c and load.c) $ icc –xMIC-AVX512 -qopenmp load.c timers.c \ omp_load.c -o omp_load In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Distribution (PROC_BIND spread) Set spread policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=spread #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st HW-thread of every core) [ 0, 1,..., 67] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 3rd HW-thread of each core) [ 0, 1,..., 67] [ --- ] [136, 137,...,203] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018
Distribution (PROC_BIND close) Set close policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=close #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st Quadrant) [ 0, 1,..., 17,---] [ 68, 69,..., 84,---] [136,137,...,152,---] [204,205,...,220,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 2nd Quadrant) [ 0, 1,..., 33,---] [ 68, 69,...,101,---] [136,137,...,169,---] [204,205,...,237,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018
Locations (PLACES) Specify 4 places with a stride 4, starting from HW-thread 0. $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PLACES="{0},{2},{4},{6}" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display $ export OMP_PLACES="{0}:4:2" (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018
Locations (PLACES) $ export OMP_DISPLAY_ENV=TRUE #what does this do? Specify 4 places with a stride 68, starting from HW-thread 0. $ export OMP_PLACES="{0}:4:68" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display Specify 8 places with a stride 2, starting from HW-thread 0 (uses two expressions) $ export OMP_PLACES="{0}:4:2,{64}:4:2" $ export OMP_NUM_THREADS=8 (1st core) [ 0,---] [ 68,---] [135,---] [204,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 2nd HW-threads in 1st 4 tiles) [ 0, 2, 4, 6,---] [ 68,70,72,74,---] [---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018
Locations (masked PLACES) A list within a place forms a mask of HW-thread locations where an OpenMP thread can execute. Top command only shows the present load at a location, not the mask. In this exercise omp_viewload reports the mask for each thread, like this: thrd | | 10 | 20 | 30 |... HEADER (group val) 0 0---------------6------------------5-----... Mask (1st digit) 1 -1---------------7------------------6----... 2 --2---------------8------------------7---... ... Each row is a mask for a thread-id. Each mask is labeled by the thread-id on the left. Each column of the mask, cols 0 – 285, represent a HW-thread id of the mask. The header demarks HW-thread ids in groups of 10’s (| | | |...) and includes the beginning value for the group (| 10 | 20 | 30 | ...). The bits set in the mask are indicated by a number (eg 0---------------6------------------5 ) which is the first digit of the HW-thread id. (group value + first digit = HW-thread id) If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Locations ( masking PLACES for cores) Change to the mask directory (in the $HOME/lab_affinity directory) $ cd mask Build the omp_viewload executable $ make Create 68 places with a mask for each core, and execute omp_viewload with 68 OpenMP threads. $ export OMP_NUM_THREADS=68 $ export OMP_PLACES="{0,68,136,204}:68“ $ omp_viewload #what does each line mean? #See next slide for help # this will run for awhile --use ^c to exit. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Locations (Mask Report) (15, 15) (4, 34) OMP_NUM_THREADS=68 OMP_PLACES={0,68,136,204}:68 Occupation Report from omp_viewload (thread-id, HW-thread id) (2, { 2,70,136,206}) … If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (15, { 15,83,151,219}) … 9/20/2018
Locations ( masking PLACES for tiles) Create 34 places with a mask for each tile, and execute omp_viewload with 34 OpenMP threads (a thread on each tile). $ export OMP_NUM_THREADS=34 $ export OMP_PLACES="{0:2,68:2,136:2,204:2}:34:2" $ omp_viewload #what does each line mean? If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018
Another Window to your KNL interactive node Optional How to create another window to your interactive KNL compute node: $ ssh <username>@login-knl1.stampede.tacc.utexas.edu $ ssh c###.### #get nodename from prompt or squeue $ ... c###.###...$ #This is your 2nd window on KNL #run top, compile code, etc. lower case L (l), followed by one (1). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif login-knl1.stampede(1)$ whoami train450 login-knl1.stampede(1)$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 755 Flat-All2Al idv11588 milfeld R 13:06 1 c561-002 754 Flat-All2Al idv11568 train450 R 16:22 1 c561-001 login-knl1.stampede(2)$ ssh c561-001 … c561-001.stampede(1)$ 9/20/2018
Done! 9/20/2018