Carlos Rosales, John Cazes, Kent Milfeld

Slides:

Advertisements

Similar presentations

Code Composer Department of Electrical and Computer Engineering

Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Job Submission.

An End-User Perspective On Using NatQuery Extraction From two Files T

Lab6 – Debug Assembly Language Lab

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Introduction to UNIX/Linux Exercises Dan Stanzione.

Filesystem Hierarchy Standard (FHS) –Standard of outlining the location of set files and directories on a Linux system –Gives Linux software developers.

Introduction to Shell Script Programming

Intro to Linux/Unix (user commands) Box. What is Linux? Open Source Operating system Developed by Linus Trovaldsa the U. of Helsinki in Finland since.

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Instructors begin using McGraw-Hill’s Homework Manager by creating a unique class Web site in the system. The Class Homepage becomes the entry point for.

DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.

Using the BYU Supercomputers. Resources Basic Usage After your account is activated: – ssh You will be logged in to an interactive.

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

M. Schott (CERN) Page 1 CERN Group Tutorials CAT Tier-3 Tutorial October 2009.

Fall 08, Oct 31ELEC Lecture 8 (Updated) 1 Lecture 8: Design, Simulation Synthesis and Test Tools ELEC 2200: Digital Logic Circuits Nitin Yogi

Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 2: Message-Passing Computing LAM/MPI at the.

Introduction to HPC Workshop October Introduction Rob Lane & The HPC Support Team Research Computing Services CUIT.

THE C PROGRAMMING ENVIRONMENT. Four parts of C environment  Main menu  Editor status line and edit window  Compiler message window  “Hot Keys” quick.

How to use HybriLIT Matveev M. A., Zuev M.I. Heterogeneous Computations team HybriLIT Laboratory of Information Technologies (LIT), Joint Institute for.

Advanced topics Cluster Training Center for Simulation and Modeling September 4, 2015.

Getting Started: XSEDE Comet Shahzeb Siddiqui - Software Systems Engineer Office: 222A Computer Building Institute of CyberScience May.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

NUMA Control for Hybrid Applications Kent Milfeld TACC May 5, 2015.

Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

© 2011 Pittsburgh Supercomputing Center XSEDE 2012 Intro To Our Environment John Urbanic Pittsburgh Supercomputing Center July, 2012.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Linux & Joker – An Introduction

Hands on training session for core skills

GRID COMPUTING.

Auburn University

Tutorial of Unix Command & shell scriptS 5027

PARADOX Cluster job management

EET 2259 Unit 13 Strings and File I/O

CS1010: Intro Workshop.

Assumptions What are the prerequisites? … The hands on portion of the workshop will be on the command-line. If you are not familiar with the command.

SUSE Linux Enterprise Desktop Administration

Getting Started with R.

OCR on Knights Landing (Xeon-Phi)

Profiling and Optimization Outline

Joker: Getting the most out of the slurm scheduler

Current outstanding balance

Simple Illustration of L1 Bandwidth Limitations on Vector Performance

Shell Script Assignment 1.

CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017

Getting Started with Automatic Compiler Vectorization

Tutorial of Unix Command & shell scriptS 5027

Tutorial of Unix Command & shell scriptS 5027

Lecture 5: GPU Compute Architecture

MATLAB DENC 2533 ECADD LAB 9.

Introduction to HPC Workshop

College of Engineering

CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster

Advanced Computing Facility Introduction

Tutorial of Unix Command & shell scriptS 5027

Getting Started: Developing Code with Cloud9

Getting Started: Amazon AWS Account Creation

Introduction to TouchDevelop

Parallel computation with R & Python on TACC HPC server

Examplify The following slides are the ExamSoft’s recommended best practices to help you take your exam. We want you to be prepared for your exam on exam.

MPI MPI = Message Passing Interface

self-paced eLearning series

Module 13 System and User Security

Parallel Computing Explained How to Parallelize a Code

EET 2259 Unit 13 Strings and File I/O

Quick Tutorial on MPICH for NIC-Cluster

CSCE 206 Lab Structured Programming in C

Working in The IITJ HPC System

Maxwell Compute Cluster

Presentation transcript:

Carlos Rosales, John Cazes, Kent Milfeld TACC KNL Training Carlos Rosales, John Cazes, Kent Milfeld 9/20/2018

Hands-on Session 1 KNL Introduction Lab Kent Milfeld milfeld@tacc.utexas.edu 9/20/2018

Login Usernames and passwords will be handed out before lab. Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu This is a Haswell login node. You can compile code here and submit batch jobs– but we’ll get interactive access to a KNL compute node, and compile and execute directly on a KNL processor in just a bit! lower case L (l), followed by one (1). 9/20/2018

Get Files for Compiling Untar all lab files into your directory: $ tar xvf ~train00/knl_labs.tar Move into the newly created (knl_intro) directory: $ cd knl_intro 9/20/2018

List Queues Display list of SLURM queues (partitions): $ sinfo # list includes nodes $ idev -queues # just list of queues Each queue has nodes that are configured for a specific Memory Mode and Clustering: Queues (Partitions): Mem. Mode – Clustering Flat-All2All* Cache-All2All Flat-Quadrant Cache-Quadrant Flat-SNC-4 Cache-SNC-4 This is the list you will see – more about this in class. *Flat-All2All is the default partition (queue). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Access a KNL node interactively Interactively access a KNL node with idev: $ idev #default: 30 min, Flat-All2All node $ … #slurm reporting, when you have access, c###-###...$ #an interactive prompt will appear. c###-### in the new prompt is the KNL node you are on (you can even ssh to this node in another window– see last slide if you want 2 windows). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Access a KNL node interactively Let’s look at the hardware, and some useful commands: $ grep processor /proc/cpuinfo #what is the processor count? $ hwloc-ls -l #How many logical cores are there (See L#). $ hwloc-ls -p #Which physical cores have been disabled (See P#)? What is the Memory Mode and Clustering for this node? See load on each cpu. $ sinfo --format="%.12P %.5a %.6D" # Queue names (mem. mode-clustering), up/down # $ top # Hit 1 key (no return)– load of each cpu is shown. # Resize terminal to see more cpus! (hit q to quit) If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Compile and Run OpenMP code Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes): $ ifort -qopenmp -xMIC-AVX512 omp_hello.F90 -o omp_hello or $ icc -qopenmp -xMIC-AVX512 omp_hello.c -o omp_hello $ export OMP_NUM_THREADS=68 OMP_DISPLAY_ENV=TRUE $ ./omp_hello | sort $ export OMP_NUM_THREADS=272 If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (Will show OpenMP details at beginning of run.) 9/20/2018

Compile and Run MPI code Remember to compile with –xMIC-AVX512 (on compute nodes AND login nodes). $ mpif90 -xMIC-AVX512 mpi_hello.F90 -o mpi_hello or $ mpicc -xMIC-AVX512 mpi_hello.c -o mpi_hello $ mpiexec.hydra -np 68 ./mpi_hello | sort Or $ mpirun -n 68 ./mpi_hello | sort If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Vector Performance Compile vector.c as a native MIC application: $ icc -qopenmp -O3 -xMIC-AVX512 ./vector.c -o vec And also as a MIC application but disabling vectorization: $ icc -qopenmp -O3 -xMIC-AVX512 -no-vec ./vector.c -o novec Run both executables and take note of the timing difference. How much speedup comes from the vectorization? Does this make sense given what you have learned about the KNL architecture?

Vector Reports (I) Let's get some information about the vectorization in this example code. Compile the code again, but add a basic optimization report option to the compilation line: $ icc -qopenmp -O3 -xMIC-AVX512 -qopt-report=2 \ ./vector.c -o vec This will generate a report file called vector.optrpt Open the optimization report file with your favorite text editor, or simply cat the contents to screen: $ cat ./vector.optrpt 9/20/2018

Vector reports (II) There is a lot of information in the optimization report file. We find out that our array initialization can’t be vectorized because we call an external function (RAND) in lines 34 and 35 or the example: LOOP BEGIN at ./vector.c(34,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(34,33) ] LOOP END LOOP BEGIN at ./vector.c(35,2) remark #15527: loop was not vectorized: function call to rand(void) cannot be vectorized [ ./vector.c(35,33) ] But the main loop has been vectorized: LOOP BEGIN at ./vector.c(45,3) remark #15300: LOOP WAS VECTORIZED 9/20/2018

Vector reports (III) Let’s use a higher reporting level in order to find out more about the quality of the main loop vectorization: $ icc -qopenmp -O3 –xMIC-AVX512 –qopt-report=4./vector.c -o vec LOOP BEGIN at ./vector.c(45,3) remark #15388: vectorization support: reference x has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference y has aligned access [ ./vector.c(46,4) ] remark #15388: vectorization support: reference z has aligned access [ ./vector.c(46,4) ] remark #15305: vectorization support: vector length 8 remark #15399: vectorization support: unroll factor set to 8 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 11 remark #15477: vector loop cost: 0.620 remark #15478: estimated potential speedup: 17.600 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=4 LOOP END 9/20/2018

Vectorization Reports Memory Configuration Hands-on Session 2 Vectorization Reports Memory Configuration Carlos Rosales carlos@tacc.utexas.edu 9/20/2018

Setup Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu Move into the directory: $ cd knl_mem Obtain an interactive session in Stampede: $ idev -p Flat-All2All The idev command will start an interactive session on a KNL compute node lower case letter L (l), followed by number one (1). 9/20/2018

Memory configuration If you are still on a compute node (cxxx-xxx in your prompt) please exit now, and also exit after every test in this exercise so that you run them in the specified queue. In this exercise you will submit interactive jobs to queues with KNL in different memory configurations. Exit each session after the exercise is completed. 9/20/2018

Cache (All-to-all / Quadrant) Get an interactive session in a node with Cache-All2All configuration: $ idev -p Cache-All2All Once the session starts check the NUMA configuration: $ numactl -H Note: the Cache-Quadrant configuration will look exactly the same available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98207 MB DDR4

Cache (SubNUMA-4) Get an interactive session in a node with Cache-SNC-4 configuration and check its configuration: $ idev -p Cache-SNC-4 $ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 node 0 size: 24479 MB … node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 3 size: 24576 MB Important features: 4 NUMA nodes All NUMA nodes assigned to the DDR4 since we are using Cache mode Memory equally distributed across NUMA nodes ( 96 GB / 4 = 24 GB ) Nodes 0 and 1 have 18 CPU cores associated to them, while 2 and 3 have 16 (no tile splitting allowed) DDR4

Flat (All-to-all / Quadrant) Get an interactive session in a node with Flat-All2All configuration and check its configuration: $ idev -p Flat-All2All $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 98207 MB node 1 cpus: node 1 size: 16384 MB Important features: Note that MCDRAM is NUMA node 1 ( 16 GB ) and DDR4 is NUMA node 0 ( 96 GB, default allocation ) CPUs associated to NUMA node 0 are also associated to NUMA node 1 DDR4 MCDRAM

Flat (SubNUMA-4) DDR MCDRAM Get an interactive session in a node with Flat-SNC-4 configuration and check its configuration: $ idev -p Flat-SNC-4 $ numactl -H available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 node 0 size: 24479 MB … node 3 cpus: 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 3 size: 24576 MB node 4 cpus: node 4 size: 4096 MB node 7 cpus: node 7 size: 4096 MB Eight NUMA nodes: (0,1,2,3) are DDR4 and (4,5,6,7) are MCDRAM Memory equally split across NUMA nodes ( 96/4 = 24 GB DDR4, 16/4 = 4 GB MCDRAM ) CPUS for NUMA node 0 are also associated to NUMA node 4, etc DDR MCDRAM

Allocating in MCDRAM (I) The stream benchmark is the industry standard to measure memory bandwidth. In this exercise you will run it on both DDR4 and MCDRAM and look at the difference in performance. Make sure you are in the login node. Change directory into the stream benchmark: $ cd stream Build the executable: $ icc -qopenmp -O2 -xMIC-AVX512 ./stream.c -o stream 9/20/2018

Allocating in MCDRAM (II) In your interactive session, set the number of OMP threads to 68: $ idev -p Flat-All2All $ export OMP_NUM_THREADS=68 Run binding to DDR4 and MCDRAM and compare results: $ numactl --membind=0 ./stream $ numactl --membind=1 ./stream Does the performance difference you measured match what you learned about architectural features of KNL? What performance do you see if you run in the Cache-All2All queue with the same number of threads (and no numactl) 9/20/2018

Hands-on Session 3 KNL Affinity Lab Kent Milfeld milfeld@tacc.utexas.edu 9/20/2018

Login Login to Stampede $ ssh <username>@login-knl1.stampede.tacc.utexas.edu Move into the affinity directory created from the untar commmad in the intro lab: $ cd knl_affinity Access a KNL node: $ idev #we call this the idev window lower case L (l), followed by one (1). 9/20/2018

Login again in another window 1.) In another terminal window on you laptop login to login-knl1 2.) ssh to the knl compute node 3.) run top so that you can watch the cpu loads. $ ssh <username>@login-knl1.stampede.tacc.utexas.edu $ ssh c###-### #we call this the ssh window $ top Hit the “1” key, and Adjust the screen size/font so you can see 137 cpus, or at least 69. 9/20/2018

What you will do and learn In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Create an OpenMP load generator Compile omp_load.c (uses functions in timers.c and load.c) $ icc –xMIC-AVX512 -qopenmp load.c timers.c \ omp_load.c -o omp_load In the following 3 sections you will set up an OpenMP environments for 1.) evaluating the effects of PROC_BIND policy (distribution); 2.) placing OpenMP threads on HW-threads; and 3.) floating OpenMP threads on KNL cores. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Distribution (PROC_BIND spread) Set spread policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=spread #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st HW-thread of every core) [ 0, 1,..., 67] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 3rd HW-thread of each core) [ 0, 1,..., 67] [ --- ] [136, 137,...,203] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

Distribution (PROC_BIND close) Set close policy and run 1 and 2 threads per/core (with 68 and 136 threads) $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PROC_BIND=close #what is the processor count? $ export OMP_NUM_THREADS=68 $ ./omp_load #Watch top display $ export OMP_NUM_THREADS=136 (1st Quadrant) [ 0, 1,..., 17,---] [ 68, 69,..., 84,---] [136,137,...,152,---] [204,205,...,220,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 2nd Quadrant) [ 0, 1,..., 33,---] [ 68, 69,...,101,---] [136,137,...,169,---] [204,205,...,237,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

Locations (PLACES) Specify 4 places with a stride 4, starting from HW-thread 0. $ export OMP_DISPLAY_ENV=TRUE #what does this do? $ export OMP_PLACES="{0},{2},{4},{6}" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display $ export OMP_PLACES="{0}:4:2" (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1 thread on 1st 4 tiles) [ 0, 2, 4, 6,---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

Locations (PLACES) $ export OMP_DISPLAY_ENV=TRUE #what does this do? Specify 4 places with a stride 68, starting from HW-thread 0. $ export OMP_PLACES="{0}:4:68" $ export OMP_NUM_THREADS=4 $ ./omp_load #Watch top display Specify 8 places with a stride 2, starting from HW-thread 0 (uses two expressions) $ export OMP_PLACES="{0}:4:2,{64}:4:2" $ export OMP_NUM_THREADS=8 (1st core) [ 0,---] [ 68,---] [135,---] [204,---] If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (1st & 2nd HW-threads in 1st 4 tiles) [ 0, 2, 4, 6,---] [ 68,70,72,74,---] [---] 1...#2 means sequence (inclusive) is occupied: [0,1,...,5,6] = [0,1,2,3,4,5,6] 1---#2 means sequence (exclusive) is NOT occupied [0,1,---,5,6] = [0,2,5,6] 9/20/2018

Locations (masked PLACES) A list within a place forms a mask of HW-thread locations where an OpenMP thread can execute. Top command only shows the present load at a location, not the mask. In this exercise omp_viewload reports the mask for each thread, like this: thrd | | 10 | 20 | 30 |... HEADER (group val) 0 0---------------6------------------5-----... Mask (1st digit) 1 -1---------------7------------------6----... 2 --2---------------8------------------7---... ... Each row is a mask for a thread-id. Each mask is labeled by the thread-id on the left. Each column of the mask, cols 0 – 285, represent a HW-thread id of the mask. The header demarks HW-thread ids in groups of 10’s (| | | |...) and includes the beginning value for the group (| 10 | 20 | 30 | ...). The bits set in the mask are indicated by a number (eg 0---------------6------------------5 ) which is the first digit of the HW-thread id. (group value + first digit = HW-thread id) If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Locations ( masking PLACES for cores) Change to the mask directory (in the $HOME/lab_affinity directory) $ cd mask Build the omp_viewload executable $ make Create 68 places with a mask for each core, and execute omp_viewload with 68 OpenMP threads. $ export OMP_NUM_THREADS=68 $ export OMP_PLACES="{0,68,136,204}:68“ $ omp_viewload #what does each line mean? #See next slide for help # this will run for awhile --use ^c to exit. If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Locations (Mask Report) (15, 15) (4, 34) OMP_NUM_THREADS=68 OMP_PLACES={0,68,136,204}:68 Occupation Report from omp_viewload (thread-id, HW-thread id) (2, { 2,70,136,206}) … If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif (15, { 15,83,151,219}) … 9/20/2018

Locations ( masking PLACES for tiles) Create 34 places with a mask for each tile, and execute omp_viewload with 34 OpenMP threads (a thread on each tile). $ export OMP_NUM_THREADS=34 $ export OMP_PLACES="{0:2,68:2,136:2,204:2}:34:2" $ omp_viewload #what does each line mean? If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif 9/20/2018

Another Window to your KNL interactive node Optional How to create another window to your interactive KNL compute node: $ ssh <username>@login-knl1.stampede.tacc.utexas.edu $ ssh c###.### #get nodename from prompt or squeue $ ... c###.###...$ #This is your 2nd window on KNL #run top, compile code, etc. lower case L (l), followed by one (1). If you use vi you might find the vim graphical cheat-sheet useful: http://www.viemu.com/vi-vim-cheat-sheet.gif login-knl1.stampede(1)$ whoami train450 login-knl1.stampede(1)$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 755 Flat-All2Al idv11588 milfeld R 13:06 1 c561-002 754 Flat-All2Al idv11568 train450 R 16:22 1 c561-001 login-knl1.stampede(2)$ ssh c561-001 … c561-001.stampede(1)$ 9/20/2018

Done! 9/20/2018