Computational Methods in Astrophysics

Computational Methods in Astrophysics
Dr Rob Thacker (AT319E)

Column vs Row major storage
In FORTRAN the storage is such that A(3,N) is A(1,1),A(2,1),A(3,1),A(1,2),A(2,2),A(3,2) in memory The first index is contiguous in memory (column major) In C/C++ A[N][3] is laid out such that A[1][1],A[1][2],A[1][3],A[2][1],A[2][2],A[2][3] The last index is contiguous (row major)

Loop Level Parallelism
Consider the single precision vector add-multiply operation Y=aX+Y (“SAXPY”) C/C++ do i=1,n Y(i)=a*X(i)+Y(i) end do FORTRAN C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) for (i=1;i<=n;++i) { Y[i]+=a*X[i]; } #pragma omp parallel for \ private(i) shared(X,Y,n,a) for (i=1;i<=n;++i) { Y[i]+=a*X[i]; }

necessary for continuation
In more detail Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously

Reminder Recall private variables are uninitialized
Motivation: no need to copy in from serial section Private variables do not carry a value into the serial parts of the code Motivation: no need to copy from parallel to serial API provides to mechanisms to circumvent this issue FIRSTPRIVATE LASTPRIVATE

Load balancing When running in parallel you are only as fast as your slowest thread In example, total work is 40 seconds, & have 4 cpus Max speed up would be 40/4=10 secs All have to equal 10 secs though to give max speed-up Example of poor load balance, only a 40/16=2.5 speed-up despite using 4 processors

STATIC scheduling Simplest of the four
If SCHEDULE is unspecified, STATIC scheduling will result Default behaviour is to simply divide up the iterations among the threads ~n/(# threads) STATIC(chunksize), creates a cyclic distribution of iterations

Comparison STATIC No chunksize STATIC chunksize=1 THREAD 1 THREAD 2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 STATIC chunksize=1 THREAD 1 THREAD 2 THREAD 3 THREAD 4 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Chunksize & Cache line issues
If you are accessing arrays using the loop index, e.g. Ensure chunksize > words in cache line False sharing otherwise C$OMP PARALLEL DO C$OMP& PRIVATE(I,..), SHARED(a,..) C$OMP& SCHEDULE(STATIC,1) do i=1,n **work** a(i)=… end do

Drawbacks to STATIC scheduling
Naïve method of achieving good load balance Works well if the work in each iteration is constant If work varies by factor ~10 then usually cyclic load balancing can help If work varies by larger factors then dynamic balancing is usually better

DYNAMIC scheduling DYNAMIC scheduling is a personal favourite
Specify using DYNAMIC(chunksize) Simple implementation of master-worker type distribution of iterations Master thread passes off values of iterations to the workers of size chunksize Not a silver bullet: if load balance is too severe (i.e. one thread takes longer than the rest combined) an algorithm rewrite is necessary Also not good if you need a regular access pattern for data locality

Master-Worker Model Master THREAD 1 THREAD 2 THREAD 3 REQUEST REQUEST

Performance issues - False Sharing
You may parallelize your algorithm and find performance is less than stellar: Speed-up Ncpu

Example A possible cause of poor performance is something called `false sharing’: integer m,n,i,j real a(m,n),s(m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(i)=0.0 do j=1,n s(i)=s(i)+a(i,j) end do Simple code which sums rows of a matrix

Execution time line Set m=4, what happens in each thread?
Since memory is laid out in four word cache lines, at each stage all four threads are fighting for the same cache line t=0 s(1)=0.0 s(2)=0.0 s(3)=0.0 s(4)=0.0 t=1 s(1)=s(1)+a(1,1) s(2)=s(2)+a(2,1) s(3)=s(3)+a(3,1) s(4)=s(4)+a(4,1) t=2 s(1)=s(1)+a(1,2) s(2)=s(2)+a(2,2) s(3)=s(3)+a(3,2) s(4)=s(4)+a(4,2)

Cache line invalidation
For each thread, prior to the next operation on s(), it must retrieve a new version of the s(1:4) cache line from main memory (it’s own copy of s(1:4) has been invalidated by other operations) Fetches from anywhere other than the local cache are much slower Result is significantly increased run time

Simple solution Just need to spread out elements of s() so that each of them is its own cache line: integer m,n,i,j real a(m,n),s(32,m) C$OMP PARALLEL DO C$OMP& PRIVATE(i,j) C$OMP& SHARED(s,a) do i=1,m s(1,i)=0.0 do j=1,n s(1,i)=s(1,i)+a(i,j) end do

Layout of s(,) s(,) 1 Each item of interest is now separated
by (multiple) cache lines … … 32 i i i+1 i+2 8 word cache lines s(1,i-1) s(1,i) s(1,i+1) s(1,i+2)

Tips to avoid false sharing
Minimize the number of variables that are shared Segregate rarely updated variables from those that are update frequently (“volatile”) Isolate the volatile items into separate cache lines Have to accept the waste of memory to improve performance

OpenMP subroutines and functions
OMP_SET_NUM_THREADS (s) OMP_GET_NUM_THREADS (f) OMP_GET_MAX_THREADS (f) OMP_GET_THREAD_NUM (f) OMP_GET_NUM_PROCS (f) OMP_IN_PARALLEL (f) OMP_SET_DYNAMIC (s) OMP_GET_DYNAMIC (f) OMP_SET_NESTED (s) OMP_GET_NESTED (f) C/C++ versions are all lower case Red=subroutines

Useful functions/subroutines
OMP_SET_NUM_THREADS is useful to change the number of threads of execution: However, it will accept values higher than the number of CPUs – which will result in poor execution times call OMP_SET_NUM_THREADS(num_threads) or void omp_set_num_threads(int num_threads)

OMP_GET_NUM_THREADS Returns the number of threads currently being used for execution Will obviously produce a different result when executed in a parallel loop, versus a serial part of the code – be careful!

OMP_GET_MAX_THREADS While OMP_GET_NUM_THREADS will return the number of threads being used OMP_GET_MAX_THREADS returns the maximum possible number This value will be the same whether the code is executing in a serial or parallel section Remember: that is not true for OMP_GET_NUM_THREADS!

OMP_GET_THREAD_NUM Very useful function – returns your thread number from 0 to (number of threads) – 1 Can be used to control access to data by using the thread number to index to start and end points of a section Can also be useful in debugging race conditions

Environment variables
The OpenMP standard defines a number of environment variables (some of which we have met) OMP_NUM_THREADS OMP_SCHEDULE OMP_DYNAMIC OMP_NESTED

Environment variables
All of these variables can be overridden by using the subroutines we discussed (with the exception of the OMP_SCHEDULE variable) OMP_DYNAMIC is set to false by default OMP_NESTED is set to false by default If you are using nesting it is probably safer to ensure you turn on nesting within the code

Thread Stack size One of the most significant variables is not declared within the OpenMP standard Each parallel thread may require a large stack to declare its private variables on Typical sizes are 1-8 MB, but certain codes may require more I often run problems where I need over 100 MB of thread stack (for example)

Quick example Consider the following code:
You are assigning 1283*3 words on to the thread stack=24 MB real r(3, ) C$OMP PARALLEL DO C$OMP PRIVATE(r,i) do i=1,10 call work(r) end do

Conditional Parallelism
To save code replication, OpenMP allows you to determine whether a loop should be called in parallel or serial: `conditional parallelism’ A parallel do loop can be followed by an IF statement: Only if the if statement is true is the routine executed in parallel C$OMP PARALLEL DO C$OMP& IF(L.gt.Lrmax) C$OMP& PRIVATE(…) code….

How this avoids replication
Suppose you have a routine that is applied to a certain data set, but you may have a large single data set, or alternatively many small data sets Option 2 Option 1 call foo call foo call foo call foo Single instance, parallel across machine Multiple, (parallel) serial calls

Conditional Parallelism
To determine whether the routine is executed in parallel or not, the omp_in_parallel function can be used: subroutine foo(bar) real bar logical inpar inpar = omp_in_parallel() C$OMP PARALLEL DO C$OMP& IF(.NOT. inpar) C$OMP& PRIVATE(…) C$OMP& SHARED(…) do i=1,N work…. end do return

Comments Applications that have a recursive nature can often be significantly reduced in size by using conditional parallelism Most significant benefit is probably in code correctness – any bugs need only be fixed once!

Conditional Compilation
It’s good practice to use your OpenMP code in serial However, if you include OpenMP functions you may not have these defined within your compiler Conditional compilation allows a simple way around this

Conditional Compilation
Two methods can be used to remove OpenMP subroutines Or C preprocessor statements can be used C$ i=OMP_GET_THREAD_NUM() #IFDEF _OPENMP i=OMP_GET_THREAD_NUM() #ENDIF

Simple Example Suppose you use the thread number function given in the previous example C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=1 C$ j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do C$OMP END PARALLEL C$OMP PARALLEL DO C$OMP& PRIVATE(j) do i=1,n j=OMP_GET_THREAD_NUM() js=j*iblk+1 je=(j+1)*iblk do k=js,je ! Use j to point to ! parts of an array end do C$OMP END PARALLEL Add default value and ensure all constants are adjusted accordingly

Parallel I/O Provided you are prepared to read and write multiple files then OpenMP can give moderate I/O speed-up Even if you are writing only one array, sections of it can be written within the files e.g. write(13) (r(i),i=1,block) write(13) (r(i),i=block+1,2*block)

I/O continued However, writing to one file system implies a bandwidth limit Multiple file systems on one machine can show great benefits Note: some machines require that applications be compiled with special flags May need parallel io flag at compile time

Code Example Writes array r() out in nthread files simultaneously
blck_len=n/OMP_GET_NUM_THREADS() C$OMP PARALLEL DO C$OMP& PRIVATE(I,filename,js,je) C$OMP& SHARED(blck_len,r) do i=1,nthread write(filename,`(``r``,i4.4,``.out``)`)i js=(i-1)*blck_len+1 je=i*blck_len open(i+10,file=filename,status=‘new’,form=‘unformatted’) write(i+10) (r(j),j=js,je) close(i+10) end do Writes array r() out in nthread files simultaneously

Performance examples Ultimately depends on system you are on

Alternative programming methodology
OpenMP also allows you to program a way more akin to MPI (SPMD model) Using the PARALLEL construct you don’t need a do loop, or sections to execute in parallel: C$OMP PARALLEL call foo(I,a,b,c) C$OMP END PARALLEL

Execution Call foo Call foo Call foo Call foo

PARALLEL environment To do useful work necessary to have each subroutine call access different data Can use thread id to select different pieces of an array – similar in principle to using process id’s in MPI Good scaling requires that replications really do perform parallel work, rather than just replicating execution

Pros/cons of this approach
Cons – may be difficult to take an already written code and convert it to this style Can also lead to tough to debug code Pros – codes written using this environment often scale very well Potentially very useful if you are starting from scratch

BARRIERS & SINGLE SECTIONS
Within the parallel region it may be necessary to synchronize and/or have only one thread execute a piece of code This can be achieved by using the SINGLE clause and BARRIER C$OMP PARALLEL DEFAULT(SHARED) CALL WORK(X) C$OMP BARRIER C$OMP SINGLE CALL OUTPUT(X) CALL INPUT(Y) C$OMP END SINGLE CALL WORK(Y) C$OMP END PARALLEL Block until all threads arrive here Executed by first thread to arrive at this part of code

SINGLE DIRECTIVE The point of using the SINGLE directive is to avoid the overhead of stopping and then starting the parallel section There is an implicitly implied barrier at the END SINGLE statement This can be avoided by adding NOWAIT: C$OMP END SINGLE NOWAIT

MASTER DIRECTIVE Instead of using SINGLE, one can also use the MASTER directive to select code that only the master thread should execute Unlike the SINGLE directive there is no implied barrier at the END MASTER statement MASTER regions can be used to keep tabs on the values of control variables SINGLE regions are better for I/O (for example)

DO Within parallel regions DO loops can be specified as parallel do’s
Exactly the same idea as loop level parallelism, have to specify scope of the variables, scheduling etc

Example for 4 threads C$OMP PARALLEL do i=1,4 write(*,*) ‘Hello’
end do C$OMP END PARALLEL C$OMP PARALLEL C$OMP DO do i=1,4 write(*,*) ‘Hello’ end do C$OMP END DO C$OMP END PARALLEL Hello Thread 1 Thread 1 Thread 2 Hello Thread 2 Thread 3 Thread 3 Thread 4 Thread 4

Task Queue in the SPMD model
Program tasks Integer my_task,mytid,index index=1 C$OMP PARALLEL C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(mytid),SHARED(index) C$OMP CRITICAL mytid=my_task(index) C$OMP END CRITICAL do while (mytid.ne.-1) call work(mytid) end do C$OMP END PARALLEL end integer function my_task(index) integer index,MAX MAX=some value if (index.lt.MAX) then index=index+1 my_task=index else my_task=-1 end if return

WORKSHARE This is a new feature to the OpenMP 2.0 standard
Allows parallelization of f90 array expressions and FORALL statements Example, if A & B are arrays: Directive was brought in to get around problems with f90 array syntax C$OMP WORKSHARE A=B C$OMP END WORKSHARE

Memory locality “LOCATION, LOCATION, LOCATION!”
Strive for memory locality at all times Example: we used linked lists to control access to particles Replacing with ordered lists – 35% speed improvement Reordering particle indices to ensure sequential access – 240%(!) speed improvement

Examples Poor locality Slightly better locality Linked list
j= !start value 1 If (j.gt.0) then x1=r(1,j) work… j=ll(j) goto 1 end if js= !start value je= js+n do k=js,je j=ol(k) x1=r(1,j) work… end do Linked list Ordered sequential list

Best locality js= !start value je= js+n do j=js,je x1=r(1,j) work… end do Need to reorder values of r(,) for this to work, but this penalty is more than removed by the increased speed of this kind of loop.

Example

Rules of thumb Cache friendly algorithms (i.e. applications which access memory in a regular pattern) are good candidates to scale to large numbers of CPUs Applications with irregular access to memory (cache unfriendly) are poor candidates, unless they perform a very large number of calculations on retrieved data Cache unfriendly algorithms require an implicitly high computation to communication ratio

Computational Methods in Astrophysics

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods in Astrophysics

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics"— Presentation transcript:

Similar presentations

About project

Feedback