Download presentation
Presentation is loading. Please wait.
1
Introduction to Scientific Computing Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization
2
Outline Introduction Software Parallelization Hardware
3
Introduction What is Scientific Computing? – Need for speed – Need for memory Simulations tend to grow until they overwhelm available resources – If I can simulate 1000 neurons, wouldn’t it be cool if I could do 2000? 10000? 10 87 ? Example – flow over an airplane – It has been estimated that if a teraflop machine were available, would take about 200,000 years to solve (resolving all scales). If Homo Erectus had a teraflop machine, we could be getting the result right about now.
4
Introduction (cont’d) Optimization – Profile serial (1-processor) code Tells where most time is consumed – Is there any “low fruit”? Faster algorithm Optimized library Wasted operations Parallelization – Break problem up into chunks – Solve chunks simultaneously on different processors
5
Software
6
Compiler The compiler is your friend (usually) Optimizers are quite refined – Always try highest level Usually –O3 Sometimes –fast, -O5, … Loads of flags, many for optimization Good news – many compilers will automatically parallelize for shared- memory systems Bad news – this usually doesn’t work well
7
Software Libraries – Solver is often a major consumer of CPU time – Numerical Recipes is a good book, but many algorithms are not optimal – Lapack is a good resource – Libraries are often available that have been optimized for the local architecture Disadvantage – not portable
8
Parallelization
9
Divide and conquer! – divide operations among many processors – perform operations simultaneously – if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? not so easy, of course
10
Parallelization (cont’d) problem – some calculations depend upon previous calculations – can’t be performed simultaneously – sometimes tied to the physics of the problem, e.g., time evolution of a system want to maximize amount of parallel code – occasionally easy – usually requires some work
11
Parallelization (3) method used for parallelization may depend on hardware distributed memory – each processor has own address space – if one processor needs data from another processor, must be explicitly passed shared memory – common address space – no message passing required
12
Parallelization (4) proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mem 2 mem 3 distributed memory proc 0 proc 1 proc 2 proc 3 mem shared memory proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mixed memory
13
Parallelization (5) MPI – for both distributed and shared memory – portable – freely downloadable OpenMP – shared memory only – must be supported by compiler (most do) – usually easier than MPI – can be implemented incrementally
14
MPI Computational domain is typically decomposed into regions – One region assigned to each processor Separate copy of program runs on each processor
15
MPI Discretized domain to solve flow over airfoil System of coupled PDE’s solved at each point
16
MPI Decomposed domain for 4 processors
17
MPI Since points depend on adjacent points, must transfer information after each iteration This is done with explicit calls in the source code
18
MPI Diminishing returns – Sending messages can get expensive – Want to maximize ratio of computation to communication Parallel speedup, parallel efficiency T = time n = number of processors
19
Speedup
20
Parallel Efficiency
21
OpenMP Usually loop-level parallelization An OpenMP directive is placed in the source code before the loop – Assigns subset of loop indices to each processor – No message passing since each processor can “see” the whole domain for(i=0; i<N; i++){ do lots of stuff }
22
OpenMP Can’t guarantee order of operations for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; ia[i] (serial)a[i] (parallel) 011 122 244 388 4162 5324 6648 Proc. 0 Proc. 1 Parallelize this loop on 2 processors Example of how to do it wrong!
23
Hardware
24
A faster processor is obviously good, but: – Memory access speed is often a big driver Cache – a critical element of memory system Processors have internal parallelism such as pipelines and multiply-add instructions
25
Cache Cache is a small chunk of fast memory between the main memory and the registers secondary cache registers primary cache main memory
26
Cache (cont’d) Variables are moved from main memory to cache in lines – L1 cache line sizes on our machines Opteron (blade cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes If variables are used repeatedly, code will run faster since cache memory is much faster than main memory
27
Cache (cont’d) Why not just make the main memory out of the same stuff as cache? – Expensive – Runs hot – This was actually done in Cray computers Liquid cooling system
28
Cache (cont’d) Cache hit – Required variable is in cache Cache miss – Required variable not in cache – If cache is full, something else must be thrown out (sent back to main memory) to make room – Want to minimize number of cache misses
29
Cache example … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] Main memory “mini” cache holds 2 lines, 4 words each for(i==0; i<10; i++) x[i] = i a b …
30
Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] We will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits a b …
31
Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits x[4] x[5] x[6] x[7] a b …
32
Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[8] x[9] a b need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line x[4] x[5] x[6] x[7] a b …
33
Cache (cont’d) Contiguous access is important In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …
34
Cache (cont’d) In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …
35
Cache (cont’d) Rule: Always order your loops appropriately for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo C Fortran
36
SCF Machines
37
p-series Shared memory IBM Power4 processors 32 KB L1 cache per processor 1.41 MB L2 cache per pair of processors 128 MB L3 cache per 8 processors
38
p-series machineProc. spd# procsmemory kite1.3 GHz3232 GB pogo1.3 GHz3232 GB frisbee1.3 GHz3232 GB domino1.3 GHz1616 GB twister1.1 GHz816 GB scrabble1.1 GHz816 GB marbles1.1 GHz816 GB crayon1.1 GHz816 GB litebrite1.1 GHz816 GB hotwheels1.1 GHz816 GB
39
Blue Gene Distributed memory 2048 processors – 1024 2-processor nodes IBM PowerPC 440 processors – 700 MHz 512 MB memory per node (per 2 processors) 32 KB L1 cache per node 2 MB L2 cache per node 4 MB L3 cache per node
40
BladeCenter Hybrid memory 56 processors – 14 4-processor nodes AMD Opteron processors – 2.6 GHz 8 GB memory per node (per 4 processors) – Each node has shared memory 64 KB L1 cache per 2 processors 1 MB L2 cache per 2 processors
41
Linux Cluster Hybrid memory 104 processors – 52 2-processor nodes Intel Pentium III processors – 1.3 GHz 1 GB memory per node (per 2 processors) – Each node has shared memory 16 KB L1 cache per 2 processors 512 KB L2 cache per 2 processors
42
For More Information SCV web site http://scv.bu.edu/ Today’s presentations are available at http://scv.bu.edu/documentation/presentations/ under the title “Introduction to Scientific Computing and Visualization”
43
Next Time G & T code Time it – Look at effect of compiler flags profile it – Where is time consumed? Modify it to improve serial performance
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.