Parallel Spectral Methods: Solving Elliptic Problems with FFTs

Parallel Spectral Methods: Solving Elliptic Problems with FFTs
Kathy Yelick 03/21/2007 CS267 Lecture 19

Lecture by Geoffrey Fox: FFTW project Spiral project
References Previous CS267 lectures Lecture by Geoffrey Fox: FFTW project Spiral project 03/21/2007 CS267 Lecture 19

Poisson’s equation arises in many models
3D: 2u/x2 + 2u/y2 + 2u/z2 = f(x,y,z) f represents the sources; also need boundary conditions 2D: 2u/x2 + 2u/y2 = f(x,y) 1D: d2u/dx2 = f(x) Electrostatic or Gravitational Potential: Potential(position) Heat flow: Temperature(position, time) Diffusion: Concentration(position, time) Fluid flow: Velocity,Pressure,Density(position,time) Elasticity: Stress,Strain(position,time) Variations of Poisson have variable coefficients 03/21/2007 CS267 Lecture 19

Algorithms for 2D (3D) Poisson Equation (N = n2 (n3) vars)
Algorithm Serial PRAM Memory #Procs Dense LU N3 N N2 N2 Band LU N2 (N7/3) N N3/2 (N5/3) N (N4/3) Jacobi N2 (N5/3) N (N2/3) N N Explicit Inv. N2 log N N2 N2 Conj.Gradients N3/2 (N4/3) N1/2(1/3) *log N N N Red/Black SOR N3/2 (N4/3) N1/2 (N1/3) N N Sparse LU N3/2 (N2) N1/2 N*log N (N4/3) N FFT N*log N log N N N Multigrid N log2 N N N Lower bound N log N N PRAM is an idealized parallel model with zero cost communication Reference: James Demmel, Applied Numerical Linear Algebra, SIAM, 1997. Fix table in earlier lecture for 3D case! Band LU serial complexity: dimension * bandwidth^2 = n^2 * n^2 in 2D and n^3 * (n^2)^2 in 3D Band LU storage: dimension * bandwidth Band LU #procs = bandwidth^2 Jacobi complexity = complexity(SpMV) * #its, #its = cond# = n^2 in 2D and 3D CG complexity = [ complexity(SpMV)+complexity(dotprod) ] * #its, #its = sqrt(cond#) = n in 2D and 3D RB SOR complexity = complexity(SpMV) * #its, #its = sqrt(cond#) = n in 2D and 3D 03/21/2007 CS267 Lecture 19

Solving Poisson’s Equation with the FFT
Express any 2D function defined in 0  x,y  1 as a series (x,y) = Sj Sk jk sin(p jx) sin(p ky) Here jk are called Fourier coefficient of (x,y) The inverse of this is: jk = (x,y) sin(p jx) sin(p ky) Poisson’s equation 2  / x2 +  2  / y2 = f(x,y) becomes Sj Sk (-p2j2 - p2k2) jk sin(p jx) sin(p ky) = Sj Sk fjk sin(p jx) sin(p ky) where fjk are Fourier coefficients of f(x,y) and f(x,y) = Sj Sk fjk sin(p jx) sin(p ky) This implies PDE can be solved exactly algebraically, jk = fjk / (-p2j2 - p2k2) 03/21/2007 CS267 Lecture 19

Solving Poisson’s Equation with the FFT
So solution of Poisson’s equation involves the following steps 1) Find the Fourier coefficients fjk of f(x,y) by performing integral 2) Form the Fourier coefficients of  by jk = fjk / (-p2j2 - p2k2) 3) Construct the solution by performing sum (x,y) There is another version of this (Discrete Fourier Transform) which deals with functions defined at grid points and not directly the continuous integral Also the simplest (mathematically) transform uses exp(-2pijx) not sin(p jx) Let us first consider 1D discrete version of this case PDE case normally deals with discretized functions as these needed for other parts of problem 03/21/2007 CS267 Lecture 19

Let i=sqrt(-1) and index matrices and vectors from 0.
Serial FFT Let i=sqrt(-1) and index matrices and vectors from 0. The Discrete Fourier Transform of an m-element vector v is: F*v Where F is the m*m matrix defined as: F[j,k] = v (j*k) Where v is: v = e (2pi/m) = cos(2p/m) + i*sin(2p/m) v is a complex number with whose mth power vm =1 and is therefore called an mth root of unity E.g., for m = 4: v = i, v2 = -1, v3 = -i, v4 = 1, 03/21/2007 CS267 Lecture 19

Using the 1D FFT for filtering
Signal = sin(7t) + .5 sin(5t) at 128 points Noise = random number bounded by .75 Filter by zeroing out FFT components < .25 03/21/2007 CS267 Lecture 19

Using the 2D FFT for image compression
Image = 200x320 matrix of values Compress by keeping largest 2.5% of FFT components Similar idea used by jpeg Jpeg breaks an image into 8x8 pixel blocks, does a DCT, quantizes the coefficients (round them so they can be Stored in very few bits), and stored these bits 03/21/2007 CS267 Lecture 19

Most applications require multiplication by both F and inverse(F).
Related Transforms Most applications require multiplication by both F and inverse(F). Multiplying by F and inverse(F) are essentially the same. (inverse(F) is the complex conjugate of F divided by n.) For solving the Poisson equation and various other applications, we use variations on the FFT The sin transform -- imaginary part of F The cos transform -- real part of F Algorithms are similar, so we will focus on the forward FFT. 03/21/2007 CS267 Lecture 19

Serial Algorithm for the FFT
Compute the FFT of an m-element vector v, F*v (F*v)[j] = S F(j,k) * v(k) = S v (j*k) * v(k) = S (v j)k * v(k) = V(v j) Where V is defined as the polynomial V(x) = S xk * v(k) m-1 k = 0 m-1 k = 0 m-1 k = 0 m-1 k = 0 03/21/2007 CS267 Lecture 19

V can be evaluated using divide-and-conquer V(x) = S (x)k * v(k)
Divide and Conquer FFT V can be evaluated using divide-and-conquer V(x) = S (x)k * v(k) = v[0] + x2*v[2] + x4*v[4] + … + x*(v[1] + x2*v[3] + x4*v[5] + … ) = Veven(x2) + x*Vodd(x2) V has degree m-1, so Veven and Vodd are polynomials of degree m/2-1 We evaluate these at points (v j)2 for 0<=j<=m-1 But this is really just m/2 different points, since (v (j+m/2) )2 = (v j *v m/2) )2 = v 2j *v m = (v j)2 So FFT on m points reduced to 2 FFTs on m/2 points Divide and conquer! m-1 k = 0 03/21/2007 CS267 Lecture 19

Divide-and-Conquer FFT
FFT(v, v, m) if m = 1 return v[0] else veven = FFT(v[0:2:m-2], v 2, m/2) vodd = FFT(v[1:2:m-1], v 2, m/2) v-vec = [v0, v1, … v (m/2-1) ] return [veven + (v-vec .* vodd), veven - (v-vec .* vodd) ] The .* above is component-wise multiply. The […,…] is construction an m-element vector from 2 m/2 element vectors This results in an O(m log m) algorithm. precomputed 03/21/2007 CS267 Lecture 19

An Iterative Algorithm
The call tree of the d&c FFT algorithm is a complete binary tree of log m levels An iterative algorithm that uses loops rather than recursion, goes each level in the tree starting at the bottom Algorithm overwrites v[i] by (F*v)[bitreverse(i)] Practical algorithms combine recursion (for memory hiearchy) and iteration (to avoid function call overhead) FFT(0,1,2,3,…,15) = FFT(xxxx) even odd FFT(0,2,…,14) = FFT(xxx0) FFT(1,3,…,15) = FFT(xxx1) FFT(xx00) FFT(xx10) FFT(xx01) FFT(xx11) FFT(x000) FFT(x100) FFT(x010) FFT(x110) FFT(x001) FFT(x101) FFT(x011) FFT(x111) FFT(0) FFT(8) FFT(4) FFT(12) FFT(2) FFT(10) FFT(6) FFT(14) FFT(1) FFT(9) FFT(5) FFT(13) FFT(3) FFT(11) FFT(7) FFT(15) 03/21/2007 CS267 Lecture 19

Data dependencies in 1D FFT A PRAM algorithm takes O(log m) time
Parallel 1D FFT Data dependencies in 1D FFT Butterfly pattern A PRAM algorithm takes O(log m) time each step to right is parallel there are log m steps What about communication cost? See LogP paper for details 03/21/2007 CS267 Lecture 19

Using a block layout (m/p contiguous elts per processor)
Block Layout of 1D FFT Using a block layout (m/p contiguous elts per processor) No communication in last log m/p steps Each step requires fine- grained communication in first log p steps 03/21/2007 CS267 Lecture 19

Cyclic layout (only 1 element per processor, wrapped)
Cyclic Layout of 1D FFT Cyclic layout (only 1 element per processor, wrapped) No communication in first log(m/p) steps Communication in last log(p) steps 03/21/2007 CS267 Lecture 19

m = vector size, p = number of processors f = time per flop = 1
Parallel Complexity m = vector size, p = number of processors f = time per flop = 1 a = startup for message (in f units) b = time per word in a message (in f units) Time(blockFFT) = Time(cyclicFFT) = 2*m*log(m)/p + log(p) * a + m*log(p)/p * b 03/21/2007 CS267 Lecture 19

Then transpose the vector for last log(m/p) steps
FFT With “Transpose” If we start with a cyclic layout for first log(p) steps, there is no communication Then transpose the vector for last log(m/p) steps All communication is in the transpose Note: This example has log(m/p) = log(p) If log(m/p) > log(p) more phases/layouts will be needed We will work with this assumption for simplicity 03/21/2007 CS267 Lecture 19

Why is the Communication Step Called a Transpose?
Analogous to transposing an array View as a 2D array of n/p by p Note: same idea is useful for uniprocessor caches 03/21/2007 CS267 Lecture 19

Complexity of the FFT with Transpose
If no communication is pipelined (overestimate!) Time(transposeFFT) = 2*m*log(m)/p same as before + (p-1) * a was log(p) * a + m*(p-1)/p2 * b was m* log(p)/p * b If communication is pipelined, so we do not pay for p-1 messages, the second term becomes simply a, rather than (p-1)a. This is close to optimal. See LogP paper for details. See also following papers on class resource page A. Sahai, “Hiding Communication Costs in Bandwidth Limited FFT” R. Nishtala et al, “Optimizing bandwidth limited problems using one-sided communication” 03/21/2007 CS267 Lecture 19

Comment on the 1D Parallel FFT
The above algorithm leaves data in bit-reversed order Some applications can use it this way, like Poisson Others require another transpose-like operation Other parallel algorithms also exist A very different 1D FFT is due to Edelman (see Based on the Fast Multipole algorithm Less communication for non-bit-reversed algorithm 03/21/2007 CS267 Lecture 19

E.g., a 2D FFT does 1D FFTs on all rows and then all columns
Higher Dimension FFTs FFTs on 2 or 3 dimensions are define as 1D FFTs on vectors in all dimensions. E.g., a 2D FFT does 1D FFTs on all rows and then all columns There are 3 obvious possibilities for the 2D FFT: (1) 2D blocked layout for matrix, using 1D algorithms for each row and column (2) Block row layout for matrix, using serial 1D FFTs on rows, followed by a transpose, then more serial 1D FFTs (3) Block row layout for matrix, using serial 1D FFTs on rows, followed by parallel 1D FFTs on columns Option 2 is best, if we overlap communication and computation For a 3D FFT the options are similar 2 phases done with serial FFTs, followed by a transpose for 3rd can overlap communication with 2nd phase in practice 03/21/2007 CS267 Lecture 19

FFTW – Fastest Fourier Transform in the West
Produces FFT implementation optimized for Your version of FFT (complex, real,…) Your value of n (arbitrary, possibly prime) Your architecture Close to optimal for serial, can be improved for parallel Similar in spirit to PHIPAC/ATLAS/Sparsity Won 1999 Wilkinson Prize for Numerical Software Widely used for serial FFTs Had parallel FFTs in version 2, but no longer supporting them Layout constraints from users/apps + network differences are hard to support 03/21/2007 CS267 Lecture 19

FFT requires one (or more) transpose operations:
Bisection Bandwidth FFT requires one (or more) transpose operations: Ever processor send 1/P of its data to each other one Bisection Bandwidth limits this performance Bisection bandwidth is the bandwidth across the narrowest part of the network Important in global transpose operations, all-to-all, etc. “Full bisection bandwidth” is expensive Fraction of machine cost in the network is increasing Fat-tree and full crossbar topologies may be too expensive Especially on machines with 100K and more processors SMP clusters often limit bandwidth at the node level 03/21/2007 CS267 Lecture 19

EEL: end to end latency (1/2 roundtrip)
Modified LogGP Model LogGP: no overlap LogGP: no overlap P0 P1 osend L orecv P0 g P1 EEL: end to end latency (1/2 roundtrip) g: minimum time between small message sends G: additional gap per byte for larger messages 03/21/2007 CS267 Lecture 19

Historical Perspective
½ round-trip latency Potential performance advantage for fine-grained, one-sided programs Potential productivity advantage for irregular applications 03/21/2007 CS267 Lecture 19

The overlap potential is the difference between the gap and overhead
General Observations The overlap potential is the difference between the gap and overhead No potential if CPU is tied up throughout message send E.g., no send size DMA Grows with message size for machines with DMA (per byte cost is handled by network) Because per-Byte cost is handled by NIC Grows with amount of network congestion Because gap grows as network becomes saturated Remote overhead is 0 for machine with RDMA 03/21/2007 CS267 Lecture 19

GASNet Communications System
GASNet offers put/get communication One-sided: no remote CPU involvement required in API (key difference with MPI) Message contains remote address No need to match with a receive No implicit ordering required Compiler-generated code Used in language runtimes (UPC, etc.) Fine-grained and bulk xfers Split-phase communication Language-specific runtime Core API is useful in many ways: quick port to new platform, offers a “working” implementation of useful GASNet Network Hardware 03/21/2007 CS267 Lecture 19

Performance of 1-Sided vs 2-sided Communication: GASNet vs MPI
Comparison on Opteron/InfiniBand – GASNet’s vapi-conduit and OSU MPI 0.9.5 Up to large message size (> 256 Kb), GASNet provides up to 2.2X improvement in streaming bandwidth Half power point (N/2) differs by one order of magnitude 03/21/2007 CS267 Lecture 19

GASNet: Performance for mid-range message sizes
(up is good) LAPI and VAPI increase B/W relative to MPI Generally meet or exceed what MPI can provide GASNet usually reaches saturation bandwidth before MPI - fewer costs to amortize Usually outperform MPI at medium message sizes - often by a large margin 03/21/2007 CS267 Lecture 19

Performance of Exchange (Alltoall) is critical
NAS FT Case Study Performance of Exchange (Alltoall) is critical Communication to computation ratio increases with faster, more optimized 1-D FFTs Determined by available bisection bandwidth Between 30-40% of the applications total runtime Two ways to reduce Exchange cost 1. Use a better network (higher Bisection BW) 2. Overlap the all-to-all with communication (where possible) – “break up” the exchange Default NAS FT Fortran/MPI relies on #1 Our approach uses UPC/GASNet and builds on #2 Started as CS267 project 1D partition of 3D grid is a limitation At most N processors for N^3 grid HPC Challenge benchmark has large 1D FFT (can be viewed as 3D or more with proper roots of unity) 03/21/2007 CS267 Lecture 19

3D FFT Operation with Global Exchange
1D-FFT Columns 1D-FFT (Columns) Transpose + 1D-FFT (Rows) Cachelines 1D-FFT Rows send to Thread 0 Exchange (Alltoall) Transpose + 1D-FFT send to Thread 1 Divide rows among threads send to Thread 2 Last 1D-FFT (Thread 0’s view) Single Communication Operation (Global Exchange) sends THREADS large messages Separate computation and communication phases 03/21/2007 CS267 Lecture 19

Communication Strategies for 3D FFT
chunk = all rows with same destination Three approaches: Chunk: Wait for 2nd dim FFTs to finish Minimize # messages Slab: Wait for chunk of rows destined for 1 proc to finish Overlap with computation Pencil: Send each row as it completes Maximize overlap and Match natural layout Rearrange this slide; need a picture pencil = 1 row slab = all rows in a single plane with same destination Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea 03/21/2007 CS267 Lecture 19

Decomposing NAS FT Exchange into Smaller Messages
Three approaches: Chunk: Wait for 2nd dim FFTs to finish Slab: Wait for chunk of rows destined for 1 proc to finish Pencil: Send each row as it completes Example Message Size Breakdown for Class D (2048 x 1024 x 1024) at 256 processors Exchange (Default) 512 Kbytes Slabs (set of contiguous rows) 65 Kbytes Pencils (single row) 16 Kbytes 03/21/2007 CS267 Lecture 19

Overlapping Communication
Goal: make use of “all the wires” Distributed memory machines allow for asynchronous communication Berkeley Non-blocking extensions expose GASNet’s non-blocking operations Approach: Break all-to-all communication Interleave row computations and row communications since 1D-FFT is independent across rows Decomposition can be into slabs (contiguous sets of rows) or pencils (individual row) Pencils allow: Earlier start for communication “phase” and improved local cache use But more smaller messages (same total volume) 03/21/2007 CS267 Lecture 19

NAS FT: UPC Non-blocking MFlops
Berkeley UPC compiler support non-blocking UPC extensions Produce 15-45% speedup over best UPC Blocking version Non-blocking version requires about 30 extra lines of UPC code 03/21/2007 CS267 Lecture 19

NAS FT Variants Performance Summary
Shown are the largest classes/configurations possible on each test machine MPI not particularly tuned for many small/medium size messages in flight (long message matching queue depths) 03/21/2007 CS267 Lecture 19

Pencil/Slab optimizations: UPC vs MPI
Same data, viewed in the context of what MPI is able to overlap “For the amount of time that MPI spends in communication, how much of that time can UPC effectively overlap with computation” On Infiniband, UPC overlaps almost all the time the MPI spends in communication On Elan3, UPC obtains more overlap than MPI as the problem scales up 03/21/2007 CS267 Lecture 19

Summary of Overlap in FFTs
One-sided communication has performance advantages Better match for most networking hardware Most cluster networks have RDMA support Machines with global address space support (X1, Altix) shown elsewhere Smaller messages may make better use of network Spread communication over longer period of time Postpone bisection bandwidth pain Smaller messages can also prevent cache thrashing for packing Avoid packing overheads if natural message size is reasonable 03/21/2007 CS267 Lecture 19

• C library for real & complex FFTs (arbitrary size/dimensionality)
FFTW the “Fastest Fourier Tranform in the West” • C library for real & complex FFTs (arbitrary size/dimensionality) (+ parallel versions for threads & MPI) • Computational kernels (80% of code) automatically generated • Self-optimizes for your hardware (picks best composition of steps) = portability + performance free software: 03/21/2007 CS267 Lecture 19

FFTW performance power-of-two sizes, double precision
833 MHz Alpha EV6 2 GHz PowerPC G5 2 GHz AMD Opteron 500 MHz Ultrasparc IIe 03/21/2007 CS267 Lecture 19

FFTW performance non-power-of-two sizes, double precision
unusual: non-power-of-two sizes receive as much optimization as powers of two 833 MHz Alpha EV6 2 GHz AMD Opteron …because we let the code do the optimizing 03/21/2007 CS267 Lecture 19

exploiting CPU-specific SIMD instructions (rewriting the code) is easy
FFTW performance double precision, 2.8GHz Pentium IV: 2-way SIMD (SSE2) powers of two exploiting CPU-specific SIMD instructions (rewriting the code) is easy non-powers-of-two …because we let the code write itself 03/21/2007 CS267 Lecture 19

Why is FFTW fast? three unusual features
FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 1 2 The resulting plan is executed with explicit recursion: enhances locality The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 03/21/2007 CS267 Lecture 19

many transforms of same size are required.
FFTW is easy to use { complex x[n]; plan p; p = plan_dft_1d(n, x, x, FORWARD, MEASURE); ... execute(p); /* repeat as needed */ destroy_plan(p); } Key fact: usually, many transforms of same size are required. 03/21/2007 CS267 Lecture 19

FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 03/21/2007 CS267 Lecture 19

FFTW Uses Natural Recursion
But traditional implementation is non-recursive, breadth-first traversal: log2 n passes over whole array Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT 03/21/2007 CS267 Lecture 19

Traditional cache solution: Blocking
Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT breadth-first, but with blocks of size = cache …requires program specialized for cache size 03/21/2007 CS267 Lecture 19

Recursive Divide & Conquer is Good
(depth-first traversal) [Singleton, 1967] Size 8 DFT p = 2 (radix 2) Size 4 DFT Size 4 DFT Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT eventually small enough to fit in cache …no matter what size the cache is 03/21/2007 CS267 Lecture 19

• A cache-oblivious algorithm does not know the cache size
Cache Obliviousness • A cache-oblivious algorithm does not know the cache size — it can be optimal for any machine & for all levels of cache simultaneously • Exist for many other algorithms, too [Frigo et al. 1999] — all via the recursive divide & conquer approach 03/21/2007 CS267 Lecture 19

FFTW implements many FFT algorithms: A planner picks the best composition by measuring the speed of different combinations. 3 The resulting plan is executed with explicit recursion: enhances locality 1 The base cases of the recursion are codelets: highly-optimized dense code automatically generated by a special-purpose “compiler” 2 03/21/2007 CS267 Lecture 19

Parallel Spectral Methods: Solving Elliptic Problems with FFTs

Similar presentations

Presentation on theme: "Parallel Spectral Methods: Solving Elliptic Problems with FFTs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Spectral Methods: Solving Elliptic Problems with FFTs

Similar presentations

Presentation on theme: "Parallel Spectral Methods: Solving Elliptic Problems with FFTs"— Presentation transcript:

Similar presentations

About project

Feedback