Download presentation
Presentation is loading. Please wait.
Published bySilas Daniels Modified over 9 years ago
1
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan
2
2009/4/21 Third French-Japanese PAAP Workshop 2 Outline Background Objectives Approach 3-D FFT Algorithm Volumetric 3-D FFT Algorithm Performance Results Conclusion
3
2009/4/21 Third French-Japanese PAAP Workshop 3 Background The fast Fourier transform (FFT) is an algorithm widely used today in science and engineering. Parallel 3-D FFT algorithms on distributed- memory parallel computers have been well studied. November 2008 TOP500 Supercomputing Sites –Roadrunner: 1,105.00 TFlops (129,600 Cores) –Jaguar (Cray XT5 QC 2.3GHz): 1,059.00 TFlops (150,152 Cores) Recently, the number of cores keeps increasing.
4
2009/4/21 Third French-Japanese PAAP Workshop 4 Background (cont ’ d) A typical decomposition for performing a parallel 3-D FFT is slabwise. –A 3-D array is distributed along the third dimension. – must be greater than or equal to the number of MPI processes. This becomes an issue with very large node counts for a massively parallel cluster of multi-core processors.
5
2009/4/21 Third French-Japanese PAAP Workshop 5 Related Works Scalable framework for 3-D FFTs on the Blue Gene/L supercomputer [Eleftheriou et al. 03, 05] –Based on a volumetric decomposition of data. –Scale well up to 1,024 nodes for 3-D FFTs of size 128x128x128. 3-D FFT on the 6-D network torus QCDOC parallel supercomputer [Fang et al. 07] –3-D FFTs of size 128x128x128 can scale well on QCDOC up to 4,096 nodes.
6
2009/4/21 Third French-Japanese PAAP Workshop 6 Objectives Implementation and evaluation of highly scalable 3-D FFT on massively parallel cluster of multi-core processors. Reduce the communication time for larger numbers of MPI processes. A comparison between 1-D and 2-D distribution for 3-D FFT.
7
2009/4/21 Third French-Japanese PAAP Workshop 7 Approach Some previously presented volumetric 3-D FFT algorithms [Eleftheriou et al. 03, 05, Fang07] uses the 3-D distribution for 3-D FFT. –These schemes require three all-to-all communications. We use a 2-D distribution for volumetric 3-D FFT. –It requires only two all-to-all communications.
8
2009/4/21 Third French-Japanese PAAP Workshop 8 3-D FFT 3-D discrete Fourier transform (DFT) is given by
9
2009/4/21 Third French-Japanese PAAP Workshop 9 1-D distribution along z-axis 1. FFTs in x-axis2. FFTs in y-axis3. FFTs in z-axis With a slab decomposition
10
2009/4/21 Third French-Japanese PAAP Workshop 10 2-D distribution along y- and z-axes 1. FFTs in x-axis 2. FFTs in y-axis 3. FFTs in z-axis With a volumetric domain decomposition
11
2009/4/21 Third French-Japanese PAAP Workshop 11 Communication time of 1-D distribution Let us assume for -point FFT: –Latency of communication: (sec) –Bandwidth: (Byte/s) –The number of processors: One all-to-all communication Communication time of 1-D distribution (sec)
12
2009/4/21 Third French-Japanese PAAP Workshop 12 Communication time of 2-D distribution Two all-to-all communications – simultaneous all-to-all communications of processors in y-axis. – simultaneous all-to-all communications of processors in z-axis. Communication time of 2-D distribution (sec)
13
2009/4/21 Third French-Japanese PAAP Workshop 13 Comparing communication time Communication time of 1-D distribution Communication of 2-D distribution By comparing two equations, the communication time of the 2-D distribution is less than that of the 1-D distribution for larger number of processors and latency.
14
2009/4/21 Third French-Japanese PAAP Workshop 14 Performance Results To evaluate parallel 3-D FFTs, we compared –1-D distribution –2-D distribution and -point FFTs on from 1 to 4,096 cores. Target parallel machine: –T2K-Tsukuba system (256 nodes, 4,096 cores). –The flat MPI programming model was used. –MVAPICH 1.2.0 was used as a communication library. –The compiler used was Intel Fortran compiler 10.1.
15
2009/4/21 Third French-Japanese PAAP Workshop 15 T2K-Tsukuba System Specification –The number of nodes: 648 ( Appro Xtreme-X3 Server ) –Theoretical peak performance: 95.4 TFlops –Node configuration: 4-socket of quad-core AMD Opteron 8356 (Barcelona 2.3 GHz) –Total main memory size: 20 TB –Network interface: DDR InfiniBand Mellanox ConnectX HCA x 4 –Network toporogy: Fat Tree –Full-bisection bandwidth: 5.18 TB/s
16
2009/4/21 Third French-Japanese PAAP Workshop 16 Computation Node of T2K-Tsukuba Bridge NVIDIA nForce 3050 Bridge NVIDIA nForce 3050 USB Dual Channel Reg DDR2 Hyper Transport 8GB/s (Full- duplex) PCI-X I/O Hub 8GB/s (A)2 (B)2 4GB/s (Full-duplex) 4GB/s (Full-duplex) (A)1 (B)1 4GB/s (Full-duplex) 4GB/s (Full-duplex) Bridge NVIDIA nForce 3600 Bridge NVIDIA nForce 3600 Bridge PCI-Express X16 PCI-Express X8 PCI-X X16 X8 X4 PCI-Express X16 PCI-Express X8 SAS X16 X8 X4 2GB 667MHz DDR2 DIMM x4 Mellanox MHGH28-XTC ConnectX HCA x2 ( 1.2µs MPI Latency, 4X DDR 20Gb/s ) Mellanox MHGH28-XTC ConnectX HCA x2 ( 1.2µs MPI Latency, 4X DDR 20Gb/s )
17
2009/4/21 Third French-Japanese PAAP Workshop 17
18
2009/4/21 Third French-Japanese PAAP Workshop 18 Discussion (1/2) For -point FFT, we can clearly see that communication overhead dominates the execution time. –In this case, the total working set size is only 1MB. On the other hand, the 2-D distribution scales well up to 4,096 cores for -point FFT. –Performance on 4,096 cores is over 401 GFlops, about 1.1% of theoretical peak. –Performance except for all-to-all communications is over 10 TFlops, about 26.7% of theoretical peak.
19
2009/4/21 Third French-Japanese PAAP Workshop 19
20
2009/4/21 Third French-Japanese PAAP Workshop 20 Discussion (2/2) For, the performance of the 1-D distribution is better than that of the 2-D distribution. –This is because that the total communication amount of the 1-D distribution is a half of the 2-D distribution. However, for, the performance of the 2-D distribution is better than that of the 1-D distribution due to the latency.
21
2009/4/21 Third French-Japanese PAAP Workshop 21
22
2009/4/21 Third French-Japanese PAAP Workshop 22 Conclusions We implemented of a volumetric parallel 3-D FFT on clusters of multi-core processors. We showed that a 2-D distribution improves performance effectively by reducing the communication time for larger numbers of MPI processes. The proposed volumetric parallel 3-D FFT algorithm is most advantageous on massively parallel cluster of multi-core processors. We successfully achieved performance of over 401 GFlops on the T2K-Tsukuba system with 4,096 cores for -point FFT.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.