Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator Hiroaki Honda 1, Farhad Mehdipour 2, Hiroshi Kataoka 1, Koji Inoue 1, and Kazuaki J. Murakami 1 1 Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan 2 Center for Japan-Egypt Cooperation in Science and Technology, Kyushu University, Fukuoka, Japan 1
Agenda Introduction Single-flux quantum (SFQ) circuit SFQ-reconfigurable data-path (RDP) processor Objective Implementing an Application on SFQ-RDP Tool chain Code modification DFG extraction and mapping Performance Evaluation Comparison with GPU and GPP results Conclusions 2
Top500 Supercomputer Ranking and Projection 1 ExaFlop/s [=10 9 GFlop/s] can be attained in ~2019 and 10 ExaFLop/s in ~2022?? (only in next ten years) PetaFLop/s [=10 6 GFlop/s] world from 2009, 1000 times speed up in 10 years 1EFlops / 3 10EF 2022
Energy Consumption Estimation for Floating Point Units (FPUs) Power / [1FPU (2GHz)] is larger than 10 mW (CMOS, ~8nm in ~2019) 1) Power / [1GFlop/s] is larger than 5 mW Enegy consumption of FPUs for 10 ExaFlop/s system is larger than 5 mW * 10 * 10 9 = 50 MW !! 1) p178 It is extremely power consuming to construct 10 ExaFlop/s supercomputer system by CMOS circuit processor Additional power consumption by memory, network, storage,… (1ExaFlop/s =10 9 GFlop/s) 4
Difficult to implement feed back loops and conditional branches No practical SFQ memory Single-Flux Quantum (SFQ) Circuit Pulse logic: Bit serial/slice description for 32/64 bits Ultra high speed switching Ultra low power No cost for latch Suitable for Pipeline processing Josephson junction 2~3 ps SFQ Pulse ~1 mV SFQ Pulse (quantized magnetic flux) Superconductivity loop AdvantagesDisadvantages x 10~100 faster operation x ~1/10 energy consumption x 10~100 faster operation x ~1/10 energy consumption 5
Single-Flux Quantum-Reconfigurable Data Path (SFQ-RDP) Computer Large scale two-dimensional floating-point unit array, data-path architecture Reconfigurable Operand Routing Network (ORN) No on-chip memory Dynamically reconfigurable PEs and ORNs Data Flow is unidirectional No feed back loop Minimal amount of control circuits 2-ports/1-port Data accesses For Input / Output ~2.5TFLOPS/chip One FPU and data through units One FPU and data through units Network connecting between PEs and PEs Network connecting between PEs and PEs PE ORN 6
CREST-JST SFQ-RDP Project (2006~): A Low-Power, High-performance Reconfigurable Processor Based on Single-Flux Quantum Circuits Goals: Discovering appropriate computation-intensive scientific applications Developing compiler tools Developing performance evaluation tools Designing the SFQ-LSRDP architecture Yokohama National Univ. SFQ-FPU chip, cell library Kyushu Univ. Architecture, Compiler and Applications Nagoya Univ. SFQ-RDP chip, cell library, and wiring SFQ-RDP Nagoya Univ. CAD for logic design and arithmetic circuits Superconducting Research Lab. (SRL) SFQ process 7
Prototype 2x3 SFQ-RDP Processor and SFQ-MUL FPU 8-bit ALUs implementing: ADD, SUB, AND, OR, XOR Frequency: 25GHz Process: 2 m Area: 6.84 x 6.72 mm 2 Power: 4.1mW 1) Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, ) H.Hara, et al.,"Design and Implementation of SFQ Half-Precision Floating-Point Multipliers,", ACS08, bit FPUs: Adder, Multiplier MUL Frequency: 32GHz Performance: 2.6 GFLOPs The number of junctions: JJs Power consumption: 3.5 mW Circuit area: 6.22 ×3.78 mm 2 2x3 SFQ-RDP processor 1) SFQ- Floating Point Multiplier 2) 8
Objectives Performance evaluations by implementing practical applications and showing possibility of efficient computations by SFQ-RDP computer system Applications: 2D-diffusion,2D-Finite-Difference Time-Domain (2D-FDTD) Comparisons of execution times with GPP and GPU 2D-FPU array, data-flow architecture Data Flow Graphs (DFGs) are extracted from applications and mapped onto the SFQ-RDP Compiler tools Compiler tools have to be developed No on-chip memory DMA transfer of DRAM has to be fully used to avoid random accesses Dynamically reconfigurable PEs and ORNs One time reconfiguration is enough for both Diffusion & FDTD applications Points 9
Tool Chain for Implementation of an Application on SFQ-RDP Application: C/Fortran code Application: C/Fortran code Modified code Code Modification using SFQ-RDP API Code Modification using SFQ-RDP API Compiler developed for SFQ-RDP Compiler developed for SFQ-RDP Data Flow Graph (DFG) Extraction (Semi-manual) Data Flow Graph (DFG) Extraction (Semi-manual) Object code Extracted DFG Placement and Routing Tool Placement and Routing Tool RDP Configuration file RDP Configuration file RDP library file Functions definition & declaration RDP architecture description Input GPP SFQ-RDP Tool chain has been almost completed 10
Implementing an Application on SFQ-RDP: 2D Diffusion Basic Finite Difference Method (FDM) formula n-axis (time) x-axis (space) y-axis (space) i j n Time development calculation by FDM (time=n points) n+1 In/OutOps 5 / 17 11
loop n loop i, j f (n+1) [i,j] = C 0 * ( f (n) [i-1,j] + f (n) [i+1,j] ) + C 1 * ( f (n) [i,j-1] + f (n) [i,j+1] ) + C 2 * f (n) [i,j] end Original Code for GPP ( n ⇒ n+1 ) Code Implementation and Modification for SFQ-RDP Extracted DFG: In/OutOpsByte/Flop 5 / In/OutOpsByte/Flop 21 / 97 * 91.9 loop n loop i, j, (+3, +3) f (n+1) [i,j] = C 0 * ( f (n) [i-1,j] + f (n) [i+1,j] ) + C 1 * ( f (n) [i,j-1] + f (n) [i,j+1] ) + C 2 * f (n) [i,j] f (n+1) [i+1,j] = C 0 * ( f (n) [i,j] + f (n) [i+2,j] ) + C 1 * ( f (n) [i+1,j-1] + f (n) [i+1,j+1] ) + C 2 * f (n) [i+1,j] f (n+1) [i+2,j] = … … f (n+1) [i+2,j+2]= … end Unrolled Loop Code for SFQ-RDP ( n ⇒ n+1) 9 formulas in loop-body DFG Extraction 12
Mapping Extracted DFG onto SFQ-RDP Placement and Routing Extracted DFG DFG mapping Result 13 RDP configuration data
Improving Data Access Efficiency- Data Structure Conversion for DMA Transfer All two dimensional f[i,j] values are divided and stored as two one-dimensional arrays: A[] and B[] 15(A)+15(B) input data are accessed via two input ports 9 output data are accessed Unrolled loop includes 21 inputs and 9 outputs for calculation Random memory accesses Data Structure Conversion: 14 f[i,j]: A[i]: B[i]: Sequential memory accesses: possible to use DMA transfer f[i,j] A[i],B[i] double buffering
Performance Evaluation GPP: Simulation by cycle accurate processor simulator SFQ-RDP: Performance evaluation modeling Estimation of execution times GPPProcessor typeOut-of-Order Freq.3.2 GHz Inst. issue width4 Inst./CC L1 data cache64 KB L2 unified cache4 MB Latency of main mem.300 CC RDPFreq.(SFQ-RDP)80 GHz Reconfiguration latency CC Main mem. Bandwidth * 141.7, GB/s No. PEs in a row22 No. PEs in a column15 * BW numbers are based on ones for GPU calculation System ArchitectureSystem Configuration 15 2input/1output ports
Results of Performance Evaluation SFQ-RDP (GFLop/s) GPU (GFLop/s) Ratio (by GPU) Ratio (by GPP) 2D-Diffusion ) D-FDTD ) D-Diffusion ) --- 1D-Vibration ) --- Comparable results to GPU SFQ-RDP processor, which is implemented by superconductivity circuits and simple 2D-array architecture, can be used as an efficient accelerator 1) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10: , ) N. Takada, et al., “Speeding up of FDTD finite difference calculations by efficient use of GPU and shared memory,” (Japanese), Proceedings of Forum of Information Science and Technology, ) H. Kataoka, et al.,"Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications", SAAHPC 10, Jul
Why Can We Achieve Comparable Results? # of Operation # of I/O 1) Byte/FlopEstimation of GFlop/s 2) (Max. BW 159.0GB/s) RDP Calc. Original formula (1 Output) 75+1 = 6 6*4/7 = 3.42 ~4.7 (random access: ~16GB/s) Unrolled loop formula (9 outputs formula) 7 * 9 = = * 4 / 63 = 1.90 ~8.4 (random access: ~16GB/s) Data structure conversion for DMA transfer 7 * 9 = = * 4 / 63 = (DMA: ~159.0GB/s) With GPP calc, comm. and other overheads 50.6 (DMA: ~159.0GB/s) GPU Calc. Aoki et al. 3 ) ) 1)Based on the utilization of HW for rearrangement of input data 2)Single Precision Calculation, BW 159.0GB/s, GeForce GTX 285 3)GeForce GTX 285, 1 proc. calculation : (1024x1204 mesh) 4)T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10: ,
Conclusions and Future Works Conclusions An Single-Flux Quantum Reconfigurable Data-Path (SFQ-RDP) with two-dimensional floating point array architecture implemented by superconducting circuits was introduced. Two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications were implemented on SFQ-RDP and performance evaluations were conducted. For 2D-Heat and 2D-FDTD, 50.6 and 79.0 times faster computation than general purpose processor were achievable respectively, while these performance values were comparable to reported results for the GPU. SFQ-RDP accelerator can be used for practical scientific calculations especially based on finite difference methods. Future Works Implementations and performance evaluations of other applications 18
CAD for logic design and arithmetic circuits Prof. N.Takagi (Leader), Prof. K.Takagi (Kyoto Univ.) SFQ-RDP chip, cell library, and wiring Prof. A.Fujimaki, Prof. H.Akaike, Prof. M.Tanaka (Nagoya Univ.) SFQ-FPU chip, cell library Prof. N.Yoshikawa (Yokohama National Univ.) SFQ process Dr. S.Nagasawa, Dr. M.Hidaka (SLRC) Acknowledgement This research was supported by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). Other SFQ-RDP research members 19