Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami Kyushu University, Japan
Kyushu University KL, Malaysia CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum (SFQ) circuits SFQ-LSRDP K. Murakami K. Inoue H. Honda F. Mehdipour H. Kataoka K. Murakami K. Inoue H. Honda F. Mehdipour H. Kataoka Kyushu Univ. Architecture, Compiler and Applications Kyushu Univ. Architecture, Compiler and Applications S. Nagasawa et al. Superconducting Research Lab. (SRL) SFQ process Superconducting Research Lab. (SRL) SFQ process N. Yoshikawa et al. Yokohama National Univ. SFQ-FPU chip, cell library Yokohama National Univ. SFQ-FPU chip, cell library A. Fujimaki et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. SFQ-RDP chip, cell library, and wiring N. Takagi (Leader) et al. N. Takagi (Leader) et al. Nagoya Univ. CAD for logic design and arithmetic circuits Nagoya Univ. CAD for logic design and arithmetic circuits Our mission: Architecture, compiler and application development 2
Kyushu University KL, Malaysia Outline of Large-Scale Reconfigurable Data-Path (LSRDP) Processor 3 SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing
Kyushu University KL, Malaysia … … … … … … Buffers inst; … conf_LSRDP ( ); Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( ); inst; … sync_lsrdp ( ); rearrange_output_data ( ); End_Loop inst; … inst conf_LSRDP(); conf. bit-stream … … … … rearrange_input_data () GPP Memory Controller set_IO_info ( ); Memory Controller … … … … … … … … … … … run_LSRDP ( );inst sync_lsrdp ( ); GPP Waiting for the LSRDP LSRDP terminating the operation rearrange_output_data ( ) GPP How it works 4 Memory Buffers LSRDP
Kyushu University KL, Malaysia Architecture Exploration MCL= 1 Number of rows = 1.5×M Number of columns = 4×MCL Number of rows = 2×M Number of columns = 6×MCL+2 MCL= 1 Number of rows = 1.5×M Number of columns = 4×MCL+1 MCL= 2 LSRDP Layouts ORN structures 5 FUTU PE arch. I 4-inps/3-outs FU TU PE arch. II 3-inps/3-outs TU FUTU Basic PE arch. 3-inps/2-outs PE structures
Kyushu University KL, Malaysia LSRDP Tool Chain Application C code Application C code 1 Modified application code Modified application code 2 Modifying application code Inserting LSRDP instructions in the code Modifying application code Inserting LSRDP instructions in the code 1 ISAcc or COINS compiler 2 DFG Extraction 1 binary code 2 Data flow graphs Placing and Routing Tool 2 Configuration file + various text & schematic reports Configuration file + various text & schematic reports 1 LSRDP library file Function definitions & declarations 1 LSRDP architecture description 2 1: flow of the assembly code generation for GPP 2: flow of configuration bit-stream generation for the LSRDP 1: flow of the assembly code generation for GPP 2: flow of configuration bit-stream generation for the LSRDP Simulator Performance evaluation Simulator Performance evaluation 6
Kyushu University KL, Malaysia Mapping DFGs onto LSRDP 7 Longest connections DFG LSRDP Architecture Description LSRDP Architecture Description Placing Input Nodes Placing Operational & Output Nodes Placing Operational & Output Nodes Routing Nets Routing IO Nets Final Map
Kyushu University KL, Malaysia Global routing algorithms src dest src dest vacant fully- occupied exhaustive search-based very time consuming exhaustive search-based very time consuming branch and bound alg. Very fast branch and bound alg. Very fast Routing DFG connections between source and destination PEs 8
Kyushu University KL, Malaysia Micro-Routing-Problem Definition Inputs LSRDP basic specifications –Layout, Width (W), MCL, PE arch., and etc. –List of connections b/w consecutive rows ORN structure including –The number of CBs and T2s in each row –The number of CB rows –Topology of connections among CBs Output Detailed routes via cross-bar switches –The list of CBs used for routing each connection –Configuration of CBs FUT T T T … T T T T … ORN i-th row (i+1)-th row A micro-routing algorithm has been implemented for the LSRDP with underlying layout II and PE arch. III
Kyushu University KL, Malaysia ORN Micro-routing CB ½ CB (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7, PE8) (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7, PE8) 1/2CB: 1-input/2-ouput CB: 2-input/2-output Micro-nets Example 10 PE 1 PE 2 PE 3 PE 5 PE 6 PE 7 PE 4 PE 8 ½ CB CB (CB) CB
Kyushu University KL, Malaysia … … … … PEs in 3 rd Row PEs in 4 th row ORN Micro-Routing Example: Heat 8x2- ORN b/w 3rd and 4th Rows
Kyushu University KL, Malaysia Specifications of Attempted DFGs total # of nodes # of Inputs# of outputs# of ops Heat-8x Heat-8x Heat-16x Poisson-3x Vibration-4x Vibration-8x Vibration-8x ERI ERI
Kyushu University KL, Malaysia Example of a DFG Mapping Vibration- 8x2 13
Kyushu University KL, Malaysia Results of routing nets using the proposed algorithms DFGavg. hor. C.L. avg./max. ver. C.L. # of global/micro nets to route Time to map (sec) Heat-8x /336/ Heat-8x /5 68/ Heat-16x /7 204/ Poisson-3x /16 67/ Vibration-4x /9 50/ Vibration-8x /10 154/ Vibration-8x /16 348/ ERI /9 111/ ERI /9 95/
Kyushu University KL, Malaysia Thank You for Your Attention! Any Questions!
Kyushu University KL, Malaysia 16 SMAC 10TFLOPS SFQ-RDP computer :...::: SMAC SB ORN... ORN... : : : : ORN... ORN FPU SFQ RDP ( 32 PE×32 chips ) ( 2.5 GFLOPS / PE) 4.2 K Streaming memory Access controller CMOS CPU (One Chip) Memory bandwidth per MCM : 256GB/ s (=16GB/s ×16 channels) (34 chips ) ×4MCM 2TB memory module ( FB-DIMM 128GB] ×16 modules ) SFQ 0.5μm process
Kyushu University KL, Malaysia Chip Micro-architecture: Two types of PEs: F PA and FPM PE layout: Checkered pattern PE : Two Inputs ( A,B,C ) → Three Outputs ( A(*B),B,C ) Three scales of RDP (Small, Medium and Large-Scales ) 17 FU TU FP TU RDP parameters ( optimized by total number of JJs ) # Input# OutputWidthHeightMCL Total JJs (∝ RDP size ) RDP-S K RDP-M K RDP-L K Development of RDP Architecture TU: Data Through
Kyushu University KL, Malaysia Development of RDP Complier Application C code Application C code 1 Modified code 2 Modifying application code Manual: Inserting LSRDP instructions in the code Modifying application code Manual: Inserting LSRDP instructions in the code 1 ISAcc or COINS compiler 2 DFG Extraction Semi-manual DFG Extraction Semi-manual 1.asm code for MIPS-based GPP.asm code for MIPS-based GPP 2 Data flow graphs Placement and Routing Tool 2 Configuration file + various text and schematic reports Configuration file + various text and schematic reports 1 RDP library file Functions definition & declaration 1 RDP architecture description 2 1: flow of the assembly code generation for GPU 2: flow of configuration bit-stream generation for the RDP 1: flow of the assembly code generation for GPU 2: flow of configuration bit-stream generation for the RDP Simulator Performance evaluation Simulator Performance evaluation
Kyushu University KL, Malaysia 19 Development of RDP Oriented Algorithms One-dimensional heat and vibrational equations Two-dimensional heat and FDTD equations Two-Electron Repulsion Integral calculation in quantum chemistry Runge-Kutta calculation for ordinary differential equation Performance Evaluation Two-dimensional heat equation (1024x1024 mesh ) SFQ-RDP 1) : 50.6GFlop/s vs. GPU 2) : 63.0GFlop/s 1) Evaluation method: RDP: - Execution time model, - DFG has 21 inputs, 9 outputs, and 63 operations GPP: - Cycle-accurate processor simulator - BW: 159.0GB/s 2) T. Aoki, and A. Nukada,“CUDA programming premier,“ Kougakusya, ISBN-10: , 2009 (in Japanese).