1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.
VLSI Communication SystemsRecap VLSI Communication Systems RECAP.
Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator Hiroaki Honda 1,
Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator Farhad.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
04/26/2006VLSI Design & Test Seminar Series 1 Phase Delay in MAC-based Analog Functional Testing in Mixed-Signal Systems Jie Qin, Charles Stroud, and Foster.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Paper Review: XiSystem - A Reconfigurable Processor and System
Automated Design of Custom Architecture Tulika Mitra
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Implementation and Evaluation of Fock Matrix Calculation Program on the Cell Processor Hiroaki Honda a), Tetsuo Hayashi b), Yuichi Inadomi a), Koji Inoue.
A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,
Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami.
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
Optimizing the Architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath) F. Mehdipour*, Hiroaki Honda **, H. Kataoka*, K. Inoue* and K. Murakami*
A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.
High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
ROUTING ARCHITECTURE AND ALGORITHMS FOR A SUPERCONDUCTIVITY CIRCUITS-BASED COMPUTING HARDWARE Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue,
Let’s Open Up New Fields for Next 10X! Koji Inoue Kyushu University, Japan
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
Sunpyo Hong, Hyesoon Kim
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
1 of 14 Lab 2: Design-Space Exploration with MPARM.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Buffering Techniques Greg Stitt ECE Department University of Florida.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
INTRODUCTION TO MICROPROCESSORS
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
INTRODUCTION TO MICROPROCESSORS
INTRODUCTION TO MICROPROCESSORS
Masamitsu Tanaka, Nagoya Univ.
A High Performance SoC: PkunityTM
A New Design Approach for High-Throughput Arithmetic Circuits for Single-Flux-Quantum Microprocessors Masamitsu Tanaka, Nagoya Univ., JSPS Co-workers:
Presentation transcript:

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

WAHA 2009 Kyushu University 2 Agenda Introduction Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work

WAHA 2009 Kyushu University 3 Introduction Parallel computer clusters with General-Purpose Processors (GPP) are often used for HPC Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar performance TSUBAME NVIDIA Tesla S Roadrunner w ith PowerXcell

WAHA 2009 Kyushu University 4 Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP) A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency Large-Scale Reconfigurable Data-Path (LSRDP): is introduced as an alternative accelerator reduces the no. of memory accesses is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits is suitable for high performance scientific computations

WAHA 2009 Kyushu University 5 Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor Features: Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped Pipeline execution Burst transfer is used for input /output rearranged data from/to memory Main Memory GPP ORN : : : : ORN : Operand Routing Network... FU... FU... FU LSRDP :::...: SB SMAC Scratchpad Memory Reconfigurable data-path includes: A large number of floating point Functional Units (FUs) Reconfigurable Operand Routing Network : ORN Dynamic reconfiguration facilities Streaming Buffers (SB) for I/O ports Implementation by SFQ circuits

WAHA 2009 Kyushu University 6 Single-Flux Quantum (SFQ) against CMOS CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing

WAHA 2009 Kyushu University 7 CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits SFQ-LSRDP Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka Kyushu Univ. Architecture, Compiler and Applications Dr. S. Nagasawa et al. Superconducting Research Lab. (SRL) SFQ process Prof. N. Yoshikawa et al. Yokohama National Univ. SFQ-FPU chip, cell library Prof. A. Fujimaki et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. N. Takagi (Leader) et al. Nagoya Univ. CAD for logic design and arithmetic circuits

WAHA 2009 Kyushu University 8 Goals of the Project Discovering appropriate applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

9 LSRDP General Architecture and Specifications

WAHA 2009 Kyushu University 10 Parameters Should Be Decided Within the LSRDP Design Procedure Maximum Connection Length (MCL) between consecutive rows? PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU) Reconfiguration mechanism? (PE, ORN, Immediate data) Layout: FU types (ADD/SUB and MUL)? Core structure  a matrix of PEs Width and Height ? On-chip memory configuration?

WAHA 2009 Kyushu University 11 LSRDP Architecture Processing Elements FU implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row FUTU FU TU FUTU FUTUFU PE including Two components Four functionalities

WAHA 2009 Kyushu University 12 Layout Types- Type I W ORN … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M ADD/SUM MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H Flexible but consume a lot of resources

WAHA 2009 Kyushu University 13 W ORN … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout Types- Type II (Checkered) H Each PE implements ADD/SUB or MUL ADD/SUMTUMULTU

WAHA 2009 Kyushu University 14 W ORN … MTMTMTMTMT … ATATATATAT … MTMTMTMTMT … ATATATATAT Layout Types- Type III (Striped) H Each PE implements ADD/SUB or MUL ADD/SUMTUMULTU Type II or III, which one is more efficient?

WAHA 2009 Kyushu University 15 Maximum Connection Length (MCL) MCL: maximum horizontal distance between two PEs located in two consecutive rows

WAHA 2009 Kyushu University 16 An ORN Structure A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches 2bit shift register ORN

WAHA 2009 Kyushu University 17 Dynamic Reconfiguration Mechanism Three bit-stream lines for dynamic reconfiguration of: Immediate registers (64bit) in each PE Selector bits for muxes selecting the input data of FUs Cross-bar switches in ORNs

18 Design Procedure and Tool Chain

WAHA 2009 Kyushu University 19 Compiler and Design Flow DFGs are manually generated from critical parts of applications DFG mapping results are used for Analyzing LSRDP architecture statistics Generating LSRDP configuration bit-streams

WAHA 2009 Kyushu University 20 LSRDP Design Procedure For each parameter Appropriate value for each parameter DFGs & LSRDP HW constraints

WAHA 2009 Kyushu University 21 Benchmark Applications for Design Procedures Finite differential method calculation of 2 nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Only ADD/SUB and MUL operations are used in the critical calculations of all above applications

WAHA 2009 Kyushu University 22 DFG Extraction- Heat Equation 1-dim. heat equation for T(x,t) Calculation by Finite Difference Method (FDM) (A is const.) Basic DFG corresponding to Minimum FDM calculation Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

WAHA 2009 Kyushu University 23 Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)

WAHA 2009 Kyushu University 24 DFG Classification Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes Totally, 24 DFGs are prepared for benchmark DFG

WAHA 2009 Kyushu University 25 Mapping DFGs onto LSRDP Longest connections

26 Preliminary Results

WAHA 2009 Kyushu University 27 LSRDP Specifications: Width & Height # of Input ports # of Output ports WidthHeight LSRDP-S LSRDP-M LSRDP-L LSRDP Dimensions and the number of Input/Output Ports

WAHA 2009 Kyushu University 28 LSRDP Specifications: MCL Needs further MCL optimization LSRDPMCL (avg/max) ORN Size- No of Inps (avg/max), Outs LSRDP-S4/818/34, 3 LSRDP-M5/922/38, 3 LSRDP-L5/922/34, 3

WAHA 2009 Kyushu University 29 Analyzing Various LSRDP Layouts Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost. (Except ERI1 DFG which gives better size for Layout III) Layout I Layout II

WAHA 2009 Kyushu University 30 LSRDP at One Glance (1/2) Functional unitsADD/SUB, MUL LayoutType II (checker pattern) Operations64-bit floating point Processing structurePipelined PE structureFU, T, FU+T, T+T LSRDP SizeSmallMediumLarge No. of inp/out ports19/12 38/24 Width/Height16/1632/1664/32 Conf. bit-stream size Imm. Regs16*16*6432*16*6464*32*64 ORNs16*BSS(ORN)32* BSS(ORN)64*BSS(ORN) PEs16*16* 232*16*264*32* 2 ORNinputs, outputs22, 326, 3 StructureCross-bar switch Conn. TypeOne-directional

WAHA 2009 Kyushu University 31 LSRDP at One Glance (2/2) Internal memoryTypeImmediate registers Size and count64-bit registers, One reg. for each PE Communication mechanismSerial External memoryNo. of memory modules16 Date trans. rate1800Mbps/pin Overall data trans. rate24 GB/s Mem. to LSRDP bus width64 bit Channels per moduleTwo Reconf. mechanismBit serial configuration through a serial chain

WAHA 2009 Kyushu University 32 Preliminary Performance Evaluation Processor typeOut-of-order GPP operating frequency3.2GHz Inst. issue width4 instruction/cc Inst. decode width4 instruction/cc Cache configurationL1 data64KB(128B Entry, 2way, 2cc) L1 instruction64KB(64B Entry, 1way, 1cc) L2 unified4MB(128B Entry, 4way, 16cc) Latency of main memory300cc L2 to main memoryBus width64 Bytes Freq800 MHz LSRDP operating frequency80 GHz Reconfiguration Latency1cc Latency SPM  LSRDP latency 1cc Latency Main Memory  SPM 7500cc Bandwidth SPM  LSRDP Max. 64 * 8 Bytes/cc Bandwidth Main Memory  SPM 102.4GB/sec Base processor configuration GPP+LSRDP configuration GPP : Exec. time measurement by means of a processor simulator LSRDP : Estimation by performance modeling

WAHA 2009 Kyushu University 33 Preliminary Performance Evaluation (Heat) Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory. Basic: SB only Reuse: SB + SPM

WAHA 2009 Kyushu University 34 Preliminary Performance Evaluation (Poisson) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

WAHA 2009 Kyushu University 35 Conclusions & Future Work A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. 24 benchmark Data Flow Graphs (DFGs) were manually generated. LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. Future Work: To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

WAHA 2009 Kyushu University 36 Acknowledgement This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).

37 Thanks! Any Questions?

WAHA 2009 Kyushu University 38 Backup Slides

WAHA 2009 Kyushu University 39 SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS Φ0Φ0 L Ic Ib 2mV 2ps Tunneling effect Single Flux Quantum Superconductivity loop Josephson junction

WAHA 2009 Kyushu University 40 Mapping Results For each class, a lot of extra TUs are needed to map all DFGs PE types FU T T TT

WAHA 2009 Kyushu University 41 Connection Length Minimization- Results MCL (ave/max) RDP-S4/9 RDP-M5/9 RDP-L9.3/19 Final optimized Maximum Connection Length (MCL) results ORNs should provide the connection length of 9 in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!! ⇒ Serious Implementation Cost Possible to decrease?

WAHA 2009 Kyushu University 42 Distributions of Connection Lengths Connection length 93% of connection lengths are 0 ~ 2 Only small fractions of connections results in larger ORNs

WAHA 2009 Kyushu University 43 Analyzing Various LSRDP Layouts Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well Almost a similar small size values are achieved for Layout I and II for the majority of DFGs (except ERI1 DFG which gives better size for Layout III)

WAHA 2009 Kyushu University 44 Why only ERI1 DFG is suitable to Layout III ? Heat ERI 1 Layout III Layout II

WAHA 2009 Kyushu University 45 FU Layout for DIV, SQRT, EXP operations ORN : : : : ORN : Operand Routing Network... FU... FU... FU ORN... FU... FU DIV Three times larger latency Where ? Where should we place different latency FU ? Heterogeneous configuration of FU array ? 16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology. Pipeline execution based on ADD and MUL latency

WAHA 2009 Kyushu University 46 Estimated performance improvement of 2-dim Poisson equation by LSRDP calc. Normalized exec. time by GPP(3GHz) calc. Main Mem. bandwidth [GByte/sec]

WAHA 2009 Kyushu University 47 Estimated performance improvement of ERI calculation by LSRDP (3GHz)

WAHA 2009 Kyushu University 48 Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec) DFG sizes have already determined from original recursive formula No. of Operations No. of Inputs No. of Output (ps,ss)983 (ps,ps)51169 (pp,ss)66149 (pp,ps) (pp,pp)

WAHA 2009 Kyushu University 49 What types of software/algorithms are suitable for LSRDP ? When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator. Input/Output data size is small compared with the amount of the operations. small size of input small size of output Large amount of calculations X memory access LSRDP

WAHA 2009 Kyushu University 50 Exploration of suitable applications for LSRDP Application matrix elements calculation Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc… Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc… Investigating applicability against various applications

WAHA 2009 Kyushu University 51 Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc. # of Inputs : Max. 28 # of Outputs : 1 ~ 81 (ss,ss) (m) and all coefficients are given as input (i,j,k,l = x,y,z): p function has 3 components (as 1dim array) Each DFG has only ADD (SUB) and MUL FUs. ~Up to (pp,pp) Recursive Calculation~ DFG sizes are determined by original calculation algorithm

WAHA 2009 Kyushu University 52 DFG Distribution for each application # of FUs # of Inputs Poisson (3) Vibration (7) Heat (6) ERI-Rec (8 DFGs) DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs

WAHA 2009 Kyushu University 53 Example of MCL (Heat) Heat original DFG (I/O: 8/4, FUs: 32) Mapping result MCL

WAHA 2009 Kyushu University 54 Example of extracted DFGs (ERI-Rec) Maximum DFG of ERI-Rec: (p i p j,p k p l ) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0

WAHA 2009 Kyushu University 55 Poisson Equation 2D – Poisson Eq. ω is const. Successive Over Relaxation method In order to obtain u (n+1) (x i,y j ) in the next iteration, current values of five variables i.e. u (n) (x i,y j ), u (n) (x i±1,y j ), u (n) (x i,y j ±1 ) are needed Red/Black Gauss Seidel 55

WAHA 2009 Kyushu University 56 Example of extracted DFGs (Poisson) Maximum Poisson DFG Inputs: 32 Outputs: 1 FUs: 721 Immediates: 364

WAHA 2009 Kyushu University 57 Performance Evaluation: Simulation Environment 57 GPP Main Memory LSRDP GPP : Exec. time measurement by processor simulator LSRDP : Estimation by performance modeling Variable parameters: Freq. of GPP and LSRDP Bandwidth between main memory and LSRDP Latency of reconfiguration time # of FPUs in LSRDP Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory.

WAHA 2009 Kyushu University 58 Estimated performance improvement of 1-dim heat equation by LSRDP calc. Main Mem. bandwidth [GByte/sec]

WAHA 2009 Kyushu University 59 Estimated performance improvement of 1-dim heat equation by LSRDP calc. Main Mem. bandwidth [GByte/sec] Normalized exec. time by GPP(3GHz) calc.

WAHA 2009 Kyushu University 60 Poisson Red/Black 法における DFG の拡大による繰り返し回数の増加 9+4 ノード の入力 中心 1 ノードの出 力 SOR 式 2 回の繰り返し 4+1 ノード の入力 中心 1 ノードの出 力 SOR 式 1 回の計算 これに伴い必要な入力数も増加 DFG の拡大により 1 度に計算可能な繰り返し回数が増加 60

WAHA 2009 Kyushu University 61 Implementation of Heat calculation to LSRDP Loop j Loop i T(xi,tj) End Loop Original GPP code LSRDP Reconfiguration Loop j’ Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data Rearrangement End Loop 61 LSRDP code

WAHA 2009 Kyushu University 62 Implementation of Poisson calculation to LSRDP Loop Iter Loop i loop j u(xi,yj) End Loop Original GPP code LSRDP Reconfiguration Loop Iter’ Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data rearrangement End Loop 62 LSRDP code

WAHA 2009 Kyushu University 63 Implementation of ERI-Rec calculation to LSRDP Loop I,J,K,L LSRDP Reconfiguration Loop contraction Initial Integral Calc. End Loop Input Data rearrangement Loop N LSRDP pipeline calc. (Recursive DFG calc.) End Loop Output Data rearrangement Partial Fock Calc. End Loop Loop I,J,K,L Loop contraction Initial Integral Calc. Recursive Calc. End Loop Partial Fock Calc. End Loop Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation original GPP code LSRDP code 63

WAHA 2009 Kyushu University 64 Vertical vs. Horizontal DFG Decomposition Loop N Reconfiguration Loop M LSRDP pipeline calc. End Loop Original 64 Loop n ( > N) Reconfiguration Loop M LSRDP pipeline calc. End Loop Vertical Decomp. Loop N Reconfiguration Loop M 1 st LSRDP pipeline calc. End Loop Loop N Reconfiguration Loop M 2 nd LSRDP pipeline calc. End Loop Horizontal Decomp.

WAHA 2009 Kyushu University 65 Example of extracted DFGs Maximum DFG of ERI-Rec: (p i p j,p k p l ) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0

WAHA 2009 Kyushu University 66 Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)

WAHA 2009 Kyushu University 67 Performance Evaluation: Simulation Environment 67 GPP Main Memory LSRDP GPP : Exec. time measurement by processor simulator LSRDP : Estimation by performance modeling Variable parameters: Freq. of GPP and LSRDP Bandwidth between main memory and LSRDP Latency of reconfiguration time # of FPUs in LSRDP Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory.

WAHA 2009 Kyushu University 68 Performance Evaluation: Execution Time Modeling 68 Execution time Calculation time Stall time Latency of LSRDP Mem For first Input and last output Sort data + Reconfig. + Send signal for comm. + Stall from Bandwidth req > Bandwidth mem Total pipeline depth in the given program + # of rows of LSRDP (latency of LSRDP)

WAHA 2009 Kyushu University 69 Layout Types- Type I W ORN … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M Total No. of PEs= W * H Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs) ADD/SUM MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H

WAHA 2009 Kyushu University 70 W ORN … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout Types- Type II H Each PE implements ADD/SUB or MUL Total No. of PEs= W * H Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) ADD/SUMTUMULTU

WAHA 2009 Kyushu University 71 W ORN … MTMTMTMTMT … ATATATATAT … MTMTMTMTMT … MTATATATMT Layout Types- Type III H Each PE implements ADD/SUB or MUL Total No. of PEs= W * H Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) ADD/SUMTUMULTU

WAHA 2009 Kyushu University 72 CB Various Functionalities CB two inputs/outputs four cases are possible reconfigurable 1/2CB one input/ two outputs four cases are possible reconfigurable CB ½ CB

WAHA 2009 Kyushu University 73 The number of FPUs is M, the number of Transfer Units (T) is also M; MCL is a maximum connection length if we consider FPUs only =>  ½ CB – 2×M  T2 – (M+4×MCL+2)  CB – (2×MCL+1) ×(4×M-1) An ORN Structure CB: 351 JJs ½ CB: 216 JJs * T2 is a 2-bit shift register  “ + ” :  scalable  pipelined  easily re-designed for any number of N and M  “–” :  large number of Josephson junctions  M number of ½ CB and (2×M+1)×MCL number of CB Reduction of the number of Josephson junctions is essential! A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

WAHA 2009 Kyushu University 74 Reconfiguration Mechanism PE & ORN reconfiguration structure Reconfiguration in LSRDP State Diagram

WAHA 2009 Kyushu University 75 LSRDP Design Procedure

WAHA 2009 Kyushu University 76 SMAC 10TFLOPS SFQ-RDP computer :...::: SMAC SB ORN... ORN... : : : : ORN... ORN FPU SFQ RDP ( 32FPU×32chips ) (4 GFLOPS / FPU) 4.2 K SFQ Streaming Buffer ( 64Kb×2chips ) CMOS CPU (1chip) Memory band width per MCM : 256GB/ s (=16GB/s ×16 channels) (34 chips ) ×4MCM 2TB memory module ( FB-DIMM 128GB] ×16 modules ) SFQ 0.5um process

WAHA 2009 Kyushu University 77 Power Consumption Comparison for 10TFLOPS computers MPUMem.HDFreezerAir Conditioning Total CMOS RISC (90 nm) 125 kW12.5 kW5 KW43 kW186 kW SFQ RDP (0.5 um) 3W +3.3 W 250 W100 W1 kW0.1 kW1.5 kW

WAHA 2009 Kyushu University 78 Power Consumption / Performance Performance ( GFlops ) Power Consumption (W) Power Cons. / Perf.(W/GFlops) SFQ-RDP~10K~1.5K~ um, Whole system SFQ-RDP~10K~6.3 (MPU)~0.63*10 ^ (-3)0.5 um, MPU GRAPE-DR Chip CSX ClearSpeed Chip Cell192 (single)320.17Inside Chip, SPE core Cell (eDP) 1.33PF(# ) ? 12960*110W ? Roadrunner GeForce8800GTX518 (single) Chip SX nodes whole system: 15.4MW CMOS RISC (90nm) ~10K~186K~18.6