Optimizing the Architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath) F. Mehdipour*, Hiroaki Honda **, H. Kataoka, K. Inoue and K. Murakami*

Slides:

Advertisements

Similar presentations

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

Advertisements

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator Hiroaki Honda 1,

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator Farhad.

1/42 Changkun Park Title Dual mode RF CMOS Power Amplifier with transformer for polar transmitters March. 26, 2007 Changkun Park Wave Embedded Integrated.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

PipeRench: A Coprocessor for Streaming Multimedia Acceleration Seth Goldstein, Herman Schmit et al. Carnegie Mellon University.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

04/26/2006VLSI Design & Test Seminar Series 1 Phase Delay in MAC-based Analog Functional Testing in Mixed-Signal Systems Jie Qin, Charles Stroud, and Foster.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Approaching Ideal NoC Latency with Pre-Configured Routes George Michelogiannakis, Dionisios Pnevmatikatos and Manolis Katevenis Institute of Computer Science.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Paper Review I Coarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# ENG*6530 Tues, June, 10,

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

CAD for Physical Design of VLSI Circuits

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Paper Review: XiSystem - A Reconfigurable Processor and System

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

CSE 494: Electronic Design Automation Lecture 2 VLSI Design, Physical Design Automation, Design Styles.

A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Farhad Mehdipour, H. Noori, B.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

An Integrated Temporal Partitioning and Mapping Framework for Handling Custom Instructions on a Reconfigurable Functional Unit Farhad Mehdipour †, Hamid.

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

ROUTING ARCHITECTURE AND ALGORITHMS FOR A SUPERCONDUCTIVITY CIRCUITS-BASED COMPUTING HARDWARE Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue,

Let’s Open Up New Fields for Next 10X! Koji Inoue Kyushu University, Japan

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

INTRODUCTION TO MICROPROCESSORS

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

INTRODUCTION TO MICROPROCESSORS

INTRODUCTION TO MICROPROCESSORS

Dynamically Reconfigurable Architectures: An Overview

Masamitsu Tanaka, Nagoya Univ.

A New Design Approach for High-Throughput Arithmetic Circuits for Single-Flux-Quantum Microprocessors Masamitsu Tanaka, Nagoya Univ., JSPS Co-workers:

Presentation transcript:

Optimizing the Architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath) F. Mehdipour*, Hiroaki Honda **, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan

SSV 2009 Kyushu University 2 CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits SFQ-LSRDP Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka Kyushu Univ. Architecture, Compiler and Applications Dr. S. Nagasawa et al. Superconducting Research Lab. (SRL) SFQ process Prof. N. Yoshikawa et al. Yokohama National Univ. SFQ-FPU chip, cell library Prof. A. Fujimaki et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. N. Takagi (Leader) et al. Nagoya Univ. CAD for logic design and arithmetic circuits

SSV 2009 Kyushu University 3 Agenda Introduction Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work

SSV 2009 Kyushu University 4 Introduction For performance improvement various accelerators are used with GPPs PowerXcell, GPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar performance NVIDIA Tesla S1070

SSV 2009 Kyushu University 5 Acceleration Through a Data-Path Processor Mechanism Acceleration by using a data-path accelerator Augmenting the accelerator to the base processor Executes hot portions of applications on the accelerator

SSV 2009 Kyushu University 6 How a Reconfigurable Processor Works Application code Main Memory GPP critical code Non-critical code critical code Non-critical code LSRDP

SSV 2009 Kyushu University 7 Coupling an Accelerator to a Processor Coprocessor Processor RFU Memory Attached Processor Bridge Tight Coupling Loose Coupling Tight Coupling

SSV 2009 Kyushu University 8 Motivation Conventional accelerators: A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency Large-Scale Reconfigurable Data-Path (LSRDP): is introduced as an alternative accelerator reduces the no. of memory accesses by utilizing data-path

SSV 2009 Kyushu University 9 Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor Features: Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped Pipeline execution Burst transfer is used for input /output rearranged data from/to memory Main Memory GPP ORN : : : : ORN : Operand Routing Network... FU... FU... FU LSRDP :::...: SB SMAC Scratchpad Memory Reconfigurable data-path includes: A large number of floating point Functional Units (FUs) Arranged as arrays Reconfigurable Operand Routing Network : (ORN) Dynamic reconfiguration facilities Streaming Buffer (SB) for I/O ports

SSV 2009 Kyushu University 10 Single-Flux Quantum (SFQ) against CMOS CMOS issues: (if LSRDP has 32x32 FUs) high electric power consumption high heat radiation and difficulties in high-density packing SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing

SSV 2009 Kyushu University 11 Goals of the Project Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

LSRDP General Architecture and Specifications

SSV 2009 Kyushu University 13 Parameters Should Be Decided Within the LSRDP Design Procedure Maximum Connection Length (MCL) between consecutive rows? (impossible to implement full cross bar) PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU) Reconfiguration mechanism? (PE, ORN, Immediate data) Layout: FU types (ADD/SUB and MUL)? Core structure: a rectangular matrix of PEs Width and Height ? On-chip memory configuration?

SSV 2009 Kyushu University 14 LSRDP Architecture Processing Elements FU implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row FUTU FU TU FUTU FUTUFU PE including Two components Four functionalities

SSV 2009 Kyushu University 15 Layout Types- Type I W ORN … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M ADD/SUB MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H Flexible but consume a lot of resources

SSV 2009 Kyushu University 16 W ORN … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout Types- Type II (Checkered) H Each PE implements ADD/SUB or MUL ADD/SUBTUMULTU

SSV 2009 Kyushu University 17 W ORN … MTMTMTMTMT … ATATATATAT … MTMTMTMTMT … ATATATATAT Layout Types- Type III (Striped) H Each PE implements ADD/SUB or MUL ADD/SUBTUMULTU Type II or III, which one is more efficient?

SSV 2009 Kyushu University 18 Maximum Connection Length (MCL) MCL: maximum horizontal distance between two PEs located in two subsequent rows

SSV 2009 Kyushu University 19 An ORN Structure A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches 2bit shift register ORN

SSV 2009 Kyushu University 20 Dynamic Reconfiguration Mechanism

SSV 2009 Kyushu University 21 Dynamic Reconfiguration Architecture Three bit-stream lines for dynamic reconfiguration of: Immediate registers (64bit) in each PE Selector bits for muxes selecting the input data of FUs Cross-bar switches in ORNs

Design Procedure and Tool Chain

SSV 2009 Kyushu University 23 Compiler and Design Flow DFGs are manually generated from critical parts of applications DFG mapping results are used for Analyzing LSRDP architecture statistics Generating LSRDP configuration bit-streams

SSV 2009 Kyushu University 24 Benchmark Applications for Design Procedures Finite differential method calculation of 2 nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Only ADD/SUB and MUL operations are used in the critical calculations of all above applications

SSV 2009 Kyushu University 25 DFG Extraction- Heat Equation 1-dim. heat equation for T(x,t) Calculation by Finite Difference Method (FDM) (A is const.) Basic DFG corresponding to minimum FDM calculation Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

SSV 2009 Kyushu University 26 Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)

SSV 2009 Kyushu University 27 DFG Distribution for each application # of FUs # of Inputs Poisson (3) Vibration (7) Heat (6) ERI-Rec (8 DFGs) DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs 24DFGs

SSV 2009 Kyushu University 28 DFG Classification Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes => LSRDP designing processes for S, M, L, XL, respectively Totally, 24 DFGs are prepared for benchmark Apps.

SSV 2009 Kyushu University 29 Mapping DFGs onto LSRDP Longest connections

SSV 2009 Kyushu University 30 LSRDP Design Procedure For each parameter Appropriate values for all parameters DFGs & LSRDP HW constraints

Preliminary Results

SSV 2009 Kyushu University 32 LSRDP Specifications: Width & Height # of Input ports # of Output ports WidthHeight LSRDP-S LSRDP-M LSRDP-L LSRDP Dimensions and the number of input/output ports

SSV 2009 Kyushu University 33 LSRDPMCL (avg/max) ORN Size- No of Inps (avg/max), Outs LSRDP-S4/818/34, 3 LSRDP-M5/922/38, 3 LSRDP-L5/922/34, 3 LSRDP Specifications: MCL Further MCL optimization needed

SSV 2009 Kyushu University 34 Analyzing Various LSRDP Layouts Layout II can be used instead of Layout I to obtain a smaller LSRDP (Except ERI1 DFG which gives better size for Layout III)

SSV 2009 Kyushu University 35 LSRDP at One Glance (1/2) Functional unitsADD/SUB, MUL LayoutType II (checker pattern) Operations64-bit floating point Processing structurePipelined PE structureFU, T, FU+T, T+T LSRDP SizeSmallMediumLarge No. of inp/out ports19/12 38/24 Width/Height16/1632/1664/32 Conf. bit-stream size Imm. Regs16*16*6432*16*6464*32*64 ORNs16*BSS(ORN)32* BSS(ORN)64*BSS(ORN) PEs16*16* 232*16*264*32* 2 ORNinputs, outputs22, 326, 3 StructureCross-bar switch Conn. TypeOne-directional

SSV 2009 Kyushu University 36 LSRDP at One Glance (2/2) Internal memoryTypeImmediate registers Size and count64-bit registers, One reg. for each PE Communication mechanismSerial External memoryNo. of memory modules16 Date trans. rate1800Mbps/pin Overall data trans. rate24 GB/s Mem. to LSRDP bus width64 bit Channels per moduleTwo Reconf. mechanismBit serial configuration through a serial chain

SSV 2009 Kyushu University 37 Preliminary Performance Evaluation Processor typeOut-of-order GPP operating frequency3.2GHz Inst. issue width4 instruction/cc Inst. decode width4 instruction/cc Cache configurationL1 data64KB(128B Entry, 2way, 2cc) L1 instruction64KB(64B Entry, 1way, 1cc) L2 unified4MB(128B Entry, 4way, 16cc) Latency of main memory300cc L2 to main memoryBus width64 Bytes Freq800 MHz LSRDP operating frequency80 GHz Reconfiguration Latency1cc Latency SPM  LSRDP latency 1cc Latency Main Memory  SPM 7500cc Bandwidth SPM  LSRDP Max. 64 * 8 Bytes/cc Bandwidth Main Memory  SPM 102.4GB/sec Base processor configuration GPP+LSRDP configuration GPP ： Exec. time measurement by means of a processor simulator LSRDP ： Estimation by performance modeling

SSV 2009 Kyushu University 38 Preliminary Performance Evaluation (Heat) Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory. Basic: SB only Reuse: SB + SPM

SSV 2009 Kyushu University 39 Preliminary Performance Evaluation (Poisson) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP

SSV 2009 Kyushu University 40 Conclusions & Future Work A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. 24 benchmark Data Flow Graphs (DFGs) were manually generated. LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. Future Work: To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.

SSV 2009 Kyushu University 41 Acknowledgement This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).

SSV 2009 Kyushu University 42 2x3 RDP processor prototype 8-bit ALUs implementing: ADD, SUB, AND, OR, XOR 25GHz Frequency 6-bit Data transfer shift registers 16-bit I/O shift registers 21 Pipeline stages 7-bit Data width Area: 6.84 x 6.72 mm 2 Total number of Junctions ： 14040JJs Bias current ： 1.652A Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008