Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator Hiroaki Honda 1,

Slides:

Advertisements

Similar presentations

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Advertisements

The University of Adelaide, School of Computer Science

Parallell Processing Systems1 Chapter 4 Vector Processors.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue*

Kyushu University KL, Malaysia Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator Farhad.

Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

A Thermal-Aware Mapping Algorithm for Reducing Peak Temperature of an Accelerator Deployed in a 3D Stack A Thermal-Aware Mapping Algorithm for Reducing.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Abdullah Aldahami ( ) Feb26, Introduction 2. Feedback Switch Logic 3. Arithmetic Logic Unit Architecture a.Ripple-Carry Adder b.Kogge-Stone.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Automated Design of Custom Architecture Tulika Mitra

1 Rapid Estimation of Power Consumption for Hybrid FPGAs Chun Hok Ho 1, Philip Leong 2, Wayne Luk 1, Steve Wilton 3 1 Department of Computing, Imperial.

Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

80-Tile Teraflop Network-On- Chip 1. Contents Overview of the chip Architecture ▫Computational Core ▫Mesh Network Router ▫Power save features Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator Farhad Mehdipour, Hamid Noori, Hiroaki Honda, Koji Inoue, Kazuaki.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

ShiDianNao: Shifting Vision Processing Closer to the Sensor

Design Space Exploration for a Coarse Grain Accelerator Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani*, Koji Inoue, Kazuaki Murakami Kyushu University,

Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Optimizing the Architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath) F. Mehdipour*, Hiroaki Honda **, H. Kataoka*, K. Inoue* and K. Murakami*

High Performance, Low Power Reconfigurable Processor for Embedded Systems Farhad Mehdipour, Hamid Noori, Koji Inoue, Kazuaki Murakami Kyushu University,

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

ROUTING ARCHITECTURE AND ALGORITHMS FOR A SUPERCONDUCTIVITY CIRCUITS-BASED COMPUTING HARDWARE Farhad Mehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue,

Let’s Open Up New Fields for Next 10X! Koji Inoue Kyushu University, Japan

An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.

Sunpyo Hong, Hyesoon Kim

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

My Coordinates Office EM G.27 contact time:

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

Philipp Gysel ECE Department University of California, Davis

Array computers. Single Instruction Stream Multiple Data Streams computer There two types of general structures of array processors SIMD Distributerd.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

ESE532: System-on-a-Chip Architecture

Floating-Point FPGA (FPFPGA)

Variable Word Width Computation for Low Power

Design-Space Exploration

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

FPGAs in AWS and First Use Cases, Kees Vissers

Hamid Noori*, Farhad Mehdipour†, Norifumi Yoshimastu‡,

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Masamitsu Tanaka, Nagoya Univ.

A High Performance SoC: PkunityTM

A New Design Approach for High-Throughput Arithmetic Circuits for Single-Flux-Quantum Microprocessors Masamitsu Tanaka, Nagoya Univ., JSPS Co-workers:

Presentation transcript:

Performance Evaluations of Finite Difference Applications Realized on a Single Flux Quantum Circuits-Based Reconfigurable Accelerator Hiroaki Honda 1, Farhad Mehdipour 2, Hiroshi Kataoka 1, Koji Inoue 1, and Kazuaki J. Murakami 1 1 Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan 2 Center for Japan-Egypt Cooperation in Science and Technology, Kyushu University, Fukuoka, Japan 1

Agenda Introduction Single-flux quantum (SFQ) circuit SFQ-reconfigurable data-path (RDP) processor Objective Implementing an Application on SFQ-RDP Tool chain Code modification DFG extraction and mapping Performance Evaluation Comparison with GPU and GPP results Conclusions 2

Top500 Supercomputer Ranking and Projection 1 ExaFlop/s [=10 9 GFlop/s] can be attained in ~2019 and 10 ExaFLop/s in ~2022?? (only in next ten years) PetaFLop/s [=10 6 GFlop/s] world from 2009, 1000 times speed up in 10 years 1EFlops / 3 10EF 2022

Energy Consumption Estimation for Floating Point Units (FPUs) Power / [1FPU (2GHz)] is larger than 10 mW (CMOS, ~8nm in ~2019) 1) Power / [1GFlop/s] is larger than 5 mW Enegy consumption of FPUs for 10 ExaFlop/s system is larger than 5 mW * 10 * 10 9 = 50 MW !! 1) p178 It is extremely power consuming to construct 10 ExaFlop/s supercomputer system by CMOS circuit processor Additional power consumption by memory, network, storage,… (1ExaFlop/s =10 9 GFlop/s) 4

Difficult to implement feed back loops and conditional branches No practical SFQ memory Single-Flux Quantum (SFQ) Circuit Pulse logic: Bit serial/slice description for 32/64 bits Ultra high speed switching Ultra low power No cost for latch Suitable for Pipeline processing Josephson junction 2~3 ps SFQ Pulse ~1 mV SFQ Pulse (quantized magnetic flux) Superconductivity loop AdvantagesDisadvantages x 10~100 faster operation x ~1/10 energy consumption x 10~100 faster operation x ~1/10 energy consumption 5

Single-Flux Quantum-Reconfigurable Data Path (SFQ-RDP) Computer Large scale two-dimensional floating-point unit array, data-path architecture Reconfigurable Operand Routing Network (ORN) No on-chip memory Dynamically reconfigurable PEs and ORNs Data Flow is unidirectional No feed back loop Minimal amount of control circuits 2-ports/1-port Data accesses For Input / Output ~2.5TFLOPS/chip One FPU and data through units One FPU and data through units Network connecting between PEs and PEs Network connecting between PEs and PEs PE ORN 6

CREST-JST SFQ-RDP Project (2006~): A Low-Power, High-performance Reconfigurable Processor Based on Single-Flux Quantum Circuits Goals: Discovering appropriate computation-intensive scientific applications Developing compiler tools Developing performance evaluation tools Designing the SFQ-LSRDP architecture Yokohama National Univ. SFQ-FPU chip, cell library Kyushu Univ. Architecture, Compiler and Applications Nagoya Univ. SFQ-RDP chip, cell library, and wiring SFQ-RDP Nagoya Univ. CAD for logic design and arithmetic circuits Superconducting Research Lab. (SRL) SFQ process 7

Prototype 2x3 SFQ-RDP Processor and SFQ-MUL FPU 8-bit ALUs implementing: ADD, SUB, AND, OR, XOR Frequency: 25GHz Process: 2  m Area: 6.84 x 6.72 mm 2 Power: 4.1mW 1) Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, ) H.Hara, et al.,"Design and Implementation of SFQ Half-Precision Floating-Point Multipliers,", ACS08, bit FPUs: Adder, Multiplier MUL Frequency: 32GHz Performance: 2.6 GFLOPs The number of junctions: JJs Power consumption: 3.5 mW Circuit area: 6.22 ×3.78 mm 2 2x3 SFQ-RDP processor 1) SFQ- Floating Point Multiplier 2) 8

Objectives Performance evaluations by implementing practical applications and showing possibility of efficient computations by SFQ-RDP computer system Applications: 2D-diffusion,2D-Finite-Difference Time-Domain (2D-FDTD) Comparisons of execution times with GPP and GPU 2D-FPU array, data-flow architecture Data Flow Graphs (DFGs) are extracted from applications and mapped onto the SFQ-RDP Compiler tools Compiler tools have to be developed No on-chip memory DMA transfer of DRAM has to be fully used to avoid random accesses Dynamically reconfigurable PEs and ORNs One time reconfiguration is enough for both Diffusion & FDTD applications Points 9

Tool Chain for Implementation of an Application on SFQ-RDP Application: C/Fortran code Application: C/Fortran code Modified code Code Modification using SFQ-RDP API Code Modification using SFQ-RDP API Compiler developed for SFQ-RDP Compiler developed for SFQ-RDP Data Flow Graph (DFG) Extraction (Semi-manual) Data Flow Graph (DFG) Extraction (Semi-manual) Object code Extracted DFG Placement and Routing Tool Placement and Routing Tool RDP Configuration file RDP Configuration file RDP library file Functions definition & declaration RDP architecture description Input GPP SFQ-RDP Tool chain has been almost completed 10

Implementing an Application on SFQ-RDP: 2D Diffusion Basic Finite Difference Method (FDM) formula n-axis (time) x-axis (space) y-axis (space) i j n Time development calculation by FDM (time=n points) n+1 In/OutOps 5 / 17 11

loop n loop i, j f (n+1) [i,j] = C 0 * ( f (n) [i-1,j] + f (n) [i+1,j] ) + C 1 * ( f (n) [i,j-1] + f (n) [i,j+1] ) + C 2 * f (n) [i,j] end Original Code for GPP ( n ⇒ n+1 ) Code Implementation and Modification for SFQ-RDP Extracted DFG: In/OutOpsByte/Flop 5 / In/OutOpsByte/Flop 21 / 97 * 91.9 loop n loop i, j, (+3, +3) f (n+1) [i,j] = C 0 * ( f (n) [i-1,j] + f (n) [i+1,j] ) + C 1 * ( f (n) [i,j-1] + f (n) [i,j+1] ) + C 2 * f (n) [i,j] f (n+1) [i+1,j] = C 0 * ( f (n) [i,j] + f (n) [i+2,j] ) + C 1 * ( f (n) [i+1,j-1] + f (n) [i+1,j+1] ) + C 2 * f (n) [i+1,j] f (n+1) [i+2,j] = … … f (n+1) [i+2,j+2]= … end Unrolled Loop Code for SFQ-RDP ( n ⇒ n+1) 9 formulas in loop-body DFG Extraction 12

Mapping Extracted DFG onto SFQ-RDP Placement and Routing Extracted DFG DFG mapping Result 13 RDP configuration data

Improving Data Access Efficiency- Data Structure Conversion for DMA Transfer All two dimensional f[i,j] values are divided and stored as two one-dimensional arrays: A[] and B[] 15(A)+15(B) input data are accessed via two input ports 9 output data are accessed Unrolled loop includes 21 inputs and 9 outputs for calculation Random memory accesses Data Structure Conversion: 14 f[i,j]: A[i]: B[i]: Sequential memory accesses: possible to use DMA transfer f[i,j] A[i],B[i] double buffering

Performance Evaluation GPP: Simulation by cycle accurate processor simulator SFQ-RDP: Performance evaluation modeling Estimation of execution times GPPProcessor typeOut-of-Order Freq.3.2 GHz Inst. issue width4 Inst./CC L1 data cache64 KB L2 unified cache4 MB Latency of main mem.300 CC RDPFreq.(SFQ-RDP)80 GHz Reconfiguration latency CC Main mem. Bandwidth * 141.7, GB/s No. PEs in a row22 No. PEs in a column15 * BW numbers are based on ones for GPU calculation System ArchitectureSystem Configuration 15 2input/1output ports

Results of Performance Evaluation SFQ-RDP (GFLop/s) GPU (GFLop/s) Ratio (by GPU) Ratio (by GPP) 2D-Diffusion ) D-FDTD ) D-Diffusion ) --- 1D-Vibration ) --- Comparable results to GPU SFQ-RDP processor, which is implemented by superconductivity circuits and simple 2D-array architecture, can be used as an efficient accelerator 1) T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10: , ) N. Takada, et al., “Speeding up of FDTD finite difference calculations by efficient use of GPU and shared memory,” (Japanese), Proceedings of Forum of Information Science and Technology, ) H. Kataoka, et al.,"Reducing Preprocessing Overhead Times in a Reconfigurable Accelerator of Finite Difference Applications", SAAHPC 10, Jul

Why Can We Achieve Comparable Results? # of Operation # of I/O 1) Byte/FlopEstimation of GFlop/s 2) (Max. BW 159.0GB/s) RDP Calc. Original formula (1 Output) 75+1 = 6 6*4/7 = 3.42 ~4.7 (random access: ~16GB/s) Unrolled loop formula (9 outputs formula) 7 * 9 = = * 4 / 63 = 1.90 ~8.4 (random access: ~16GB/s) Data structure conversion for DMA transfer 7 * 9 = = * 4 / 63 = (DMA: ~159.0GB/s) With GPP calc, comm. and other overheads 50.6 (DMA: ~159.0GB/s) GPU Calc. Aoki et al. ３ ) ) 1)Based on the utilization of HW for rearrangement of input data 2)Single Precision Calculation, BW 159.0GB/s, GeForce GTX 285 3)GeForce GTX 285, 1 proc. calculation ： (1024x1204 mesh) 4)T. Aoki, et al., “CUDA programming primer,”, (Japanese), Kougakusya, ISBN-10: ,

Conclusions and Future Works Conclusions An Single-Flux Quantum Reconfigurable Data-Path (SFQ-RDP) with two-dimensional floating point array architecture implemented by superconducting circuits was introduced. Two-dimensional Heat (2D-Heat) and Finite Difference Time Domain (2D-FDTD) applications were implemented on SFQ-RDP and performance evaluations were conducted. For 2D-Heat and 2D-FDTD, 50.6 and 79.0 times faster computation than general purpose processor were achievable respectively, while these performance values were comparable to reported results for the GPU. SFQ-RDP accelerator can be used for practical scientific calculations especially based on finite difference methods. Future Works Implementations and performance evaluations of other applications 18

CAD for logic design and arithmetic circuits Prof. N.Takagi (Leader), Prof. K.Takagi (Kyoto Univ.) SFQ-RDP chip, cell library, and wiring Prof. A.Fujimaki, Prof. H.Akaike, Prof. M.Tanaka (Nagoya Univ.) SFQ-FPU chip, cell library Prof. N.Yoshikawa (Yokohama National Univ.) SFQ process Dr. S.Nagasawa, Dr. M.Hidaka (SLRC) Acknowledgement This research was supported by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST). Other SFQ-RDP research members 19