Download presentation
Presentation is loading. Please wait.
Published byKristopher Byrd Modified over 9 years ago
1
Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda *, H. Kataoka, K. Inoue and K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan *Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: farhad@c.csce.kyushu-ua.c.jpfarhad@c.csce.kyushu-ua.c.jp
2
2009/01/29 Agenda Introduction SFQ-LSRDP General Architecture The Design Procedure and Tool Chain Input/ Output Nodes Placement Area Minimization Experimental Results Conclusions
3
2009/01/29 CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits SFQ-LSRDP Prof. K. Murakami et al. Kyushu Univ. Architecture, Compiler and Applications Dr. S. Nagasawa et al. Superconducting Research Lab. (SRL) SFQ process Prof. N. Yoshikawa et al. Yokohama National Univ. SFQ-FPU chip, cell library Prof. A. Fujimaki et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. N. Takagi (Leader) et al. Nagoya Univ. CAD for logic design and arithmetic circuits
4
2009/01/29 Goals Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits
5
2009/01/29 How a reconfigurable processor works Application code Main Memory GPP Computation-intensive (critical) code Non-critical code Computation-intensive (critical) code Non-critical code LSRDP... PE... PE... PE LSRDP ORN …......
6
2009/01/29 Single-flux quantum (SFQ) against CMOS CMOS main issues in implementing a large accelerator: High electric power consumption High heat radiation Difficulties in high-density packing SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation (smaller area) Suitable for pipeline processing of data stream
7
2009/01/29 Outline of large-scale reconfigurable data-path (LSRDP) processor Features: Handling data flow graphs (DFGs) extracted from scientific applications Pipeline execution Burst transfer of input /output rearranged data from/to memory Reduced no. of memory accesses ( alleviating the memory wall problem ) Main Memory GPP ORN : : : : ORN : Operand Routing Network... PE... PE... PE LSRDP :::...: SB SMAC Scratchpad Memory Reconfigurable data-path components: A matrix of large number of floating- point Functional Units (FUs) Reconfigurable Operand Routing Network : (ORN) Dynamic reconfiguration facilities Streaming Buffer (SB) for I/O ports
8
SFQ-LSRDP General Architecture
9
2009/01/29 LSRDP architecture Processing Elements FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows FUTU FU TU FUTU FUTUFU PE including two components Four functionalities Input ports Output ports MUL Node 15 TU 7 4 15 13 12
10
2009/01/29 PE structures FUTU PE Basic arch. 3-inps/2-outs FU- -TU FUTU FUTU PE arch. I 4-inps/3-outs FU-- TU- - - FUTU FUTU- FU TU PE arch. II 3-inps/3-outs TU - FU- TU FUTU TU-TU TU
11
2009/01/29 Layout types- Type I W ORN...... … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M ADD/SUB MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H Flexible but consumes a lot of resources
12
2009/01/29 W ORN...... … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout types- Type II H Each PE implements ADD/SUB or MUL ADD/SUBTUMULTU
13
2009/01/29 Maximum connection length (MCL)- Definition MCL: maximum horizontal distance b/w two PEs located in two subsequent rows
14
2009/01/29 An ORN structure A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008. ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches 2bit shift register ORN
15
2009/01/29 Dynamic reconfiguration architecture Three bit-stream lines for dynamic reconfiguration of: Immediate registers (64bit) in each PE Selector bits for muxes selecting the input data of FUs Cross-bar switches in ORNs
16
2009/01/29 What should be decided during the design procedure Maximum Connection Length (MCL)? ORN size and structure? Reconfiguration mechanism? (PE, ORN, Immediate data) Layout: FU types (ADD/SUB and MUL)? Width and Height ? On-chip memory configuration? The number of I/O ports?
17
The Design Procedure and Tool Chain
18
2009/01/29 Compiler and design flow DFGs are manually generated DFG mapping results are employed for: Analyzing LSRDP architecture statistics (a quantitative approach) Generating LSRDP configuration bit-streams
19
2009/01/29 Benchmark applications Finite differential method calculation of 2 nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Types of operations in the calculations: ADD/SUB and MUL
20
2009/01/29 DFG extraction- Heat equation 1-dim. heat equation for T(x,t) Calculation by Finite Difference Method (FDM) (A is const.) Basic DFG Basic DFG can be extended to horizontal and vertical directions to make a larger DFG
21
2009/01/29 A sample DFG - Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A sample DFG (Heat)
22
2009/01/29 DFG mapping flow Longest connections MCL= 2
23
Placing Input/Output Nodes
24
2009/01/29 Fan-out based I/O nodes placement ni: the number of children of input node i Ci1, Ci2, Ci3, Ci,ni X: location of the input node i Total Connection Length: TCL= |Ci1-X|+ |Ci2- X|+…|Ci,ni-X| Objective: Minimize TCL ni= 1 X= Ci1 ni= 2 Ci1 <= X <= Ci2 ni= 3 X = Ci2 ni>=2 X = Cij, j=2…ni-1
25
2009/01/29 One main reason for the large MCL Inputs Ports are far from each other
26
2009/01/29 Proximity-factor based placement Proximity factor indicates how far a pair of input ports should be located from each other For a pair of input nodes The larger number of closer descendants, higher proximity factor is assigned S i,j : a set of common descendants for input nodes i and j D k,i (=D k,j ): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node)
27
2009/01/29 Proximity factor-Example Inputs nodes I1 and I2 should be located closer than I3 1 I1I1I2I2 I3I3 4 6 7 2 3 5
28
2009/01/29 Input nodes placement alg.: Example if C(l)> C(r) l= l+1, L[l]=j else r= r+1, L[r]=j N/2-3N/2-2N/2-1 1 N/2+1N/2+2N/2+3 … … N/2-3N/2-2 21 N/2+1N/2+2N/2+3 … … Placing the 1 st input node with the highest proximity factor Placing the 2 nd input node with the highest proximity factor
29
2009/01/29 Input ports placement alg.: Example Placing i-th input node l r N/2-K… 21 3… N/2+M … … If C(l)> C(r): l r i … 21 3… N/2+M … … If C(r)> C(l): lr N/2-K … 21 3… i …
30
Area Minimization
31
2009/01/29 Estimating the area of a PE TU FU PE arch. I FU TU PE arch. II TU op mux A B C TU sel A B C Layout I: Area(PE)= 2.2x Area(FU) Layout II: Area(PE)= 1.2x Area(FU) Layout I: Area(PE)= 2.2x Area(FU), Layout II: Area(PE)= 1.2x Area(FU) Area(FU)= Area(ADD/SUB)= Area(MUL) Area(TU)= Area(MUX)~ 0.1 Area (FU) TUFU PE basic arch Layout I: Area(PE)= 2.1x Area(FU), Layout II: Area(PE)= 1.1x Area(FU)
32
2009/01/29 Estimating the ORN area-PE Basic arch. Area (ORN) = 1.5 x W x (4 x MCL) x Area (CB) W: the no. of the PEs in a RDP row FUTU Basic arch. 3-inps/2-outs MCL= 1 Number of rows = 1.5×W Number of columns = 4×MCL
33
2009/01/29 Estimating the ORN area-PE arch. I Area (ORN) = 2 x W x (6 x MCL+ 2) x Area (CB) FUTU PE arch. I 4-inps/3-outs Number of rows = 2×W Number of columns = 6×MCL+2 MCL= 1
34
2009/01/29 Estimating the ORN area-PE arch. II Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB) FU TU PE arch. II 3-inps/3-outs TU Number of rows = 1.5×W Number of columns = 4×MCL+1 MCL= 2
35
2009/01/29 A modified connection length measurement New measurement technique for the net length src dest dhdh dvdv Connection length measurement: initial C.L.= d h modified C.L.= d h / d v src dest1 dest2 C.L.(previous)= 3 C.L.(new)=1 C.L.(previous)= 3 C.L.(new)=3
36
2009/01/29 A modified connection length measurement- Example 1, 3 1, 1 2, 2 2, 2/3 Parent 2 3, 1 3,1/3 4, 0 d h d h /d v Parent 1 0, 4 0, 4/3 1, 3 0.5, 0.75 2, 2 1, 0.5 3, 1 3/2, 1/4 4, 0 2, 0 0, 4 0, 1 is chosen when C.L. is measured as d h MCL= 2 is chosen when C.L. is measured as d h /d v MCL= 1 d h d h /d v
37
2009/01/29 MCL minimization- Using a MCL threshold A maximum threshold is assumed for the MCL During the placement process: For each CL larger than the threshold, the vertical distance increases as: d v = CL/MCL_Threshold src dest max permitted length= 2 d h =3 > max permitted length d v = 1 PE with the min. C.L to the source d v= d v + [3/2]=d v +1= 2
38
2009/01/29 Basic placement and routing vs. integrated placement and routing DFG LSRDP Architecture Description Placing Input Nodes Placing Operational & Output Nodes Routing Nets Routing IO Nets Final Map DFG LSRDP Architecture Description Placing Input Nodes using PF-based alg. Placing Operational Nodes & Routing Nets (node by node) Placing Output Nodes Routing Output Nets Final Map Basic Placement and Routing FlowIntegrated Placement and Routing Flow
39
Experimental Results
40
2009/01/29 Specifications of the benchmark DFGs DFG # of nodes # of inputs # of outputs # of pure ops max. inp. nodes fan-out Max. fan-out Heat-8x134641621 Heat-8x260843233 Heat-16x217216129633 Poisson-3x3621813332 Vibration-4x248842432 Vibration-8x213616127244 ERI-12083931 ERI-2761695133 ERI-3891496636 ERI-4671914743 Max17219129646
41
2009/01/29 Evaluation results for various architectures- MCL and ORN sizes Layout-ILayout-II S1S2S1S2 MCL PE basic arch.1461512 PE arch. I8394 PE arch. II104127 ORN size (overall) x CB PE basic arch25116126002925024336 PE arch. I30600136803808017680 PE arch. II1869682372352014790 nodes placementConnection length measurement S1fan-out basedlhlh S2proximity-factor basedl hv S2 results in smaller MCL and ORN size for both layout types
42
2009/01/29 Evaluation results for various architectures- no. of utilized PEs Layout-ILayout-II S1S2S1S2 No. of PEs (overall) x PE PE basic arch580683330344 PE arch. I634713384 PE arch. II627669360384 By using l hv, larger number of RDP rows are utilized larger number of PEs will be employed for S2
43
2009/01/29 Evaluation results for various architectures- overall LSRDP area (KJJ) Layout-ILayout-II S1S2S1S2 Overall LSRDP Area x (KJJ) PE basic arch36923341932920027040 PE arch. I48083359953619025031 PE arch. II35307312582726623451 S2 results in smaller overall area in terms of KJJ for both layout types Layout II results in smaller area PE arch. II gives smaller area FUTU FUTU Basic PE arch. 3-inps/2-outs PE arch. I 4-inps/3-outs FU TU PE arch. II 3-inps/3-outs TU
44
2009/01/29 A sample ORN implementation data_in ladder clkin_lfin clkin_lfout clkin_hf data_out circuit under test Block diagram of a high frequency test bench input shift register output shift register ladder input shift register output shift register circuit under test A photograph of a chip with 1-to-3 ORN prototype test bench 5 mm
45
2009/01/29 Conclusions SFQ-LSRDP is a basic core of a high-performance low-power computer Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance Acknowledgement: This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).
46
Thanks for your attention! Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.