Masamitsu Tanaka, Nagoya Univ.

Masamitsu Tanaka, Nagoya Univ.
ISEC’05 / O-A01 Design of a pipelined 8-bit-serial single-flux-quantum microprocessor with multiple ALUs Masamitsu Tanaka, Nagoya Univ. Co-workers T. Kawamoto1, Y. Yamanashi2, Y. Kamiya1, A. Akimoto2, K. Fujiwara2, A. Fujimaki1, N. Yoshikawa2, H. Terai3, and S. Yorozu4 1Nagoya Univ., 2Yokohama National Univ., 3NICT, 4ISTEC-SRL Acknowledgment: This work was supported by the NEDO through ISTEC as Collaborative Research and Superconductors Network Device Project.

Outline Introduction Microarchitecture Component & Chip design
performance estimation Conclusion

Introduction Single-flux-quantum (SFQ) logic:
High-speed & low-power operation, Ballistic signal transport by using passive interconnects. High-end microprocessor unit (MPU) is one of the most attractive SFQ applications. FLUX Chip (TRW&SUNY) CORE1α (NU&YNU) Demonstrated up to 21-GHz operations (bit processing). 200 MIPS, 2.3 mW, 7220 JJs. CORE1α version 10 with 4-byte memory fabricated using the NEC Nb standard II 1mm

Motivation for CORE1β We have started to develop much powerful SFQ MPUs called CORE1β, extending CORE1α. state-of-the-art CMOS MPUs 1010 Target of CORE1β CORE1β (designed) CORE1β5 Peak performance [operations/s] 109 CORE1α (demonstrated) CORE1α 108

Strategies to improve Pipelining
Several Instructions are overlapped in execution. A common technique in conventional MPUs. Multiple ALUs (the forwarding architecture) Bit-serial data are processed in cascaded ALUs. A unique technique in the serial SFQ MPUs. forwarding buffer (holds the result in last operation)

Pipeline Implementation
Basically, SFQ circuits are suitable for pipeline. Almost all the SFQ logic gates have latch functionalities. However, deep (bit-level) pipelining is too difficult to control. Dependences of instructions (pipeline hazards) would degrade performance. We use two different types of clocks: System clock for pipelining (target: 1-2 GHz). Local clocks for bit operation of instruction and data; targets: 25 GHz and 20 GHz, respectively.

Microarchitecture Instruction Fetch Instruction Decode Execution
Write Back

Features of CORE1β5 Microarchitecture Instruction set
7-stage pipelined, issuing instructions every other cycle, The datapath is composed of four 8-bit registers and dual cascaded ALUs, and several buffers. Instruction set 8 primary instructions (16-bit), 7 register-register operations. Circuit scale, etc. ~10,000 junctions, 5 mm2 chip, Bias lines are completely shielded. GND circuit bias supplying line superconductive shielding

Instruction Set Primary Instructions R-type Operations Name Meaning
register-register operation LD load ST store BEQZ branch on equal to zero BNEZ branch on not equal to zero J jump HLT halt NOP no operation Name Meaning Add addition Sub subtract And logical AND Or logical OR Xor logical exclusive OR PassX pass one operand PassY

Component Design Key issues of pipelining in CORE1β
To control and configure every component along with successive instructions. see P-B06 by Y. Yamanashi, et al. To avoid confliction or interference of bit-serial data between stages in any components. time inst1 IF0 IF1 ID0 ID1 EX0 EX1 WB inst2 IF0 IF1 ID0 ID1 EX0 EX1 WB inst3 IF0 IF1 ID0 ID1 EX0 EX1 WB inst4 IF0 IF1 ID0 ID1 EX0 inst5 IF0 IF1 ID0 : issue instructions every other cycle

Design of Datapath The timing design of datapath is very difficult.
Should be read-out and written-in simultaneously, Requires most precise timing control. We have inserted buffers with new structure. write clock register file (4 x 8 bit) SRB1 SRB2 DRB trigger 1 ALUa FB D2FFs ALUb 1 SRB: source register buffer DRB: destination register buffer read clock

Register File The register file is designed to be read-out and written-in simultaneously with the same clock. We confirmed successful operations up to 18 GHz. Dc bias margin [%] Clock frequency [GHz] On-chip test circuit [M. Tanaka et al., to be published in Physica C]

ALU (Single) In ALU, we have implemented and confirmed all of functionalities up to 23 GHz. Test circuit for the ALU Dc bias margin [%] Clock frequency [GHz] Test result of subtraction [M. Tanaka et al., to be published in Physica C]

Chip Layout We have confirmed several partial operations.
CORE1β version3 test result (high-speed local clocks) 10,923 JJs 1000 MOPS (peak) 4-stage pipeline 1mm Fabricated using the NEC 2.5 kA/cm2 Nb standard process II

Critical Path Timing System clock Local clocks
665.9 ps in odd-numbered stage (EX1 stage of R-type) Estimated peak performance is 1500 MOPS. Local clocks Instruction (25GHz) Not critical. Data (20 GHz) 31.4 ps in the buffers. The target local clock fre-quencies will be achievable.

Conclusion We have designed an 8-bit SFQ MPU with the peak performance of 1500 MOPS. Employed the forwarding architecture, using two cascaded ALUs to enhance register-register operations, Implemented pipelining in the bit-serial MPU by separating clocks for pipelining and bit processing, improving controller and datapath for pipelining. We have already finished component tests, and are testing the whole CORE1β MPU. Confirmed several partial operations so far.

Masamitsu Tanaka, Nagoya Univ.

Similar presentations

Presentation on theme: "Masamitsu Tanaka, Nagoya Univ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Masamitsu Tanaka, Nagoya Univ.

Similar presentations

Presentation on theme: "Masamitsu Tanaka, Nagoya Univ."— Presentation transcript:

Similar presentations

About project

Feedback