1 An Accelerator Based on Single-Flux Quantum Circuits for a High-Performance Reconfigurable Computer F. Mehdipour*, Hiroaki Honda**, H. Kataoka*, K. Inoue* and K. Murakami* *Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan **Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan
WAHA 2009 Kyushu University 2 Agenda Introduction Large-Scale Reconfigurable Data-Path (LSRDP) General Architecture and Specifications Design Procedure and Tool Chain Preliminary Results Conclusions and Future Work
WAHA 2009 Kyushu University 3 Introduction Parallel computer clusters with General-Purpose Processors (GPP) are often used for HPC Various accelerators are used with GPPs for further performance improvement PowerXcell, GPGPU, GRAPE-DR, ClearSpeed, etc. Small size and low power consumption comparing to processors with similar performance TSUBAME NVIDIA Tesla S Roadrunner w ith PowerXcell
WAHA 2009 Kyushu University 4 Single Flux Quantum Large Scale Reconfigurable Data-Path (SFQ-LSRDP) A large memory bandwidth is demanded in conventional accelerators for high-performance computation On chip memories are often used to hide memory access latency Large-Scale Reconfigurable Data-Path (LSRDP): is introduced as an alternative accelerator reduces the no. of memory accesses is implemented by Single-Flux Quantum (SFQ) circuits instead of CMOS circuits is suitable for high performance scientific computations
WAHA 2009 Kyushu University 5 Outline of Large-Scale Reconfigurable Data-Path (LSRDP) processor Features: Data Flow Graphs (DFGs) extracted from critical calculation parts are directly mapped Pipeline execution Burst transfer is used for input /output rearranged data from/to memory Main Memory GPP ORN : : : : ORN : Operand Routing Network... FU... FU... FU LSRDP :::...: SB SMAC Scratchpad Memory Reconfigurable data-path includes: A large number of floating point Functional Units (FUs) Reconfigurable Operand Routing Network : ORN Dynamic reconfiguration facilities Streaming Buffers (SB) for I/O ports Implementation by SFQ circuits
WAHA 2009 Kyushu University 6 Single-Flux Quantum (SFQ) against CMOS CMOS issues: high electric power consumption high heat radiation and difficulties in high-density packing memory wall problem which limits the processing speed SFQ Features: High-speed switching and signal transmission Low power consumption Compact implementation of a system (small area) No cost for latch Suitable for pipeline processing of data stream Serial bit-level processing
WAHA 2009 Kyushu University 7 CREST-JST (2006~): Low-power, high-performance, reconfigurable processor using single-flux quantum circuits SFQ-LSRDP Prof. K. Murakami Dr. K. Inoue Dr. H. Honda Dr. F. Mehdipour H. Kataoka Kyushu Univ. Architecture, Compiler and Applications Dr. S. Nagasawa et al. Superconducting Research Lab. (SRL) SFQ process Prof. N. Yoshikawa et al. Yokohama National Univ. SFQ-FPU chip, cell library Prof. A. Fujimaki et al. Nagoya Univ. SFQ-RDP chip, cell library, and wiring Prof. N. Takagi (Leader) et al. Nagoya Univ. CAD for logic design and arithmetic circuits
WAHA 2009 Kyushu University 8 Goals of the Project Discovering appropriate applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits
9 LSRDP General Architecture and Specifications
WAHA 2009 Kyushu University 10 Parameters Should Be Decided Within the LSRDP Design Procedure Maximum Connection Length (MCL) between consecutive rows? PE: combination of a Functional Unit (FU) and a data Transfer Unit (TU) Reconfiguration mechanism? (PE, ORN, Immediate data) Layout: FU types (ADD/SUB and MUL)? Core structure a matrix of PEs Width and Height ? On-chip memory configuration?
WAHA 2009 Kyushu University 11 LSRDP Architecture Processing Elements FU implements basic 64-bit double-precision floating point operations including: ADD, SUB and MUL TU (transfer unit) as a routing resource for transferring data from a row to an inconsecutive row FUTU FU TU FUTU FUTUFU PE including Two components Four functionalities
WAHA 2009 Kyushu University 12 Layout Types- Type I W ORN … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M ADD/SUM MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H Flexible but consume a lot of resources
WAHA 2009 Kyushu University 13 W ORN … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout Types- Type II (Checkered) H Each PE implements ADD/SUB or MUL ADD/SUMTUMULTU
WAHA 2009 Kyushu University 14 W ORN … MTMTMTMTMT … ATATATATAT … MTMTMTMTMT … ATATATATAT Layout Types- Type III (Striped) H Each PE implements ADD/SUB or MUL ADD/SUMTUMULTU Type II or III, which one is more efficient?
WAHA 2009 Kyushu University 15 Maximum Connection Length (MCL) MCL: maximum horizontal distance between two PEs located in two consecutive rows
WAHA 2009 Kyushu University 16 An ORN Structure A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, ORN is consisted of 2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches 2bit shift register ORN
WAHA 2009 Kyushu University 17 Dynamic Reconfiguration Mechanism Three bit-stream lines for dynamic reconfiguration of: Immediate registers (64bit) in each PE Selector bits for muxes selecting the input data of FUs Cross-bar switches in ORNs
18 Design Procedure and Tool Chain
WAHA 2009 Kyushu University 19 Compiler and Design Flow DFGs are manually generated from critical parts of applications DFG mapping results are used for Analyzing LSRDP architecture statistics Generating LSRDP configuration bit-streams
WAHA 2009 Kyushu University 20 LSRDP Design Procedure For each parameter Appropriate value for each parameter DFGs & LSRDP HW constraints
WAHA 2009 Kyushu University 21 Benchmark Applications for Design Procedures Finite differential method calculation of 2 nd order partial differential equations 1dim-Heat equation (Heat) 1dim-Vibration equation (Vibration) 2dim-Poisson equation (Poisson) Quantum chemistry application Recursive parts of Electron Repulsion Integral calculation (ERI-Rec) Only ADD/SUB and MUL operations are used in the critical calculations of all above applications
WAHA 2009 Kyushu University 22 DFG Extraction- Heat Equation 1-dim. heat equation for T(x,t) Calculation by Finite Difference Method (FDM) (A is const.) Basic DFG corresponding to Minimum FDM calculation Basic DFG can be extended to horizontal and vertical directions to make a larger DFG
WAHA 2009 Kyushu University 23 Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)
WAHA 2009 Kyushu University 24 DFG Classification Due to broad range of DFG sizes DFGs are classified as S, M, L, XL with respect to their size and the number of Input/Output nodes Totally, 24 DFGs are prepared for benchmark DFG
WAHA 2009 Kyushu University 25 Mapping DFGs onto LSRDP Longest connections
26 Preliminary Results
WAHA 2009 Kyushu University 27 LSRDP Specifications: Width & Height # of Input ports # of Output ports WidthHeight LSRDP-S LSRDP-M LSRDP-L LSRDP Dimensions and the number of Input/Output Ports
WAHA 2009 Kyushu University 28 LSRDP Specifications: MCL Needs further MCL optimization LSRDPMCL (avg/max) ORN Size- No of Inps (avg/max), Outs LSRDP-S4/818/34, 3 LSRDP-M5/922/38, 3 LSRDP-L5/922/34, 3
WAHA 2009 Kyushu University 29 Analyzing Various LSRDP Layouts Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost. (Except ERI1 DFG which gives better size for Layout III) Layout I Layout II
WAHA 2009 Kyushu University 30 LSRDP at One Glance (1/2) Functional unitsADD/SUB, MUL LayoutType II (checker pattern) Operations64-bit floating point Processing structurePipelined PE structureFU, T, FU+T, T+T LSRDP SizeSmallMediumLarge No. of inp/out ports19/12 38/24 Width/Height16/1632/1664/32 Conf. bit-stream size Imm. Regs16*16*6432*16*6464*32*64 ORNs16*BSS(ORN)32* BSS(ORN)64*BSS(ORN) PEs16*16* 232*16*264*32* 2 ORNinputs, outputs22, 326, 3 StructureCross-bar switch Conn. TypeOne-directional
WAHA 2009 Kyushu University 31 LSRDP at One Glance (2/2) Internal memoryTypeImmediate registers Size and count64-bit registers, One reg. for each PE Communication mechanismSerial External memoryNo. of memory modules16 Date trans. rate1800Mbps/pin Overall data trans. rate24 GB/s Mem. to LSRDP bus width64 bit Channels per moduleTwo Reconf. mechanismBit serial configuration through a serial chain
WAHA 2009 Kyushu University 32 Preliminary Performance Evaluation Processor typeOut-of-order GPP operating frequency3.2GHz Inst. issue width4 instruction/cc Inst. decode width4 instruction/cc Cache configurationL1 data64KB(128B Entry, 2way, 2cc) L1 instruction64KB(64B Entry, 1way, 1cc) L2 unified4MB(128B Entry, 4way, 16cc) Latency of main memory300cc L2 to main memoryBus width64 Bytes Freq800 MHz LSRDP operating frequency80 GHz Reconfiguration Latency1cc Latency SPM LSRDP latency 1cc Latency Main Memory SPM 7500cc Bandwidth SPM LSRDP Max. 64 * 8 Bytes/cc Bandwidth Main Memory SPM 102.4GB/sec Base processor configuration GPP+LSRDP configuration GPP : Exec. time measurement by means of a processor simulator LSRDP : Estimation by performance modeling
WAHA 2009 Kyushu University 33 Preliminary Performance Evaluation (Heat) Data reusing is employed to avoid the need for data rearrangement as well as frequently data retrieval from the scratchpad memory. Basic: SB only Reuse: SB + SPM
WAHA 2009 Kyushu University 34 Preliminary Performance Evaluation (Poisson) A small fraction is related to processing time on LSRDP and the main fraction concerns to various overhead times as well as the execution time on GPP
WAHA 2009 Kyushu University 35 Conclusions & Future Work A high-performance computer comprising an accelerator (LSRDP) implemented by superconducting circuits was introduced. 24 benchmark Data Flow Graphs (DFGs) were manually generated. LSRDP micro-architecture is designed based on characteristics of scientific applications via a quantitative approach. LSRDP is promising for resolving issues originated from CMOS technology as well as achieving considerable performances. Future Work: To achieve higher performance it is required to reduce various overhead costs mainly related to data management part. To reduce the implementation cost of LSRDP, we will focus on reducing maximum connection length and ORN size.
WAHA 2009 Kyushu University 36 Acknowledgement This research was supported in part by Core Research for Evolutional Science and Technology (CREST) of Japan Science and Technology Corporation (JST).
37 Thanks! Any Questions?
WAHA 2009 Kyushu University 38 Backup Slides
WAHA 2009 Kyushu University 39 SFQ (Single Flux Quantum) Circuit High speed, Low power consumption, and Operating by a different principle from the CMOS Φ0Φ0 L Ic Ib 2mV 2ps Tunneling effect Single Flux Quantum Superconductivity loop Josephson junction
WAHA 2009 Kyushu University 40 Mapping Results For each class, a lot of extra TUs are needed to map all DFGs PE types FU T T TT
WAHA 2009 Kyushu University 41 Connection Length Minimization- Results MCL (ave/max) RDP-S4/9 RDP-M5/9 RDP-L9.3/19 Final optimized Maximum Connection Length (MCL) results ORNs should provide the connection length of 9 in LSRDP-S/M (MCL= 9). For LSRDP-L, MCL = 19 !!! ⇒ Serious Implementation Cost Possible to decrease?
WAHA 2009 Kyushu University 42 Distributions of Connection Lengths Connection length 93% of connection lengths are 0 ~ 2 Only small fractions of connections results in larger ORNs
WAHA 2009 Kyushu University 43 Analyzing Various LSRDP Layouts Layout II can be used instead of Layout I to obtain a smaller LSRDP architecture with lower power consumption and implementation cost as well Almost a similar small size values are achieved for Layout I and II for the majority of DFGs (except ERI1 DFG which gives better size for Layout III)
WAHA 2009 Kyushu University 44 Why only ERI1 DFG is suitable to Layout III ? Heat ERI 1 Layout III Layout II
WAHA 2009 Kyushu University 45 FU Layout for DIV, SQRT, EXP operations ORN : : : : ORN : Operand Routing Network... FU... FU... FU ORN... FU... FU DIV Three times larger latency Where ? Where should we place different latency FU ? Heterogeneous configuration of FU array ? 16Bits Floating point DIV, SQRT, and EXP Functional unit have been already developed by SFQ current technology. Pipeline execution based on ADD and MUL latency
WAHA 2009 Kyushu University 46 Estimated performance improvement of 2-dim Poisson equation by LSRDP calc. Normalized exec. time by GPP(3GHz) calc. Main Mem. bandwidth [GByte/sec]
WAHA 2009 Kyushu University 47 Estimated performance improvement of ERI calculation by LSRDP (3GHz)
WAHA 2009 Kyushu University 48 Recursive Parts of Electron Repulsion Integral Formula (ERI-Rec) DFG sizes have already determined from original recursive formula No. of Operations No. of Inputs No. of Output (ps,ss)983 (ps,ps)51169 (pp,ss)66149 (pp,ps) (pp,pp)
WAHA 2009 Kyushu University 49 What types of software/algorithms are suitable for LSRDP ? When same calculations have to be calculated repeatedly. LSRDP is used for high throughput accelerator. Input/Output data size is small compared with the amount of the operations. small size of input small size of output Large amount of calculations X memory access LSRDP
WAHA 2009 Kyushu University 50 Exploration of suitable applications for LSRDP Application matrix elements calculation Molecular integral calculations in molecular orbital method Monte Carlo type simulation etc… Numerical calculation library special function (promising?) differential equation numerical integration matrix operation (difficult ??) Triangular matrix simultaneous equation etc… Investigating applicability against various applications
WAHA 2009 Kyushu University 51 Recursive Parts of Electron Repulsion Integral Formula in Molecular Orbital Calc. # of Inputs : Max. 28 # of Outputs : 1 ~ 81 (ss,ss) (m) and all coefficients are given as input (i,j,k,l = x,y,z): p function has 3 components (as 1dim array) Each DFG has only ADD (SUB) and MUL FUs. ~Up to (pp,pp) Recursive Calculation~ DFG sizes are determined by original calculation algorithm
WAHA 2009 Kyushu University 52 DFG Distribution for each application # of FUs # of Inputs Poisson (3) Vibration (7) Heat (6) ERI-Rec (8 DFGs) DFGs have different qualities in terms of the # of FUs, # of Inputs and Outputs
WAHA 2009 Kyushu University 53 Example of MCL (Heat) Heat original DFG (I/O: 8/4, FUs: 32) Mapping result MCL
WAHA 2009 Kyushu University 54 Example of extracted DFGs (ERI-Rec) Maximum DFG of ERI-Rec: (p i p j,p k p l ) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0
WAHA 2009 Kyushu University 55 Poisson Equation 2D – Poisson Eq. ω is const. Successive Over Relaxation method In order to obtain u (n+1) (x i,y j ) in the next iteration, current values of five variables i.e. u (n) (x i,y j ), u (n) (x i±1,y j ), u (n) (x i,y j ±1 ) are needed Red/Black Gauss Seidel 55
WAHA 2009 Kyushu University 56 Example of extracted DFGs (Poisson) Maximum Poisson DFG Inputs: 32 Outputs: 1 FUs: 721 Immediates: 364
WAHA 2009 Kyushu University 57 Performance Evaluation: Simulation Environment 57 GPP Main Memory LSRDP GPP : Exec. time measurement by processor simulator LSRDP : Estimation by performance modeling Variable parameters: Freq. of GPP and LSRDP Bandwidth between main memory and LSRDP Latency of reconfiguration time # of FPUs in LSRDP Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory.
WAHA 2009 Kyushu University 58 Estimated performance improvement of 1-dim heat equation by LSRDP calc. Main Mem. bandwidth [GByte/sec]
WAHA 2009 Kyushu University 59 Estimated performance improvement of 1-dim heat equation by LSRDP calc. Main Mem. bandwidth [GByte/sec] Normalized exec. time by GPP(3GHz) calc.
WAHA 2009 Kyushu University 60 Poisson Red/Black 法における DFG の拡大による繰り返し回数の増加 9+4 ノード の入力 中心 1 ノードの出 力 SOR 式 2 回の繰り返し 4+1 ノード の入力 中心 1 ノードの出 力 SOR 式 1 回の計算 これに伴い必要な入力数も増加 DFG の拡大により 1 度に計算可能な繰り返し回数が増加 60
WAHA 2009 Kyushu University 61 Implementation of Heat calculation to LSRDP Loop j Loop i T(xi,tj) End Loop Original GPP code LSRDP Reconfiguration Loop j’ Input Data Rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data Rearrangement End Loop 61 LSRDP code
WAHA 2009 Kyushu University 62 Implementation of Poisson calculation to LSRDP Loop Iter Loop i loop j u(xi,yj) End Loop Original GPP code LSRDP Reconfiguration Loop Iter’ Input Data rearrangement Loop N LSRDP pipeline exec. (FDM DFG calc.) End Loop Output Data rearrangement End Loop 62 LSRDP code
WAHA 2009 Kyushu University 63 Implementation of ERI-Rec calculation to LSRDP Loop I,J,K,L LSRDP Reconfiguration Loop contraction Initial Integral Calc. End Loop Input Data rearrangement Loop N LSRDP pipeline calc. (Recursive DFG calc.) End Loop Output Data rearrangement Partial Fock Calc. End Loop Loop I,J,K,L Loop contraction Initial Integral Calc. Recursive Calc. End Loop Partial Fock Calc. End Loop Initial Integral Calc.: 1/Sqrt, Exp, Fm(T) are utilized => GPP calculation Recursive Calc.: only ADD/SUB, MUL => LSRDP calculation original GPP code LSRDP code 63
WAHA 2009 Kyushu University 64 Vertical vs. Horizontal DFG Decomposition Loop N Reconfiguration Loop M LSRDP pipeline calc. End Loop Original 64 Loop n ( > N) Reconfiguration Loop M LSRDP pipeline calc. End Loop Vertical Decomp. Loop N Reconfiguration Loop M 1 st LSRDP pipeline calc. End Loop Loop N Reconfiguration Loop M 2 nd LSRDP pipeline calc. End Loop Horizontal Decomp.
WAHA 2009 Kyushu University 65 Example of extracted DFGs Maximum DFG of ERI-Rec: (p i p j,p k p l ) Inputs: 28 Outputs: 81 FUs: 1004 Immediates: 0 Vertical Partitioning Inputs: 24 Outputs: 1 FUs: 108 Immediates: 0
WAHA 2009 Kyushu University 66 Example of extracted DFGs- Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A huge sample DFG (Heat)
WAHA 2009 Kyushu University 67 Performance Evaluation: Simulation Environment 67 GPP Main Memory LSRDP GPP : Exec. time measurement by processor simulator LSRDP : Estimation by performance modeling Variable parameters: Freq. of GPP and LSRDP Bandwidth between main memory and LSRDP Latency of reconfiguration time # of FPUs in LSRDP Supporting FPU types (Add, Mul, Div, Exp, Sqrt, Error function units are supported) Use streaming buffer in the LSRDP chip I/O data is sorted in the main memory.
WAHA 2009 Kyushu University 68 Performance Evaluation: Execution Time Modeling 68 Execution time Calculation time Stall time Latency of LSRDP Mem For first Input and last output Sort data + Reconfig. + Send signal for comm. + Stall from Bandwidth req > Bandwidth mem Total pipeline depth in the given program + # of rows of LSRDP (latency of LSRDP)
WAHA 2009 Kyushu University 69 Layout Types- Type I W ORN … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M … A T M A T M A T M A T M A T M Total No. of PEs= W * H Total Area= W*H* [Area(MUL)+Area(ADD/SUB)+ Area(TU)]+ Area(ORNs) ADD/SUM MUL TU Each PE implements ADD/SUB and MUL M A T : ADD/SUB : MUL : Transfer Unit H
WAHA 2009 Kyushu University 70 W ORN … MTATATATMT … MTATATATMT … MTATATATMT … MTATATATMT Layout Types- Type II H Each PE implements ADD/SUB or MUL Total No. of PEs= W * H Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) ADD/SUMTUMULTU
WAHA 2009 Kyushu University 71 W ORN … MTMTMTMTMT … ATATATATAT … MTMTMTMTMT … MTATATATMT Layout Types- Type III H Each PE implements ADD/SUB or MUL Total No. of PEs= W * H Total Area= ½* W*H*[Area(MUL)+Area(TU)]+ ½*W*H*[Area(ADD/SUB)+Area(TU)]+ Area (ORNs) ADD/SUMTUMULTU
WAHA 2009 Kyushu University 72 CB Various Functionalities CB two inputs/outputs four cases are possible reconfigurable 1/2CB one input/ two outputs four cases are possible reconfigurable CB ½ CB
WAHA 2009 Kyushu University 73 The number of FPUs is M, the number of Transfer Units (T) is also M; MCL is a maximum connection length if we consider FPUs only => ½ CB – 2×M T2 – (M+4×MCL+2) CB – (2×MCL+1) ×(4×M-1) An ORN Structure CB: 351 JJs ½ CB: 216 JJs * T2 is a 2-bit shift register “ + ” : scalable pipelined easily re-designed for any number of N and M “–” : large number of Josephson junctions M number of ½ CB and (2×M+1)×MCL number of CB Reduction of the number of Josephson junctions is essential! A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.
WAHA 2009 Kyushu University 74 Reconfiguration Mechanism PE & ORN reconfiguration structure Reconfiguration in LSRDP State Diagram
WAHA 2009 Kyushu University 75 LSRDP Design Procedure
WAHA 2009 Kyushu University 76 SMAC 10TFLOPS SFQ-RDP computer :...::: SMAC SB ORN... ORN... : : : : ORN... ORN FPU SFQ RDP ( 32FPU×32chips ) (4 GFLOPS / FPU) 4.2 K SFQ Streaming Buffer ( 64Kb×2chips ) CMOS CPU (1chip) Memory band width per MCM : 256GB/ s (=16GB/s ×16 channels) (34 chips ) ×4MCM 2TB memory module ( FB-DIMM 128GB] ×16 modules ) SFQ 0.5um process
WAHA 2009 Kyushu University 77 Power Consumption Comparison for 10TFLOPS computers MPUMem.HDFreezerAir Conditioning Total CMOS RISC (90 nm) 125 kW12.5 kW5 KW43 kW186 kW SFQ RDP (0.5 um) 3W +3.3 W 250 W100 W1 kW0.1 kW1.5 kW
WAHA 2009 Kyushu University 78 Power Consumption / Performance Performance ( GFlops ) Power Consumption (W) Power Cons. / Perf.(W/GFlops) SFQ-RDP~10K~1.5K~ um, Whole system SFQ-RDP~10K~6.3 (MPU)~0.63*10 ^ (-3)0.5 um, MPU GRAPE-DR Chip CSX ClearSpeed Chip Cell192 (single)320.17Inside Chip, SPE core Cell (eDP) 1.33PF(# ) ? 12960*110W ? Roadrunner GeForce8800GTX518 (single) Chip SX nodes whole system: 15.4MW CMOS RISC (90nm) ~10K~186K~18.6