ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi.

Slides:

Advertisements

Similar presentations

A Novel 3D Layer-Multiplexed On-Chip Network

Advertisements

COMP541 Transistors and all that… a brief overview

Kwangok Jeong and Andrew B. Kahng UCSD VLSI CAD Laboratory

Semiconductor Memory Design. Organization of Memory Systems Driven only from outside Data flow in and out A cell is accessed for reading by selecting.

Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.

Improved On-Chip Analytical Power and Area Modeling Andrew B. Kahng Bill Lin Kambiz Samadi University of California, San Diego January 20, 2010.

Fall 06, Sep 19, 21 ELEC / Lecture 6 1 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Introduction to CMOS VLSI Design Lecture 18: Design for Low Power David Harris Harvey Mudd College Spring 2004.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

S. Reda EN160 SP’08 Design and Implementation of VLSI Systems (EN1600) Lecture 14: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

Communication Modeling for System-Level Design Andrew B. Kahng #,* Kambiz Samadi * CSE # and ECE * Departments,

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Architectural-Level Prediction of Interconnect Wirelength and Fanout Kwangok Jeong, Andrew B. Kahng and Kambiz Samadi UCSD VLSI CAD Laboratory

Chung-Kuan Cheng†, Andrew B. Kahng†‡,

CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,

On-Line Adjustable Buffering for Runtime Power Reduction Andrew B. Kahng Ψ Sherief Reda † Puneet Sharma Ψ Ψ University of California, San Diego † Brown.

S. Reda EN160 SP’07 Design and Implementation of VLSI Systems (EN0160) Lecture 13: Power Dissipation Prof. Sherief Reda Division of Engineering, Brown.

Lecture 5 – Power Prof. Luke Theogarajan

Statistical Gate Delay Calculation with Crosstalk Alignment Consideration Andrew B. Kahng, Bao Liu, Xu Xu UC San Diego

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Temperature-Aware Design Presented by Mehul Shah 4/29/04.

Lecture 7: Power.

UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD Laboratory UC San Diego Computer Engineering VLSI CAD.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

Lecture 21, Slide 1EECS40, Fall 2004Prof. White Lecture #21 OUTLINE –Sequential logic circuits –Fan-out –Propagation delay –CMOS power consumption Reading:

The CMOS Inverter Slides adapted from:

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Enhanced Metamodeling Techniques for High-Dimensional IC Design Estimation Problems Andrew B. Kahng, Bill Lin and Siddhartha Nath VLSI CAD LABORATORY,

Case Study - SRAM & Caches

EE466: VLSI Design Power Dissipation. Outline Motivation to estimate power dissipation Sources of power dissipation Dynamic power dissipation Static power.

1 Delay Estimation Most digital designs have multiple data paths some of which are not critical. The critical path is defined as the path the offers the.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

ENGG 6090 Topic Review1 How to reduce the power dissipation? Switching Activity Switched Capacitance Voltage Scaling.

17 Sep 2002Embedded Seminar2 Outline The Big Picture Who’s got the Power? What’s in the bag of tricks?

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Sub-threshold Design of Ultra Low Power CMOS Circuits Students: Dmitry Vaysman Alexander Gertsman Supervisors: Prof. Natan Kopeika Prof. Orly Yadid-Pecht.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

Chapter 07 Electronic Analysis of CMOS Logic Gates

XIAOYU HU AANCHAL GUPTA Multi Threshold Technique for High Speed and Low Power Consumption CMOS Circuits.

Interconnect Modeling for Improved System-Level Design Optimization Luca Carloni  § Andrew B. Kahng ¶ Swamy Muddu ¶ Alessandro Pinto ‡ Kambiz Samadi ¶

Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing

Basics of Energy & Power Dissipation

Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

University of Michigan, Ann Arbor

Explicit Modeling of Control and Data for Improved NoC Router Estimation Andrew B. Kahng +*, Bill Lin * and Siddhartha Nath + UCSD CSE + and ECE * Departments.

Introduction to Clock Tree Synthesis

EE201C : Stochastic Modeling of FinFET LER and Circuits Optimization based on Stochastic Modeling Shaodi Wang

Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 6.1 EE4800 CMOS Digital IC Design & Analysis Lecture 6 Power Zhuo Feng.

Seok-jae, Lee VLSI Signal Processing Lab. Korea University

Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 28: November 8, 2013 Memory Overview.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

PROCEED: Pareto Optimization-based Circuit-level Evaluation Methodology for Emerging Devices Shaodi Wang, Andrew Pan, Chi-On Chui and Puneet Gupta Department.

LOW POWER DESIGN METHODS

COE 360 Principles of VLSI Design Delay. 2 Definitions.

Power-Optimal Pipelining in Deep Submicron Technology

FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers

Alireza Shafaei, Shuang Chen, Yanzhi Wang, and Massoud Pedram

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Reading: Hambley Ch. 7; Rabaey et al. Sec. 5.2

Circuit Design Techniques for Low Power DSPs

Impact of Parameter Variations on Multi-core chips

Presentation transcript:

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi ¶ ¶ University of California, San Diego ‡ Princeton University April 21,

Outline  Motivation  ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 2

Motivation  Many-core chip  NoCs needed to interconnect many-core chips  Power-efficiency of NoCs is important  Performance was the primary concern  Now power efficiency is critical  28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links);  Need rapid power estimation to trade off alternative architectures  Rapid power-area tradeoffs at the architectural level  Our Goal: Develop accurate models that are easily usable by system-level designer early in the design cycle 3

Related Work  Real-chip power measurements (Isci et al. 03)  RTL-level NoC power estimations (A. Banerjee et al. 07, and N. Banerjee et al. 04)  Simulation time is slow  Requires detailed RTL modeling  not suitable for early-stage NoC design space exploration  Architectural-level power estimation  Interconnection network (Patel et al. 97); model is not instantiated with architectural parameters  not suitable to explore tradeoffs in router microarchitecture  Uniprocessor power modeling (Wattch: Brooks et al. 00 and SimplePower: Ye et al. 00)  NoC power modeling (ORION 1.0: Wang et al. 02)  ORION 1.0  has been widely used  early-stage design space exploration for NoC power-performance tradeoff analysis 4

ORION 1.0 Modeling Methodology  Power models derived for major building blocks (FIFO, Crossbar, and arbiter)  For each component, a canonical structure is described in terms of architectural and technological parameters  Detailed analysis is performed to determine parameterized capacitance equations  Capacitance equations and switch activity estimation are combined to determine power consumption  Power models are based on detailed estimates of gate and wire capacitance and switching activity 5

Limitations of ORION 1.0 ParametersDescription ORION 1.0 ORION 2.0 B F P V X tech f clk V dd - B F P V X tech f clk V dd N pipeline App D #buffers flit-width #ports #virtual channels #crossbar ports technology node clock frequency supply voltage #pipeline stages application domain chip dimension ParametersDescription ORION 1.0 ORION nm 5.1 GHz 1.2V - B F P V X tech f clk V dd N pipeline App D #buffers flit-width #ports #virtual channels #crossbar ports technology node clock frequency supply voltage #pipeline stages application domain chip dimension ComponentPower (mW) V1Intel 80-core Buffer Crossbar Arbiter Link Clock Total Up to 8.1X diff. 10.3X diff. 6

Outline Motivation  ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 7

ORION 2.0: Accurate NoC Router Models circuit implementation & buffering scheme SRAM and register FIFO MUX-tree and Matrix crossbar different arbitration scheme hybrid buffering scheme architectural parameters # of ports; # of buffers # of xbar ports; # of VC voltage, frequency interconnect parameters device parameters scaling factors for future technologies … technology parameters ORION 2.0 req I req E req W req N req S grant I grant E grant W grant N grant S Arbiter out E out W out N out S in I in E in W in N in S out I Crossbar Buf E Buf W Buf N Buf S Buf I Link Source Link Source Write Control Request Signals  Built on top of ORION 1.0  Uses our automatic/semi-automatic flows to obtain technology inputs  Provides significant accuracy improvement compared with ORION 1.0 8

ORION 2.0 Improvements Crossbar Links (dynamic power) Arbiter (dynamic power) Buffer (SRAM-based) Clock Crossbar Links Hybrid buffering Leakage power Arbiter VC allocator model Leakage power Buffer SRAM-based Flip-flop-based Application-specific technology-level adjustment Updated capacitance and transistor sizes ORION 1.0ORION 2.0 Power Subcomponents Model Infrastructure Area (router) Area More accurate router area model Link area model 9

Model Technology Inputs  Inputs for power calculation  Leakage current values (obtained from Liberty (.lib) / SPICE)  Input capacitance for different repeater size (Liberty, Predictive Technology Models (PTM))  Inputs for area calculation  Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS)  Cell area is available from Liberty and for future technologies, ITRS A- factors or proposed area models can be used  We also provide data for (1) high-performance (HP), and (2) low-power (LOP) device types for 90nm and 65nm  Scaling factors for 45nm and 32nm technologies were obtained from ITRS 2007 / MASTAR5.0 10

Outline Motivation ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 11

Dynamic Power Modeling  Dynamic Power: Switching Capacitance  Clock power:  P clk =  × C clk × V dd 2 × f  C clk = C sram-fifo + C pipeline-registers + C register-fifo + C wiring  Physical Links: due to charging and discharging of capacitive load  P d =  × C load × V dd 2 × f; C load = C ground + C coupling + C input  Register-based FIFO: implemented as shift registers  Virtual channel allocator: added two models  Other components: we use ORION 1.0 models with updated transistor and technology parameters 12

Clock Power (1)  Clock power heavily depends on its distribution topology  we assume an H-tree topology  C clk = C sram-fifo + C pipeline-registers + C register-fifo + C clock-wiring  Memory structures: precharge circuitry capacitive load on clock network:  due to precharge transistor T c  C chg = C g (T c ) + C d (T c )  C sram-fifo = (P r + P w ) × F × B × C chg  where P r, P w, F, B are #read ports, #write ports, #buffers, and flit-width, respectively  Pipeline registers: due to different stages in a router  assume D-flip-flop (DFF) as the building block for pipeline registers  C pipeline-register = N pipeline × F × C ff, where C ff is DFF capacitance  Register-based FIFO: due to DFF capacitance used in registers  C register-fifo = F × B × C ff 13

Clock Power (2)  Wiring load: due to (1) wiring and (2) clock tree buffers  Example: 5-level H-tree clock distribution:  where, D, C w are chip dimension and per-unit-length wire capacitance, respectively  capacitive contribution due to clock buffers requires estimation of number of buffer stages, k:  where R int, C int, R d, and C gate are clock tree network wire resistance, wire capacitance, drive resistance, and input gate capacitance of a minimum size inverter, respectively  where ρ, C area, and C fringe are resistivity, unit area, and unit fringe capacitances respectively  C clock-wiring = kC gate + C wire  Clock leakage power is due to clock buffers 14

Repeater and Wire Power Models  Repeaters (buffers) are used in links and clock tree network  Leakage power has two main components: (1) sub-threshold leakage, and (2) gate-tunneling current  Depending on design conditions we will compute the leakage power at different temperature conditions:(1) 25 ◦ C, (2) 80 ◦ C, and (3) 110 ◦ C  Both components depend linearly on device size p s = (p s n + p s p ) / 2 p s n = k 0 n + k 1 n × w n p s p = k 0 p + k 1 p × w p  Dynamic power can be calculated as: p d = a × c l × v dd 2 × f c l = c i + c g + c c  p d, a, c l, v dd and f are dynamic power, activity factor, load capacitance, supply voltage and frequency, respectively  Load capacitance is composed of the input capacitance of the next repeater (c i ), ground (c g ) and coupling (c c ) capacitances of the wire driven 15

Interconnect Optimization: Buffering  Conventional delay-optimal buffering  unrealistic buffer sizes  high dynamic / leakage power  suboptimal  Our approach: iterative optimization of hybrid objective (power + delay)  Search for optimal number and size of repeaters  Can be extended for other interconnect optimizations (e.g., wire sizing and driver sizing) Pareto-optimal frontier of the power-delay tradeoff of a 5mm interconnect in 90nm / 65nm 16

Virtual Channel Allocator Model  Provides three virtual channel (VC) allocation models  Traditional two-stage VC allocator model  Most widely used  Power consumption increases rapidly as number VCs increases  Add One-stage VC allocator model  Lower power consumption  Lower matching probability  Add VC selection model  Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07  Low power and high performance 17

Outline Motivation ORION2.0 Framework Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 18

Leakage Power Modeling  Leakage Power: Subthreshold and Gate  From 65nm and beyond gate leakage becomes significant  I ’ sub (i,s) and I ’ gate (i,s) are subthreshold and gate leakage currents per unit transistor width for a specific technology  W sub (i,s) and W gate (i,s) are the effective widths of component i at input state s for subthreshold and gate leakage, respectively  Key circuit components INVx1, NAND2x1, NOR2x1, and DFF  Leakage currents are computed at different transistor junction temperatures: (1) 110 ◦ C, (2) 80 ◦ C, and (3) 25 ◦ C  Same methodology as in ORION 1.0  Leakage current values are all obtained through SPICE simulation using foundry SPICE models 19

Arbiter Leakage Power Model  Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing  Example: matrix arbiter  with R requesters  one R×R matrix to keep the priorities  grant logic can be implemented as a tree of NOR and INV gates and the RxR matrix can be constructed using DFF  NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF, respectively  Further details on modeling methodology in Chen et al

Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 21

Router Area Model  As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)  Gate area model by Yoshida et al. (DAC’04)  Link area model by Carloni et al. (ASPDAC’08) Area arbiter = (Area NOR2x1 2(R-1)R) + (Area DFF (R(R-1)/2)) + (Area INVx1 R) Matrix Arbiter 22

Repeater and Wire Area Models  For existing technologies, the area of a repeater can be calculated as: a r = τ 0 + τ 1 × (w n + w p )  a r denotes repeater area, τ 0 and τ 1 are coefficients using linear regression; w n, w p are widths of NMOS, and PMOS respectively  For future technologies, feature size (F), contacted pitch (CP), row height (RH), and cell width (CW) can be used to estimate the area: NF = (w p + w n + 2 × F) / RH CW = NF × (F + CP) + CP a r = RH × CW  Wiring area can be calculated as: a w = (n × (w w + s w ) + s w ) × L  a w denotes wire area, n is the bit width of the bus, and w w, s w, L are wire width, spacing and wire length 23

Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling  Validation and Significance Assessment  Conclusions 24

ORION2.0: Validations and Results  Validation: Two Intel NoC Chips  (1) Intel 80-core Teraflops: high-performance many-core design  (2) Intel SCC: ultra low-power communication core  ORION2.0 offers significant accuracy improvement Component%diff (ORION 2.0 vs. Intel 80-core) Buffer Crossbar Arbiter Clock Link Intel 80-coreORION 2.0ORION 1.0

Impact on System-Level Design  Testcases  VPROC: video processor with 42 cores and 128-bit datawidth  dVOPD: dual video object plane decoder with 26 cores and 128-bit datawidth  System-level Impact: Communication-Driven Synthesis in COSI-OCC  Accurate ORION 2.0 models lead to better-performing NoC  Relative power due to additional port not as high in ORION 2.0 vs. 1.0 …….. R2R2 R2R2 R2R2 R2R2 R2R2 … … … … … R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 … ……… … 26

Conclusions  Accurate models can drive effective NoC design space exploration  ORION 1.0 is inaccurate for current and future technology nodes  Proposed accurate power and area models for network routers (ORION 2.0)  Presented a reproducible methodology for extracting inputs to our models  Maintained ORION 1.0 interface, while significantly improved the accuracy of models  switching to ORION 2.0 is easy! 27

ORION 2.0 Release  ORION 2.0 Website: 28

System-Level NoC Power Modeling Example V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07 Polaris Toolchain