ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi.

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi ¶ ¶ University of California, San Diego ‡ Princeton University April 21, 2009 1

Outline  Motivation  ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 2

Motivation  Many-core chip  NoCs needed to interconnect many-core chips  Power-efficiency of NoCs is important  Performance was the primary concern  Now power efficiency is critical  28% of total power in Intel 80-core Teraflops chip is due to interconnection networks (routers + links);  Need rapid power estimation to trade off alternative architectures  Rapid power-area tradeoffs at the architectural level  Our Goal: Develop accurate models that are easily usable by system-level designer early in the design cycle 3

Related Work  Real-chip power measurements (Isci et al. 03)  RTL-level NoC power estimations (A. Banerjee et al. 07, and N. Banerjee et al. 04)  Simulation time is slow  Requires detailed RTL modeling  not suitable for early-stage NoC design space exploration  Architectural-level power estimation  Interconnection network (Patel et al. 97); model is not instantiated with architectural parameters  not suitable to explore tradeoffs in router microarchitecture  Uniprocessor power modeling (Wattch: Brooks et al. 00 and SimplePower: Ye et al. 00)  NoC power modeling (ORION 1.0: Wang et al. 02)  ORION 1.0  has been widely used  early-stage design space exploration for NoC power-performance tradeoff analysis 4

ORION 1.0 Modeling Methodology  Power models derived for major building blocks (FIFO, Crossbar, and arbiter)  For each component, a canonical structure is described in terms of architectural and technological parameters  Detailed analysis is performed to determine parameterized capacitance equations  Capacitance equations and switch activity estimation are combined to determine power consumption  Power models are based on detailed estimates of gate and wire capacitance and switching activity 5

Limitations of ORION 1.0 ParametersDescription ORION 1.0 ORION 2.0 B F P V X tech f clk V dd - B F P V X tech f clk V dd N pipeline App D #buffers flit-width #ports #virtual channels #crossbar ports technology node clock frequency supply voltage #pipeline stages application domain chip dimension ParametersDescription ORION 1.0 ORION 2.0 16 39 5 2 5 65nm 5.1 GHz 1.2V - B F P V X tech f clk V dd N pipeline App D #buffers flit-width #ports #virtual channels #crossbar ports technology node clock frequency supply voltage #pipeline stages application domain chip dimension ComponentPower (mW) V1Intel 80-core Buffer Crossbar Arbiter Link Clock Total 25.2 53.2 11.1 - 89.5 203.3 138.6 64.7 212.5 304.9 924 Up to 8.1X diff. 10.3X diff. 6

Outline Motivation  ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 7

ORION 2.0: Accurate NoC Router Models circuit implementation & buffering scheme SRAM and register FIFO MUX-tree and Matrix crossbar different arbitration scheme hybrid buffering scheme architectural parameters # of ports; # of buffers # of xbar ports; # of VC voltage, frequency interconnect parameters device parameters scaling factors for future technologies … technology parameters ORION 2.0 req I req E req W req N req S grant I grant E grant W grant N grant S Arbiter out E out W out N out S in I in E in W in N in S out I Crossbar Buf E Buf W Buf N Buf S Buf I Link Source Link Source Write Control Request Signals  Built on top of ORION 1.0  Uses our automatic/semi-automatic flows to obtain technology inputs  Provides significant accuracy improvement compared with ORION 1.0 8

ORION 2.0 Improvements Crossbar Links (dynamic power) Arbiter (dynamic power) Buffer (SRAM-based) Clock Crossbar Links Hybrid buffering Leakage power Arbiter VC allocator model Leakage power Buffer SRAM-based Flip-flop-based Application-specific technology-level adjustment Updated capacitance and transistor sizes ORION 1.0ORION 2.0 Power Subcomponents Model Infrastructure Area (router) Area More accurate router area model Link area model 9

Model Technology Inputs  Inputs for power calculation  Leakage current values (obtained from Liberty (.lib) / SPICE)  Input capacitance for different repeater size (Liberty, Predictive Technology Models (PTM))  Inputs for area calculation  Wire dimensions (Interconnect Technology Format (ITF) / LEF / ITRS)  Cell area is available from Liberty and for future technologies, ITRS A- factors or proposed area models can be used  We also provide data for (1) high-performance (HP), and (2) low-power (LOP) device types for 90nm and 65nm  Scaling factors for 45nm and 32nm technologies were obtained from ITRS 2007 / MASTAR5.0 10

Outline Motivation ORION2.0 Framework  Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 11

Dynamic Power Modeling  Dynamic Power: Switching Capacitance  Clock power:  P clk =  × C clk × V dd 2 × f  C clk = C sram-fifo + C pipeline-registers + C register-fifo + C wiring  Physical Links: due to charging and discharging of capacitive load  P d =  × C load × V dd 2 × f; C load = C ground + C coupling + C input  Register-based FIFO: implemented as shift registers  Virtual channel allocator: added two models  Other components: we use ORION 1.0 models with updated transistor and technology parameters 12

Clock Power (1)  Clock power heavily depends on its distribution topology  we assume an H-tree topology  C clk = C sram-fifo + C pipeline-registers + C register-fifo + C clock-wiring  Memory structures: precharge circuitry capacitive load on clock network:  due to precharge transistor T c  C chg = C g (T c ) + C d (T c )  C sram-fifo = (P r + P w ) × F × B × C chg  where P r, P w, F, B are #read ports, #write ports, #buffers, and flit-width, respectively  Pipeline registers: due to different stages in a router  assume D-flip-flop (DFF) as the building block for pipeline registers  C pipeline-register = N pipeline × F × C ff, where C ff is DFF capacitance  Register-based FIFO: due to DFF capacitance used in registers  C register-fifo = F × B × C ff 13

Clock Power (2)  Wiring load: due to (1) wiring and (2) clock tree buffers  Example: 5-level H-tree clock distribution:  where, D, C w are chip dimension and per-unit-length wire capacitance, respectively  capacitive contribution due to clock buffers requires estimation of number of buffer stages, k:  where R int, C int, R d, and C gate are clock tree network wire resistance, wire capacitance, drive resistance, and input gate capacitance of a minimum size inverter, respectively  where ρ, C area, and C fringe are resistivity, unit area, and unit fringe capacitances respectively  C clock-wiring = kC gate + C wire  Clock leakage power is due to clock buffers 14

Repeater and Wire Power Models  Repeaters (buffers) are used in links and clock tree network  Leakage power has two main components: (1) sub-threshold leakage, and (2) gate-tunneling current  Depending on design conditions we will compute the leakage power at different temperature conditions:(1) 25 ◦ C, (2) 80 ◦ C, and (3) 110 ◦ C  Both components depend linearly on device size p s = (p s n + p s p ) / 2 p s n = k 0 n + k 1 n × w n p s p = k 0 p + k 1 p × w p  Dynamic power can be calculated as: p d = a × c l × v dd 2 × f c l = c i + c g + c c  p d, a, c l, v dd and f are dynamic power, activity factor, load capacitance, supply voltage and frequency, respectively  Load capacitance is composed of the input capacitance of the next repeater (c i ), ground (c g ) and coupling (c c ) capacitances of the wire driven 15

Interconnect Optimization: Buffering  Conventional delay-optimal buffering  unrealistic buffer sizes  high dynamic / leakage power  suboptimal  Our approach: iterative optimization of hybrid objective (power + delay)  Search for optimal number and size of repeaters  Can be extended for other interconnect optimizations (e.g., wire sizing and driver sizing) Pareto-optimal frontier of the power-delay tradeoff of a 5mm interconnect in 90nm / 65nm 16

Virtual Channel Allocator Model  Provides three virtual channel (VC) allocation models  Traditional two-stage VC allocator model  Most widely used  Power consumption increases rapidly as number VCs increases  Add One-stage VC allocator model  Lower power consumption  Lower matching probability  Add VC selection model  Proposed by Kumar et al. "A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS”, ICCD07  Low power and high performance 17

Outline Motivation ORION2.0 Framework Dynamic Power Modeling  Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 18

Leakage Power Modeling  Leakage Power: Subthreshold and Gate  From 65nm and beyond gate leakage becomes significant  I ’ sub (i,s) and I ’ gate (i,s) are subthreshold and gate leakage currents per unit transistor width for a specific technology  W sub (i,s) and W gate (i,s) are the effective widths of component i at input state s for subthreshold and gate leakage, respectively  Key circuit components INVx1, NAND2x1, NOR2x1, and DFF  Leakage currents are computed at different transistor junction temperatures: (1) 110 ◦ C, (2) 80 ◦ C, and (3) 25 ◦ C  Same methodology as in ORION 1.0  Leakage current values are all obtained through SPICE simulation using foundry SPICE models 19

Arbiter Leakage Power Model  Three arbitration schemes: (1) matrix, (2) round-robin (RR), and (3) queuing  Example: matrix arbiter  with R requesters  one R×R matrix to keep the priorities  grant logic can be implemented as a tree of NOR and INV gates and the RxR matrix can be constructed using DFF  NOR2, INV, and DFF represent 2-input NOR gate, inverter gate, and DFF, respectively  Further details on modeling methodology in Chen et al. 2003 20

Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling  Area Modeling  Validation and Significance Assessment  Conclusions 21

Router Area Model  As number of cores increases, the area occupied by communication components becomes significant (19% of total tile area in the Intel 80-core Teraflops Chip)  Gate area model by Yoshida et al. (DAC’04)  Link area model by Carloni et al. (ASPDAC’08) Area arbiter = (Area NOR2x1 2(R-1)R) + (Area DFF (R(R-1)/2)) + (Area INVx1 R) Matrix Arbiter 22

Repeater and Wire Area Models  For existing technologies, the area of a repeater can be calculated as: a r = τ 0 + τ 1 × (w n + w p )  a r denotes repeater area, τ 0 and τ 1 are coefficients using linear regression; w n, w p are widths of NMOS, and PMOS respectively  For future technologies, feature size (F), contacted pitch (CP), row height (RH), and cell width (CW) can be used to estimate the area: NF = (w p + w n + 2 × F) / RH CW = NF × (F + CP) + CP a r = RH × CW  Wiring area can be calculated as: a w = (n × (w w + s w ) + s w ) × L  a w denotes wire area, n is the bit width of the bus, and w w, s w, L are wire width, spacing and wire length 23

Outline Motivation ORION2.0 Framework Dynamic Power Modeling Leakage Power Modeling Area Modeling  Validation and Significance Assessment  Conclusions 24

ORION2.0: Validations and Results  Validation: Two Intel NoC Chips  (1) Intel 80-core Teraflops: high-performance many-core design  (2) Intel SCC: ultra low-power communication core  ORION2.0 offers significant accuracy improvement Component%diff (ORION 2.0 vs. Intel 80-core) Buffer Crossbar Arbiter Clock Link -14.8 16.9 -9.0 -20.9 8.8 25 Intel 80-coreORION 2.0ORION 1.0

Impact on System-Level Design  Testcases  VPROC: video processor with 42 cores and 128-bit datawidth  dVOPD: dual video object plane decoder with 26 cores and 128-bit datawidth  System-level Impact: Communication-Driven Synthesis in COSI-OCC  Accurate ORION 2.0 models lead to better-performing NoC  Relative power due to additional port not as high in ORION 2.0 vs. 1.0 …….. R2R2 R2R2 R2R2 R2R2 R2R2 … … … … … R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 R1R1 … ……… … 26

Conclusions  Accurate models can drive effective NoC design space exploration  ORION 1.0 is inaccurate for current and future technology nodes  Proposed accurate power and area models for network routers (ORION 2.0)  Presented a reproducible methodology for extracting inputs to our models  Maintained ORION 1.0 interface, while significantly improved the accuracy of models  switching to ORION 2.0 is easy! 27

ORION 2.0 Release  ORION 2.0 Website: http://www.princeton.edu/~peh/orion.html 28

System-Level NoC Power Modeling Example V. Soteriou, N. Eisley, H. Wang, B. Li, L.S. Peh, TVLSI’07 Polaris Toolchain

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi.

Similar presentations

Presentation on theme: "ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi.

Similar presentations

Presentation on theme: "ORION2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration Andrew B. Kahng ¶ Bin Li ‡ Li-Shiuan Peh ‡ Kambiz Samadi."— Presentation transcript:

Similar presentations

About project

Feedback