1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture Conference 30 Jan 2003 {alaffely, jliang, tessier, moritz, This material is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
2 Motivation Problem: Need low power architectures for wireless DSP How to support dynamic clock and voltage scaling in heterogeneous systems with data-dependent workloads (granularity, overhead, control) A Solution: Use modularity of SoC; apply at IP core level Apply discrete frequency and voltage scaling Use interconnect utilization measures and data rate requirements to dynamically control scaling
3 Overview Adaptive System-on-a-Chip Implementation Approach Preliminary Results Conclusions and Challenges
4 Adaptive System-on-a-Chip Tiled architecture with mesh interconnect Point to point communication pipeline Allows for heterogeneous cores Differing sizes, clock rates, voltages Low-overhead core interface for On-chip bus substitute for streaming applications Based on static scheduling Fast and predictable Proc Tile Multiplier FPGA Multiplier ctrl South Core West North East Communication Interface
5 aSoC Implementation technology Full custom
6 Some Results 9 and 16 core systems tested for IIR, MPEG encoding and Image processing applications ~ 2 x the performance compared to Coreconnect bus Burst and Hierarchical ~ 1.5 x the performance of an oblivious routing network 1 (Dynamic routing) Max speedup is 5 x 1. W. Dally and H. Aoki, “Deadlock-free Adaptive Routing in Multi-computer Networks Using Virtual Routing”, IEEE Transactions on Parallel and Distributed Systems, April 1993
7 Dynamic Properties of Statically Routed System?? Dynamically Parameterizable Cores proposed to save power Motion Estimation core (by P. Jain UMASS) changes from 256 cycles/pixel to 16 cycles/pixel based on input data Streams within the scheduled communication pipeline can be blocked and back up or go unused Inefficient to simply run at fastest rate MEDCT Latency Changes with Data Scheduled Communications
8 Key Features for Dynamic Power Reduction SoC Modularity Sets a manageable granularity for voltage scaling Heterogeneous cores Multiple on chip clocks and voltages already supported Core interface already handles synchronization and level conversion Statically scheduled Interconnect traffic indicate system bottlenecks
9 Approach Stream based cores Limited buffering Core-ports Single buffer for each stream to cross clock/voltage barrier between core and interface Reading/Writing success rates indicate core utilization Input blocked: Core too slow Output blocked: Core too fast Controller Interprets core-port success rates to adjust local clock and voltage Interconnect Buffer Input Core-port Output Core-port Core Clock and Supply Controller Local Vdd Local Clock Blocked Processing Pipeline
10 Power-Aware System: Core Utilization Measurement Accumulate failures at each core-port to control clock change Blocked – Add 1 Success – Subtract 1 Threshold and compare input and output failure counts Many input, few output: increase frequency Many output, few input: decrease frequency Many or few of both: do nothing Compare and Threshold Increase or Decrease Local Clock Core countCore-port OutCore-port Incount Out/In Data Interconnect Interface Blocked
11 Power-Aware System: Local Clock Selection Derived from high frequency global clock 8 possible values (Global Clock/2n) Move one up or down each transition /128 /64 /32 /16 /8 /4 /2 /1 count Global Clock From Rate Measurement Core Local Clock
12 Power-Aware System: Voltage Selection System Choose one of 4 supply voltages Look-up-table (LUT) used to match voltage to frequency setting for specific core Using cascading buffers core Vdd can change within 30ns (250nm technology) LUT V1V2V4V3 Core Local Supply From Clock Selector
13 Vdd Selection Criteria Voltage Normalized Delay As Vdd decreases delay increases exponentially Use curve to match available clock frequencies to voltages The voltage drop reduces power by 70%, 84%, and 89% P = C(Vdd) 2 f Normalized Core Critical Path Delay vs. Vdd Max Speed 1/2 Speed 1/4 Speed 1/8 Speed
14 Power Savings Two core system ME chooses 3 different algorithms based on input data DCT constant rate MEDCT Core power from Synopsys RTL simulation
15 Test System Results Simple test case Core 1 starts 16 x too fast Core 2 starts 8 x too slow Core1Core Core1 Core2 Relative Clock Frequency Number of Clock Cycles
16 Key Issues Count value require to control frequency shifting? May be application and core dependent Core characterization Not easy, data dependent Some tools exist for StrongArm (JouleTrack A. Sinha MIT) Benchmark development A bit tedious
17 Conclusions SoC: a good candidate platform for voltage scaling implementation Convenient granularity Low overhead Easily measurable control mechanism Hardware Preliminary results Now test real benchmarks and data