Instruction-based System-level Power Evaluation of System-on-a-chip Peripheral Cores Tony Givargis, Frank Vahid* Dept. of Computer Science & Engineering University of California, Riverside *also with the Center for Embedded Computer Systems, UC Irvine Joerg Henkel NEC C&C Research Princeton, New Jersey This work was supported by the National Science Foundation under grant # CCR , and by a Design Automation Conference graduate scholarship.
System-on-a-chip (SOC) Want to explore alternative cores, parameter settings, and applications Micro- processor CacheMemory Bridge Application1 Application2 SOC Peripheral1 Peripheral2 …. Core database Peripheral1 Peripheral2_aPeripheral2_b Gate/RT level simulation too slow
SOC System-level Power Estimation Microprocessor Tiwari/Malik/Wolfe 94 Instruction set simulator Marculescu/Pedram 96 Instruction trace reduction Micro- processor CacheMemory Bridge Application SOC: System-level model Peripheral Micro- processor CacheMemory Bridge Application SOC: Gate-level model Peripheral Still need system-level method for peripherals 3-step method Plus cache, memory & bus Simunic/Benini/DeMicheli 99 Extended instruct. simulator Givargis/Vahid/Henkel 99 Trace reductions Micro- processor CacheMemory Bridge
…. Core database Core Provider’s Step 1: Instruction- based System-Level Model Creation System simulation model already commonly used, and required in VSIA standard Executes ~1000x faster than gate-level model UART Reset() … Enable_tx() … Enable_rx() … Send() … Rcceive() … UART JPEG decode
Core Provider’s Step 2: Low-level Per-instruction Power Evaluation Measure power of gate/layout model, per instruction Use unique testbench per instruction, may take hours/days Low-level model differentiates cores from other SOC modules enabling accurate power estimation UART instruction 2 bytes4 bytes8 bytes16 bytes Reset 13 J 14 J Enable_tx 23 J25 J24 J Enable_rx 18 J19 J Send 76 J77 J89 J115 J Receive 44 J49 J55 J64 J Buffer size Instruction UART instruction Energy Reset 13 J Enable_tx 23 J Enable_rx 18 J Send 76 J Receive 44 J Must account for core parameters
Core Provider’s Step 3: Back Annotation of System Model JPEG decode …. Core database Energy Reset 13 J Enable_tx 23 J Enable_rx 18 J Send 76 J Receive 44 J Reset() … uJtot += 13 Enable_tx() … uJtot += 23 Enable_rx() … uJtot += 18 Send() … uJtot += 76 Rcceive() … uJtot += 44 UART
Core “Power Modes” Requires Extra Effort by Core Provider Unlike microprocessor, certain peripheral core instructions can greatly modify power consumption of other instructions Must create power mode transition function, and measure power per instruction per mode. 2 bytes4 bytes8 bytes16 bytes Mode 1: Idle Reset 11 J13 J14 J Enable_tx 27 J32 J31 J Enable_rx 17 J18 J19 J18 J Send 17 J19 J 20 J Receive 14 J15 J17 J18 J Mode 2 : Enabled Reset 13 J 14 J Enable_tx 23 J25 J24 J Enable_rx 18 J19 J Send 76 J77 J89 J115 J Receive 44 J49 J55 J64 J Mode1: Idle Mode2: Enabled Enable_tx or Enable_rx Reset
User Performs System Simulation, Which Yields Power Data Simulation takes only seconds or minutes Micro- processor CacheMemory Bridge Application SOC Peripheral UART JPEG decode …. Core database UART + Total energy
Results: Image-decode Accelerator Examined 3 peripheral cores: UART, DMA, JPEG Compared our instruction-based system-level method with: Gate-level simulation: slow but accurate “Databook” RT-level: cycle-accurate simulation, used databook average- power values UARTDMAJPEG Energy (mJ) Gate-level: 40,980 sec “Databook” RT-level: 2,700 sec % 38% 14% Instr.-based system-level: 14 sec 2% 5% 1%
Results: Importance of Power Modes Proper power-mode selection is critical for peripheral cores Too few modes or wrong modes can lead to much error Gate-level energy (mJ) System-level energy (mJ) Error Single- mode % Two- modes % Four- modes % UART example
Conclusions Introduced instruction-based method is Accurate (less than 5% error) Fast (1000x speedup over gate-level) Fits with current core-based methodology Concept of power modes is necessary for accuracy Future work includes: Trace-simulator-based approach (10x speedup) Trace-analysis-based approach (100x speedup)