1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University of California, Riverside ACM Transactions on Embedded Computing Systems, February 2004
2 General Idea Increase performance and/or save power of a single embedded system program. Take advantage of embedded properties: –Fairly specific applications that rarely change. –Small loops account for large portion of exec time. Dedicate configurable logic device or an ASIC to perform the loop function efficiently. For overall power savings, speedup must be great enough to overcome increase in “exec” power.
3 Study Uniqueness What separates this study from others? –Simple HW/SW partitioning method (no complex search algorithms). –Focus on embedded systems –Extensive evaluation of energy savings.
4 Critical Loops Ave.
5 Partitioning Methods Modified apps with critical loops moved to hardware using Synopsis register-transfer VHDL. Configurable system logic (CSL) master of bus. CSL accesses memory directly, or through DMA. CPU – CSL communication via shared memory (including CSL registers) and direct signals. Using ASIC, no DMA.
6 Partitioning Methods Handshaking routines used for activating custom hardware (CSL or ASIC) when entering “critical” region.
7 Speedup & E-savings (estimation) Software loops replaced with handshaking behavior. HW cycles/loop calculated as always worst-case. Simulated: 100 MHz MIPS, 25 MHz 8051, max possible CSL speeds after synthesis. Xilinx Vertix power estimator for CSL (.18 um FPGA 1.8V – XCV50E). Measured active/idle power in Triscend’s parts: CPUidle =.85*CPUactive, CSLidle =.125*SCLactive. Power of interconnect and memory gathered through physical measurment of Triscend parts. Total system Energy =
8 Speedup & E-savings (estimation)
9 Gates
10 Speedup & E-savings (measurement) Single-chip microproc/CSL devices from Triscend: –E5 25 MHz) –A7 40 MHz Digital multimeter used for current/voltage measurement, time with timer (!) Subset of benchmarks measured. Good speedups and energy savings, energy “estimates even look conservative”. (only on MIPS) … But comparing a 100 MHz MIPS (sim) to 40 MHz ARM7 (measured)?
11 Speedup & E-savings (measurement)
12 Speedup & E-savings (ASIC) Estimations of a uP and custom logic on a single ASIC. Synopsis synthesis and power estimation tools for 0.18 um. Ave. estimated speedups increased to 4.0 from 3.2, due to increase in clock speed. E savings up to 50% from 34%. Ave # of gates down to 5,738 from 10,507.
13 Voltage Scaling Additional energy saved if voltage scaling factored in. Because of the increased performance, clock speed may be slowed, and voltage reduced to attain equivalent performance. On average, Vscaling gives an additional 14% of E-savings.
14 Voltage Scaling Percent Speed (clock) Reduction
15 Conclusion Moving a small amount of critical code to hardware can provide speedups and/or energy savings. Single-chip CPU / Config logic can give much improvements over CPU-only implementations. Extensive hardware/software partitioning exploration not needed – only basic profiling.
16 Discussion Ideas Can the gains seen on this benchmark suite carry over to actual applications? Why did they simulate a 100 MHz MIPS, but used a 40 MHz ARM? How would the results be different on more modern microprocs? Xscale? AVR? Do these newer CPU’s have much better performance/power ratios? Pg 223. – “parallel execution”? Do they actually have parallel exec going on? Pg. 224 says no. How does having a DMA option allow “almost any software region” to be implemented on HW more easily? 85% idle power for 8051??!! (pg 225). Obviously not “sleeping.”