Improving Embedded System Software Speed and Energy using Microprocessor/FPGA Platform ICs Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine http://www.cs.ucr.edu/~vahid This research has been supported by the National Science Foundation, NEC, Trimedia, and Triscend Frank Vahid, UC Riverside
General Purpose vs. Special Purpose Amazing to think this came from wolves Standard tradeoff Oct. 14, 2002, Cincinnati, Ohio -- physician at Cincinnati Children’s Hospital Medical Center report duct tape effective at treating warts. Frank Vahid, UC Riverside
General Purpose vs. Single Purpose Processors total = 0 for i = 1 to N loop total += M[i] end loop Designers have long known that: General-purpose processors are flexible Single-purpose processors are fast General purpose OR Single purpose IR PC Register file General ALU Datapath Controller Program memory Assembly code for: total = 0 for i =1 to … Control logic and State register Data memory Datapath Controller Control logic State register Data memory i total + ENIAC, 1940’s Its flexibility was the big deal Flexibility Design cost Time-to-market Performance Power efficiency Size Frank Vahid, UC Riverside
Mixing General and Single Purpose Processors A.k.a. Hardware/software partitioning Hardware: single-purpose processors coprocessor, accelerator, peripheral, etc. Software: general-purpose processors Though hardware underneath! Especially important for embedded systems Computers embedded in devices (cameras, cars, toys, even people) Speed, cost, time-to-market, power, size, … demands are tough Microcontroller CCD preprocessor Pixel coprocessor A2D D2A JPEG codec DMA controller Memory controller ISA bus interface UART LCD control Display control Multiplier/Accumulator Digital camera chip lens CCD Frank Vahid, UC Riverside
How is Partitioning Done for Embedded Systems? Partitioning into hw and sw blocks done early During conceptual stage Sw design done separately from hw design Attempts since late 1980s to automate not yet successful Partitioning manually is reasonably straightforward Spec is informal and not machine readable Sw algorithms may differ from hw algorithms No compelling need for tools Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Processor ASIC Frank Vahid, UC Riverside
New Platforms Invite New Efforts in Hw/Sw Partitioning New single-chip platforms contain both general-purpose processor and an FPGA FPGA: Field-programmable gate array Programmable just like software Flexible Intended largely to implement single-purpose processors Can we perform a later partitioning to improve the software too? Processor + FPGA Informal spec System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside
Commercial Single-Chip Microprocessor/FPGA Platforms Triscend E5 chip Configurable logic 8051 processor plus other peripherals Memory Triscend E5: based on 8-bit 8051 CISC core (2000) 10 Dhrystone MIPS at 40MHz up to 40K logic gates Cost only about $4 Frank Vahid, UC Riverside
Frank Vahid, UC Riverside Single-Chip Microprocessor/FPGA Platforms Atmel FPSLIC Field-Programmable System-Level IC Based on AVR 8-bit RISC core 20 Dhrystone MIPS 5k-40k logic gates $5-$10 Courtesy of Atmel Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Triscend A7 chip (2001) Based on ARM7 32-bit RISC processor 54 Dhrystone MIPS at 60 MHz Up to 40k logic gates $10-$20 in volume Courtesy of Triscend Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Altera’s Excalibur EPXA 10 (2002) ARM (922T) hard core 200 Dhrystone MIPS at 200 MHz ~200k to ~2 million logic gates Source: www.altera.com Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Xilinx Virtex II Pro (2002) PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs 4 to 16 gigabit transceivers 12 to 216 multipliers Millions of logic gates 200k to 4M bits RAM 204 to 852 I/O $100-$500 (>25,000 units) Up to 16 serial transceivers 622 Mbps to 3.125 Gbps PowerPCs Config. logic Courtesy of Xilinx Frank Vahid, UC Riverside
Single-Chip Microprocessor/FPGA Platforms Why wouldn’t future microprocessor chips include some amount of on-chip FPGA? One argument against – area Lots of silicon area taken up by FPGA FPGA about 20-30 times less area efficient than custom logic FPGA used to be for prototyping, too big for final products But chip trends imply that FPGAs will be O.K. in final products… Frank Vahid, UC Riverside
Frank Vahid, UC Riverside How Much is Enough? Perhaps a bit small Frank Vahid, UC Riverside
Frank Vahid, UC Riverside How Much is Enough? Reasonably sized Frank Vahid, UC Riverside
How Much is Enough? Probably plenty big for most of us Frank Vahid, UC Riverside
How Much is Enough? More than typically necessary Frank Vahid, UC Riverside
How Much Custom Logic is Enough? IC package IC 1993: ~ 1 million logic transistors Perhaps a bit small 8-bit processor: 50,000 tr. Pentium: 3 million tr. MPEG decoder: several million tr. Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 1996: ~ 5-8 million logic transistors Reasonably sized Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 1999: ~ 10-50 million logic transistors Probably plenty big for most of us Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 2002: ~ 100-200 million logic transistors More than typically necessary Frank Vahid, UC Riverside
How Much Custom Logic is Enough? 2008: >1 BILLION logic transistors Perhaps very few people could design this Frank Vahid, UC Riverside
Very Few Companies Can Design High-End ICs Design productivity gap 10,000 1,000 100 10 1 0.1 0.01 0.001 Logic transistors per chip (in millions) 100,000 1000 Productivity (K) Trans./Staff-Mo. 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 IC capacity productivity Gap Moore’s Law Source: ITRS’99 Designer productivity growing at slower rate 1981: 100 designer months ~$1M 2002: 30,000 designer months ~$300M Frank Vahid, UC Riverside
Single-Chip Platforms with On-Chip FPGAs So, big FPGAs on-chip are O.K., because mainstream designers couldn’t have used all that silicon area anyways Becoming out of reach of mainstream designers But, couldn’t designers use custom logic instead of FPGAs to make smaller chips and save costs? Frank Vahid, UC Riverside
Shrinking Chips Yes, but there’s a limit Chips becoming pin limited A football huddle can only get so small This area will exist whether we use it all or not Shrink Pads connecting to external pins Frank Vahid, UC Riverside
Trend Towards Pre-Fabricated Platforms: ASSPs ASSP: application specific standard product Domain-specific pre-fabricated IC e.g., digital camera IC ASIC: application specific IC ASSP revenue > ASIC ASSP design starts > ASIC Unique IC design Ignores quantity of same IC ASIC design starts decreasing Due to strong benefits of using pre-fabricated devices Source: Gartner/Dataquest September’01 Frank Vahid, UC Riverside
Microprocessor/FPGA Platforms Trends point towards such platforms increasing in popularity Can we automatically partition the software to utilize the FPGA? For improved speed and energy Frank Vahid, UC Riverside
Automatic Hardware/Software Partitioning Since late 1980s – goal has been spec in, hw/sw out But no successful commercial tool yet. Why? // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … // Thousands of lines like this in dozens of files Hardware “Spec” Partitioner Processor ASIC/FPGA Compilation Synthesis Software Ideal Software Frank Vahid, UC Riverside
Why No Successful Tool Yet? Most research has focused on extensive exploration Roots in VLSI CAD Decompose problem into fine-grained operations Apply sophisticated partitioning algorithms Examples Min-cut, dynamic programming, simulated annealing, tabu-search, genetic evolution, etc. Is this overkill? “Spec” 1000s of nodes (like circuit partitioning) Partitioner Frank Vahid, UC Riverside
We Really Only Need Consider a Few Loops – Due to the 90-10 Rule Recent appearance of embedded benchmark suites Enables analysis understanding of the real problem We’ve examined UCLA’s MediaBench, Netbench, Motorola’s Powerstone Currently examining EEMBC (embedded equivalent of SPEC) UCR loop analysis tools based on SimpleScalar and Simics // From MediaBench’s JPEG codec GLOBAL(void) jpeg_fdct_ifast (DCTELEM * data) { DCTELEM tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7; DCTELEM tmp10, tmp11, tmp12, tmp13; DCTELEM z1, z2, z3, z4, z5, z11, z13; DCTELEM *dataptr; int ctr; SHIFT_TEMPS /* Pass 1: process rows. */ dataptr = data; for (ctr = DCTSIZE-1; ctr >= 0; ctr--) { tmp0 = dataptr[0] + dataptr[7]; tmp7 = dataptr[0] - dataptr[7]; tmp1 = dataptr[1] + dataptr[6]; … 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Assigned each loop a number, sorted by fraction of contribution to total execution time Frank Vahid, UC Riverside
The 90-10 Rule Holds for Embedded Systems In fact, the most frequent loop alone took 50% of time, using 1% of code Frank Vahid, UC Riverside
So Need We Only Consider the First Few Loops? Not Necessarily What if programs were self-similar w.r.t. 90-10 rule? Remove most frequent loop – 90-10 rule still hold? Intuition might say yes – remove loop, and we have another program. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10 Loop % Remaining Execution Time So we need only speedup the first few loops After that, speedups are limited Good from tool perspective! 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup 500 1000 1 2 3 4 5 6 7 8 9 10 Loop Speedup Frank Vahid, UC Riverside
Used multimeter and timer to measure performance and power Manually Partitioned Several PowerStone Benchmarks onto Triscend A7 and E5 Chips E5 IC Used multimeter and timer to measure performance and power Obtained good speedups and energy savings by partitioning software among microprocessor and on-chip FPGA Triscend A7 development board Frank Vahid, UC Riverside
Simulation-Based Results for More Benchmarks (Quicker than physical implementation, results matched reasonably well) Speedup of 3.2 and energy savings of 34% obtained with only 10,500 gates (avg) Frank Vahid, UC Riverside
Looking at Multiple Loops per Benchmark Manually created several partitioned versions of each benchmarks Most speedup gained with first 20,000 gates Surprisingly few gates! Stitt, Grattan and Vahid, Field-programmable Custom Computing Machines (FCCM) 2002 Stitt and Vahid, IEEE Design and Test, Dec. 2002 J. Villarreal, D. Suresh, G. Stitt, F. Vahid and W. Najjar, Design Automation of Embedded Systems, 2002 (to appear). Frank Vahid, UC Riverside
Ideal Speedups for Different Architectures Varied loop speedup ratio (sw time / hw time of loop itself) to see impact of faster microprocessor or slower FPGA: 30, 20, 10 (base case), 5 and 2 Loop speedups of 5 or more work fine for first few loops, not hard to achieve Frank Vahid, UC Riverside
Ideal Energy Savings for Different Architectures Varied loop power ratio (FPGA power / microprocessor power) to account for different architectures – 2.5, 2.0, 1.5 (base case), 1.0 Energy savings quite resilient to variations Frank Vahid, UC Riverside
How is Automated Partitioning Done? Informal spec Previous data obtained manually System Partitioning Sw spec Hw spec Sw design Hw design Partitioning Processor + FPGA ASIC Frank Vahid, UC Riverside
Source-Level Partitioning SW Source _______ Front-end converts code into intermediate format, such as SUIF (Stanford University Intermediate Format) Compiler Front-End Intermediate format explored for hardware candidates Hw/Sw Partitioning Compiler Back-End Hw source Assembly & object files Binary is generated from assembling and linking. Hw source is generated and synthesized into netlist Assembler & Linker Synthesis Binary Netlists Processor FPGA Frank Vahid, UC Riverside
Problems with Source-Level Partitioning Though technically superior, source-level partitioning Disrupts standard commercial tool flow significantly Requires special compiler (ouch!) Multiple source languages, changing source languages How deal with library code, assembly code, object code Compiler Front-end C Source C++ Source Java Source ? C SUIF Compiler C++ SUIF Compiler Frank Vahid, UC Riverside
Frank Vahid, UC Riverside Binary Partitioning SW Source _______ Assembly & object files Compilation Source code is first compiled and linked in order to create a binary. Assembler & Linker Binary Candidate hardware regions (a few small, frequent loops) are decompiled for partitioning Hw/Sw Partitioning Hw source Updated Binary HDL is generated and synthesized, and binary is updated to use hardware Synthesis Netlists Processor FPGA Frank Vahid, UC Riverside
Binary-Level Partitioning Results (ICCAD’02) Source-Level Average speedup, 1.5 Average energy savings, 27% Average 4,361 gates Binary-Level Average speedup, 1.4 Average energy savings, 13% Large area overhead averaging 10,325 gates Frank Vahid, UC Riverside
Frank Vahid, UC Riverside Binary Partitioning Could Eventually Lead to Dynamic Hw/Sw Partitioning Dynamic software optimization gaining interest e.g., HP’s Dynamo What better optimization than moving to FPGA? Add component on-chip: Detects most frequent sw loops Decompiles a loop Performs compiler optimizations Synthesizes to a netlist Places and routes the netlist onto (simple) FPGA Updates sw to call FPGA Config. Logic Mem Processor DMA D$ I$ Profiler Proc. Self-improving IC Can be invisible to designer Appears as efficient processor HARD! Much future work. Frank Vahid, UC Riverside
Frank Vahid, UC Riverside Conclusions Hardware/software partitioning can significantly improve software speed and energy Single-chip microprocessor/FPGA platforms, increasing in popularity, make such partitioning even more attractive Successful commercial tool still on the horizon Binary-level partitioning may help in some cases Source-level can yield massive parallelism (Profs. Najjar/Payne) Future dynamic hw/sw partitioning possible? Distinction between sw/hw continually being blurred! Many people involved: Greg Stitt, Roman Lysecky, Shawn Nematbakhsh, Dinesh Suresh, Walid Najjar, Jason Villarreal, Tom Payne, several others… Support from NSF, Triscend, and soon SRC… Exciting new directions! Frank Vahid, UC Riverside