Frank Vahid, UC Riverside 1 Self-Improving Configurable IC Platforms Frank Vahid Associate Professor Dept. of Computer Science and Engineering University of California, Riverside Also with the Center for Embedded Computer Systems at UC Irvine Co-PI: Walid Najjar, Professor, CS&E, UCR
Frank Vahid, UC Riverside 2 Goal: Platform Self-Tunes to Executing Application Download standard binary Platform adjusts to executing application Result is better speed and energy Why and How? Application Platform
Frank Vahid, UC Riverside 3 Platforms Pre-designed programmable platforms Reduce NRE cost, time-to-market, and risk Platform designer amortizes design cost over large volumes Many (if not most) will include FPGA Today: Triscend, Altera, Xilinx, Atmel More sure to come As FPGA vendors license to SoC makers FPGA MemProcessor L1 Cache Periph 1 JPEG Sample Platform Processor, cache, memory, FPGA, etc. Modern IC costs are feasible mostly in very high volumes
Frank Vahid, UC Riverside 4 Hardware/Software Partitioning Improves Speed and Energy FPGA Mem Processor L1 Cache Periph1JPEG But requires partitioning CAD tool O.K. in some flows In mainstream software flows, hard to integrate Standard Sw Tools Hw/Sw Parti- tioner idleuP active idleuP FPGA
Frank Vahid, UC Riverside 5 Idea: Perform Partitioning Dynamically (and hence Transparently) Add components on-chip: Profile Decompile frequent loops Optimize Synthesize Place and route onto FPGA Update Sw to call FPGA Transparent No impact on tool flow Dynamic software optimization, software binary updating, and dynamic binary translation are proven technologies But how can you profile, decompile, optimize, synthesize, and p&r, on-chip? DAG & LC MemProcessor L1 Cache Profiler Explorer Dynamic Partitioning Module Decompiler, Optimizer Synthesis, Place and Route FPGA
Frank Vahid, UC Riverside 6 Dynamic Partitioning Requires Lean Tools How can you run Synopsys/Cadence/Xilinx tools on-chip, when they currently run on powerful workstations? Key – our tools only need be good enough to speedup critical loops Most time spent in small loops (e.g., Mediabench, Netbench, EEMBC) Created ultra-lean versions of the tools Quality not necessarily as good, but good enough Runs on a 60 MHz ARM 7 Loop
Frank Vahid, UC Riverside 7 Dynamic Hw/Sw Partitioning Tool Chain DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Partitioner Binary Loop Profiler Small, Frequent Loops Loop Decompilation Place & Route Hw Synthesis Binary Modification Updated Binary DMA Configuration Bitfile Creation Tech. Mapping Architecture targeted for loop speedup, simple P&R We’ve developed efficient profiler Hw We’re continuing to extend these tools to handle more benchmarks Decompiler, Optimizer Synthesis, Place and Route
Frank Vahid, UC Riverside 8 Dynamic Hw/Sw Partitioning Results DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Partitioner Decompiler, Optimizer Synthesis, Place and Route
Frank Vahid, UC Riverside 9 Dynamic Hw/Sw Partitioning Results Powerstone, NetBench, and EEMBC examples, most frequent 1 loop only Average speedup very close to ideal speedup of 2.4 Not much left on the table in these examples Dynamically speeding up inners loops on FPGAs is feasible using on-chip tools ICCAD’02 (Stitt/Vahid) – Binary-level partitioning in general is very effective
Frank Vahid, UC Riverside 10 Configurable Cache: Why? ARM920T: Caches consume half of total processor system power (Segars 01) M*CORE: Unified cache consumes half of total processor sys. power (Lee/Moyer/Arends 99) DAG & LC FPGA MemProcessor L1 Cache Profiler Explorer Dynamic Partitioning Module Decompiler Synthesis Place and Route
Frank Vahid, UC Riverside 11 Best Cache for Embedded Systems? Diversity of associativity, line size, total size
Frank Vahid, UC Riverside 12 Cache Design Dilemmas Associativity Low: low power, good performance for many programs High: better performance on more programs Total size Small: lower power if working set small, (less area) Big: better performance/power if working set large Line size Small: better when poor spatial locality Big: better when good spatial locality Most caches are a compromise for many programs Work best on average But embedded systems run one/few programs Want best cache for that one program vs.
Frank Vahid, UC Riverside 13 Solution to the Cache Design Dilemna Configurable cache Design physical cache that can be reconfigured 1-way, 2-ways, or 4-ways Way concatenation – new technique, ISCA’03 (Zhang/Vahid/Najjar) Four 2K ways, plus concatenation logic 8K, 4K or 2K byte total size Way shutdown, ISCA’03 Gates Vdd, saves both dynamic and static power, some performance overhead (5%) 16, 32 or 64 byte line size Variable line fetch size, ISVLSI’03 Physical 16 byte line, one, two or four physical line fetches Note: this is a single physical cache, not a synthesizable core
Frank Vahid, UC Riverside 14 Configurable Cache Design: Way Concatenation (4, 2 or 1 way) index c1c1 c3c3 c0c0 c2c2 a 11 a 12 reg 1 reg 0 sense amps column mux tag part tag address mux driver c1c1 line offset data output critical path c0c0 c2c2 c0c0 c1c1 6x64 c3c3 c2c2 c3c3 a 31 tag address a 13 a 12 a 11 a 10 index a 5 a 4 line offset a 0 Configuration circuit data array bitline Trivial area overhead, no performance overhead
Frank Vahid, UC Riverside 15 Configurable Cache Design Metrics We computed power, performance, energy and size using CACTI models Our own layout (0.13 TSMC CMOS), Cadence tools Energy: considered cache, memory, bus, and CPU stall Powerstone, MediaBench, and SPEC benchmarks Used SimpleScalar for simulations
Frank Vahid, UC Riverside 16 Configurable Cache Energy Benefits 40%-50% energy savings on average Compared to conventional 4-way and 1-way assoc., 32-byte line size AND, best for every example (remember, conventional is compromise)
Frank Vahid, UC Riverside 17 Future Work Dynamic cache tuning More advanced dynamic partitioning Automatic frequent loop detection On-chip exploration tool Better decompilation, synthesis Better FPGA fabric, place and route Approach: continue to extend to support more benchmarks Extend to platforms with multiple processors Scales well – processors can share on-chip partitioning tools
Frank Vahid, UC Riverside 18 Conclusions Self-improving configurable ICs Provide excellent speed and energy improvements Require no modification to existing software flows Can thus be widely adopted We’ve shown the idea is practical Lean on-chip tools are possible Now need to make them even better Extensive research into algorithms, designs and architecture is needed