GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras, Margaret Martonosi, PRINCETON UNIVERSITY Appears in the International Symposium on Low Power Electronics and Design (ISLPED), ‘08 Concurrent, Task Detailed performance/power tradeoffs at µarch level are crucial SW simulators are traditionally used at µarch stage eg. Wattch, SimplePower, Hotspot Flexible, low development time But SW simulators are slow More complex chips More complex design space Need to model OS, workload interaction Motivation SW is increasingly removed from modeling requirements 1.Run application snippets, ignore OS Accuracy and credibility are compromised 2.Parallelize SW simulator Shared data structures (eg. LLC, coherence) limit scalability 3.Hardware runtime monitoring Restricted view of components and requires existing design Proposed Solutions Develop an FPGA-based performance/power emulator that models a proposed CMP Emulation rate of 65 MHz run full apps, Linux 2.6 kernel Programmable insert relevant activity monitors, model various architectures Combine best of SW simulators and HW runtime monitoring Bottomline: Get detail and full-system effects of real measurements before it is built First full-system power/performance FPGA emulation of CMP running full Linux 2.6 distribution with multiprogramming and multithreading support Our Approach Step 1: Choosing a Target FPGA Platform Currently use the BEE2 (control unit) Will utilize user FPGA units as design scales Methodology extensible to other platforms Step 1: Choose a Candidate Core Design Currently use Leon3 Sparc V8 VHDL core 90% LUTs, 30 % BRAM on 1 V2P with 65 MHz clock Methodology extensible to other core designs Step 2: Inserting Event Counters Step 3: Power Model Development Power model form is: Get E i from gate-level simulations Write instruction µbenchmarks Get Leon3 gate-level netlist from Synopsys Design Compiler Feed µbenchmarks and netlist into Synopsys PrimeTime to get component power breakdown Step 4: System Integration and Linux 2.6 Boot FPGA-Based CMP Emulation Infrastructure Design CoreLeon3 Sparc V8 VHDL core Organization4-core, L1 snoopy cache coherence (ARM bus) PipelineSingle-issue, in-order, 7-stage Funct. UnitsAdder, Shifter, Pipelined Mul /Div L1 I-Cache8 KB, 2-way, 32-byte lines, LRR L1 D-Cache4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed MMU8-entry I and D TLBs, LRU Sparc V8 Core 0 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Event Counters 64-bit AHB Cont. AHB Bus Sparc V8 Core N 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Memory-mapped counters Added instructions to ISA for counter start/stop/reset 36 counters 3% LUTs, no impact on operating freq. Un-clock gated + leakage power Dynamic power Power model validation against Synopsys PrimeTime demonstrates under 8% error We use micro-benchmarks and 5 distinct 10 6 instruction snapshots from Spec 2006 benchmarks (Mcf, Libquantum, Bzip2, Gcc, Sjeng) ~ 35 x speedup measured over Multifacet GEMS/Ruby Even greater speedup expected when modeling pipeline, more cores, power, and when using faster FPGA clock. Power Model Validation and Speedup Results Emulator is ideal for AM studies Hotspots depend on component power available from emulator On-chip temperature rise/fall times ~ 100ms emulator is fast enough to run OS and applications well beyond this range Case Study: Activity Migration I/O RS-232 Ethernet Emulated CMP SparcV8 Core 0 Host PC Main MemoryModule Event counters AHB Bus Linux 2.6 running multithreaded and multiprogrammed workloads. Integrated power models are fed by event counters. SparcV8 Core N Modify Linux kernel to read counters within 10ms timer interrupt and deduce power trends Runtime Power Profiling CPU 1: master, CPU 0: idle Barrier: CPU0 spin- waiting Possible Reg. File hotspot Bzip2 –high activity, high power Mcf – large working set, high stalls, low power Mcf – data cached, high powerCPU 0 (Bzip2) overheats CPU 0 (Mcf) cools off Migration Triggered Successfully implemented FPGA-based perf. /power emulator booting Linux 2.6 and running full applications Combines HW speeds (35x speedup over GEMS) with SW programmability Provides power models accurate within 8% Synopsys simulations Successfully demonstrated activity migration case study FPGAs track Moore’s Law: available resources increase as architectures modeled become more complex Conclusions FPGA Platform: BEE2 Control Unit This work was supported in part by the Gigascale Systems Research Center, funded under the Focus Center Research Program, a Semiconductor Research Corporation Program. In addition, this work was supported by the National Science Foundation under grant CNS Acknowledgements