Presentation is loading. Please wait.

Presentation is loading. Please wait.

GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,

Similar presentations


Presentation on theme: "GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,"— Presentation transcript:

1 GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras, Margaret Martonosi, PRINCETON UNIVERSITY Appears in the International Symposium on Low Power Electronics and Design (ISLPED), ‘08 Concurrent, Task 1.1.2.5  Detailed performance/power tradeoffs at µarch level are crucial  SW simulators are traditionally used at µarch stage  eg. Wattch, SimplePower, Hotspot  Flexible, low development time  But SW simulators are slow  More complex chips  More complex design space  Need to model OS, workload interaction Motivation SW is increasingly removed from modeling requirements 1.Run application snippets, ignore OS  Accuracy and credibility are compromised 2.Parallelize SW simulator  Shared data structures (eg. LLC, coherence) limit scalability 3.Hardware runtime monitoring  Restricted view of components and requires existing design Proposed Solutions  Develop an FPGA-based performance/power emulator that models a proposed CMP  Emulation rate of 65 MHz  run full apps, Linux 2.6 kernel  Programmable  insert relevant activity monitors, model various architectures  Combine best of SW simulators and HW runtime monitoring  Bottomline: Get detail and full-system effects of real measurements before it is built  First full-system power/performance FPGA emulation of CMP running full Linux 2.6 distribution with multiprogramming and multithreading support Our Approach Step 1: Choosing a Target FPGA Platform  Currently use the BEE2 (control unit)  Will utilize user FPGA units as design scales  Methodology extensible to other platforms Step 1: Choose a Candidate Core Design  Currently use Leon3 Sparc V8 VHDL core  90% LUTs, 30 % BRAM on 1 V2P with 65 MHz clock  Methodology extensible to other core designs Step 2: Inserting Event Counters Step 3: Power Model Development  Power model form is:  Get E i from gate-level simulations Write 500-1000 instruction µbenchmarks Get Leon3 gate-level netlist from Synopsys Design Compiler Feed µbenchmarks and netlist into Synopsys PrimeTime to get component power breakdown Step 4: System Integration and Linux 2.6 Boot FPGA-Based CMP Emulation Infrastructure Design CoreLeon3 Sparc V8 VHDL core Organization4-core, L1 snoopy cache coherence (ARM bus) PipelineSingle-issue, in-order, 7-stage Funct. UnitsAdder, Shifter, Pipelined Mul /Div L1 I-Cache8 KB, 2-way, 32-byte lines, LRR L1 D-Cache4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed MMU8-entry I and D TLBs, LRU Sparc V8 Core 0 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Event Counters 64-bit AHB Cont. AHB Bus Sparc V8 Core N 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Memory-mapped counters Added instructions to ISA for counter start/stop/reset 36 counters  3% LUTs, no impact on operating freq. Un-clock gated + leakage power Dynamic power  Power model validation against Synopsys PrimeTime demonstrates under 8% error We use micro-benchmarks and 5 distinct 10 6 instruction snapshots from Spec 2006 benchmarks (Mcf, Libquantum, Bzip2, Gcc, Sjeng)  ~ 35 x speedup measured over Multifacet GEMS/Ruby  Even greater speedup expected when modeling pipeline, more cores, power, and when using faster FPGA clock. Power Model Validation and Speedup Results  Emulator is ideal for AM studies  Hotspots depend on component power  available from emulator  On-chip temperature rise/fall times ~ 100ms  emulator is fast enough to run OS and applications well beyond this range Case Study: Activity Migration I/O RS-232 Ethernet Emulated CMP SparcV8 Core 0 Host PC Main MemoryModule Event counters AHB Bus Linux 2.6 running multithreaded and multiprogrammed workloads. Integrated power models are fed by event counters. SparcV8 Core N  Modify Linux kernel to read counters within 10ms timer interrupt and deduce power trends Runtime Power Profiling CPU 1: master, CPU 0: idle Barrier: CPU0 spin- waiting Possible Reg. File hotspot Bzip2 –high activity, high power Mcf – large working set, high stalls, low power Mcf – data cached, high powerCPU 0 (Bzip2) overheats CPU 0 (Mcf) cools off Migration Triggered  Successfully implemented FPGA-based perf. /power emulator booting Linux 2.6 and running full applications  Combines HW speeds (35x speedup over GEMS) with SW programmability  Provides power models accurate within 8% Synopsys simulations  Successfully demonstrated activity migration case study  FPGAs track Moore’s Law: available resources increase as architectures modeled become more complex Conclusions FPGA Platform: BEE2 Control Unit This work was supported in part by the Gigascale Systems Research Center, funded under the Focus Center Research Program, a Semiconductor Research Corporation Program. In addition, this work was supported by the National Science Foundation under grant CNS-0720561 Acknowledgements


Download ppt "GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,"

Similar presentations


Ads by Google