GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Using emulation for RTL performance verification
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Shobana Padmanabhan Phillip Jones, David Schuehler, Praveen Krishnamurthy, Scott Friedman, Huakai Zhang, Ron Cytron, John Lockwood, Roger Chamberlain,
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Extensible Networking Platform 1 Liquid Architecture Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain,
FPGA Design Using the LEON3 Fault Tolerant Processor Core
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
© ABB Group Jun-15 Evaluation of Real-Time Operating Systems for Xilinx MicroBlaze CPU Anders Rönnholm.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
OS2-1 Chapter 2 Computer System Structures. OS2-2 Outlines Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
Operating System Support Focus on Architecture
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.
Energy Evaluation Methodology for Platform Based System-On- Chip Design Hildingsson, K.; Arslan, T.; Erdogan, A.T.; VLSI, Proceedings. IEEE Computer.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Presenter: Jyun-Yan Li Multiprocessor System-on-Chip Profiling Architecture: Design and Implementation Po-Hui Chen, Chung-Ta King, Yuan-Ying Chang, Shau-Yin.
CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Peter S. Magnusson, Magnus Crhistensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Högberg, Frederik larsson, Anreas Moestedt. Presented.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Morgan Kaufmann Publishers
An Investigation of Xen and PTLsim for Exploring Latency Constraints of Co-Processing Units Grant Jenks UCLA.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
By Islam Atta Supervised by Dr. Ihab Talkhan
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Liquid Architecture D. Schuehler, B. Brodie, R. Chamberlain, R. Cytron, S. Friedman, J. Fritts, P. Jones, P. Krishnamurthy, J. Lockwood, S. Padmanabhan,
Best detection scheme achieves 100% hit detection with
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Parapet Research Group, Princeton University EE Workshop on Hardware Performance Monitor Design and Functionality HPCA-11 Feb 13, 2005 Hardware Performance.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
1 COMP427 Embedded Systems Lecture 3. Virtual Platform Prof. Taeweon Suh Computer Science Education Korea University.
Memory System Characterization of Commercial Workloads
Structural Simulation Toolkit / Gem5 Integration
What is Parallel and Distributed computing?
Department of Computer Science University of California, Santa Barbara
Combining Simulators and FPGAs “An Out-of-Body Experience”
CMSC 611: Advanced Computer Architecture
A High Performance SoC: PkunityTM
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras, Margaret Martonosi, PRINCETON UNIVERSITY Appears in the International Symposium on Low Power Electronics and Design (ISLPED), ‘08 Concurrent, Task  Detailed performance/power tradeoffs at µarch level are crucial  SW simulators are traditionally used at µarch stage  eg. Wattch, SimplePower, Hotspot  Flexible, low development time  But SW simulators are slow  More complex chips  More complex design space  Need to model OS, workload interaction Motivation SW is increasingly removed from modeling requirements 1.Run application snippets, ignore OS  Accuracy and credibility are compromised 2.Parallelize SW simulator  Shared data structures (eg. LLC, coherence) limit scalability 3.Hardware runtime monitoring  Restricted view of components and requires existing design Proposed Solutions  Develop an FPGA-based performance/power emulator that models a proposed CMP  Emulation rate of 65 MHz  run full apps, Linux 2.6 kernel  Programmable  insert relevant activity monitors, model various architectures  Combine best of SW simulators and HW runtime monitoring  Bottomline: Get detail and full-system effects of real measurements before it is built  First full-system power/performance FPGA emulation of CMP running full Linux 2.6 distribution with multiprogramming and multithreading support Our Approach Step 1: Choosing a Target FPGA Platform  Currently use the BEE2 (control unit)  Will utilize user FPGA units as design scales  Methodology extensible to other platforms Step 1: Choose a Candidate Core Design  Currently use Leon3 Sparc V8 VHDL core  90% LUTs, 30 % BRAM on 1 V2P with 65 MHz clock  Methodology extensible to other core designs Step 2: Inserting Event Counters Step 3: Power Model Development  Power model form is:  Get E i from gate-level simulations Write instruction µbenchmarks Get Leon3 gate-level netlist from Synopsys Design Compiler Feed µbenchmarks and netlist into Synopsys PrimeTime to get component power breakdown Step 4: System Integration and Linux 2.6 Boot FPGA-Based CMP Emulation Infrastructure Design CoreLeon3 Sparc V8 VHDL core Organization4-core, L1 snoopy cache coherence (ARM bus) PipelineSingle-issue, in-order, 7-stage Funct. UnitsAdder, Shifter, Pipelined Mul /Div L1 I-Cache8 KB, 2-way, 32-byte lines, LRR L1 D-Cache4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed MMU8-entry I and D TLBs, LRU Sparc V8 Core 0 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Event Counters 64-bit AHB Cont. AHB Bus Sparc V8 Core N 3-Port Reg. File 7-Stage Integer Pipeline 4KB I$8KB D$ Memory-mapped counters Added instructions to ISA for counter start/stop/reset 36 counters  3% LUTs, no impact on operating freq. Un-clock gated + leakage power Dynamic power  Power model validation against Synopsys PrimeTime demonstrates under 8% error We use micro-benchmarks and 5 distinct 10 6 instruction snapshots from Spec 2006 benchmarks (Mcf, Libquantum, Bzip2, Gcc, Sjeng)  ~ 35 x speedup measured over Multifacet GEMS/Ruby  Even greater speedup expected when modeling pipeline, more cores, power, and when using faster FPGA clock. Power Model Validation and Speedup Results  Emulator is ideal for AM studies  Hotspots depend on component power  available from emulator  On-chip temperature rise/fall times ~ 100ms  emulator is fast enough to run OS and applications well beyond this range Case Study: Activity Migration I/O RS-232 Ethernet Emulated CMP SparcV8 Core 0 Host PC Main MemoryModule Event counters AHB Bus Linux 2.6 running multithreaded and multiprogrammed workloads. Integrated power models are fed by event counters. SparcV8 Core N  Modify Linux kernel to read counters within 10ms timer interrupt and deduce power trends Runtime Power Profiling CPU 1: master, CPU 0: idle Barrier: CPU0 spin- waiting Possible Reg. File hotspot Bzip2 –high activity, high power Mcf – large working set, high stalls, low power Mcf – data cached, high powerCPU 0 (Bzip2) overheats CPU 0 (Mcf) cools off Migration Triggered  Successfully implemented FPGA-based perf. /power emulator booting Linux 2.6 and running full applications  Combines HW speeds (35x speedup over GEMS) with SW programmability  Provides power models accurate within 8% Synopsys simulations  Successfully demonstrated activity migration case study  FPGAs track Moore’s Law: available resources increase as architectures modeled become more complex Conclusions FPGA Platform: BEE2 Control Unit This work was supported in part by the Gigascale Systems Research Center, funded under the Focus Center Research Program, a Semiconductor Research Corporation Program. In addition, this work was supported by the National Science Foundation under grant CNS Acknowledgements