EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Slides:

Advertisements

Similar presentations

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

RAMP in Retrospect David Patterson August 25, 2010.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

NoC Modeling Networks-on-Chips seminar May, 2008 Anton Lavro.

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

UC Berkeley 1 A Disk and Thermal Emulation Model for RAMP Zhangxi Tan and David Patterson.

Sim2Imp (Simulation to Implementation) Breakout J. Wawrzynek, K. Asanovic, G. Gibeling, M. Lin, Y. Lee, N. Patil.

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.

Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.

Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.

Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Reconfigurable Devices Presentation for Advanced Digital Electronics (ECNG3011) by Calixte George.

1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.

Multi-Core Architectures

Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.

Design Verification An Overview. Powerful HDL Verification Solutions for the Industry’s Highest Density Devices  What is driving the FPGA Verification.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Tessellation: Space-Time Partitioning in a Manycore Client OS Rose Liu 1,2, Kevin Klues 1, Sarah Bird 1, Steven Hofmeyr 3, Krste Asanovic 1, John Kubiatowicz.

Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada

Predictive Design Space Exploration Using Genetically Programmed Response Surfaces Henry Cook Department of Electrical Engineering and Computer Science.

Computer Organization and Design Computer Abstractions and Technology

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.

Outline Why this subject? What is High Performance Computing?

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.

Morgan Kaufmann Publishers

Anne Pratoomtong ECE734, Spring2002

Section 1: Introduction to Simics

Department of Computer Science University of California, Santa Barbara

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

Presentation transcript:

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB A Case for FAME: FPGA Architecture Model Execution Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanovic, David Patterson The Parallel Computing Lab, UC Berkeley ISCA ’10

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB A Brief History of Time  Hardware prototyping initially popular for architects  Prototyping each point in a design space is expensive  Simulators became popular cost-effective alternative  Software Architecture Model Execution (SAME) simulators most popular  SAME performance scaled with uniprocessor performance scaling 2

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB The Multicore Revolution  Abrupt change to multicore architectures  HW, SW systems larger, more complex  Timing-dependent nondeterminism  Dynamic code generation  Automatic tuning of app kernels  We need more simulation cycles than ever 3

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB The Multicore Simulation Gap  As number of cores increases exponentially, time to model a target cycle increases accordingly  SAME is difficult to parallelize because of cycle-by-cycle interactions  Relaxed simulation synchronization may not work  Must bridge simulation gap 4

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB One Decade of SAME Median Instructions Simulated/ Benchmark Median #Cores Median Instructions Simulated/ Core ISCA M1 ISCA M16100M  Effect is dramatically shorter (~10 ms) simulation runs 5

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB FAME: FPGA Architecture Model Execution  The SAME approach provides inadequate simulation throughput and latency  Need a fundamentally new strategy to maximize useful experiments per day  Want flexibility of SAME and performance of hardware  Ours: FPGA Architecture Model Execution (FAME)  (cf. SAME, Software Architecture Model Execution)  Why FPGAs?  FPGA capacity scaling with Moore’s Law  Now can fit a few cores on die  Highly concurrent programming model with cheap synchronization 6

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Non-FAME: FPGA Computers  FPGA Computers: using FPGAs to build a production computer  RAMP Blue (UCB 2006)  1008 MicroBlaze cores  No MMU, message passing only  Requires lots of hardware 21 BEE2 boards (full rack) / 84 FPGAs  RTL directly mapped to FPGA  Time-consuming to modify  Cool, useful, but not a flexible simulator 7

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB FAME: System Simulators in FPGAs 8 CORE D$ DRAM Shared L2$ / Interconnect … I$ CORE D$ I$ CORE D$ I$ CORE D$ I$ CORE D$ DRAM L2$ I$ CORE D$ I$ CORE D$ I$ L2$ Target System A Target System B Host System (FAME simulator)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB A Vast FAME Design Space  FAME design space even larger than SAME’s  Three dimensions of FAME simulators  Direct or Decoupled: does one host cycle model one target cycle?  Full RTL or Abstract RTL?  Host Single-threaded or Host Multi-threaded?  See paper for a FAME taxonomy! 9

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB FAME Dimension 1: Direct vs. Decoupled  Direct FAME: compile target RTL to FPGA  Problem: common ASIC structures map poorly to FPGAs  Solution: resource-efficient multi-cycle FPGA mapping  Decoupled FAME: decouple host cycles from target cycles  Full RTL still modeled, so timing accuracy still guaranteed 10 R1 R2 R3 R4 W1 W2 RegFile Rd1 Rd2 Rd3 Rd4 R1 R2 W1 RegFile Rd1 Rd2 Target System RegfileDecoupled Host Implementation FSM

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB FAME Dimension 2: Full RTL vs. Abstract RTL  Decoupled FAME models full RTL of target machine  Don’t have full RTL in initial design phase  Full RTL is too much work for design space exploration  Abstract FAME: model the target RTL at a high level  For example, split timing and functional models (à la SAME)  Also enables runtime parameterization: run different simulations without re-synthesizing the design  Advantages of Abstract FAME come at cost: model verification  Timing of abstract model not guaranteed to match target machine 11 Abstraction Functional Model Target RTL Timing Model

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB FAME Dimension 3: Single- or Multi-threaded Host  Problem: can’t fit big manycore on FPGA, even abstracted  Problem: long host latencies reduce utilization  Solution: host-multithreading 12 CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreaded Emulation Engine (on FPGA) +1 2 PC1 PC1 PC1 PC1 I$ IR GPR GPR GPR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Metrics besides Cycles: Power, Area, Cycle Time  FAME simulators determine how many cycles a program takes to run  Computing Power/Area/Cycle Time: SAME old story  Push target RTL through VLSI flow  Analytical or empirical models  Collecting event stats for model inputs is much faster than with SAME 13

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB RAMP Gold: A Multithreaded FAME Simulator  Rapid accurate simulation of manycore architectural ideas using FPGAs  Initial version models 64 cores of SPARC v8 with shared memory system on $750 board  Hardware FPU, MMU, boots OS. Cost Performance (MIPS) Simulations per day Simics (SAME)$2, RAMP Gold (FAME)$2,000 + $

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB RAMP Gold Target Machine SPARC V8 CORE SPARC V8 CORE I$ D$ DRAM Shared L2$ / Interconnect SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ SPARC V8 CORE SPARC V8 CORE I$ D$ … 64 cores 15

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB RAMP Gold Model Functional Model Pipeline Arch State Timing Model Pipeline Timing State  SPARC V8 ISA  One-socket manycore target  Split functional/timing model, both in hardware –Functional model: Executes ISA –Timing model: Capture pipeline timing detail  Host multithreading of both functional and timing models  Functional-first, timing- directed  Built for Xilinx Virtex-5 systems [ RAMP Gold, DAC ‘10 ] 16 CORE D$ DRAM Shared L2$ / Interconnect … 64 cores I$ CORE D$ I$ CORE D$ I$ CORE D$ I$

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Case Study: Manycore OS Resource Allocation  Spatial resource allocation in a manycore system is hard  Combinatorial explosion in number of apps and number of resources  Idea: use predictive models of app performance to make it easier on OS  HW partitioning for performance isolation (so models still work when apps run together)  Problem: evaluating effectiveness of resulting scheduling decisions requires running hundreds of schedules for billions of cycles each  Simulation-bound: 8.3 CPU-years for Simics!  See paper for app modeling strategy details 17

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Case Study: Manycore OS Resource Allocation 18

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Case Study: Manycore OS Resource Allocation  The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality! 19

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB RAMP Gold Performance  FAME (RAMP Gold) vs. SAME (Simics) Performance  PARSEC parallel benchmarks, large input sets  >250x faster than full system simulator for a 64-core target system 20

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Researcher Productivity is Inversely Proportional to Latency  Simulation latency is even more important than throughput  How long before experimenter gets feedback?  How many experimenter-days are wasted if there was an error in the experimental setup? 21 Median Latency (days)Maximum Latency (days) FAME SAME

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Fallacy: FAME is too hard  FAME simulators more complex, but not greatly so  Efficient, complete SAME simulators also quite complex  Most experiments only need to change timing model  RAMP Gold’s timing model is only 1000 lines of SystemVerilog  Modeled Globally Synchronized Frames [Lee08] in 3 hours & 100 LOC  Corollary fallacy: architects don’t need to write RTL  We design hardware; we shouldn’t be scared of HDL 22

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Fallacy: FAME Costs Too Much  Running SAME on cloud (EC2) much more expensive!  FAME: 5 XUP boards at $750 ea.; $0.10 per kWh  SAME: EC2 Medium-High instances at $0.17 per hour 23 Runtime (hours) Cost for first experiment Cost for next experiment Carbon offset (trees) FAME257$3,750$100.1 SAME73,000$12,  Are architects good stewards of the environment?  SAME uses energy of 45 seconds of Gulf oil spill!

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Fallacy: statistical sampling will save us  Sampling may not make sense for multiprocessors  Timing is now architecturally visible  May be OK for transactional workloads  Even if sampling is appropriate, runtime dominated by functional warming => still need FAME  FAME simulator ProtoFlex (CMU) originally designed for this purpose  Parallel programs of the future will likely be dynamically adaptive and auto-tuned, which may render sampling useless 24

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Challenge: Simulator Debug Loop can be Longer  Takes 2 hours to push RAMP Gold through the CAD tools  Software RTL simulation to debug simulator is also very slow  SAME debug loop only minutes long  But sheer speed of FAME eases some tasks  Try debugging and porting a complex parallel program in SAME 25

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Challenge: FPGA CAD Tools  Compared to ASIC tools, FPGA tools are immature  Encountered 84 formally-tracked bugs developing RAMP Gold  Including several in the formal verification tools!!  By far FAME’s biggest barrier  (Help us, industry!)  On the bright side, the more people using FAME, the better 26

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB When should Architects still use SAME?  SAME still appropriate in some situations  Pure functional simulation  ISA design  Uniprocessor pipeline design  FAME necessary for manycore research with modern applications 27

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions  FAME uses FPGAs to build simulators, not computers  FAME works, it’s fast, and we’re using it  SAME doesn’t cut it, so use FAME!  Thanks to the entire RAMP community for contributions to FAME methodology  Thanks to NSF, DARPA, Xilinx, SPARC International, IBM, Microsoft, Intel, and UC Discovery for funding support 28 RAMP Gold source code is available: