1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,

1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 RAMP Implementation J. Wawrzynek. 2 RDL supports multiple platforms:  XUP, pure software, BEE2 BEE2 will be the standard RAMP platform for the next.

© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

Configurable System-on-Chip: Xilinx EDK

Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley

RAMP-White Hari Angepat Derek Chiou University of Texas at Austin.

1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008.

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

January 2007 RAMP Retreat BEE3 Update Chuck Thacker Technical Fellow Microsoft Research 11 January, 2007.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

Computer Processing of Data

E0001 Computers in Engineering1 The System Unit & Memory.

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

RAMPing Down Chuck Thacker Microsoft Research August 2010.

집적회로 Spring 2007 Prof. Sang Sik AHN Signal Processing LAB.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

25 April 2000 SEESCOASEESCOA STWW - Programma Evaluation of on-chip debugging techniques Deliverable D5.1 Michiel Ronsse.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

The Alpha – Data Stream Matt Ziegler.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Recen progress R93088 李清新. Recent status – about hardware design Finishing the EPXA10 JPEG2000 project. Due to the DPRAM problem can’t be solved by me,

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Background Computer System Architectures Computer System Software.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.

Andrew Putnam University of Washington RAMP Retreat January 17, 2008

Derek Chiou The University of Texas at Austin

Figure 1 PC Emulation System Display Memory [Embedded SOC Software]

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 14: Reducing Cache Misses

Combining Simulators and FPGAs “An Out-of-Body Experience”

What is Computer Architecture?

Presentation transcript:

1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009

2 Much confusion about RAMP Frequently asked questions: When will RAMP be finished/usable? What ISA does RAMP use? Can RAMP model my new feature “X”? How accurate is RAMP? Why so many different RAMP projects? Why is there not more sharing among projects?

3 Not much confusion about software simulators Rarely asked questions: When will software simulation be finished/usable? What ISA do software simulators use? Can a software simulator model my new feature “X”? How accurate is software simulation? Why so many software simulators? Why is there not more sharing among software simulators?

4 RAMP is a consortium, not a project Many projects with different goals  sometimes multiple per site So far, much sharing of ideas and techniques  Very healthy and active community Some sharing of low-level infrastructure  Boards + platform-level interfaces to DRAM, Ethernet, etc. Not a single complete infrastructure that everyone uses  and that’s been OK, and might continue to be OK

5 Host Platform CPU Interconnect Network DRAM Target Machine Hard Work Run Model of Target on Host Platform

6 RAMP Projects’ Goals Model some target machine trading off:  Fidelity  Model design effort  Emulation speed (and capacity)

7 Space of Target Machines Which ISA?  x86, SPARC, PowerPC, Alpha, ARM, MIPS? In-order or out-of-order cores? How many cores?  1, 16, 256, 1M? Processor+memory of general-purpose machine, or whole SoC including I/O devices? Accelerators, GPUs? Which operating system? Hypervisor?

8 ISA Wars Original pick to standardize around was SPARC  Open standard  Available verification suite  Simplest ISA with extensive general-purpose software support (i.e., desktop/server development environment available) SGI/MIPS sorely missed…  Leon implementation for FPGA  Simics But the intent was always to support multiple ISAs

9 ISA usage in RAMP models UCB RAMP Blue: Microblaze++  Xilinx soft core modified to add 64-bit FPU Stanford RAMP Red: PowerPC  Used Virtex-II Pro hard cores UT FAST: x86  Functional simulation in software on front-end machine (or on PowerPC hardcore) UT RAMP White: PowerPC -> SPARC  Initial version used hard PowerPC cores moving to Leon soft cores MIT/Intel HASIM: Alpha -> x86?  Initially Alpha ISA, eventually to form basis of x86/uOP machine CMU ProtoFLEX: SPARC  “SPARC three ways” (own core + emulation on hard PowerPC core + emulation on front-end machine) UCB RAMP Gold & Internet-in-a-Box: SPARC  Own core design UCB/LBNL Green Flash: Tensilica  RTL generated from Tensilica tools

10 Supporting new ISAs x86 still very desirable, but difficult  FAST software functional model is probably current best approach if want to play with different timings  Microcoded functional model would be good way to go if had resources (HASIM?)  Even with working functional model, timing model is difficult? Adding new features difficult? ARM also desirable for mobile device modeling  Renewed interest in engaging here MIT/IBM PowerPC work in progress, could form functional model But nobody does this for fun - only to advance their own research goals…

11 Commercial/Existing RTL Cores Originally seen as big benefit of RAMP But didn’t turn out that way in practice (except for prototyping usage model - see later) Cores don’t provide features we need, too big, too difficult to modify For simple ISAs (i.e. non-x86), biggest help is ISA verification suites, and/or *really* simple synthesizable ISA pipeline to form basis of functional model

12 Operating System Support Currently only ProtoFLEX, FAST, RAMP-White support OS  Others can run one application with proxy mechanism for I/O Reflects interests of groups. OS is not primary subject of research for groups building models so far.  RAMP Gold to add support for ParLab OS work (Tessellation)  Green Flash to add support for HPC-style microkernel

13 Target systems From a few, to millions of cores  Scaling simulation to 100s of cores was a shared goal  But smaller core counts (16-128) very interesting also  Huge core counts (>1E6) also of interest Single node versus clusters  RAMP Blue & Internet-in-a-box are message-passing clusters  Rest are shared-memory systems Memory hierarchy and cache coherence protocols  Wide variety of possibilities Desktop/Laptop/Server versus Handheld or SoC  What is important to model for given research topic? Accelerators/GPUs  Even wider variety than CPU ISAs/microarchitectures

14 Wide variety, how to reuse? Proposal: ISA functional models  also FPU across ISAs  Perhaps even common uOP engine across all ISAs? CPU Microarchitecture timing model  E.g., in-order superscalar, out-of-order with unified physical register file Memory functional model  Host-level caches + memory interleaving Memory hierarchy timing models  On-chip network types as subset I/O bus shims  To allow random RTL to be attached for I/O devices and non-GPU accelerators This won’t be easy, as have to agree on interfaces between these components, might need further specialization Definitely need more experience doing all of the above

15 Simulator Types Functional model only (no timing) RTL models (functional includes timing)  Also used for chip prototyping Split functional and timing models + Hybrids of above

16 Simulator Mapping Styles Gate-level emulator (Quickturn, Palladium)  ~1MHz Direct RTL emulator  5-20MHz FPGA-tuned RTL emulator  20-50MHz Virtualized RTL emulator  MHz Host-multithreaded models  >100MHz

17

18 RAMP Blue Release 2/25/ design available from RAMP website - ramp.eecs.berkeley.edu

19 Climate System Design Concept Strawman Design Study 10PF sustained ~120 m 2 <3MWatts < $75M 32 boards per rack 100 ~25KW power + comms 32 chip + memory clusters per board ( W VLIW CPU: 128b load-store + 2 DP MUL/ADD + integer op/ DMA per cycle: Synthesizable at 650MHz in commodity 65nm 1mm 2 core, mm 2 with inst cache, data cache data RAM, DMA interface, 0.25mW/MHz Double precision SIMD FP : 4 ops/cycle (2.7GFLOPs) Vectorizing compiler, cycle-accurate simulator, debugger GUI (Existing part of Tensilica Tool Set) 8 channel DMA for streaming from on/off chip DRAM Nearest neighbor 2D communications grid Proc Array RAM 8 DRAM per processor chip: ~50 GB/s CPU K D 2x128b 32K I 8 chan DMA CPU D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA D I DMADMA Opt. 8MB embedded DRAM External DRAM interface Master Processor Comm Link Control 32 processors per 65nm chip 83 7W

20 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file  Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage  If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area  If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches  Current FPGAs only have ~1MB of on-chip SRAM  Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

21 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

22 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

23 RAMP White Hari Angepat, Derek Chiou (UT Austin) RAMP-White23 Leon 3 MstSlvDbgInt Leon3 shim MP IntCntrl DSUEthDDR2 Leon 3 MstSlvDbgInt AHB bus Leon3 shim Intersectio n Unit NIU Intersectio n Unit NIU Route r Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models DDR2 AHB bus AHB shim

24 Multithreaded Func. & Timing Models (RAMP Gold: UCB) MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline Timing State MT-Unit MT-Channels

25 CMU Simics/RAMP Simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

26 What Hardware Platforms? RTL mapping approaches  Need large amounts of logic  Selected BEE2, and then designed BEE3 for this emulation style  Observed that don’t need much interconnect bandwidth (memory + inter-board links) because RTL cores are slow and latency sensitive Host-multithreading allows large systems to be mapped to small (one?) FPGA (e.g., cores on ML505)  Logic gate count not as critical, need to focus on on-chip capacity, off- chip memory bandwidth and total memory capacity per FPGA (conventional processor memory hierarchy issues multiplied by multithreading factor)  One big FPGA with lots of fast memory channels would be ideal Software functional emulation (FAST) or transplant (ProtoFLEX)  Focus on fast coherent connection to front-end x86 CPU  Hypertransport, FSB, QPI interfaces better than PCI I/O connections

27 Summary Many reasons for great divergence in RAMP projects  Different ISAs, different target machines, different research topics, different emulation styles Sharing possible, but hard work and more experience needed Questions?