1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Avalon Switch Fabric. 2 Proprietary interconnect specification used with Nios II Principal design goals – Low resource utilization for bus logic – Simplicity.
CS-334: Computer Architecture
Hardwired networks on chip for FPGAs and their applications
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
On-Chip Cache Analysis A Parameterized Cache Implementation for a System-on-Chip RISC CPU.
1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 RAMP 100K Core Breakout Assorted RAMPants RAMP Retreat, UC San Diego June 14, M.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Configurable System-on-Chip: Xilinx EDK
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.
RAMP Common Interface Krste Asanovic Derek Chiou Joel Emer.
I/O Subsystem Organization and Interfacing Cs 147 Peter Nguyen
1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
ECE Department: University of Massachusetts, Amherst Lab 1: Introduction to NIOS II Hardware Development.
Switch EECS 252 – Spring 2006 RAMP Blue Project Jue Sun and Gary Voronel Electrical Engineering and Computer Sciences University of California, Berkeley.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Basic Computer Organization CH-4 Richard Gomez 6/14/01 Computer Science Quote: John Von Neumann If people do not believe that mathematics is simple, it.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University
© 2011 Altera Corporation—Public Introducing Qsys – Next Generation System Integration Platform AP Tech Roadshow.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
1 Nios II Processor Architecture and Programming CEG 4131 Computer Architecture III Miodrag Bolic.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
© 2007 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Hardware Design INF3430 MicroBlaze 7.1.
FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.
Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CE-321: Computer.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
CH10 Input/Output DDDData Transfer EEEExternal Devices IIII/O Modules PPPProgrammed I/O IIIInterrupt-Driven I/O DDDDirect Memory.
Tools - LogiBLOX - Chapter 5 slide 1 FPGA Tools Course The LogiBLOX GUI and the Core Generator LogiBLOX L BX.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.
Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower than CPU.
IT3002 Computer Architecture
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
1 A Deficit Round Robin 20MB/s Layer 2 Switch Muraleedhara Navada Francois Labonte.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the Field Programmable Port Extender John Lockwood and David Taylor Washington University.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Lab 4 HW/SW Compression and Decompression of Captured Image
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
5.2 Eleven Advanced Optimizations of Cache Performance
Avalon Switch Fabric.
RAMP Retreat, UC Berkeley
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Presentation transcript:

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007

RAMP: An infrastructure to build simulators using FPGAs

3 Host Platform CPU Interconnect Network DRAM Target Model Hard Work Run Target Model on Host Platform

4 Reduce, Reuse, Recycle Reduce effort to build target models  Users just build components, infrastructure handles connections (The RDL Compiler) Reuse components by having good abstractions  Across different target models  Across different host platforms XUP, Calinx, BEE2, BEE3, also Altera (see Greg) Recycle existing IP for use as simulation models  Commercial processor RTL is its own model

5 RAMP Target Models Units Relatively large chunks of functionality  e.g., processor + L1 cache User-written in some HDL or software Channels Point-point, undirectional, two kinds:  FIFO channel: Flow-controlled interface  Pipeline channel: Simple shift register, bits drop off end Generated by RAMP infrastructure Unit C Unit B Unit A FIFO Channel Pipeline Channel

6 Target FIFO Channel Parameters Need buffering of at least (Forward+Reverse) latency to get full bandwidth over link RAMP infrastructure instantiates channel with desired parameters D Forward Latency Buffering D Reverse Latency Datawidth RDY ENQ RDY DEQ

7 Target Pipeline Channel Parameters Only recommended for expert use in target models (Should use FIFO channels and latency-insensitive protocols in target design) D Forward Latency Datawidth D

8 RAMP Description Language (RDL) Unit C Unit B Unit A User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL RDL Compiler (RDLC) generates configurations Unit C Uni t B Uni t A FPGA1 FPGA2 RDLC Generated Unit Wrappers Generated links carry channels Target: Host: [ Greg Gibeling, UCB ]

9 Virtual Target Clock

10 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file  Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage  If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area  If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches  Current FPGAs only have ~1MB of on-chip SRAM  Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

11 Start/Done Timing Interface Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) Unit Start Done Wrapper Out In1 In2

12 Distributed Timing Models

13 Distributed Timing Example Unit A Unit B Latency L D Target:RDYsRDY Host: Unit A Unit B DD Start Done Start Done DEQs ENQDEQ Pipeline target channel implemented as distributed FIFO with at least L buffers

14 Latency L D Target: D D Credits RDY ENQ D RDY DEQ Credit control Timing Target FIFO Channel Can build timed credit-based flow control (CBFC) FIFO inside Target model, using pipeline channels for communicating data forwards and credits backwards But this puts two CBFCs in series (one in target unit, one hidden in host implementation of pipeline channels) RDL can generate a unified FIFO that merges both of these behind the FIFO interface

15 Other Automatically Generated Networks Control network has workstation as master and every unit as slave device  Memory-mapped interface with block transfers  Used for initialization, stats gathering, debugging, and monitoring Units can connect to DRAM resources outside of timed target channels  Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels  Support arbitrary communication. E.g., for distributed stats gathering

16 Wide Variety of RAMP Simulators

17 Simulator Design Choices Structural Analog versus Highly Virtualized Functional-only versus Functional+Timing Timing via (virtual) RTL design versus separate functional and timing models Hybrid software/hardware simulators We’re trying to build layers of abstractions that are useful to all types of simulator Also, trying to make modules in different styles inter- operate

18 Effective Abstractions Hide Details

19 …But Provide Inter-Operability

20 Work in Progress: Stay Tuned