RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Continuously Recording Program Execution for Deterministic Replay Debugging.

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

UC Berkeley 1 A Disk and Thermal Emulation Model for RAMP Zhangxi Tan and David Patterson.

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.

The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.

ATLAS (a.k.a. RAMP Red) Parallel Programming with Transactional Memory Njuguna Njoroge and Sewook Wee Transactional Coherence and Consistency Computer.

Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley

CS252 Project Presentation Optimizing the Leon Soft Core Marghoob Mohiyuddin Zhangxi TanAlex Elium Dept. of EECS University of California, Berkeley.

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.

Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

Multithreaded SPARC v8 Functional Model for RAMP Gold Zhangxi Tan UC Berkeley RAMP Retreat, Jan 17, 2008.

ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.

Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.

Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

Computer System Architectures Computer System Software

Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.

Content Project Goals. Term A Goals. Quick Overview of Term A Goals. Term B Goals. Gantt Chart. Requests.

1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Network On Chip Platform

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Full and Para Virtualization

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Corflow Online Tutorial Eric Chung

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

CA Final Project – Multithreaded Processor with IPC Interface

Andrew Putnam University of Washington RAMP Retreat January 17, 2008

Computer Structure Multi-Threading

Section 1: Introduction to Simics

Hardware Multithreading

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

Prof. Leonardo Mostarda University of Camerino

Application-Specific Customization of Soft Processor Microarchitecture

Presentation transcript:

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley March 2010

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 2

Overview Purpose of RAMP Gold An FPGA-based simulator for shared-memory multicore target for Parlab Usage case: Architecture, OS and applications Highlight of RAMP Gold Works on $750 Xilinx XUP v5 board Written in systemverilog, no special CAD tools required, works with standard FPGA CAD flows (Synplify/ISE/Modelsim) Two orders of magnitude faster than Simics+GEMS Runtime configurable parameters without resynthesis Full RTL verification environment and software infrastructure BSD and GNU license 3

Simulation Jargon Target vs. Host Target: System/architecture being simulated, e.g. SPARC v8 CMP Host : The platform on which the simulator runs, e.g. FPGAs Functional model and timing model Functional: compute instruction result Timing: how long to compute the instruction 4

RAMP Gold Overall Setup 5 Both functional and timing models on FPGA App server: control and service syscall/IO

Target Machine Template 6 64-core SPARC v8 shared-memory machine Configurable two-level cache + multichannel DRAM

RAMP Gold Performance vs Simics PARSEC parallel benchmarks running on a research OS >250x faster than full system simulator for a 64-core multiprocessor target 7

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 8

RAMP Gold Model Key Concepts 9 Decoupled functional/timing model, both in hardware Enables many FPGA fabric friendly optimizations Increase modeling efficiency and module reuse Host multithreading of both functional and timing models Hide emulation latencies and improve resource utilization Time-multiplexed effect patched by the timing model Functional Model Pipeline Arch State Timing Model Pipeline Timing State

Host multithreading Example: simulating four independent CPUs PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y ALUALU D$ 22 DE 2 Thread Select CPU 0 CPU 1 CPU 2 CPU 3 Target Model Functional CPU model on FPGA

Functional Model Full SPARC v8 support (FP, MMU, I/Os) Pass the SPARC v8 certification test Run Linux and research OS 11

Timing Model Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST) Cache models: only store tags in BRAMs Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc Model 3C but not 4C (coherent support soon) DRAM model: bandwidth-delay pipe with optional QoS 12

Debugging and Simulation Configuration 13 Frontend app server Reliable Gigabit Ethernet connection to FPGA Periodically pulls the simulator to serve I/O requests Transparent to target (no side effect on simulated timing) 64-bit hardware performance counters to collect runtime stats 657 counters in timing model + 10 host counters Can be read by either target apps or the app server Ring interconnect for counters (easy to add and remove)

Host Performance Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical path Host DRAM bandwidth is not a problem (<15% utilization) 14

Implementation Single FPGA: 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range workstation BRAM bounded, but have logic resources to fit more pipelines 15

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work 16

Software Tools SPARC cross compiler with binutils/gcc/glibc Support most of POSIX programs Static & dynamic linking support Built from GNU GCC (4.3.2) Full software and HW debugging suite Low-cost XUP boards sometimes do not work out-of-box FPGA CAD tools are very bad 17

18 Target Software Proxy Kernel: single-protection-domain application host Runs programs statically linked against glibc Forwards I/O system calls to x86/Linux host PC Presents simple hard-threads API for multithreaded programs Very easy to modify ROS: UCBs manycore research OS Provides multiprogramming support Sufficiently POSIX compliant to run many programs Much easier to modify than linux Run more than 64-cores

Infrastructure 19

Case studies Parallel application studies for software programmers Parallel OS for system researchers Adding hardware performance counter for advanced debugging Micro-architecture studies - adding features and modifying existing timing models Adding new instructions – changing the functional model 20

Appserver 101 Appserver command-line options: Usage: sparc_app [-f ] [-p ] [-s] [binary] [args] Platform memory test: App server memory test: sparc_app –p64 hw memtest none Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest 21

For application programmers Main usage scenario: use runtime configurable timing model without any FPGA hardware change Use hard-threads to write a parallel hello world program running on the proxykernel Compile the program using the cross toolchain sparc-ros-gcc –o hello hellp.cpp -lhart Measure performance using performance counters sparc_app –s1 –p64 hw kernel.ramp hello Change target machine configuration on the fly and rerun the experiment edit file appserver.conf 22

For OS Developer Similar usage model like application programmers Proxykernel is a good start to learn the bootstrapping process ROS is a full functional kernel Demo: Boot the ROS kernel using the appserver sparc_app –p64 –fappserver_ros.conf hw your_kernel none 23

Adding Hardware Performance Counters Two types of counter interface Global counter: Local (per core) counter: Modify the verilog file to add more counters on the ring. perfctr_io #(.NLOCAL(num_of_local),.NGLOBAL(num_of_global)) gen_tm_counter(.gclk,.rst,.bus_out(io_out),.bus_in(io_in),.bus_sel(), //IO bus interface.global_inc(global_counter_inc),.local_inc(local_counter_inc),.local_tid(local_counter_tid)); Modify the app server to support more counters: Add your counter definition in TestAppServer/perfcnt.h 24

Adding Features to Timing Models Timing models are much simpler than functional models ~1000 LoC vs 35,000 LoC Example 1: Changing the cache replacement policy Example 2: Adding memory QoS Lee et al. Globally-Synchronized Frames for, Guaranteed Quality-of-Service in On-Chip Networks, ISCA08 ~100 lines of code added in the timing model A new DRAM model Several memory mapped register added on the functional I/O bus for configuration purpose 25

Adding New Instructions Adding instructions to a feed-through pipeline is straightforward FPU instructions were added as new instructions within a week Including: new register file, decode, exception/commit and microcode Example: Adding new atomic instructions through microcode 4 global scratchpad registers (not visible to programmer) in the main integer register file for temporary storage Two write-port for supporting scratchpad registers update along with architecture register change 26

Steps of Adding Instructions Add proper decoding logic in function decode_dsp_add_logic of regacc_dma.sv Update the writeback/exception stage in file exception_dma.sv to trap to microcode. Edit function decode_microcode_mode to trap to microcode Edit function rd_gen to write address to scratch register 0, and load data to scratch register 1 Edit microcode ROM Microcode.sv // SWAP* : begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0}; end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1, 13'b0}; end 27

Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version) 28

Further References Research papers Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA10 RAMP Gold design: RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors, DAC10 Beta release 29

Backup Slides 30

Functional/Timing Model Interface 31 // FM -> TM typedef struct { bit valid; //timing token between FM and TM. bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit [5:0] tid; //thread ID bit run; //cpu states bit run; //cpu states bit replay; //this instruction needs to replay by FM bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn bit [31:0] npc; //PC of next fetched insn}tm_cpu_ctrl_token_type; // TM -> FM typedef struct { bit valid; //timing token between FM and TM. bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit [5:0] tid; //thread ID bit run; //run bit bit run; //run bit}tm2cpu_token_type;