Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
Term Project Overview Yong Wang. Introduction Goal –familiarize with the design and implementation of a simple pipelined RISC processor What to do –Build.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
Configurable System-on-Chip: Xilinx EDK
The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.
ABACUS: A Hardware-Based Software Profiler for Modern Processors Eric Matthews Lesley Shannon School of Engineering Science Sergey Blagodurov Sergey Zhuravlev.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Xilinx at Work in Hot New Technologies ® Spartan-II 64- and 32-bit PCI Solutions Below ASSP Prices January
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
© 2004 Xilinx, Inc. All Rights Reserved EDK Overview.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Computer Architecture Lab at 1 FPGAs and Bluespec: Experiences and Practices Eric S. Chung, James C. Hoe {echung,
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
Computer Engineering 1502 Advanced Digital Design Professor Donald Chiarulli Computer Science Dept Sennott Square
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Full and Para Virtualization
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C.
Corflow Online Tutorial Eric Chung
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
What is CRKIT Framework ? Baseband Processor :  FPGA-based off-the-shelf board  Control up to 4 full-duplex wideband radios  FPGA-based System-on-Chip.
Introduction to the FPGA and Labs
Lecture 2. A Computer System for Labs
Programmable Logic Devices
Virtual Machine Monitors
Lab 1: Using NIOS II processor for code execution on FPGA
CA Final Project – Multithreaded Processor with IPC Interface
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
INTEL HYPER THREADING TECHNOLOGY
Instructor: Dr. Phillip Jones
Improving java performance using Dynamic Method Migration on FPGAs
Combining Simulators and FPGAs “An Out-of-Body Experience”
ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs
Today’s agenda Hardware architecture and runtime system
A High Performance SoC: PkunityTM
Lecture 20: OOO, Memory Hierarchy
THE ECE 554 XILINX DESIGN PROCESS
Portable SystemC-on-a-Chip
NetFPGA - an open network development platform
THE ECE 554 XILINX DESIGN PROCESS
Cluster Computers.
Presentation transcript:

Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, and Ken Mai P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, Sun, Xilinx, and Microsoft.

Field Programmable Gate Arrays What are FPGAs? –Sea of programmable logic, RAMs –Capacity: 30 RISC cores, 0.5-3MB –Speed: MHz 7/9/20162 Interconnect switch Lookup Tables (LUTs) FI 0 I 1 I 2 I 3 I 4 I … LUTs can be used as a 6- input gate or a 64x1 RAM Truth Table BlockRAMs 64 x 1-bit SRAM I0I0 I1I1 I2I2 I3I3 I4I4 I5I5 F Typical applications –ASIC replacement, logic emulation –Communications, auto industry, consumer electronics, defense, etc.

The ProtoFlex Simulator History –Project started (circa 2007) to build a scalable, full-system multiprocessor simulator using Field-Programmable Gate Arrays Key Features –Functional simulation of 16-way UltraSPARC III server (~50-90 MIPS) –Using hybrid simulation, runs real server apps + Solaris OS –Employs multithreading to virtualize # CPUs per FPGA core 3 Hybrid Simulation Virtualization 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael

The ProtoFlex Simulator User perceives UltraSPARC III functional simulator –Software user interface –Applications load directly from Simics checkpoints –Features: state viewing, scripting, single-stepping, terminal, statistics 4 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael

Availability: Now What we’re giving out –Source RTL, scripts for SPARCV9 CPU model –XUPV5 Reference Design for EDK 10.1 –Virtutech Simics plug-ins for hybrid simulation –Top-level SW controller, user command-line interface –Documentation through online wiki Why open source? –Demonstration of FPGAs as viable architecture research vehicle –Facilitate adoption of hybrid simulation & host multithreading –Encourage building on top of our work 5

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael6 Tutorial Agenda Part I: Core Concepts (45min) –FPGA Virtualization techniques –Simulator implementation Part II: Hands-on Tutorial (2 hr) –Download, familiarize with tool flows –Access CMU’s remote FPGA infrastructure –Compile, run code within FPGA-simulated 4-CPU UltraSPARC III server

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael7 Part I: Core Concepts

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 8 Full-system Functional Simulators Convenient exploration of real (or future) HW –Can boot OS, run commercial apps But too slow for large-scale (>100-way) MP studies –Existing functional simulators single-threaded –High instrumentation overhead

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 9 FPGA-based simulation Advantages –Fast and flexible –Low HW instrumentation performance overhead

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 10 FPGA-based simulation Caveat: simulation w/ FPGAs more complex than SW –Large-MP configurations (>100)  expensive to scale –Full-system support  need device models + full ISA

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 11 Reducing Complexity Hybrid Full-System Simulation Multiprocessor Host Interleaving 21 CPU P Memory Devices P Common-case behaviors Uncommon behaviors Memory 4-way P P P P P P P P P

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 12 3 P Hybrid Full-System Simulation 3 ways to map target components FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Complex, rare behaviors relegated to software PP Memory CD FIBRE GFX NIC PCI TERM SCSI PC Host Simulator FPGA host 1 2 I/O instr P P transplant Transplants reduce full-system complexity

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 13 Multiprocessor Host Interleaving # processors to model in functional simulator 1-to-1 mapping between target and host CPUs # soft cores implemented in FPGA 1-to-1 P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P 4-to-1 host interleaving P P P P P P P P P P P P P P P P P P P P Advantages: Trade away FPGA throughput for smaller implementation Decouple logical simulated size from FPGA host size Host processor in FPGA can be made very simple

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 14 FPGA Host Processor Architecturally executes multiple contexts –Existing multithreaded micro-architectures are good candidates Our design: Instruction-Interleaved Pipeline –Switch in new processor context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between contexts mapped to same pipeline

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 15 Outline Introduction The ProtoFlex Simulator Applications Platform Release

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 16 Top-level Simulator View FPGA Core 2 Duo BlueSPARC Core PowerPC (or Microblaze) Ethernet (or PCIe) Interface to Memory Processor Bus Simulated I/O Devices (using full-system simulator) Simics simulates devices, memory- mapped I/O and DMA (~ µsec) Nearby core simulates unimplemented SPARC instructions (~10-50 µsec)

The BlueSPARC Core Mirrors the Virtutech Simics model –64-bit SPARCV9 ISA + US III extensions –8 register windows, 4 global register files –512-entry D-TLB, 128 I-TLB Implementation –14-stage, multi-threaded pipeline, switch context on each cycle –On Virtex-5, XST~148MHz, Placed & 100MHz –Parameterized non-blocking caches –FP + rare MMU instructions are SW-emulated by nearby uBlaze 17 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM)

Why UltraSPARC III? Popular ISA used in architecture community – –Many simulation users will find this model familiar High-quality reference model from Virtutech –Simics has complete & validated reference model (CPU and system devices) –Runs commercial out-of-box software and OS –Simics system simulation supports ~ 300 CPUs in target model Backwards compatibility –US III workloads brought up in Simics usable in ProtoFlex 18

Core Design Statistics Runs 100MHz on V5 –Synthesizes up to 148MHz using standard tools (ISE XST) Logic usage –23.5 KLUTs (11.3% LX330T) BRAM usage –120 BRAMs for 16-context configuration (37% LX330T) Future optimizations –Paging structures to SRAM or DRAM can reduce BRAM by significant amount 19

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 20 Hybrid host partitioning choices BlueSPARCPowerPC405Simics on PC Integer ALU Register Windows Traps/Interrupts I-/D-MMU Memory + Atomics 36% special SPARC insns Multimedia Floating Point Rare TLB operations 64% special SPARC insns PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus Implementation = 60% of basic ISA, 36% of special SPARC insns, 0% devices

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 21 BlueSPARC on the BEE2 Platform Functionally models 16-CPU UltraSPARC III Server –HW components are hosted on BEE2 FPGA platform Virtutech Simics used as full-system I/O simulator 5 Virtex-II Pro 70s (~66 KLUTs) 2 embedded PowerPCs per FPGA Only use 2 out of 5 FPGAs (pipeline + DDR controllers) Berkeley Emulation Engine 2

Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW NO Instrumentation Speedup: 1.23x MIPS

Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW WITH Instrumentation Speedup: 37x MIPS

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 24 Outline Introduction The ProtoFlex Simulator Applications Platform Release

Using ProtoFlex Add passive monitors –Counters, histograms –Roll your own 25 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM) Trace-based simulation –Collect dynamic traces –Feed traces to functional-first timing model Sampled Program Monitoring –Use micro-blaze (or PPC) to monitor core/memory state –Unintrusive profiling w/o changes to target SW Counters Histogram Tracker Histogram Trackers Timing Model FPGA Hard/Soft Core (PowerPC or MicroBlaze)

Applications of ProtoFlex Examples –Functional-first CMP cache coherency model for first-order timing models and functional warming [TRETS’09] –Real-time stack trace profiling –CMP interconnect model (in progress) –Realistic CPU traffic generators (in progress) … running real 16-CPU server workloads –Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K 26 Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States

What does the RTL look like? We use Bluespec System Verilog (high-level, synthesizable HDL) –4-8 weeks learning curve for normal HDL users –Requires BSV compiler (free for academics) –Paper in MEMOCODE’09 describes BSV coding/validation of core Sample code: 27 rule split_ALU_pipeline (True); … p1 = piperegs[DECODE]; piperegs[ALU1] <= doALUStage1(p1, alu_ifc); p2 = piperegs[ALU1]; piperegs[ALU2] <= doALUStage2(p2, alu_ifc); … endrule rule merged_ALU_pipeline (True); … p1 = piperegs[DECODE]; p_tmp1 = doALUStage1(p1, alu_ifc); p_tmp2 = doALUStage2(p_tmp1, alu_ifc); piperegs[ALU] <= p_tmp2; … endrule 2-stage ALU 1-stage ALU

Other Simulator Features Changing Core Parameters –Number of CPU contexts –Cache sizes –Merge/split pipeline stages –Enable/disable modules for profiling & debugging –Clock frequency 10 MHz – 100 MHz) –Set optimal LUTRAM size (16 = V2P, 64 = V5) –Choose LUTRAMs or BRAMs for any CPU state 28

Current Limitations Simulator is currently non-deterministic Simics checkpoint saving not included in release No hardware DP FPU No coherency between multiple BlueSPARCs 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 29

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 30 Outline Introduction The ProtoFlex Simulator Applications Platform Release

Platform Release: XUPV5 Why XUPv5? –Inexpensive (~$750), easily accessible –Standard tool flows (EDK, ISE) –Reference design portable to other platforms – just drop in our ‘pcores’ Supporting other platforms –Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) –Contributions are welcomed! 31

Required Equipment 32 BlueSPARC PFMON + Simics Bitstream Loader + RS232 Terminal

33 XUPv5 Overview Virtex-5 LX110T DDR2 Memory –up to 2GB 1ΜΒ SRAM PCI Express x1 Serial Port

XUPv5 Plugged into Host System 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 34

SRAM Controller Multi-Port Memory Controller Reference Design Block Diagram 35 PLB BlueSPARCMicroBlaze PCI Express BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM

Controller Multi-Port Memory Controller BlueSPARC 36 PLB MicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express Wrapped in EDK IP core –connects to PLB & NPI 100MHz 4 CPU contexts 64KB I&D L1 caches BlueSPARC

SRAM Controller Multi-Port Memory Controller BlueSPARC 37 PLB BlueSPARCMicroBlaze PCI Express Serial Port XUPv5 LX110T DRAM SRAM PCI Express 81% utilization –Core 51% (76 out of 148) –Rest 30% (45 out of 148) BRAM

SRAM Controller Multi-Port Memory Controller PCI Express 38 PLB BlueSPARCMicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM 200 MB/sec bandwidth ~10 usec RTT latency Socket Abstraction PCI Express

SRAM Controller DDR2 Memory Controller 39 PLB BlueSPARCMicroBlaze Ethernet BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express 1.5GB/s peak BW 115ns latency Multiple ports/interfaces Multi-Port Memory Controller

40 Linux PC Software requirements –OpenSuSE Linux 11.0 –Bluespec SystemVerilog Compiler v C –Xilinx ISE 10.1, Xilinx EDK 10.1 –Virtutech Simics –Synopsys VCS Y (optional for Verilog simulations) We provide –Linux PCI Express DMA driver –Simics hybrid simulation plug-in modules –ProtoFlex MONitor tool (PFMON)

41 Linux PC Runs PFMON (ProtoFlex MONitor) –Orchestrates communication between Simics & BlueSPARC –Provides CLI interface to simulator (like Simics Console) Runs Simics –Handles I/O, FPGA Core and memory initialization BlueSPARC PFMON Linux PC Simics

42 From RTL to Running System Bluespec  Verilog – Bluespec compiler – ~30 minutes Verilog  Bitstream – Xilinx EDK – ~ 3 hours Bitstream  Working System – Stream mem. image over PCIe – ~ 1 minute (for 1GB image) Bluespec code Verilog code Bitstream Working System 3

XUPv5 Caveats Differences from BEE2 platform –Uses Virtex-5 LX110T  no PowerPC core, fewer BRAMs –Uses soft microblaze core instead  5X slower than PPC –Lower memory capacity (1 GB max) Our implementation on XUPv5 –Uses PCI express instead of ethernet  fast mem loading –Limited to simulating 4-CPU system –Not enough resources for “extras” (e.g., cache simulator) 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 43

9-Jul-16RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 44 BlueSPARC Demo on BEE2 44 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint BEE2 Platform

45 BEE2 Demo Configuration 4 GB memory (4 DDR2 Controllers) Ethernet (to Simics on PC) Virtex-II Pro 70 (BlueSPARC & PowerPC) RS232 (Debugging) FACS (FPGA Accelerated Cache Simulator) Memory references

Demo Web-based Real-time Viewing of Statistics 46

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 47 Part II: Hands-on Tutorial

Objectives Download open source code Familiarize with standard tool flows Implement an FPGA design from scratch Prepare Simics workload and checkpoint Load Simics checkpoint into FPGA Extract runtime statistics 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 48

CMU Remote Setup 7/9/

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 50 Groups We have a total of FOUR remote workstations –Please divide evenly into 4 groups –We will assign a group number to you (1-4) Each group will need two laptops: 1) Remote Desktop Connection (Windows XP) 2) Web browser for displaying online instructions

Instructions Navigate to 7/9/2016IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 51 1) Click Here 2) Register new account here

Thank you! Acknowledgments –Funding for this work provided by: NSF CCF , NSF CNS , C2S2 Marco Center, SUN, and Microsoft –Paul Hartke & Xilinx for FPGA donations & support Topics & References –Hybrid Simulation and FPGA Core Multithreading A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08] –Functional-First, First-Order Timing Model for a CMP cache ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09] –FPGA Core Design & Validation Using Flight Data Recorder Implementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09] 52

9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 53 Thanks! Any questions? Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects.