Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung, Michael K. Papamichael, James C. Hoe, Babak Falsafi, and Ken Mai P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, Sun, Xilinx, and Microsoft.
Field Programmable Gate Arrays What are FPGAs? –Sea of programmable logic, RAMs –Capacity: 30 RISC cores, 0.5-3MB –Speed: MHz 7/9/20162 Interconnect switch Lookup Tables (LUTs) FI 0 I 1 I 2 I 3 I 4 I … LUTs can be used as a 6- input gate or a 64x1 RAM Truth Table BlockRAMs 64 x 1-bit SRAM I0I0 I1I1 I2I2 I3I3 I4I4 I5I5 F Typical applications –ASIC replacement, logic emulation –Communications, auto industry, consumer electronics, defense, etc.
The ProtoFlex Simulator History –Project started (circa 2007) to build a scalable, full-system multiprocessor simulator using Field-Programmable Gate Arrays Key Features –Functional simulation of 16-way UltraSPARC III server (~50-90 MIPS) –Using hybrid simulation, runs real server apps + Solaris OS –Employs multithreading to virtualize # CPUs per FPGA core 3 Hybrid Simulation Virtualization 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael
The ProtoFlex Simulator User perceives UltraSPARC III functional simulator –Software user interface –Applications load directly from Simics checkpoints –Features: state viewing, scripting, single-stepping, terminal, statistics 4 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael
Availability: Now What we’re giving out –Source RTL, scripts for SPARCV9 CPU model –XUPV5 Reference Design for EDK 10.1 –Virtutech Simics plug-ins for hybrid simulation –Top-level SW controller, user command-line interface –Documentation through online wiki Why open source? –Demonstration of FPGAs as viable architecture research vehicle –Facilitate adoption of hybrid simulation & host multithreading –Encourage building on top of our work 5
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael6 Tutorial Agenda Part I: Core Concepts (45min) –FPGA Virtualization techniques –Simulator implementation Part II: Hands-on Tutorial (2 hr) –Download, familiarize with tool flows –Access CMU’s remote FPGA infrastructure –Compile, run code within FPGA-simulated 4-CPU UltraSPARC III server
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael7 Part I: Core Concepts
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 8 Full-system Functional Simulators Convenient exploration of real (or future) HW –Can boot OS, run commercial apps But too slow for large-scale (>100-way) MP studies –Existing functional simulators single-threaded –High instrumentation overhead
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 9 FPGA-based simulation Advantages –Fast and flexible –Low HW instrumentation performance overhead
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 10 FPGA-based simulation Caveat: simulation w/ FPGAs more complex than SW –Large-MP configurations (>100) expensive to scale –Full-system support need device models + full ISA
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 11 Reducing Complexity Hybrid Full-System Simulation Multiprocessor Host Interleaving 21 CPU P Memory Devices P Common-case behaviors Uncommon behaviors Memory 4-way P P P P P P P P P
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 12 3 P Hybrid Full-System Simulation 3 ways to map target components FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Complex, rare behaviors relegated to software PP Memory CD FIBRE GFX NIC PCI TERM SCSI PC Host Simulator FPGA host 1 2 I/O instr P P transplant Transplants reduce full-system complexity
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 13 Multiprocessor Host Interleaving # processors to model in functional simulator 1-to-1 mapping between target and host CPUs # soft cores implemented in FPGA 1-to-1 P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P P 4-to-1 host interleaving P P P P P P P P P P P P P P P P P P P P Advantages: Trade away FPGA throughput for smaller implementation Decouple logical simulated size from FPGA host size Host processor in FPGA can be made very simple
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 14 FPGA Host Processor Architecturally executes multiple contexts –Existing multithreaded micro-architectures are good candidates Our design: Instruction-Interleaved Pipeline –Switch in new processor context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between contexts mapped to same pipeline
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 15 Outline Introduction The ProtoFlex Simulator Applications Platform Release
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 16 Top-level Simulator View FPGA Core 2 Duo BlueSPARC Core PowerPC (or Microblaze) Ethernet (or PCIe) Interface to Memory Processor Bus Simulated I/O Devices (using full-system simulator) Simics simulates devices, memory- mapped I/O and DMA (~ µsec) Nearby core simulates unimplemented SPARC instructions (~10-50 µsec)
The BlueSPARC Core Mirrors the Virtutech Simics model –64-bit SPARCV9 ISA + US III extensions –8 register windows, 4 global register files –512-entry D-TLB, 128 I-TLB Implementation –14-stage, multi-threaded pipeline, switch context on each cycle –On Virtex-5, XST~148MHz, Placed & 100MHz –Parameterized non-blocking caches –FP + rare MMU instructions are SW-emulated by nearby uBlaze 17 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM)
Why UltraSPARC III? Popular ISA used in architecture community – –Many simulation users will find this model familiar High-quality reference model from Virtutech –Simics has complete & validated reference model (CPU and system devices) –Runs commercial out-of-box software and OS –Simics system simulation supports ~ 300 CPUs in target model Backwards compatibility –US III workloads brought up in Simics usable in ProtoFlex 18
Core Design Statistics Runs 100MHz on V5 –Synthesizes up to 148MHz using standard tools (ISE XST) Logic usage –23.5 KLUTs (11.3% LX330T) BRAM usage –120 BRAMs for 16-context configuration (37% LX330T) Future optimizations –Paging structures to SRAM or DRAM can reduce BRAM by significant amount 19
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 20 Hybrid host partitioning choices BlueSPARCPowerPC405Simics on PC Integer ALU Register Windows Traps/Interrupts I-/D-MMU Memory + Atomics 36% special SPARC insns Multimedia Floating Point Rare TLB operations 64% special SPARC insns PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus Implementation = 60% of basic ISA, 36% of special SPARC insns, 0% devices
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 21 BlueSPARC on the BEE2 Platform Functionally models 16-CPU UltraSPARC III Server –HW components are hosted on BEE2 FPGA platform Virtutech Simics used as full-system I/O simulator 5 Virtex-II Pro 70s (~66 KLUTs) 2 embedded PowerPCs per FPGA Only use 2 out of 5 FPGAs (pipeline + DDR controllers) Berkeley Emulation Engine 2
Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW NO Instrumentation Speedup: 1.23x MIPS
Functional CMP Cache Model Functional Branch Predictor Model HW vs. SW Simulation Performance Simulation Infrastructure: –Full-system functional simulator (e.g. Simics) –Instrumentation (e.g. functional cache model) Functional Branch Predictor Model Functional CMP Cache Model ? BlueSPARC Simics SW-based HW-based SWHW WITH Instrumentation Speedup: 37x MIPS
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 24 Outline Introduction The ProtoFlex Simulator Applications Platform Release
Using ProtoFlex Add passive monitors –Counters, histograms –Roll your own 25 I-TLB Stage 1 I-TLB Stage 2 64-bit ALU Stage 1 Context Scheduler I-Fetch Address Generate I-Fetch Tag Check US III Decoder 64-bit ALU Stage 2 D-TLB Stage 1 D-TLB Stage 2 D-TLB Stage 3 D-Cache Address Generate D-Cache Tag Check Multi-Cycle Instruction Unit Nonblocking I-cache (BRAM) Writeback Arbiter to DDR Memory Integer RF (BRAM) Nonblocking D-cache (BRAM) Trace-based simulation –Collect dynamic traces –Feed traces to functional-first timing model Sampled Program Monitoring –Use micro-blaze (or PPC) to monitor core/memory state –Unintrusive profiling w/o changes to target SW Counters Histogram Tracker Histogram Trackers Timing Model FPGA Hard/Soft Core (PowerPC or MicroBlaze)
Applications of ProtoFlex Examples –Functional-first CMP cache coherency model for first-order timing models and functional warming [TRETS’09] –Real-time stack trace profiling –CMP interconnect model (in progress) –Realistic CPU traffic generators (in progress) … running real 16-CPU server workloads –Oracle TPC-C, IBM DB/2 TPC-C, TPC-H, SPEC2K 26 Piranha CMP Cache (First-Order Timing Model) Statistics + Warmed Coherency & Tag States
What does the RTL look like? We use Bluespec System Verilog (high-level, synthesizable HDL) –4-8 weeks learning curve for normal HDL users –Requires BSV compiler (free for academics) –Paper in MEMOCODE’09 describes BSV coding/validation of core Sample code: 27 rule split_ALU_pipeline (True); … p1 = piperegs[DECODE]; piperegs[ALU1] <= doALUStage1(p1, alu_ifc); p2 = piperegs[ALU1]; piperegs[ALU2] <= doALUStage2(p2, alu_ifc); … endrule rule merged_ALU_pipeline (True); … p1 = piperegs[DECODE]; p_tmp1 = doALUStage1(p1, alu_ifc); p_tmp2 = doALUStage2(p_tmp1, alu_ifc); piperegs[ALU] <= p_tmp2; … endrule 2-stage ALU 1-stage ALU
Other Simulator Features Changing Core Parameters –Number of CPU contexts –Cache sizes –Merge/split pipeline stages –Enable/disable modules for profiling & debugging –Clock frequency 10 MHz – 100 MHz) –Set optimal LUTRAM size (16 = V2P, 64 = V5) –Choose LUTRAMs or BRAMs for any CPU state 28
Current Limitations Simulator is currently non-deterministic Simics checkpoint saving not included in release No hardware DP FPU No coherency between multiple BlueSPARCs 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 29
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 30 Outline Introduction The ProtoFlex Simulator Applications Platform Release
Platform Release: XUPV5 Why XUPv5? –Inexpensive (~$750), easily accessible –Standard tool flows (EDK, ISE) –Reference design portable to other platforms – just drop in our ‘pcores’ Supporting other platforms –Future ports to BEE3 & Xilinx Accelerated Computing Platform (ACP) –Contributions are welcomed! 31
Required Equipment 32 BlueSPARC PFMON + Simics Bitstream Loader + RS232 Terminal
33 XUPv5 Overview Virtex-5 LX110T DDR2 Memory –up to 2GB 1ΜΒ SRAM PCI Express x1 Serial Port
XUPv5 Plugged into Host System 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 34
SRAM Controller Multi-Port Memory Controller Reference Design Block Diagram 35 PLB BlueSPARCMicroBlaze PCI Express BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM
Controller Multi-Port Memory Controller BlueSPARC 36 PLB MicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express Wrapped in EDK IP core –connects to PLB & NPI 100MHz 4 CPU contexts 64KB I&D L1 caches BlueSPARC
SRAM Controller Multi-Port Memory Controller BlueSPARC 37 PLB BlueSPARCMicroBlaze PCI Express Serial Port XUPv5 LX110T DRAM SRAM PCI Express 81% utilization –Core 51% (76 out of 148) –Rest 30% (45 out of 148) BRAM
SRAM Controller Multi-Port Memory Controller PCI Express 38 PLB BlueSPARCMicroBlaze BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM 200 MB/sec bandwidth ~10 usec RTT latency Socket Abstraction PCI Express
SRAM Controller DDR2 Memory Controller 39 PLB BlueSPARCMicroBlaze Ethernet BRAM Serial Port BRAM XUPv5 LX110T DRAM SRAM PCI Express 1.5GB/s peak BW 115ns latency Multiple ports/interfaces Multi-Port Memory Controller
40 Linux PC Software requirements –OpenSuSE Linux 11.0 –Bluespec SystemVerilog Compiler v C –Xilinx ISE 10.1, Xilinx EDK 10.1 –Virtutech Simics –Synopsys VCS Y (optional for Verilog simulations) We provide –Linux PCI Express DMA driver –Simics hybrid simulation plug-in modules –ProtoFlex MONitor tool (PFMON)
41 Linux PC Runs PFMON (ProtoFlex MONitor) –Orchestrates communication between Simics & BlueSPARC –Provides CLI interface to simulator (like Simics Console) Runs Simics –Handles I/O, FPGA Core and memory initialization BlueSPARC PFMON Linux PC Simics
42 From RTL to Running System Bluespec Verilog – Bluespec compiler – ~30 minutes Verilog Bitstream – Xilinx EDK – ~ 3 hours Bitstream Working System – Stream mem. image over PCIe – ~ 1 minute (for 1GB image) Bluespec code Verilog code Bitstream Working System 3
XUPv5 Caveats Differences from BEE2 platform –Uses Virtex-5 LX110T no PowerPC core, fewer BRAMs –Uses soft microblaze core instead 5X slower than PPC –Lower memory capacity (1 GB max) Our implementation on XUPv5 –Uses PCI express instead of ethernet fast mem loading –Limited to simulating 4-CPU system –Not enough resources for “extras” (e.g., cache simulator) 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 43
9-Jul-16RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 44 BlueSPARC Demo on BEE2 44 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint BEE2 Platform
45 BEE2 Demo Configuration 4 GB memory (4 DDR2 Controllers) Ethernet (to Simics on PC) Virtex-II Pro 70 (BlueSPARC & PowerPC) RS232 (Debugging) FACS (FPGA Accelerated Cache Simulator) Memory references
Demo Web-based Real-time Viewing of Statistics 46
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 47 Part II: Hands-on Tutorial
Objectives Download open source code Familiarize with standard tool flows Implement an FPGA design from scratch Prepare Simics workload and checkpoint Load Simics checkpoint into FPGA Extract runtime statistics 9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 48
CMU Remote Setup 7/9/
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 50 Groups We have a total of FOUR remote workstations –Please divide evenly into 4 groups –We will assign a group number to you (1-4) Each group will need two laptops: 1) Remote Desktop Connection (Windows XP) 2) Web browser for displaying online instructions
Instructions Navigate to 7/9/2016IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 51 1) Click Here 2) Register new account here
Thank you! Acknowledgments –Funding for this work provided by: NSF CCF , NSF CNS , C2S2 Marco Center, SUN, and Microsoft –Paul Hartke & Xilinx for FPGA donations & support Topics & References –Hybrid Simulation and FPGA Core Multithreading A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs [FPGA’08] –Functional-First, First-Order Timing Model for a CMP cache ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs [TRETS’09] –FPGA Core Design & Validation Using Flight Data Recorder Implementing a High-performance Multithreaded Microprocessor: A Case Study in High-level Design and Validation [MEMOCODE’09] 52
9-Jul-16IIWSC 2009 Tutorial / Eric S. Chung and Michael Papamichael 53 Thanks! Any questions? Acknowledgements We would like to thank our colleagues in the RAMP and TRUSS projects.