Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.
222 Full-system Functional Simulation Effective substitute for real (or non-existent) HW –Can boot OS, run commercial apps –Important in SW research & computer architecture But too slow for large-scale MP studies –Multicore won’t help existing tools –Is serious challenge for large-MP (1000-way) simulation REVIEW
333 Alternative: FPGA-based simulation Only 10x slower in clock freq than custom HW But FPGAs harder to use than software –Simulating large-MP (100- to 1000-way) can’t be done trivially –Simulating full-system support need devices + entire ISA The “build-all” strategy in FPGAs = significant effort + resources Memory PCI Bus Ethernet controller Graphics card I/O MMU controller Disk DMA controller IRQ controller Terminal SCSI controller CPU FPGAs
444 Reducing complexity w/ virtualization Hybrid Full-System SimulationVirtualized MP Simulation Only frequent behaviors hosted in FPGA. Relegate infrequent to SW. Target full-system behaviors FPGA Software frequent infrequent CPU Logical CPUs multiplexed onto fewer physical CPUs. Host resources 1 FPGA CPU Host resources Making multiple physical resources appear as a single logical resource Making a single physical resource appear as multiple logical resources 2 1
555 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work
666 3 CPU Hybrid Full-System Simulation 3 ways to map target component to hybrid simulation host FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Remaining behavs relegated to SW (turns out many of complex ones) CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host Hybrid Simulation FPGA host 1 2 I/O instr CPU transplant Transplants reduce full-system design effort CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host CPU Software-only simulation
777 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work
8 Virtualized Multiprocessor Simulation Problem: large-scale simulation configurations challenging to implement in FPGAs using structurally-accurate approaches # processors in target model Structural-accuracy 1-to-1 mapping between target and host CPUs # host processors implemented in FPGA Pros: fastest possible solution, only 10x slower than real HW Cons: difficult to build for large-scale configs (e.g., >100-way) 10x slower than real HW 1-to-1
999 Virtualized Multiprocessor Simulation Advantages: Decouple logical target system size from FPGA host size Scale FPGA host as-needed to deliver required performance High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect) # processors in target model Host Interleaving Multiplex target processors onto fewer # FPGA-hosted processors # host “engines” implemented in FPGA 40x slower than real HW 4-to-1
10 What’s inside an FPGA host processor? An “engine” that architecturally executes multiple contexts –Existing multithreaded designs are good candidates –Choice is influenced by TH ratio (target-to-host ratio) We propose an interleaved pipeline (e.g., TERA-style) –Best suited for high TH ratio –Switch in new CPU context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between CPUs mapped onto same engine CPU HOST CPU
11 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work
12 Implementation: BlueSPARC simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform
13 BlueSPARC Simulator (continued) Processing Nodes16 64-bit UltraSPARC III contexts 14-stage instruction-interleaved pipeline L1 cachesSplit I/D, 64KB, 64B, direct-mapped, writeback Non-blocking loads/stores 16-entry MSHR, 4-entry store buffer Clock frequency 90MHz on Xilinx V2P70 Main memory4GB total Resources (Xilinx V2P70) 33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug 43,206 LUTs (65%), 238 BRAMs (72%) InstrumentationAll internal state fully traceable Attachable to FPGA-based CMP cache simulator* EDA toolsXilinx EDK 9.2i, Bluespec System Verilog Statistics25K lines Bluespec, 511 rules, 89 module types CheckpointingFully compatible with Simics checkpoints Can load AND generate checkpoints
14 BlueSPARC host microarchitecture 64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages
15 Hybrid host partitioning choices BlueSPARC (FPGA)Micro-transplant (on-chip simulation) add/sub/shift/logical multiply/divide register windows 38/103 SPARC ASIs interprocessor x-calls device interrupts I-/D-MMU + tlb miss Loads/stores/atomics VIS block memory 65/103 SPARC ASIs VIS I/II multimedia FP add/sub/mul/div + traps FP/INT conversion trap on integer arithmetic alignment fixed-point arithmetic tlb/cache diagnostics tlb demap Transplant (off-chip simulation) PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Fibre Channel SCSI disk/cdrom Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus BlueSPARC Micro-transplants (PowerPC405) ON-CHIP FPGA Transplants (Simics on PC) OFF-CHIP
16 Performance Perf comparable to Simics-fast 39x speedup on average over Simics-trace
17 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work
18 Design experiences 2007 Timeline January- February Initial virtualization ideas Analysis + simulation of interleaving ISA profiling of apps for hybrid partitioning Initial specifications for host pipeline MarchSimics API wrappers + software experiments April- November BlueSPARC RTL development Validation tools November- December Host performance instrumentation and writeup* * To appear in FPGA’08
19 Design experiences (cont) What was important: –Developing effective validation strategies (more on next slide) –Existing reference model (Simics) to study and compare against –Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM) –Coping with long Xilinx builds by easing up on timing constraints –“Judicious” Bluespec What was NOT important: –Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining) –Implementing every functionality as efficiently/fast as possible
20 Validation THE most challenging aspect of this project Strategies used –Auto-generated torture tests + hand-written test cases –Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III –Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA) –Flight data recorder for non-deterministic interleaving of CPUs –Batched Verilog simulations w/ varying parameters –Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs –> 200 synthesizable assertions to Chipscope –Built-in deadlock/error detectors
21 In retrospect… What I would have done differently to begin with –Write entire USIII functional model myself in software first –Take more advantage of Verilog PLI for validation (interface to C) –Don’t over-engineer HDL –Don’t upgrade tools unless necessary (e.g., trial license runs out) –Validation infrastructure w/ batching capabilities (do earlier!) –Automated “binary search” tool for bug hunting –Re-write DDR2 Async FIFOs without BRAMs –Fast memory checkpoint loader (3GB images per run = 25m) –Simple, correct >> Fast, buggy
22 Future Work Scalability –Burden-of-proof for 1000-way simulation? –Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines Virtualization design spaces –On-chip storage virtualization (e.g., architectural state) –Memory + disk capacity (e.g., HW-based demand paging?) –Virtualizing instrumentation (e.g., paging functional cache tags) Fast instrumentation tools –Understanding systems at multiple levels of abstraction (beyond ISA) –Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?
23 BlueSPARC Demo on BEE2 23 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint 4 DDR2 Controllers + 4 GB memory Ethernet (to Simics on PC) Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging) BEE2 Platform
24 Conclusion “Build-all” simulation approach in FPGAs is challenging Two virtualization techniques for reducing complexity –Hybrid: attain full-system by deferring rare behavs to SW –Virtualized MP: decouples target system size from host size BlueSPARC proof-of-concept –Models 16-cpu UltraSPARC III server –Comparable perf to Simics-fast, 39x on avg faster than Simics-trace Thanks! Questions? P ROTO F LEX (