Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

Similar presentations


Presentation on theme: "Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak."— Presentation transcript:

1 Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

2 222 Full-system Functional Simulation Effective substitute for real (or non-existent) HW –Can boot OS, run commercial apps –Important in SW research & computer architecture But too slow for large-scale MP studies –Multicore won’t help existing tools –Is serious challenge for large-MP (1000-way) simulation REVIEW

3 333 Alternative: FPGA-based simulation Only 10x slower in clock freq than custom HW But FPGAs harder to use than software –Simulating large-MP (100- to 1000-way)  can’t be done trivially –Simulating full-system support  need devices + entire ISA The “build-all” strategy in FPGAs = significant effort + resources Memory PCI Bus Ethernet controller Graphics card I/O MMU controller Disk DMA controller IRQ controller Terminal SCSI controller CPU FPGAs

4 444 Reducing complexity w/ virtualization Hybrid Full-System SimulationVirtualized MP Simulation Only frequent behaviors hosted in FPGA. Relegate infrequent to SW. Target full-system behaviors FPGA Software frequent infrequent CPU Logical CPUs multiplexed onto fewer physical CPUs. Host resources 1 FPGA CPU Host resources Making multiple physical resources appear as a single logical resource Making a single physical resource appear as multiple logical resources 2 1

5 555 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

6 666 3 CPU Hybrid Full-System Simulation 3 ways to map target component to hybrid simulation host FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Remaining behavs relegated to SW (turns out many of complex ones) 1 2 3 CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host Hybrid Simulation FPGA host 1 2 I/O instr CPU transplant Transplants reduce full-system design effort CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host CPU Software-only simulation

7 777 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

8 8 Virtualized Multiprocessor Simulation Problem: large-scale simulation configurations challenging to implement in FPGAs using structurally-accurate approaches # processors in target model Structural-accuracy 1-to-1 mapping between target and host CPUs # host processors implemented in FPGA Pros: fastest possible solution, only 10x slower than real HW Cons: difficult to build for large-scale configs (e.g., >100-way) 10x slower than real HW 1-to-1

9 999 Virtualized Multiprocessor Simulation Advantages: Decouple logical target system size from FPGA host size Scale FPGA host as-needed to deliver required performance High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect) # processors in target model Host Interleaving Multiplex target processors onto fewer # FPGA-hosted processors # host “engines” implemented in FPGA 40x slower than real HW 4-to-1

10 10 What’s inside an FPGA host processor? An “engine” that architecturally executes multiple contexts –Existing multithreaded designs are good candidates –Choice is influenced by TH ratio (target-to-host ratio) We propose an interleaved pipeline (e.g., TERA-style) –Best suited for high TH ratio –Switch in new CPU context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between CPUs mapped onto same engine CPU HOST CPU

11 11 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

12 12 Implementation: BlueSPARC simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

13 13 BlueSPARC Simulator (continued) Processing Nodes16 64-bit UltraSPARC III contexts 14-stage instruction-interleaved pipeline L1 cachesSplit I/D, 64KB, 64B, direct-mapped, writeback Non-blocking loads/stores 16-entry MSHR, 4-entry store buffer Clock frequency 90MHz on Xilinx V2P70 Main memory4GB total Resources (Xilinx V2P70) 33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug 43,206 LUTs (65%), 238 BRAMs (72%) InstrumentationAll internal state fully traceable Attachable to FPGA-based CMP cache simulator* EDA toolsXilinx EDK 9.2i, Bluespec System Verilog Statistics25K lines Bluespec, 511 rules, 89 module types CheckpointingFully compatible with Simics checkpoints Can load AND generate checkpoints

14 14 BlueSPARC host microarchitecture 64-bit ISA, SW-visible MMU, complex memory  high # of pipeline stages

15 15 Hybrid host partitioning choices BlueSPARC (FPGA)Micro-transplant (on-chip simulation) add/sub/shift/logical multiply/divide register windows 38/103 SPARC ASIs interprocessor x-calls device interrupts I-/D-MMU + tlb miss Loads/stores/atomics VIS block memory 65/103 SPARC ASIs VIS I/II multimedia FP add/sub/mul/div + traps FP/INT conversion trap on integer arithmetic alignment fixed-point arithmetic tlb/cache diagnostics tlb demap Transplant (off-chip simulation) PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Fibre Channel SCSI disk/cdrom Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus BlueSPARC Micro-transplants (PowerPC405) ON-CHIP FPGA Transplants (Simics on PC) OFF-CHIP

16 16 Performance Perf comparable to Simics-fast 39x speedup on average over Simics-trace

17 17 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

18 18 Design experiences 2007 Timeline January- February Initial virtualization ideas Analysis + simulation of interleaving ISA profiling of apps for hybrid partitioning Initial specifications for host pipeline MarchSimics API wrappers + software experiments April- November BlueSPARC RTL development Validation tools November- December Host performance instrumentation and writeup* * To appear in FPGA’08

19 19 Design experiences (cont) What was important: –Developing effective validation strategies (more on next slide) –Existing reference model (Simics) to study and compare against –Efficient mapping of state to FPGA resources (e.g., 16 PCs  16-bit LUT-based distributed RAM) –Coping with long Xilinx builds by easing up on timing constraints –“Judicious” Bluespec What was NOT important: –Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining) –Implementing every functionality as efficiently/fast as possible

20 20 Validation THE most challenging aspect of this project Strategies used –Auto-generated torture tests + hand-written test cases –Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III –Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA) –Flight data recorder for non-deterministic interleaving of CPUs –Batched Verilog simulations w/ varying parameters –Validate non-blocking memory system with “shadow” flat memories during Verilog simulation  caught self-modifying code bugs –> 200 synthesizable assertions to Chipscope –Built-in deadlock/error detectors

21 21 In retrospect… What I would have done differently to begin with –Write entire USIII functional model myself in software first –Take more advantage of Verilog PLI for validation (interface to C) –Don’t over-engineer HDL –Don’t upgrade tools unless necessary (e.g., trial license runs out) –Validation infrastructure w/ batching capabilities (do earlier!) –Automated “binary search” tool for bug hunting –Re-write DDR2 Async FIFOs without BRAMs –Fast memory checkpoint loader (3GB images per run = 25m) –Simple, correct >> Fast, buggy

22 22 Future Work Scalability –Burden-of-proof for 1000-way simulation? –Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines Virtualization design spaces –On-chip storage virtualization (e.g., architectural state) –Memory + disk capacity (e.g., HW-based demand paging?) –Virtualizing instrumentation (e.g., paging functional cache tags) Fast instrumentation tools –Understanding systems at multiple levels of abstraction (beyond ISA) –Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?

23 23 BlueSPARC Demo on BEE2 23 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint 4 DDR2 Controllers + 4 GB memory Ethernet (to Simics on PC) Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging) BEE2 Platform

24 24 Conclusion “Build-all” simulation approach in FPGAs is challenging Two virtualization techniques for reducing complexity –Hybrid: attain full-system by deferring rare behavs to SW –Virtualized MP: decouples target system size from host size BlueSPARC proof-of-concept –Models 16-cpu UltraSPARC III server –Comparable perf to Simics-fast, 39x on avg faster than Simics-trace Thanks! Questions? echung@ece.cmu.edu P ROTO F LEX (http://www.ece.cmu.edu/~simflex/protoflex.html)


Download ppt "Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak."

Similar presentations


Ads by Google