Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Chapter 6 Computer Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
Scalable Processor Architecture (SPARC) Jeff Miles Joel Foster Dhruv Vyas.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
Configurable System-on-Chip: Xilinx EDK
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
GSRC Annual Symposium Sep 29-30, 2008 Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation Abhishek Bhattacharjee, Gilberto Contreras,
Chapter 3 Chapter 3: Server Hardware. Chapter 3 Learning Objectives n Describe the base system requirements for Windows NT 4.0 Server n Explain how to.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Reconfigurable Computing in the Undergraduate Curriculum Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Tanenbaum 8.3 See references
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
Peter S. Magnusson, Magnus Crhistensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Högberg, Frederik larsson, Anreas Moestedt. Presented.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
© 2007 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Hardware Design INF3430 MicroBlaze 7.1.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Lecture 1 1 Computer Systems Architecture Lecture 1: What is Computer Architecture?
Computer Architecture Lab at 1 FPGAs and Bluespec: Experiences and Practices Eric S. Chung, James C. Hoe {echung,
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Full and Para Virtualization
Lecture on Central Process Unit (CPU)
Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns
Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Bob Hirosky L2  eta Review 26-APR-01 L2  eta Introduction L2  etas – a stable source of processing power for DØ Level2 Goals: Commercial (replaceable)
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
FPGA Technology Overview Carl Lebsack * Some slides are from the “Programmable Logic” lecture slides by Dr. Morris Chang.
Virtual Memory By CS147 Maheshpriya Venkata. Agenda Review Cache Memory Virtual Memory Paging Segmentation Configuration Of Virtual Memory Cache Memory.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
April 15, 2013 Atul Kwatra Principal Engineer Intel Corporation Hardware/Software Co-design using SystemC/TLM – Challenges & Opportunities ISCUG ’13.
Computer Architecture Lab at ProtoFlex: An Architectural Exploration Vehicle Using FPGA-Accelerated Full-System Multiprocessor Simulations Eric S. Chung,
Translation Lookaside Buffer
Virtual Machine Monitors
From Address Translation to Demand Paging
Page Table Implementation
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Cache memory Direct Cache Memory Associate Cache Memory
Section 1: Introduction to Simics
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Combining Simulators and FPGAs “An Out-of-Body Experience”
ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs
Today’s agenda Hardware architecture and runtime system
CSE 451: Operating Systems Autumn Module 24 Virtual Machine Monitors
Presentation transcript:

Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai {echung, enurvita, jhoe, babak, P ROTO F LEX Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

222 Full-system Functional Simulation Effective substitute for real (or non-existent) HW –Can boot OS, run commercial apps –Important in SW research & computer architecture But too slow for large-scale MP studies –Multicore won’t help existing tools –Is serious challenge for large-MP (1000-way) simulation REVIEW

333 Alternative: FPGA-based simulation Only 10x slower in clock freq than custom HW But FPGAs harder to use than software –Simulating large-MP (100- to 1000-way)  can’t be done trivially –Simulating full-system support  need devices + entire ISA The “build-all” strategy in FPGAs = significant effort + resources Memory PCI Bus Ethernet controller Graphics card I/O MMU controller Disk DMA controller IRQ controller Terminal SCSI controller CPU FPGAs

444 Reducing complexity w/ virtualization Hybrid Full-System SimulationVirtualized MP Simulation Only frequent behaviors hosted in FPGA. Relegate infrequent to SW. Target full-system behaviors FPGA Software frequent infrequent CPU Logical CPUs multiplexed onto fewer physical CPUs. Host resources 1 FPGA CPU Host resources Making multiple physical resources appear as a single logical resource Making a single physical resource appear as multiple logical resources 2 1

555 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

666 3 CPU Hybrid Full-System Simulation 3 ways to map target component to hybrid simulation host FPGA-only Simulation-only Transplantable CPUs can fallback to SW by “transplanting” between hosts –Only common-case instructions/behaviors implemented in FPGA –Remaining behavs relegated to SW (turns out many of complex ones) CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host Hybrid Simulation FPGA host 1 2 I/O instr CPU transplant Transplants reduce full-system design effort CPU Memory MMU Fibre Graphics NICPCI Terminal SCSI Software full-system simulator host CPU Software-only simulation

777 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

8 Virtualized Multiprocessor Simulation Problem: large-scale simulation configurations challenging to implement in FPGAs using structurally-accurate approaches # processors in target model Structural-accuracy 1-to-1 mapping between target and host CPUs # host processors implemented in FPGA Pros: fastest possible solution, only 10x slower than real HW Cons: difficult to build for large-scale configs (e.g., >100-way) 10x slower than real HW 1-to-1

999 Virtualized Multiprocessor Simulation Advantages: Decouple logical target system size from FPGA host size Scale FPGA host as-needed to deliver required performance High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect) # processors in target model Host Interleaving Multiplex target processors onto fewer # FPGA-hosted processors # host “engines” implemented in FPGA 40x slower than real HW 4-to-1

10 What’s inside an FPGA host processor? An “engine” that architecturally executes multiple contexts –Existing multithreaded designs are good candidates –Choice is influenced by TH ratio (target-to-host ratio) We propose an interleaved pipeline (e.g., TERA-style) –Best suited for high TH ratio –Switch in new CPU context on each cycle –Simple, efficient design w/ no stalling or forwarding –Long-latency tolerance (e.g., cache miss, transplants) –Coherence is “free” between CPUs mapped onto same engine CPU HOST CPU

11 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

12 Implementation: BlueSPARC simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

13 BlueSPARC Simulator (continued) Processing Nodes16 64-bit UltraSPARC III contexts 14-stage instruction-interleaved pipeline L1 cachesSplit I/D, 64KB, 64B, direct-mapped, writeback Non-blocking loads/stores 16-entry MSHR, 4-entry store buffer Clock frequency 90MHz on Xilinx V2P70 Main memory4GB total Resources (Xilinx V2P70) 33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug 43,206 LUTs (65%), 238 BRAMs (72%) InstrumentationAll internal state fully traceable Attachable to FPGA-based CMP cache simulator* EDA toolsXilinx EDK 9.2i, Bluespec System Verilog Statistics25K lines Bluespec, 511 rules, 89 module types CheckpointingFully compatible with Simics checkpoints Can load AND generate checkpoints

14 BlueSPARC host microarchitecture 64-bit ISA, SW-visible MMU, complex memory  high # of pipeline stages

15 Hybrid host partitioning choices BlueSPARC (FPGA)Micro-transplant (on-chip simulation) add/sub/shift/logical multiply/divide register windows 38/103 SPARC ASIs interprocessor x-calls device interrupts I-/D-MMU + tlb miss Loads/stores/atomics VIS block memory 65/103 SPARC ASIs VIS I/II multimedia FP add/sub/mul/div + traps FP/INT conversion trap on integer arithmetic alignment fixed-point arithmetic tlb/cache diagnostics tlb demap Transplant (off-chip simulation) PCI bus ISP2200 Fibre Channel I21152 PCI bridge IRQ bus Fibre Channel SCSI disk/cdrom Text Console SBBC PCI device Serengeti I/O PROM Cheerio-hme NIC SCSI bus BlueSPARC Micro-transplants (PowerPC405) ON-CHIP FPGA Transplants (Simics on PC) OFF-CHIP

16 Performance Perf comparable to Simics-fast 39x speedup on average over Simics-trace

17 Outline Hybrid Full-System Simulation Virtualized Multiprocessor Simulation BlueSPARC Implementation Design Experiences Future Work

18 Design experiences 2007 Timeline January- February Initial virtualization ideas Analysis + simulation of interleaving ISA profiling of apps for hybrid partitioning Initial specifications for host pipeline MarchSimics API wrappers + software experiments April- November BlueSPARC RTL development Validation tools November- December Host performance instrumentation and writeup* * To appear in FPGA’08

19 Design experiences (cont) What was important: –Developing effective validation strategies (more on next slide) –Existing reference model (Simics) to study and compare against –Efficient mapping of state to FPGA resources (e.g., 16 PCs  16-bit LUT-based distributed RAM) –Coping with long Xilinx builds by easing up on timing constraints –“Judicious” Bluespec What was NOT important: –Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining) –Implementing every functionality as efficiently/fast as possible

20 Validation THE most challenging aspect of this project Strategies used –Auto-generated torture tests + hand-written test cases –Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III –Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA) –Flight data recorder for non-deterministic interleaving of CPUs –Batched Verilog simulations w/ varying parameters –Validate non-blocking memory system with “shadow” flat memories during Verilog simulation  caught self-modifying code bugs –> 200 synthesizable assertions to Chipscope –Built-in deadlock/error detectors

21 In retrospect… What I would have done differently to begin with –Write entire USIII functional model myself in software first –Take more advantage of Verilog PLI for validation (interface to C) –Don’t over-engineer HDL –Don’t upgrade tools unless necessary (e.g., trial license runs out) –Validation infrastructure w/ batching capabilities (do earlier!) –Automated “binary search” tool for bug hunting –Re-write DDR2 Async FIFOs without BRAMs –Fast memory checkpoint loader (3GB images per run = 25m) –Simple, correct >> Fast, buggy

22 Future Work Scalability –Burden-of-proof for 1000-way simulation? –Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines Virtualization design spaces –On-chip storage virtualization (e.g., architectural state) –Memory + disk capacity (e.g., HW-based demand paging?) –Virtualizing instrumentation (e.g., paging functional cache tags) Fast instrumentation tools –Understanding systems at multiple levels of abstraction (beyond ISA) –Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?

23 BlueSPARC Demo on BEE2 23 Demo application –On-Line Transaction Processing benchmark (TPC-C) in Oracle –Runs in Solaris 8 (unmodified binary) –FPGA + Memory directly loaded from Simics checkpoint 4 DDR2 Controllers + 4 GB memory Ethernet (to Simics on PC) Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging) BEE2 Platform

24 Conclusion “Build-all” simulation approach in FPGAs is challenging Two virtualization techniques for reducing complexity –Hybrid: attain full-system by deferring rare behavs to SW –Virtualized MP: decouples target system size from host size BlueSPARC proof-of-concept –Models 16-cpu UltraSPARC III server –Comparable perf to Simics-fast, 39x on avg faster than Simics-trace Thanks! Questions? P ROTO F LEX (