© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe,

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Transforming a FAST simulator into RTL implementation Nikhil A. Patil & Derek Chiou FAST Research group, University of Texas at Austin 1.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx.
Computer Architecture Lab at Building a Synthesizable x86 Eriko Nurvitadhi, James C. Hoe, Babak Falsafi S IMFLEX /P ROTOFLEX.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
6/15/06Derek Chiou, UT Austin, RAMP1 Confessions of a RAMP Heretic: Fast, Full-System, Cycle-Accurate x86/PowerPC/ARM/Sparc Simulators Derek Chiou University.
Microkernels: Mach and L4
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Principle of Functional Verification Chapter 1~3 Presenter : Fu-Ching Yang.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Peter S. Magnusson, Magnus Crhistensson, Jesper Eskilson, Daniel Forsgren, Gustav Hallberg, Johan Högberg, Frederik larsson, Anreas Moestedt. Presented.
Computer System Architectures Computer System Software
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Introduction CSE 410, Spring 2008 Computer Systems
Architecture Support for OS CSCI 444/544 Operating Systems Fall 2008.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Computer Organization and Design Computer Abstractions and Technology
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Processes Introduction to Operating Systems: Module 3.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.
This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Lecture on Central Process Unit (CPU)
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Introduction to Operating Systems Concepts
Computer Hardware What is a CPU.
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Derek Chiou The University of Texas at Austin
Flow Path Model of Superscalars
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
* From AMD 1996 Publication #18522 Revision E
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

© Derek Chiou 1 Functional/Timing Split in UT FAST Derek Chiou, Dam Sunwoo, Joonsoo Kim, Nikhil Patil, William Reinhart, D. Eric Johnson, Jebediah Keefe, Hari Angepat University of Texas at Austin Electrical and Computer Engineering

Test of size 1/17/08RAMP Retreat 2 First, Some Terminology Host: the system on which a simulator runs Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM A Xilinx FPGA board Target: t he system being modeled Alpha processor Dell 390 with a single 1.8GHz Core 2 Duo and 4GB of RAM Host Simulator Target Your desktop Simplescalar (sim-alpha) Alpha 21264

Test of size 1/17/08RAMP Retreat 3 FAST Goals RTL-level cycle-accuracy Complex ISA capable (x86, PowerPC) Complex micro-architecture capable (Intel Core 2) Off-the-shelf OS, apps (Windows, MS Word, Linux) Lesson I learned from my grad school career Can run other stuff too of course MP-capable (scale with FPGA resources) Fast (10MIPS range) (Relatively) easy to implement, modify, extend

Test of size 1/17/08RAMP Retreat 4 FAST Prototype in Real Time

Test of size 1/17/08RAMP Retreat 5 FAST: Speculative Functional/Timing Partitioning Proven partitioning (FastSim) FM executes instructions to completion, pushes inst trace to TM FM insts used as TM fetched insts If functional insts != timing insts, TM forces FM to rollback Eg., branch mis-speculation, resolve, memory ordering Clean inst trace/rollback interface Optimize the common case! FM runs independently from TM when functional insts == timing insts! Easy to parallelize Better target uArch simulates faster Factorized, not partitioned! (FM + TM) < (monolithic simulator) FM fairly simple, only functionality TM fairly simple, only timing Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Caches Arbitration Pipelining Associativity …. Inst trace

Test of size 1/17/08RAMP Retreat 6 High Level FAST Architecture: A Parallelized Simulator Parallelized between FM & TM Parallel target would have a parallelized FM, each target core conceptually running on a separate host core Parallelized TM Parallelizes nicely in hardware (FPGA) Latency tolerant, infrequent round-trips Stats can be done in hardware! (no performance impact) trace Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Host FPGA

Test of size 1/17/08RAMP Retreat 7 What Is A FAST Functional Model? Requirements Fast, Full System, generate instruction trace, support rollback Hardware functional models Fast, but FPGA implementation difficult to make complete x86, boots Windows? Simple, resource efficient FM sufficient (but need trace, rollback) Software functional models Bochs, QEMU, Simics, SimNow, SimOS, etc. Run on fastest hardware we know about to execute an ISA Full system

Test of size 1/17/08RAMP Retreat 8 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2

Test of size 1/17/08RAMP Retreat 9 Modular Timing Models: Modules + Connectors Modules model timing functionality E.g., rename, caches, etc. Built hierarchically for extensibility CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Connectors connect modules Abstract timing from modules Throughput (input, output), delay, maxTransactions Stats and tracing Bill Reinhart

Test of size 1/17/08RAMP Retreat 10 Microcode Compiler Intention Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86 Uses the LLVM Compiler Infrastructure developed at UIUC Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA Generates microcode that “runs” on the timing model Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported Average 1.27 uOps per handled dynamic x86 instruction Microcode ISA is simple load/store with some x86 specialization Build a functional model as a processor executing microcode ISA? Nikhil Patil

Test of size 1/17/08RAMP Retreat 11 Prototype Overview Software functional model Eventually hardware functional model, but software sim exists FPGA-based timing model written in Bluespec Complex OoO micro-architecture fits in a single FPGA DRC or XUP trace Functional Model Software Timing Model Bluespec HDL ProcessorFPGA DRC Computer HT Xilinx FPGA PowerPC 405

Test of size 1/17/08RAMP Retreat 12 Current Prototype Functional Model Derived from QEMU Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, … Prototype QEMU currently supports x86 Added tracing, rollback (implemented with checkpoint) Including I/O (keyboard, mouse, video) Hosts x86 machines PowerPC inside of an FPGA Dam Sunwoo, Jeb Keefe

Test of size 1/17/08RAMP Retreat 13 Current Prototype Timing Model: OOO Superscalar Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

Test of size 1/17/08RAMP Retreat 14 Current Simulator Performance on DRC Includes Operating System Code

Test of size 1/17/08RAMP Retreat 15 FAST Future Work Improve performance (currently TM bottlenecked) TM (5MIPS-20MIPS) FM (10MIPS-100MIPS) Optimize simulator (10MIPS-20MIPS) Hardware FM (FPGA ~100MIPS)  Simple pipe running microcode ISA  Multithreaded (Protoflex) to improve throughput Real processor??? (debug for trace, transactional for rollback?) FAST-MP Need MP host PowerPC ISA support (this month) Power estimation capabilities in FPGAs Don’t slow down FAST performance

Test of size 1/17/08RAMP Retreat 16 FAST and RAMP-Classic: A Comparison FASTRAMP CoresArbitrary, complex cores from simple core+rollback+TM Full RTL, fits in FPGA ISAs/OSsArbitrary ISA/OS (x86/Windows, PowerPC, …) Depends on core (PowerPC, Sparc) AccuracyRTL cycle-accurate Host cycles >= target cycles enabling reuse of hardware resources Only accurate target RTL available (unless timing model used) Host cycle == target cycle SpeedComplexity/Resource tradeoffAs fast as the FPGA will run ScalabilityDepends on FPGA resources (TM costs) Depends on FPGA resources

Test of size 1/17/08RAMP Retreat 17 FAST-MP: RAMP as a functional model How to do FAST-MP? Need multicore host! RAMP will not be able to accurately model Intel Core X micro-architecture in RAMP Unless it becomes much simpler in-order pipeline even then, will Intel give us RTL? But, RAMP processor can execute ISA Target ISA == Host ISA Add trace, rollback capabilities Target ISA != Host ISA Simulate target ISA Hardware support for target ISA, trace, rollback? RAMP becomes a FAST functional model Use timing model to predict arbitrary behavior

Test of size 1/17/08RAMP Retreat 18 FAST & RAMP-White Shared Infrastructure FAST connectors as White connectors Stats, timing FAST modules as White modules Quad ported RAM CAM/Cache on top of quad-ported RAM Multi-host cycle caches Branch predictors Load/store units, bus interface units Interconnection networks Eventually, full processors? Executing microcode, software cracking? FAST TM as RAMP TM At that point, RAMP == FAST-MP

Test of size 1/17/08RAMP Retreat 19 Conclusions Split functional/timing can be cycle-accurate Roll back (FAST) or Timing-directed (timing-first?) (HASim) We believe that roll back can be done relatively cheaply Can be done in software at about order of magnitude impact Hardware support appears to be reasonable Log old values, playback (easy for simple core) Transactional processor? Simple cores + roll back as host for functional model? Virtualization for more threads is orthogonal, but multiplicative in resources

Test of size 1/17/08RAMP Retreat 20 Another View of FAST Old processor/system modeling new processor/system Leverage the fact that most of the time, functionality and order is identical Differences in timing between old and new modeled in timing model Roll back/re-execute to deal with differences Use current generation host as functional model for next generation target? Trace implemented by debug support Roll back implemented with transactional support

Test of size 1/17/08RAMP Retreat 21 Host == Target Modules in FAST FAST can leverage modules that are full and accurate implementations of modules Host TLB == target TLB Memory requests issued when timing model

Test of size 1/17/08RAMP Retreat 22 FAST-MP: Mixing PP IU

Test of size 1/17/08RAMP Retreat 23 What Is A FAST Functional Model? Requirements Fast Full System Generates instruction trace Supports rollback Hardware functional models (very fast) Real processor doesn’t support trace/rollback FPGA implementation difficult to make complete x86, boots Windows? Software functional models exist today Bochs, QEMU, Simics, SimNow, SimOS, etc. Relatively fast, full system Run on fastest hardware we know about to execute an ISA Can be modified to generate trace/support rollback

Test of size 1/17/08RAMP Retreat 24 trace Step 1: Improving Performance via Parallelization Parallel slowdown due to communication? FM runs ahead, speculatively, round-trip communication infrequent Round-trip communication only when (functional path != timing path) Microprocessors have same problem Multiple issue, deep pipelines only work if predicted path is correct FM like perfect front end of processor, real uArch (TM) slows it down The better the target micro-architecture, the faster the simulator Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Host

Test of size 1/17/08RAMP Retreat 25 Current Prototype Functional Model Derived from QEMU Fast (JIT), boots Linux, Windows Supports x86, x86-64, PowerPC, Sparc, ARM, MIPS, … Prototype currently supports x86 Added tracing, rollback (implemented with checkpoint) Including I/O (keyboard, mouse, video) Hosts x86 machines PowerPC inside of an FPGA PowerPC target by January about 1 month to port Dam Sunwoo, Jeb Keefe

Test of size 1/17/08RAMP Retreat 26 Current Prototype Timing Model Joonsoo Kim, Nikhil Patil, Bill Reinhart, Eric Johnson

Test of size 1/17/08RAMP Retreat 27 Microcode Compiler Intention Automate generation of new ISA instructions Automatically retarget new micro-architectures Necessary for x86 Uses the LLVM Compiler Infrastructure developed at UIUC Compile the Bochs CPU model Bochs is another portable x86 full-system emulator. Backend retargeted to micro-op ISA Generates microcode that “runs” on the timing model Over 99% dynamic inst coverage for most INT benchmarks floating point instructions not yet supported Average 1.27 uOps per handled dynamic x86 instruction Nikhil Patil

Test of size 1/17/08RAMP Retreat 28 Current Simulator Performance on DRC Includes Operating System Code

Test of size 1/17/08RAMP Retreat 29 Performance Details Timing model is current bottleneck 100MHz host cycle (not pushing timing) Currently taking ~30 host (FPGA) cycles per target cycle, max about 54 cycles (currently max latency defines target clock) BP is a simple gshare predictor Functional model Unoptimized modified QEMU With perfect BP, immediate return from TM, 5.4MIPS FM/TM communication 469ns blocking read from Opteron on DRC (has gotten better) Poll every other basic block 13ns/word for burst write

Test of size 1/17/08RAMP Retreat 30 Some Related Work (there is a lot) Software Functional/timing partitioned Asim, current M5, Timing-First, Opal all timing model driven  Timing model tells functional model what to do and when to do it FastSim (Schnarr, et al, ASPLOS 98) Functional/timing, rollback when functional path != timing path But, instrumented binaries, not parallelized, no hardware Hardware HASim: Hardware ASim (Emer, et. al) Timing-first Seven points of communication between FM & TM  Requires infinitely renamed out-of-order FM Current supports a simplified MIPS ISA

Test of size 1/17/08RAMP Retreat 31 Conclusions/Future Work It works Current FAST simulator prototype 1.2MIPS (unoptimized), about 1000 times slower than target Timely: during architecture phase Complete: runs Windows, Linux Transparent: extensive, hardware-based stats Relatively inexpensive, easy to build and extend (Some) future work Optimize 5MIPS soon, 10MIPS-20MIPS later (hardware FM using uCode?) More realistic timing model & calibration Tattler: automatic bottleneck detection CMP/SMP targets

Test of size 1/17/08RAMP Retreat 32 FAST Relative Speeds (Current Prototype)

Test of size 1/17/08RAMP Retreat 33 X86 Micro-Op Coverage

Test of size 1/17/08RAMP Retreat 34 Number of uOps/x86 Instruction

Test of size 1/17/08RAMP Retreat 35 Outline What is FAST (1 minute) Targeted properties Demo (30 sec) Runs x86, Windows at speeds fast enough to interact How? Start with partition Proven strategy (FastSim) Rollback to handle branch mis-speculation Simplifies full-system capabilities, since the complexity of the full- system is encapsulated in the functional model Functional model can be full-system simulator or processor  FM passes trace to TM TM is simple  can model complex structures fairly low weight Improve performance Parallelize on FM/TM boundary Round-trip communication infrequent  Permit FM to run ahead speculatively  Doesn’t mis-speculation slow things down?  Learn from computer architecture (microcoded, pipelined, OoO)  Speculation permits computer system to run faster  Key is not to mis-speculate often  Parallelize on functional/timing boundary  Handles branch prediction, anything where functional path is not equal to target path  Rollback required from FM Bottleneck is timing model Parallelize TM?  Difficult to do in software  Practical limiations due to number of processors that communicate quickly Hardware (FPGA) is ideal  Parallelizes nicely  TM partition simplifies hardware  Latency tolerant, infrequent round-trips Prototype Overview of prototype Software functional model running on processor host TM in Bluespec on FPGA Describe some details DRC or XUP FM description TM description Block diagram Microcode compiler Performance Relative performance Bottlenecks Future work Conclusions

Test of size 1/17/08RAMP Retreat 36 Functional Model Modifications Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Can be implemented with checkpoints ISA state, memory, peripherals BR

Test of size 1/17/08RAMP Retreat 37

Test of size 1/17/08RAMP Retreat 38 Why

Test of size 1/17/08RAMP Retreat 39 Why Are Simulators Slow? Complexity 40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Multiple Decode Out-of-order issue I/O Multiple cores, processors Modeling PARALLELISM Can we parallelize simulator?

Test of size 1/17/08RAMP Retreat 40 Parallelizing Simulators Software Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical If a software simulator could be parallelized, use same techniques to make the target faster? Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

Test of size 1/17/08RAMP Retreat 41 FPGA Modeling Complex Processors on FPGA(s) Compile RTL for FPGA Pentium fits in large FPGA 3.1M transistors (Lu, Intel) Issues Fit: Core 2 in a single FPGA? Impossible: Processors grow as fast as FPGAs Multiple FPGAs to model one processor Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify FPGA Pentium Pentium Core 2

Test of size 1/17/08RAMP Retreat 42 Strawman Solution Partition Break problem down into multiple pieces Target module boundaries obeyed? Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation Partitioned simulator, each partition running in potentially different host technology

Test of size 1/17/08RAMP Retreat 43 Simple Module-based Software/Hardware Partitioning Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006) 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass IS THERE A BETTER PARTITIONING?

Test of size 1/17/08RAMP Retreat 44 FAST Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace FPGA

Test of size 1/17/08RAMP Retreat 45 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH only part of control path

Test of size 1/17/08RAMP Retreat 46 A Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Can Easily Execute in Parallel in Hardware

Test of size 1/17/08RAMP Retreat 47 Trace-Driven Simulation? Described a trace-driven simulator Accurate if “functional path” == “target path” “Ideal” processor Unfortunately, functional path is not always equal to timing path Even for simple pipelines Real processors speculatively execute, assuming their target path is the functional path

Test of size 1/17/08RAMP Retreat 48 Branch Prediction iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis- speculate Rollback, restore Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

Test of size 1/17/08RAMP Retreat 49 Speculative Simulation FAST simulators themselves are speculative Speculate functional path == target path Timing model detects functional path != target path Forces functional model down wrong path OR Returns functional model to functional path Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization) Functional model reusable Timing model determines target behavior

Test of size 1/17/08RAMP Retreat 50 FAST Tool Flow Will be using Intel AWB for stats, etc.

Test of size 1/17/08RAMP Retreat 51 Performance Details Functional model Unoptimized modified QEMU Driving an FPGA-based timing model Standard producer/consumer parallel communication problems 469ns blocking round-trip from Opteron to FPGA Poll every other basic block BP is a simple gshare predictor Timing model (unoptimized) 100MHz host cycle right now (not pushing timing) Currently taking ~30 host cycles per target cycle, max about 54 cycles (currently max latency defines target clock) Complex timing model fits into single FPGA

Test of size 1/17/08RAMP Retreat 52 Outline Motivation FAST: Parallelized complex core, full-system simulator RAMP-White: Parallel hosts running parallel targets

Test of size 1/17/08RAMP Retreat 53 RAMP-White Requirements Coherent shared memory experimental platform Configurable coherence protocol, engine Scalable to the same level as other RAMP machines 1K eventual target Down to 2 Full system (OS, I/O, etc.) Intentions ISA/Architecture independent (like all RAMP efforts) Use different cores Integrate components from other RAMP participants A test-bed for sharing IP

Test of size 1/17/08RAMP Retreat 54 Texas Modifications to RAMP-White New code in Bluespec rather than Verilog/VHDL Many advantages including interfaces, configurability My group’s hardware development is exclusively Bluespec Free/low cost to academics ( Start with XUP board We had XUP before BEE2 Support Leon3 and PowerPC Started with embedded PowerPC but Linux, non-MP core issues Restarted with Leon3 Linux, MP-capable core Get RAMP-White infrastructure to work Plan to port back to embedded PowerPC, easy to move to soft PowerPC

Test of size 1/17/08RAMP Retreat 55 High-Level Architecture Philosophy Flexibility Avoid wasted work Easy changes Module-agnostic Processors, network, I/O, etc. Interfaces Complete set of necessary interfaces All communication via messages Fixed fields, but fields are configurable “shims” connect components to White infrastructure Use existing IP

Test of size 1/17/08RAMP Retreat 56 RAMP-White Block Diagram Network Router Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $ Proc dependent

Test of size 1/17/08RAMP Retreat 57 Three Phase Approach to Hardware H1: Incoherent shared memory No hardware global cache, just global shared memory support Optional cache for local memory However, software can maintain coherence if necessary Network virtual memory Run a simulator on top of the processor Ring network H2: Ring-based coherence (scalable bus) Requires a coherent cache, IU awareness Running what is essentially a snoopy protocol True coherence engine not required But, very restricted communication Sufficient for testing, modeling many targets H3: General network-based coherence Requires general coherence engine, general network H4: Different cores IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

Test of size 1/17/08RAMP Retreat 58 Operating System Issues with SMP OS on embedded PowerPC Incoherent cache Load-reservation/store-conditional instructions not MP capable Also missing TLB Invalidation & OpenPIC (interprocessor interrupts, bring-up) How scalable anyways? (1K processors) Four phase approach using RAMP-White hardware O1: Separate OS per core (PowerPC) working for XUP Region of memory is global (mmap) Locks using regular loads/stores + sequential consistency O2: SMP OS on Leon3 Use RAMP-White scalable hardware across multiple FPGAs Snoopy cache O3: SMP OS on Leon3 using directory cache O4: SMP OS on PowerPC port

Test of size 1/17/08RAMP Retreat 59 Programmer View Sequential consistency PowerPC Global addresses labeled as uncached  Ordered accesses from PowerPC 405 Coherent global cache still uncached from processor Soft cores can be weaker User interface Terminal per core/OS if desired Mmap to map shared memory

Test of size 1/17/08RAMP Retreat 60 H1/O1 RAMP-White Hari Angepat did the work Components Written in Bluespec NIU code complete and tested 2 processor ring (PowerPC) IU code complete and tested Processor Slave (no coherence right now) PLB Master/slave interface (I/O) NIU interface Hardware intended to target different ISAs PLB master and slave shims written Some preliminary OS work Multi-image mmap interface running

Test of size 1/17/08RAMP Retreat 61 Current RAMP-White Phase 1 Intersection Unit (IU) IO & Platform Devices PPC 405/ Leon3 Network Interface (NIU) Memory Controller (MC) PLB shim Intersection Unit (IU) PPC 405/ Leon3 Network Interface (NIU) Linux

Test of size 1/17/08RAMP Retreat 62 Our Long Term Plans H1/O1, XUP working June 2007 PowerPC With multi-OS, limited device support H2/O2 BEE2, Leon3 Coherent cache, IU forwarding modifications SMP H3/O3 BEE2/BEE3, Leon3 Arbitrary network, cache coherency engine Getting network from Washington, Berkeley H4/O4 BEE2/BEE3, PowerPC Port back to PowerPC

Test of size 1/17/08RAMP Retreat 63 RAMP Conclusions RAMP-White architecture Phased approach minimizes wasted work Designed to be easy to modify for your purpose Many architectures only require modified coherence engine, maybe cache ISA/implementation agnostic Care taken to not be specific RAMP-White Phase 1 works Running on XUP Phase 2 close to working Running on BEE2

Test of size 1/17/08RAMP Retreat 64 Future Work: FAST + RAMP-White FAST is currently not scalable Need parallel functional models and parallel timing models Parallel functional model runs on parallel host Maintain speed Intention is to run FAST on top of RAMP-White Modify soft core to support checkpoint, rollback in hardware FAST provides cycle-accurate of complex cores RAMP provides high performance, scalable host FAST+RAMP provides scalable complex core simulator

Test of size 1/17/08RAMP Retreat 65 Acknowledgements Students Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Hari Angepat (RAMP-White) Eric Johnson (verification, Linux) Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale Software Bluespec Open-source full system simulators QEMU, Bochs

© Derek Chiou 66 Extra slides

Test of size 1/17/08RAMP Retreat 67 What is in a Trace? Conceptually, everything a functional model can produce Flattened opcode, virtual/physical address of instruction, virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc. Can be heavily compressed Eg., simulator TLB to avoid physical address

Test of size 1/17/08RAMP Retreat 68 Modules Modules predict what happens each cycle Which instruction is scheduled? Modules hierarchically constructed CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies) E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Joonsoo Kim, Nikhil Patil

Test of size 1/17/08RAMP Retreat 69 Connector Interface Commit Indicates module is done with target cycle Done Indicates Connector is done with target cycle Enq, First, Deq Just like FIFO Enq Commit Done DeqCommitFirstDone Bill Reinhart

Test of size 1/17/08RAMP Retreat 70 32b Address in Shared Memory Machine?? 4GB possible per BEE2 FPGA Need more than 32b Eventually, hope for 64b soft-core processors For now two options: live with 4GB space Or, provide one more layer of translation Physical address in certain region is global virtual address Translated by hardware to node + physical address Also useful for multiple OSs in single memory OSs tend to assume they own physical address 0

Test of size 1/17/08RAMP Retreat 71 Node Architecture IU P P $$ MC I/O IU P $$ MC I/O C $ IU P $$ MC I/O IU P $$ MC I/O C $

Test of size 1/17/08RAMP Retreat 72 Generalized Architecture Proc IUNIUMC $ Mem OPB bridge Intersection Unit Network Interface Unit PLB Proc dependent Proc independent

Test of size 1/17/08RAMP Retreat 73 Intersection Unit Processor interface Slave Snoop Network interface Master (send) Slave (receive) Memory interface Master (issue memory requests) Hooks for coherency engine Bluespec nice to specify coherence engine Incoherent version is a special case Programmable memory regions Global (local and remote) Local translation Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

Test of size 1/17/08RAMP Retreat 74 Network Interface Unit Currently two virtual channels Split into two components Msg composition/Queuing Net transmit/receive Insert/extract for ring Intended to permit other net- specific transmit/receive One input/one output Creates a simple unidirectional ring Can interface to more advanced fabrics Intersection Unit (IU) Memory Controller (MC) IO & Platform Devices Processor Network Interface (NIU) Coherent $

Test of size 1/17/08RAMP Retreat 75 Sharing IP: Some Preliminary Experience We looked at RAMP-Red XUP Used some code (PLB master) Red-BEE is not ready to distribute Looking for switch code Berkeley’s code on CVS repository But, we can’t use memory controller because we don’t have BEE2 board yet Bluespec We are spinning almost all of our own code right now Would like to steal software OS (kernel proxy) SMP OS port Naming MPI reference design in BEE2 repository Is that RAMP-Blue? A central CVS repository for RAMP code?

Test of size 1/17/08RAMP Retreat 76 Sharing Over the Long Term Processor is shared Leon PowerPC MicroBlaze Everything else MC is shared Xilinx or Berkeley Coherent cache can be shared Transactional/traditional Borrow Stanford’s? Coherency engine can be shared CMU/Stanford IU functionality can be shared Trying to make ours general NIU can be shared Borrow half from Berkeley? Network can be shared Borrow Berkeley’s? Proc IUNIUMC $ Mem Peripherals CCE

Test of size 1/17/08RAMP Retreat 77 IU Internal Message Defaults PRI: High priority, Low priority CMD: Read, Write, Coherence, … PERM: Modified, Exclusive, Shared, Invalid SIZE: Byte, word, double word, cache-line GADDR: global address (translated by IU) DATA: dependent on size Bluespec permits easy modification for your protocol PRICMDPERMSIZETAG GADDR DATA

Test of size 1/17/08RAMP Retreat 78 Network Message PRI: High and Low DEST,SRC: destination, source of message SIZE: Total message size NETTAG: network tag (optional) CMD: network command (optional) MESSAGE: data PRIDESTSRCNETTAGCMD MESSAGE SIZE

Test of size 1/17/08RAMP Retreat 79 Intersection Unit Internals Intersection Unit Controller Memory Controller & DRAM Controller BRAMs ProcIONetProcIONet Global Address Translation hardware

© Derek Chiou 80 Fast, Full-System, Cycle-Accurate Computer Simulators via Parallelization Derek Chiou University of Texas at Austin Electrical and Computer Engineering

Test of size 1/17/08RAMP Retreat 81 My Ideal Computer Simulator Fast: as fast as possible 2-3 orders of magnitude slower than target? Fast enough to run real datasets to completion Interactive? Timely: enough time to make decisions Accurate: produce cycle-accurate numbers Complete: run unmodified operating systems, applications,… Transparent: full visibility, no performance hit Inexpensive: need thousands Flexible: quick changes, generate from RTL

Test of size 1/17/08RAMP Retreat 82 Current Software Simulators Performance-Detail tug-of-war Higher performance means less simulation tasks Higher detail means more simulation tasks SimulatorSlowdownSim 2 min ISA (SimNow, Simics) min Caches hrs OOO (ASim ~1-10KIPS)10K-1M1 year RTL100M-1B2000 years CMP????

Test of size 1/17/08RAMP Retreat 83 Why So Slow? Complexity 40 entry fully-assoc iTLB 40 entry fully-assoc dTLB 16-way L2 cache Many schedulers Decoder I/O Modeling PARALLELISM Parallelize simulator? Sampling, benchmarking are orthogonal

Test of size 1/17/08RAMP Retreat 84 Outline Motivation How to Parallelize? Functional Models and Timing Models Status, Conclusions

Test of size 1/17/08RAMP Retreat 85 Parallelizing Simulators Software Starting from 10KHz sequential simulator, 10MHz performance requires 1K processors Impractical If a software simulator could be parallelized, use same techniques to make the target faster? Hardware It’s what processors are made from Plenty of parallelism Difficult to implement? FPGAs for configurability

Test of size 1/17/08RAMP Retreat 86 FPGA Modeling Complex Processors on FPGA(s) Compile RTL for FPGA Pentium fits in large FPGA 3.1M transistors (Lu, Intel) Issues Fit: Core 2 in a single FPGA? Impossible: Processors grow as fast as FPGAs Multiple FPGAs to model one processor Full RTL required A lot of RTL Difficult to cover all cases Difficult to modify FPGA Pentium Pentium Core 2

Test of size 1/17/08RAMP Retreat 87 Strawman Solution Partition Break problem down into multiple pieces Target module boundaries obeyed? Replace some/all RTL with pre-written modules and/or easier to write (behavioral/software) code Hybrid simulation Partitioned simulator, each partition running in potentially different host technology

Test of size 1/17/08RAMP Retreat 88 Simple Module-based Software/Hardware Partitioning Partitioning on module boundaries over FPGAs and software Simplescalar + FPGA L1 DCache (SuhWARFP2006) 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass IS THERE A BETTER PARTITIONING?

Test of size 1/17/08RAMP Retreat 89 Our Partitioning: Functionality/Timing Boundaries Proven Software Partitioning Asim, FastSim, etc. Factorized, not partitioned! Functional simulators exist Timing model becomes very simple Promotes reuse Functional/Timing Simplifies timing model Latency tolerant Separate FM & TM Software functional model? Hardware timing model? Balance time/space Functional Model (ISA + peripherals) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Caches Arbitration Pipelining Associativity …. Inst trace

Test of size 1/17/08RAMP Retreat 90 Partition on ISA/Timing Proven Partitioning Asim, Simplescalar, Timing-First, FastSim, etc. Simplifies simulator. Promotes reuse Same performance in software Asim at 10KHz Most of the time spent in timing model! Hardware??? Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace

Test of size 1/17/08RAMP Retreat 91 FAST Functional Model (ISA) Timing Model (Micro-architecture) Instructions Architectural registers Peripheral functionality ….. Fetch Decode Rename Reservation stations Scheduling window Reorder buffer …. Inst trace FPGA

Test of size 1/17/08RAMP Retreat 92 What is in a Trace? Conceptually, everything a functional model can produce Flattened opcode, virtual/physical address of instruction, virtual/physical address of data, source registers, destination registers, condition code source and destination registers, exceptions, etc. Can be heavily compressed Eg., simulator TLB to avoid physical address

Test of size 1/17/08RAMP Retreat 93 What is a FAST Timing Model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Stats gathering in hardware => no performance impact

Test of size 1/17/08RAMP Retreat 94 More Complexity Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Schedulers Reorder buffer/instruction window Pipeline control along with instructions NO DATAPATH only part of control path

Test of size 1/17/08RAMP Retreat 95 A Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models

Test of size 1/17/08RAMP Retreat 96 Driving a Timing Model iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Memory & I/O timing models Functional Model Can Easily Execute in Parallel in Hardware

Test of size 1/17/08RAMP Retreat 97 Trace-Driven Simulation? What we’ve described is a trace-driven simulator Accurate if “functional path” == “target path” “Ideal” processor Unfortunately, functional path is not always equal to timing path Real processors speculatively execute, assuming their target path is the functional path

Test of size 1/17/08RAMP Retreat 98 Branch Prediction iTLBiCache dTLBdCache Align & Pick Decode Sched L2 Cache Functional Model Memory & I/O timing models Wrong-path instructions! Implement BP in timing model Timing model forces ISA simulator to mis- speculate Rollback, restore Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way

Test of size 1/17/08RAMP Retreat 99 Speculative Simulation FAST simulators themselves are speculative Speculate functional path == target path Timing model detects functional path != target path Forces functional model down wrong path OR Returns functional model to functional path Speculation reduces need for round-trip communication Makes FAST latency tolerant (good for parallelization) Functional model reusable Timing model determines target behavior

Test of size 1/17/08RAMP Retreat 100 A Parallel Simulator Functional runs in parallel with timing Functional model can run in parallel OoO superscalar processor? Timing model runs in parallel in hardware Hard to not run in parallel Every register-to-register transition could potentially be parallelized

Test of size 1/17/08RAMP Retreat 101 Outline Motivation How to Parallelize? Functional Models, Timing Models and Tools Status, Related Work, Conclusions

Test of size 1/17/08RAMP Retreat 102 Functional Models Hardware functional models Difficult to make complete (use Protoflex methods?) Size Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA Can we use them? What modifications?

Test of size 1/17/08RAMP Retreat 103 Functional Model Modifications Need to Rollback, force branch Rollback, restore and continue How? set_pc(inst_num, pc) Set a particular dynamic instance of an instruction to a particular instruction pointed to by PC Sufficient Can be implemented with checkpoints ISA state, memory, peripherals BR

Test of size 1/17/08RAMP Retreat 104 Our Functional Model Derived from QEMU Fast Boots Linux, Windows x86, x86-64, PowerPC, Sparc, ARM, MIPS, … We support x86 for now Added tracing, checkpoint, rollback Runs on x86 hosts as well as embedded PowerPC host inside FPGA PowerPC in the works Dam Sunwoo

Test of size 1/17/08RAMP Retreat 105 Modular Timing Models Modules Models timing functionality Rename, caches, etc. Connectors Attach modules together Abstract timing from modules Throughput (input, output), delay, maxTransactions Stats and tracing

Test of size 1/17/08RAMP Retreat 106 Modules Modules predict what happens each cycle Which instruction is scheduled? Modules hierarchically constructed CAM, FIFOs, arbiters, etc. Branch predictors, Caches, TLBs, Schedulers, ALUs Fetch, Decode, Rename, RS, ROB Many are essentially wires (e.g., ALU) Often written to execute one operation (simplifies) E.g., Rename, Cache Executed multiple times per target cycle for wider processor, higher associativity Simplifies implementation, tradeoff time for space Joonsoo Kim, Nikhil Patil

Test of size 1/17/08RAMP Retreat 107 Connector Interface Commit Indicates module is done with target cycle Done Indicates Connector is done with target cycle Enq, First, Deq Just like FIFO Enq Commit Done DeqCommitFirstDone Bill Reinhart

Test of size 1/17/08RAMP Retreat 108 Current Timing Model

Test of size 1/17/08RAMP Retreat 109 FAST Tool Flow Will be using Intel AWB for stats, etc.

Test of size 1/17/08RAMP Retreat 110 Outline Motivation How to Parallelize? Functional Models and Timing Models Status, Related Work, Conclusions

Test of size 1/17/08RAMP Retreat 111 Execution Platforms DRC Computer Opteron + FPGA on HT Xilinx development boards XUP, ML-310 (embedded PowerPC) Research Accelerator for MP (RAMP) A shared infrastructure for MP development Uses BEE2/BEE3 FPGA boards 4/5 large FPGAs and fast off-board I/O Large collaboration between Berkeley, CMU (Hoe), MIT, Stanford, Texas, Washington and Intel.

Test of size 1/17/08RAMP Retreat 112 Simulator Performance on DRC Includes Operating System Code

Test of size 1/17/08RAMP Retreat 113 Comparison to Related Work

Test of size 1/17/08RAMP Retreat 114 Acknowledgements Students Dam Sunwoo (FM) Joonsoo Kim (TM, Interface) Nikhil Patil (TM, tools) Bill Reinhart (TM connector) Eric Johnson (verification, Linux) Funding DOE Early Career, NSF, SRC Intel, IBM, Xilinx, Freescale Software Bluespec Open-source full system simulators QEMU, Bochs

Test of size 1/17/08RAMP Retreat 115 Conclusions Fast: Expect 2MIPS-10MIPS soon 2-4 orders of magnitude slower than target Interactive Timely: quick model building Accurate: produce cycle-accurate numbers Complete: runs Linux/Windows + apps now, compile apps on standard machines Transparent: Hardware stats provides full visibility, no performance hit Inexpensive: $300/$600 XUP boards can fit many timing models $9K (academic) DRC is still cost effective per cycle Flexible: tools

© Derek Chiou 116 Backup Slides

Test of size 1/17/08RAMP Retreat 117 Connectors Abstract timing information from modules Cannot do so perfectly (state within modules) Characteristics Throughput (input and output), minimum delay, maximum outstanding Entries are equal and fully shiftable If you have distinct lanes, need separate connectors Provide stats gathering, tracing 512 entry trigger-based trace Stats funneled BACK through connector helps with place-and-route (Nikhil Patil) Bill Reinhart

Test of size 1/17/08RAMP Retreat 118 Example Connector Interface Code (in ALU Module) module mkALU#(ConsumerPort#(Tuple2#(Maybe#(PReg_t), ROBTag_t)) inQ, ProducerPort#(Tuple2#(Maybe#(PReg_t), Execute2ROB_t)) outQ) (ALU); rule pass; inQ.deq; match {.dest,.robtag } = inQ.first; outQ.enq(tuple2(dest, Execute2ROB_t{robtag:robtag, exception: Invalid})); endrule rule done (inQ.done || outQ.done); inQ.commit(True); outQ.commit(True); endrule endmodule ALU latency defined by Connector min-delay!

Test of size 1/17/08RAMP Retreat 119 FAST Tool Flow Will be using AWB for stats, etc.

Test of size 1/17/08RAMP Retreat 120 Microcode Compiler Uses the LLVM Compiler Infrastructure developed at UIUC ( Compile the Bochs CPU (Bochs is another portable x86 full- system emulator.) Backend retargetted to the micro-op "ISA“ Intention is to automate generation of new ISA instructions, automatically retarget new micro-architectures Generates microcode that “runs” on the timing model Nikhil Patil

Test of size 1/17/08RAMP Retreat 121 Example Microcode Compile "ADD r/m32, r32" instruction. void BX_CPU_C::ADD_EdGd(bxInstruction_c *i) { Bit32u op2_32, op1_32, sum_32; op2_32 = BX_READ_32BIT_REG(i->nnn()); if (i->modC0()) { op1_32 = BX_READ_32BIT_REG(i->rm()); sum_32 = op1_32 + op2_32; BX_WRITE_32BIT_REGZ(i->rm(), sum_32); } else { read_RMW_virtual_dword(i->seg(), RMAddr(i), &op1_32); sum_32 = op1_32 + op2_32; write_RMW_virtual_dword(sum_32) } SET_FLAGS_OSZAPC_32(op1_32, op2_32, sum_32, BX_INSTR_ADD32) } When ModRM != 0xC0: %u0 = ADDRGEN LOADd %v0:[%u0] -> %u1 cc,%u1 = add %u1, %nnn STOREd %v0:[%u0] <- %u1 When ModRM == 0xC0: cc,%rm = add %rm, %nnn

Test of size 1/17/08RAMP Retreat 122 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification??? But, do you really want to do it this way?

Test of size 1/17/08RAMP Retreat 123 Traces to Timing Model Interface provides ability for functional model to pass instruction trace to timing model Storage that timing model components can read Compression TLB to eliminate physical address Cache static information (src/dest registers, opcode, etc.) Joonsoo Kim

Test of size 1/17/08RAMP Retreat 124 Adding Modules Easy to add modules to improve accuracy No functionality, only timing E.g., DRAM model Model banks, row register, control timing, refresh Results in some requests taking longer If modeling memory controller, reorder memory operations as well

Test of size 1/17/08RAMP Retreat 125 Long Term Plans/Impact CMP/SMP support Early power estimation (with Freescale) Simulator available for software development/tuning before hardware True co-development Performance/power revs of software Design derived from simulator Write cycle-accurate simulator, automatically generate design (at RTL + libraries level) Change the way computer systems are designed and evaluated

Test of size 1/17/08RAMP Retreat 126 Demo Compile code, run on simulator BP Perfect, gshare See difference in simulator performance #ALUs 1, 8 Decode Fetch R/S ROB Rename L1$ BrALUTLBLd/st$ L2$ BP

Test of size 1/17/08RAMP Retreat 127 Better Partitioning? Traditional partitioning on module boundaries Timing and functionality Is there a better way? 64b adder cannot be implemented as a single monolithic entity But, 64 1b adders very tractable  Can compose a 64b adder from 64 x 1b adders Is there a better partitioning for simulation?

Test of size 1/17/08RAMP Retreat 128 Current/Future Work Finish uni-processor version Very close, running on one platform Debug rest of pipeline Shared/coherent bus MP version Porting QEMU to MP host MP timing model Hardware functional model Tool chain

Test of size 1/17/08RAMP Retreat 129 FPGA Resources for TM OOO, branch prediction, ROB, two ALUs, 1 branch unit, 1 load/store unit (32 entry load/store queue), iTLB, dTLB 7221 slices (52% of a 2VP30) 9 block RAMs (6% of a 2VP30) NOTE: large structures have not yet been mapped to block RAMs (trace buffer, ROB) Configurable cache model (old Verilog version) 32KB 4-way set associative cache with 16B cache-lines 165 slices (1% of a 2VP30) 17 block RAMs (12% of a 2VP30) 2MB 4-way set-associative cache with 64B cache-lines 140 slices (1% of a 2VP30) 40 block RAMs (29% of a 2VP30) 2VP30 is an old FPGA found on a $600 list-price XUP board Current FPGAs 10 times as much logic (330K logic cells (2VP30 is around 30K)) 3 to 4 times as much block RAM

Test of size 1/17/08RAMP Retreat 130 Current Limitations No datapath Data speculation requires datapath Control path/data path crossover Informing loads, Query loads Can support But, can require additional communication Quite possible

Test of size 1/17/08RAMP Retreat 131 FAST Tool Flow

Test of size 1/17/08RAMP Retreat 132 RTL to Timing Model Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R Bypass/interlock I1I1 I2I2 Timing model perfectly models RTL Verification???

Test of size 1/17/08RAMP Retreat 133 Hardware Related Work FAST is first FPGA-based functional/timing model partitioned simulator that we know of HASim (Emer et al) FPGA-based processor emulators Academic: RAMP, MIT UMUM, Scale, Stanford FAST, ATLAS, CMU, … Intel (Shih-Lien Lu) !Flexible (huge effort), !Accurate (old processors, not modeling everything)

Test of size 1/17/08RAMP Retreat 134 Software Related Work FastSim: Schnarr, Larus (1998) Direct execution with rollback, memoization for speed Simplescalar accuracy Emer, et al: Asim Intel cycle-accurate, about 10KHz PTLSim: Yourst Full system x86, about 200KHz Seems similar to Simplescalar in terms of accuracy Uses Xen to fast forward

Test of size 1/17/08RAMP Retreat 135 Simulator Users and Tradeoffs Architects: e.g., Matt Performance/Power/Reliability Timely, ~Accurate, Flexible !Fast, !Complete Software: e.g., Wen-Mei Development, tuning Fast, OS !Accurate, !Flexible, !Timely Implementation (RTL) Correctness Accurate, ~Timely !Fast, !Complete

Test of size 1/17/08RAMP Retreat 136 Stolen from Impossible????

Test of size 1/17/08RAMP Retreat 137 Microprocessors One instruction executes to completion before next starts Implementation is different leading to different performance

Test of size 1/17/08RAMP Retreat 138 Microprocessors In-Order Front Out-of-Order Middle In-Order End

Test of size 1/17/08RAMP Retreat 139 Current Software Simulators Performance-Detail tug-of-war Higher performance means less simulation tasks Higher detail means more simulation tasks SimulatorSlowdownSim 2 min ISA (SimNow, Simics) min Caches hrs OOO (ASim ~1-10KIPS)10K-1M1 year RTL100M-1B2000 years CMP????

Test of size 1/17/08RAMP Retreat 140 Why Build? (Anant) Software won’t work unless you are building hardware Motivation for software tools Large data sets Hard problems better understood and show up once you became building Have to solve hard problems More radical the idea, more important it is to build World only trusts end-to-end results Cycle simulator only becomes accurate after hardware gets precisely defined Needed for commercialization

Test of size 1/17/08RAMP Retreat 141 Functional Models Hardware functional models Difficult to make complete (use Protoflex methods?) Size Software functional models exist today Fast Full system Very usable Bochs, QEMU, Simics, SimNow, SimOS, etc. Runs on fastest hardware we know about to execute an ISA Can we use them? What modifications?