FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin
Wouldn’t it be nice to have a simulator that is Fast 10M cycles per second, fast enough to run real datasets to completion Accurate Produce cycle-accurate numbers Complete Run real operating systems, applications Transparent Can see everything in processor, no performance hit Inexpensive Need thousands Usable Quick changes, easy to see performance
Software? Software-based simulators inherently cannot achieve this speed and be cycle-accurate at the same time A 128 entry, fully-associative TLB at the limit requires 128 load, compare operations Arbitration requires first looking across multiple bidders There are lots of these structures in a complex processor! Thousands to tens of thousands of events Even with perfect parallelism, need a lot of CPUs
Hardware Clearly, hardware is necessary Reconfigurability (read FPGAs) is required for flexibility But how?
Full Implementation? Take RTL code, compile for FPGA Implementing full system in FPGA is prohibitively large Shih-Lin Lu’s group has single original Pentium (586, 3.1M transistors) in largest Xilinx FPGA Emulate Pentium M in a single FPGA? 140M transistors Instead, what about Accurately (to cycle resolution) simulate its behavior Running real, unmodified applications, OS With full visibility at full speed? If execution speeds are reasonable, do I care? Derek Chiou, UTexas, Austin
Can I Partition the Problem? 64b adder way too big to be implemented as a single monolithic entity But, I can implement 64 1b adders very easily with very little state and complexity Partitioning is good if possible But, how to partition?
Classic Partitioning On module boundary Caches, memories, ALUs, processors, memory controllers Partitioning doesn’t save state or complexity, but enables design to be partitioned over multiple FPGAs and software Problems? 0x2 addr inst Instruction $/Mem Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data $/Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R I1I1 I2I2 bypass
Functional/Timing Partition Functional model simulates ISA Timing model simulates micro-architecture Asim and Simplescalar are written like this Software One processor Lots of interaction between functional and timing Intended to avoid rollback of any component Put timing model in FPGA??? Parallel component executed in hardware!
UT FAST Partitioning On ISA/micro-architecture boundary (ISA + FPGA) Instruction trace generated by ISA simulator (e.g., Bochs, Simics) Fast, full system but no timing information (could be hardware!!!) What do we need to simulate in the timing model? Trace 0x2 addr inst Instruction Memory Add rd1 GPR File rr1 rr2 wr wd rd2 we Immed. Extend M 0 2 raddr waddr wdata rdata re Data Memory ALU algn 1 3 we PC A B MD1 Y MD2 IR R bypass I1I1 I2I2
UT FAST Complex Processors Straight pipelines are easy what about Caches/TLBs? Keep tags, pass address (virtual and physical if necessary) Hits, misses determined but don’t need data Superscalar (multiple issue)? “Fetch and issue” multiple instructions assuming they meet boundary constraints Multiple “functional units” Reservation stations Reorder buffer Pipeline control along with instructions NO DATAPATH!!! Timing Model speed almost unimportant! Multi-cycle memories to create more ports
Example of Complication: Branch Prediction Must process mis-speculated instructions in timing model Implement BP in timing model Timing model forces ISA simulator to mis-speculate Rollback, restore Requires support from ISA simulator Branch predictor predictor in ISA simulator? BP only works in processor if it’s fairly accurate FAST simulators take advantage of the fact that most of the time micro-architecture is on the right path Most complexity (BP, parallelism) can be handled this way
Status & Conclusions 1MHz to 100MHz, cycle-accurate, full-system, multiprocessor simulator Well, not quite that fast right now, but we are using embedded 300MHz PowerPC 405 to simplify X86, boots Linux, Windows, targeting to Pentium D-like and beyond (Dam Sunwoo, Nikhil Patil) Bochs functional model (looking at much faster models) Heavily modified instruction trace and rollback Branch-predicted superscalar model almost done in Bluespec and Verilog (John Xu, Huzefa Sanjeliwala) Have straight pipeline 486 model with TLBs and caches Statistics gathered in hardware Very little if any probe effect Tools to semi-automate micro-architectural and ISA level exploration Orthogonality of models makes both simpler