Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel Emer †‡ † MIT CS and AI Lab Computation Structures Group ‡ Intel Corporation VSSAD Group To Appear In: ISPASS 2008
Motivation We want to simulate target platforms quickly We also want to construct simulators quickly Partitioned simulators are a known technique from traditional performance models: ISA Off-chip communication Micro-architecture Resource contention Dependencies Interaction Simplifies timing model Amortize functional model design effort over many models Functional Partition can be extremely FPGA-optimized Timing Partition Timing Partition Functional Partition Functional Partition
Different Partitioning Schemes As categorized by Mauer, Hill and Wood: Source: [MAUER 2002], ACM SIGMETRICS We believe that a timing-directed solution will ultimately lead to the best performance Both partitions upon the FPGA
Functional Partition in Software Asim Get Instruction (at a given Address) Get Dependencies Get Instruction Results Read Memory * Speculatively Write Memory * (locally visible) Commit or Abort instruction Write Memory * (globally visible) * Optional depending on instruction type
Execution in Phases FDXRCFDXWCWFDXC The Emer Assertion: All data dependencies can be represented via these phases FDXRA FDXXCW
Detailed Example: 3 Different Timing Models Executing the same instruction sequence:
Functional Partition in Hardware? Requirements Support these operations in hardware Allow for out-of-order execution, speculation, rollback Challenges Minimize operation execution times Pipeline wherever possible Tradeoff between BRAM/multiport RAMs Race conditions due to extreme parallelism
Functional Partition As Pipeline Conveys concept well, but poor performance Token Gen DecExeMemLCom GComFet Timing Model Memory State Register State RegFile Functional Partition
Implementation: Large Scoreboards in BRAM Series of tables in BRAM Store information about each in-flight instruction Tables are indexed by “token” Also used by the timing partition to refer to each instruction New operation “getToken” to allocate a space in the tables
Implementing the Operations See paper for details (also extra slides)
Assessment: Three Timing Models Unpipelined Target MIPS R10K-like out-of-order superscalar 5-Stage Pipeline
Assessment: Target Performance Targets have idealized memory hierarchy
Assessment: Simulator Performance Some correspondence between target and functional partition is very helpful
Assessment: Reuse and Physical Stats Where is functionality implemented: FPGA usage: DesignIMemProgram Counter Branch Predictor Scoreboard/ ROB Reg File Maptable/ Freelist ALUDMemStore Buffer Snapshots/ Rollback Functional Partition UnpipelinedN/A 5-StageN/A Out-of-Order Unpipelined5-stageOut of Order FPGA Slices6599 (20%)9220 (28%)22,873 (69%) Block RAMs18 (5%)25 (7%) Clock Speed98.8 MHz96.9 MHz95.0 MHz Average FMR Simulation Rate2.4 MHz14 MHz6 MHz Average Simulator IPS 2.4 MIPS5.1 MIPS4.7 MIPS Virtex IIPro 70 Using ISE 8.1i
Future Work: Simulating Multicores Scheme 1: Duplicate both partitions Scheme 2: Cluster Timing Parititions Timing Model A Timing Model A Func Reg + Datapath Func Reg + Datapath Timing Model B Timing Model B Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Timing Model C Timing Model C Func Reg + Datapath Func Reg + Datapath Timing Model D Timing Model D Functional Memory State Functional Memory State Timing Model A Timing Model A Timing Model B Timing Model B Timing Model C Timing Model C Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction occurs here Interaction still occurs here Use a context ID to reference all state lookups
Future Work: Simulating Multicores Scheme 3: Perform multiplexing of timing models themselves Leverage HASim A-Ports in Timing Model Out of scope of today’s talk Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction still occurs here Use a context ID to reference all state lookups Timing Model C Timing Model C Timing Model B Timing Model B Timing Model A Timing Model A
UT-FAST is Functional-First This can be unified into Timing-Directed Just do “execute-at-fetch” Future Work: Unifying with the UT-FAST model Func Partition Func Partition Timing Partition Timing Partition Emulator Ø Ø Ø Ø functional emulator running in software FPGA execution stream resteer execution stream resteer functional emulator running in software
Summary Described a scheme for closely-coupled timing- directed partitioning Both partitions are suitable for on-FPGA implementation Demonstrated such a scheme’s benefits: Very Good Reuse, Very Good Area/Clock Speed Good FPGA-to-Model Cycle Ratio: Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target) We plan to extend this using contexts for hardware multiplexing [Chung 07] Future: rare complex operations (such as syscalls) could be done in software using virtual channels
Questions?
Extra Slides
Functional Partition Fetch
Functional Partition Decode
Functional Partition Execute
Functional Partition Back End
Timing Model: Unpipelined
5-Stage Pipeline Timing Model
Out-Of-Order Superscalar Timing Model