Download presentation
Presentation is loading. Please wait.
1
Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † pellauer@csail.mit.edu Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel Emer †‡ † MIT CS and AI Lab Computation Structures Group ‡ Intel Corporation VSSAD Group To Appear In: ISPASS 2008
2
Motivation We want to simulate target platforms quickly We also want to construct simulators quickly Partitioned simulators are a known technique from traditional performance models: ISA Off-chip communication Micro-architecture Resource contention Dependencies Interaction Simplifies timing model Amortize functional model design effort over many models Functional Partition can be extremely FPGA-optimized Timing Partition Timing Partition Functional Partition Functional Partition
3
Different Partitioning Schemes As categorized by Mauer, Hill and Wood: Source: [MAUER 2002], ACM SIGMETRICS We believe that a timing-directed solution will ultimately lead to the best performance Both partitions upon the FPGA
4
Functional Partition in Software Asim Get Instruction (at a given Address) Get Dependencies Get Instruction Results Read Memory * Speculatively Write Memory * (locally visible) Commit or Abort instruction Write Memory * (globally visible) * Optional depending on instruction type
5
Execution in Phases FDXRCFDXWCWFDXC The Emer Assertion: All data dependencies can be represented via these phases FDXRA FDXXCW
6
Detailed Example: 3 Different Timing Models Executing the same instruction sequence:
7
Functional Partition in Hardware? Requirements Support these operations in hardware Allow for out-of-order execution, speculation, rollback Challenges Minimize operation execution times Pipeline wherever possible Tradeoff between BRAM/multiport RAMs Race conditions due to extreme parallelism
8
Functional Partition As Pipeline Conveys concept well, but poor performance Token Gen DecExeMemLCom GComFet Timing Model Memory State Register State RegFile Functional Partition
9
Implementation: Large Scoreboards in BRAM Series of tables in BRAM Store information about each in-flight instruction Tables are indexed by “token” Also used by the timing partition to refer to each instruction New operation “getToken” to allocate a space in the tables
10
Implementing the Operations See paper for details (also extra slides)
11
Assessment: Three Timing Models Unpipelined Target MIPS R10K-like out-of-order superscalar 5-Stage Pipeline
12
Assessment: Target Performance Targets have idealized memory hierarchy
13
Assessment: Simulator Performance Some correspondence between target and functional partition is very helpful
14
Assessment: Reuse and Physical Stats Where is functionality implemented: FPGA usage: DesignIMemProgram Counter Branch Predictor Scoreboard/ ROB Reg File Maptable/ Freelist ALUDMemStore Buffer Snapshots/ Rollback Functional Partition UnpipelinedN/A 5-StageN/A Out-of-Order Unpipelined5-stageOut of Order FPGA Slices6599 (20%)9220 (28%)22,873 (69%) Block RAMs18 (5%)25 (7%) Clock Speed98.8 MHz96.9 MHz95.0 MHz Average FMR41.17.4915.6 Simulation Rate2.4 MHz14 MHz6 MHz Average Simulator IPS 2.4 MIPS5.1 MIPS4.7 MIPS Virtex IIPro 70 Using ISE 8.1i
15
Future Work: Simulating Multicores Scheme 1: Duplicate both partitions Scheme 2: Cluster Timing Parititions Timing Model A Timing Model A Func Reg + Datapath Func Reg + Datapath Timing Model B Timing Model B Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Func Reg + Datapath Timing Model C Timing Model C Func Reg + Datapath Func Reg + Datapath Timing Model D Timing Model D Functional Memory State Functional Memory State Timing Model A Timing Model A Timing Model B Timing Model B Timing Model C Timing Model C Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction occurs here Interaction still occurs here Use a context ID to reference all state lookups
16
Future Work: Simulating Multicores Scheme 3: Perform multiplexing of timing models themselves Leverage HASim A-Ports in Timing Model Out of scope of today’s talk Timing Model D Timing Model D Functional Reg State + Datapath Functional Reg State + Datapath Functional Memory State Functional Memory State Interaction still occurs here Use a context ID to reference all state lookups Timing Model C Timing Model C Timing Model B Timing Model B Timing Model A Timing Model A
17
UT-FAST is Functional-First This can be unified into Timing-Directed Just do “execute-at-fetch” Future Work: Unifying with the UT-FAST model Func Partition Func Partition Timing Partition Timing Partition Emulator Ø Ø Ø Ø functional emulator running in software FPGA execution stream resteer execution stream resteer functional emulator running in software
18
Summary Described a scheme for closely-coupled timing- directed partitioning Both partitions are suitable for on-FPGA implementation Demonstrated such a scheme’s benefits: Very Good Reuse, Very Good Area/Clock Speed Good FPGA-to-Model Cycle Ratio: Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target) We plan to extend this using contexts for hardware multiplexing [Chung 07] Future: rare complex operations (such as syscalls) could be done in software using virtual channels
19
Questions? pellauer@csail.mit.edu
20
Extra Slides pellauer@csail.mit.edu
21
Functional Partition Fetch
22
Functional Partition Decode
23
Functional Partition Execute
24
Functional Partition Back End
25
Timing Model: Unpipelined
26
5-Stage Pipeline Timing Model
27
Out-Of-Order Superscalar Timing Model
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.