Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005.

Similar presentations


Presentation on theme: "A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005."— Presentation transcript:

1 A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005

2 Another Simulator? Sim-DRAM still had a few unworkable bugs in its FB-DIMM model when I began my study. Sim-DRAM still had a few unworkable bugs in its FB-DIMM model when I began my study. FB-DIMM is radically different than other memory architectures. New simulator => fresh start. FB-DIMM is radically different than other memory architectures. New simulator => fresh start. FBsim is made exclusively for simulating and studying the FB- DIMM architecture. Easier to study FB-DIMM with an exclusive simulator. FBsim is made exclusively for simulating and studying the FB- DIMM architecture. Easier to study FB-DIMM with an exclusive simulator. Different scheduler, mapping algorithm, approach, style, section of study in the FB-DIMM design space. Different scheduler, mapping algorithm, approach, style, section of study in the FB-DIMM design space. FBsim is ideal for simulating ‘unreasonably’ high memory request rates and studying channel saturation effects. FBsim is ideal for simulating ‘unreasonably’ high memory request rates and studying channel saturation effects. The two simulators can be used to validate each other’s results in FB-DIMM studies. The two simulators can be used to validate each other’s results in FB-DIMM studies. Writing a memory simulator was a great experience for me. Writing a memory simulator was a great experience for me. Sim-DRAM exists and supports FB-DIMM. Why write another simulator?

3 FBsim Overview All code written from scratch. All code written from scratch. Standalone product. Does not currently interface with CPU simulators or memory traces. Instead probabilistically models memory transactions according to user specifications. Standalone product. Does not currently interface with CPU simulators or memory traces. Instead probabilistically models memory transactions according to user specifications. => Does not actually store memory data => Does not actually store memory data Written in ANSI C. ~5000 lines of code. Code organized into header files, commented, quite easy to hack. Written in ANSI C. ~5000 lines of code. Code organized into header files, commented, quite easy to hack. Fast. For each memory channel, 1 second simulates ~10ms (or ~1ms during channel saturation) on a 2.4 GHz Pentium 4. Fast. For each memory channel, 1 second simulates ~10ms (or ~1ms during channel saturation) on a 2.4 GHz Pentium 4. Supports Open & Closed Page Mode, Fixed & Variable Latency Mode. Supports Open & Closed Page Mode, Fixed & Variable Latency Mode. Supports output of macro and micro (frame by frame) simulation data Supports output of macro and micro (frame by frame) simulation data Does not model channel init, maintenance, sync. overhead. Does not model channel init, maintenance, sync. overhead. Does not model memory refresh. Does not model memory refresh. Does not model power consumption, and power timing limitations (t FAW etc.). Does not model power consumption, and power timing limitations (t FAW etc.). The above options can be incorporated readily into future versions. The above options can be incorporated readily into future versions.

4 FBsim Overview 2 Channel Scheduler 0 Channel Scheduler 1 Channel Scheduler 7 Address Mapper Input Transaction Generator A Frame Iteration Try to generate transactions Map any generated transactions to its channel scheduler. Fire each scheduler once.

5 Input Transaction Model Step Distributions Normal (Gaussian) Distributions

6 Input Transaction Model 2 Bus Trace Viewer FBsim Model

7 Address Mapping Physical address must be mapped somehow to the right channel, DIMM, rank, bank, row, and column. Physical address must be mapped somehow to the right channel, DIMM, rank, bank, row, and column. FBsim built to support different DIMM capacities, different channel capacities, even unbalanced configurations FBsim built to support different DIMM capacities, different channel capacities, even unbalanced configurations => Algorithm needed to map incoming transaction to DIMM => Algorithm needed to map incoming transaction to DIMM WHILE (a non zero row sum exists) { WHILE (visit each channel with a non zero row sum exactly once) { The next 'result' is channel DIMM with the highest number. Decrement that DIMM's number by 1. Decrement the row sum by 1. } Modulus = 4+2+1+2 = 9 Closed Page Mode Open Page Mode

8 Channel Scheduler

9 FB-DIMM Frame Format Review SouthBound (SB) Frame could be a: Channel Frame (not modeled in FBsim) Command Frame (up to three DRAM commands, with only one command possible to each DIMM in the channel) Command + Wdata Frame (holds one DRAM command, plus one DDR beat of write data) NorthBound (NB) Frame could be a: Channel Frame (not modeled in FBsim) Read Response Frame (holds two DDR beats of returned read data)

10 Some of my Results 1x8 achieved 7.9 GBps before saturating (82%) 2x4 achieved 15.6 GBps (82%) 4x2 achieved 31.3 GBps (82%) 8x1 achieved 45.2 GBps (59%!) Case Study Conclusion With at least two DIMMs on each channel, performance scales very well in FB-DIMM More than two DIMMs only increases capacity, not throughput Adding each DIMM adds ~5ns average channel latency in FLM, and slightly over half that in VLM In closed page mode, only 82% of peak theoretical throughput of a channel can be reached.

11 Some of my Results 2 In Closed Page Mode with 2:1 read/write ratio, a reordering window of size ~12 transactions achieves best possible performance (channel saturation) for a FB-DIMM channel scheduler. Increasing window-size over this has no benefit. The more skewed the read/write ratio, the bigger the scheduling window needs to be (at 4:1, its ~18). In Variable Latency Mode, a reordering window of size ~20 achieves best possible performance.

12 Some of my Results 3 Micro-study shows that in Closed Page Mode, the FB channel can at most reach ~93% write data utilization on the SB, and ~84% read data utilization on the NB. Micro-study showed that FBsim channel utilization was slightly worse for non 2:1 read/write ratios (it was 2% worse for 4:1). FBsim scheduler can quite straightforwardly be made more adaptive to read/write ratio of transactions in scheduler.

13 Future Ideas with FBsim (me) I’m graduating this semester (if Dr Jacob and Mr (Dr?) Wang so please), and escaping to the corporate world. => Writing a guide for FBsim along with some ideas for future work. Anyone who wishes to take over development is eagerly encouraged to. If so, I would be happy to help get things rolling by email or in person. Feel free to access & use anything in FBsim or my thesis paper. I strongly believe a very interesting paper or three can quite quickly come out of this research area

14 Future Ideas with FBsim 2 For credibility in a paper, add an interface between FBsim and a CPU simulator or memory traces. Run real benchmarks through FBsim. Compare and contrast these results with the transaction modeling results. AND/OR add more functionality and provable realism to the transaction modeler. Study this. Best yet, integrate FBsim into the Sim-DRAM package as an added option. Add modeling for channel overhead, memory refresh overhead, error simulation and error handling, power consumption constraints and metrics. Enhance adaptivity of FBsim scheduler to non 2:1 read/write ratios. Experiment with address mapping algorithm and load balancing. Experiment with different type scheduler implementations (eg. ones not based on pattern matching). *involved* Study hardware constraints in FB-DIMM channel scheduling.

15 More Possible FB-DIMM Studies Channel utilization and configuration trade-offs for Open Page Mode Channel utilization and configuration trade-offs for Open Page Mode Performance degradation of shrinking scheduler reorder window size Performance degradation of shrinking scheduler reorder window size Relaxation on critical DRAM device parameters (density, nBanks, timing constraints, clock frequency) allowed by FB-DIMM architecture Relaxation on critical DRAM device parameters (density, nBanks, timing constraints, clock frequency) allowed by FB-DIMM architecture OR optimizing the FB-DIMM architecture by increasing the SB and NB channel widths (adding lines) or bitrates, and maybe modifying the frame protocol OR optimizing the FB-DIMM architecture by increasing the SB and NB channel widths (adding lines) or bitrates, and maybe modifying the frame protocol AMB is a logic device on a memory module!! Can add buffers, arithmetic units, processing power, etc….. AMB is a logic device on a memory module!! Can add buffers, arithmetic units, processing power, etc…..

16 Special Thanks to.. Dr Jacob for introducing me to the field and guiding my progress Dr Jacob for introducing me to the field and guiding my progress David Wang for the course lectures and material David Wang for the course lectures and material


Download ppt "A Fully Buffered Memory System Simulator Rami Nasr -M.S. Thesis, and ENEE 759H Course Project Thursday May 12 th, 2005."

Similar presentations


Ads by Google