Download presentation
Presentation is loading. Please wait.
Published byTamsin Garrison Modified over 6 years ago
1
A Talk on Adaptive History-Based Memory Scheduling
David J. Pelster December 4, 2018
2
Basis of Talk [1] I. Hur, C. Lin, “Adaptive History-based Memory Scheduler”, International Symposium on Microarchitecture, Proceedings of the 37th International Symposium on Microarchitecture Portland, Oregon , 2004, Pages: [2] "Hitting the Memory Wall: Implications of the Obvious." Wm. A. Wulf and Sally A. McKee. Computer Architecture News, 23(1):20-24, March 1995 [3] J. Hennessy, D. Patterson, “Computer Architecture: A Quantitative Approach, Second Edition 12/4/2018
3
Introduction & Motivation
4
Introduction 1/9 Increasing gap in processor and memory performance
12/4/2018
5
Introduction 2/9 Memory Operations Comprise 20 to 40% of Typical Instruction Mix [2] On average 1 of 5 instructions reference mem! Typical Instruction Behavior [3] lw – … << low latency computations >> sw – What does this mean? Stall. Stall. Stall. 12/4/2018
6
Introduction 3/9 If those memory operations are not within on-chip cache storage, then main memory supplies it. Main Memory Composed of DRAM High Latency Operation Operates at 133MHz/266MHz– Fast you might say? Even for a page hit it can be 2-3 bus cycles And what if it isn’t a Page Hit… in any Bank? 12/4/2018
7
Introduction 4/9 Functional Block Diagram 128 Meg x 4 SDRAM From Micron SDRAM Datasheet: 12/4/2018
8
Introduction 5/9 What about re-ordering the memory instructions before dispatching to memory? Processor only gets rid of false dependencies! Current Memory Controllers Processor passes instructions to DRAM in FIFO order Why? Simple. But can result in Poor Bandwidth due to imposed output dependence. For Example– Subsequent Accesses to the Same Bank So, how about re-ordering the memory operations at the memory controller…. 12/4/2018
9
Introduction 6/9 This has been done to some extent and current policies suffer from two limitations: They are greedy algorithms E.g. if multiple pending operations are ready and they do not exhibit bank contention, the controller schedules oldest operations first– what’s the problem? They only consider hardware characteristics And disregard application behavior E.g…. 12/4/2018
10
The Power5 Memory Controller [1]
Two Re-order Queues, An Arbiter, & CAQ daxpy kernel performs 2 Loads per Store Could induce a Bottleneck 12/4/2018
11
Introduction 8/9 These authors address both limitations by:
Tracking recently scheduled memory ops These are operations that have moved from the CAQ to the DRAM Arbiter can “log” past operations and use this information to consider more long term scheduling effects– that is conform to some ideal read/write ratio So enters the History-Based Arbiter Which uses a FSM to encode scheduling policies 12/4/2018
12
Introduction 9/9 These FSMs have a goal, such as:
Try to minimize overall latency Try to conform to an ideal access ratio Try to choose between the above two So enters the Adaptive History-Based Arbiters Adaptively choose from multiple history-based arbiters The authors provide a solution and simulate it on a Power5 cycle-accurate simulator 12/4/2018
13
A little background on the architecture
14
Background 1/3 The IBM Power5, SMP 276 million transistors
Improvements over the Power4 Larger L2 cache SMT Power-saving features On-chip memory controller 2 Processors per Chip Each has Split L1 cache Unified, shared, L2 cache Each memory controller shared by 2 Processors 2 reorder queues can each hold 8 memory requests Each request is entire L2 cache line (or portion of L3) 12/4/2018
15
The Power5 Memory Controller [1]
Arbiter selects the most appropriate commands from the R/W queues to place in the CAQ From that point, references are FIFO order Controller keeps track of up to 12 prev. cmds 12/4/2018
16
Background 3/3 The Power5 Memory System
Authors assume DDR2-266 DRAM chips 5D structure Two ports connect memory controller to DRAM The DRAM is organized as: 4 ranks 4 banks Pages (Rows and Cols) Conflicts: Port, rank, bank, page Bank delays are the greatest 12/4/2018
17
The Authors’ Solution
18
Solution The Focus—history-based & adaptive
Several FSMs to encoded history-based policies with distinct goals (distinct arbiters) E.g. Minimize latency, balance Reads/Writes, Probabilistically Combine several FSMs Each optimized for only one command pattern Arbiter Selector Observe recent command pattern & periodically choose the best history-based arbiter Now we have an Adaptive controller 12/4/2018
19
Arbiter Selection [1] Ccnt Rcnt Wcnt Updateevery Ccnt
Based on R/W ratio 12/4/2018
20
Solution History-Based Arbiters Similar to Branch Predictors:
Selects next command based on the history of previously serviced commands But also uses current en-queued commands to help with decision process Encoded in an FSM: Each state represents a possible “history string” String xy means x was serviced before y Within Each state, the next state is chosen based on the criteria of the arbiter and the available pending commands E.g…. 12/4/2018
21
A Possible FSM Scenario [1]
12/4/2018
22
History-Based Arbiter [1]
Optimize for desired R/W ratio Prioritizes based on deviation from desired ratio x=Reads y=Writes 12/4/2018
23
History-Based Arbiter [1]
Optimize for minimized latency Prioritizes based on the expected latency of the schedule command Develop latency cost model: R2W to different bank, R2R to same port but different rank, W2R to same port, etc. 12/4/2018
24
History-Based Arbiter [1]
Probabilistic arbiter Criteria for prioritization needs to be hardcoded– What if this is not known? Random number periodically generated to determine state transition rules Threshold is system dependent and would be determined experimentally 12/4/2018
25
Simulation Parameters and Results
26
Simulation Parameters 1/4
Schedulers Studied 1. Current Power5 System: In-Order 2. Memory-less by Rixner et al. Selects commands that do not conflict with operations currently in DRAM 3. Their Adaptive History-Based Controller Using the 3 versions of the command_pattern arbiters… Uses a history-length string of 2 Aims to reduce port and rank conflicts due to low history length 12/4/2018
27
Simulation Parameters 2/4
Methodology Uses THE cycle-accurate simulator the designers of the IBM Power5 used Accurate to within 1% of the actual behavior of the real Power5 Parameters: 1.6GHz, priority over demand misses to prefetechs Three levels of cache: L1D – 64KB 4-way set associative L1I – 128KB 2-way set associative L2 – 3x640KB 10-way set associative w/ bsize = 128B L3 – off-chip – 36MB 12/4/2018
28
Simulation Parameters 3/4
Parameters (cont.) DDR2-266MHz SDRAM chips running at 266MHz 75ns delay for bank conflicts 30ns delay for rank conflicts 3 History-Based Arbiters w/ history length = 2 and consider commands for 2 ports so each FSM has 16 states 1R2W 1R1W 2R1W 12/4/2018
29
Simulation Parameters 4/4
Adaptive History-Based Arbiter Combines these 3 and uses: 2R1W when R/W >= 1.2 Arbiter (3) 1R1W when 0.8 < R/W < 1.2 Arbiter (2) 1R2W when R/W <= 0.8 Arbiter (1) Selection of these arbiters performed every 10000 processor cycles Results insensitive to periods 100 or greater 12/4/2018
30
Results 4 Stream Benchmarks 8 NAS Benchmarks 14 Microbenchmarks
Measures Sustainable Bandwidth 8 NAS Benchmarks Data-intensive scientific benchmarks 14 Microbenchmarks To explore a wider range of machine configurations Each uses a different R/W ratio xRyW – meaning x Read Streams and y Write Streams 12/4/2018
31
Stream Benchmarks [1] Copy/Scale: 2R / W Sum/Triad 3R / W 12/4/2018
32
NAS Benchmarks [1] Bottom 2 Graphs: CPU has 4x Clk 12/4/2018
33
Microbenchmarks [1] Arbiter Selector picks the best arbiter except the 3r2w case (ratio = 1.5) 12/4/2018
34
Potential Bottlenecks [1]
1-full 3-empty 4-full 2-full 12/4/2018
35
Conclusions
36
Conclusions Hardware cost is small Given their benchmark testing
Increases .038% of total current Power5 area Given their benchmark testing Claim: they achieve 95-98% of the IPC of an idealized DRAM that never experiences any sort of hardware hazard Claim: improves IPC 63% for Stream over in-order and 19.1% over memory-less Claim: improves IPC 10.9% for NAS over in-order and 5.1% for memory-less scheduling 12/4/2018
37
Questions? 12/4/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.