Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Similar presentations


Presentation on theme: "A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State."— Presentation transcript:

1 A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State Univ.

2 Feb. 15, 2005HPCA-112 DRAM Memory Optimizations Optimizations at DRAM side can make a big difference on single-threaded processors  Enhancement of chip interface/interconnect  Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00]  DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01]

3 Feb. 15, 2005HPCA-113 How does SMT Impact Memory Hierarchy?  Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations?  But more cache misses due to cache contention – Much more pressure on main memory  Is DRAM memory design more important or not?

4 Feb. 15, 2005HPCA-114 Outline  Motivation  Memory optimization techniques  Thread-aware memory access scheduling  Outstanding request-based  Resource occupancy-based  Methodology  Memory performance analysis on SMT systems  Effectiveness of single-thread techniques  Effectiveness of thread-aware schemes  Conclusion

5 Feb. 15, 2005HPCA-115 Memory Optimization Techniques  Page modes  Open page: good for programs with good locality  Close page: good for programs with poor locality  Mapping schemes  Exploitation of concurrency (multiple channels, chips, banks)  Row buffer conflicts  Memory access scheduling  Reorder of concurrent accesses  Reducing average latency and improving bandwidth utilization

6 Feb. 15, 2005HPCA-116 Memory Access Scheduling for Single- Threaded Systems  Hit-first  A row buffer hit has a higher priority than a row buffer miss  Read-first  A read has a higher priority than a write  Age-based  An older request has a higher priority than a new one  Criticality-based  A critical request has a higher priority than a non- critical one

7 Feb. 15, 2005HPCA-117 Memory Access Concurrency with Multithreaded Processors ProcessorMemory Single-threaded Multi-threaded

8 Feb. 15, 2005HPCA-118 Thread-Aware Memory Scheduling  New dimension in memory scheduling for SMT systems: considering the current state of each thread  States related to memory accesses  Number of outstanding requests  Number of processor resources occupied

9 Feb. 15, 2005HPCA-119 Outstanding Request-Based Scheme  Request-based  A request generated by a thread with fewer pending requests has a higher priority H A1 H A2 H B1 H A3 H A4 H B2 time H A1 H A2 H A3 H A4 H B1 H B2

10 Feb. 15, 2005HPCA-1110 Outstanding Request-Based Scheme  Request-based  Hit-first and read-first are applied on top  For SMT processors, sustained memory bandwidth is more important than the latency of an individual access H A1 H A2 M B1 H A3 H A4 M B2 time H A1 H A2 H A3 H A4 M B1 M B2

11 Feb. 15, 2005HPCA-1111 Resource Occupancy-Based Scheme  ROB-based  Higher priority to requests from threads holding more ROB entries  IQ-based  Higher priority to requests from threads holding more IQ entries  Hit-first and read-first are applied on top

12 Feb. 15, 2005HPCA-1112 Methodology  Simulator  SMT extension of sim-Alpha  Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM)  Workload  Mixture of SPEC 2000 applications  2-, 4-, 8-thread workload  “ILP”, “MIX”, and “MEM” workload mixes

13 Feb. 15, 2005HPCA-1113 Simulation Parameters Processor speed3 GHzL1 caches64KB I/D, 2-way, 1- cycle latency Fetch width8 inst.L2 cache512KB, 2-way, 10- cycle latency Baseline fetch policyDWarn.2.8L3 cache4MB, 4-way, 20-cycle latency Pipeline depth11MSHR entries(16+4 prefetch)/cache Issue queue size64 Int., 32 FPMemory channels2/4/8 Reorder buffer size256/threadMemory BW/channel200 MHz, DDR, 16B width Physical register num384 Int., 384 FPMemory banks4 banks/chip Load/store queue size64 LQ, 64 SQDRAM access latency15ns row, 15ns column, 15ns precharge

14 Feb. 15, 2005HPCA-1114 Workload Mixes 2-threadILPbzip2, gzip MIXgzip, mcf MEMmcf, ammp 4-threadILPbzip2, gzip, sixtrack, eon MIXgzip, mcf, bzip2, ammp MEMmcf, ammp, swim, lucas 8-threadILPgzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise MIXgzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas MEMmcf, ammp, swim, lucas, equake, applu, vpr, facerec

15 Feb. 15, 2005HPCA-1115 Performance Loss Due to Memory Access

16 Feb. 15, 2005HPCA-1116 Memory Access Concurrency

17 Feb. 15, 2005HPCA-1117 Memory Channel Configurations

18 Feb. 15, 2005HPCA-1118 Memory Channel Configurations

19 Feb. 15, 2005HPCA-1119 Mapping Schemes

20 Feb. 15, 2005HPCA-1120 Memory Access Concurrency

21 Feb. 15, 2005HPCA-1121 Thread-Aware Schemes

22 Feb. 15, 2005HPCA-1122 Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors  Mostly effective when a workload mix includes some memory-intensive programs  Performance is sensitive to memory channel organizations  DRAM-side locality is harder to explore due to contention  Thread-aware access scheduling schemes does bring good performance


Download ppt "A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State."

Similar presentations


Ads by Google