A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State Univ.

Feb. 15, 2005HPCA-112 DRAM Memory Optimizations Optimizations at DRAM side can make a big difference on single-threaded processors  Enhancement of chip interface/interconnect  Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00]  DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01]

Feb. 15, 2005HPCA-113 How does SMT Impact Memory Hierarchy?  Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations?  But more cache misses due to cache contention – Much more pressure on main memory  Is DRAM memory design more important or not?

Feb. 15, 2005HPCA-114 Outline  Motivation  Memory optimization techniques  Thread-aware memory access scheduling  Outstanding request-based  Resource occupancy-based  Methodology  Memory performance analysis on SMT systems  Effectiveness of single-thread techniques  Effectiveness of thread-aware schemes  Conclusion

Feb. 15, 2005HPCA-115 Memory Optimization Techniques  Page modes  Open page: good for programs with good locality  Close page: good for programs with poor locality  Mapping schemes  Exploitation of concurrency (multiple channels, chips, banks)  Row buffer conflicts  Memory access scheduling  Reorder of concurrent accesses  Reducing average latency and improving bandwidth utilization

Feb. 15, 2005HPCA-116 Memory Access Scheduling for Single- Threaded Systems  Hit-first  A row buffer hit has a higher priority than a row buffer miss  Read-first  A read has a higher priority than a write  Age-based  An older request has a higher priority than a new one  Criticality-based  A critical request has a higher priority than a non- critical one

Feb. 15, 2005HPCA-117 Memory Access Concurrency with Multithreaded Processors ProcessorMemory Single-threaded Multi-threaded

Feb. 15, 2005HPCA-118 Thread-Aware Memory Scheduling  New dimension in memory scheduling for SMT systems: considering the current state of each thread  States related to memory accesses  Number of outstanding requests  Number of processor resources occupied

Feb. 15, 2005HPCA-119 Outstanding Request-Based Scheme  Request-based  A request generated by a thread with fewer pending requests has a higher priority H A1 H A2 H B1 H A3 H A4 H B2 time H A1 H A2 H A3 H A4 H B1 H B2

Feb. 15, 2005HPCA-1110 Outstanding Request-Based Scheme  Request-based  Hit-first and read-first are applied on top  For SMT processors, sustained memory bandwidth is more important than the latency of an individual access H A1 H A2 M B1 H A3 H A4 M B2 time H A1 H A2 H A3 H A4 M B1 M B2

Feb. 15, 2005HPCA-1111 Resource Occupancy-Based Scheme  ROB-based  Higher priority to requests from threads holding more ROB entries  IQ-based  Higher priority to requests from threads holding more IQ entries  Hit-first and read-first are applied on top

Feb. 15, 2005HPCA-1112 Methodology  Simulator  SMT extension of sim-Alpha  Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM)  Workload  Mixture of SPEC 2000 applications  2-, 4-, 8-thread workload  “ILP”, “MIX”, and “MEM” workload mixes

Feb. 15, 2005HPCA-1113 Simulation Parameters Processor speed3 GHzL1 caches64KB I/D, 2-way, 1- cycle latency Fetch width8 inst.L2 cache512KB, 2-way, 10- cycle latency Baseline fetch policyDWarn.2.8L3 cache4MB, 4-way, 20-cycle latency Pipeline depth11MSHR entries(16+4 prefetch)/cache Issue queue size64 Int., 32 FPMemory channels2/4/8 Reorder buffer size256/threadMemory BW/channel200 MHz, DDR, 16B width Physical register num384 Int., 384 FPMemory banks4 banks/chip Load/store queue size64 LQ, 64 SQDRAM access latency15ns row, 15ns column, 15ns precharge

Feb. 15, 2005HPCA-1114 Workload Mixes 2-threadILPbzip2, gzip MIXgzip, mcf MEMmcf, ammp 4-threadILPbzip2, gzip, sixtrack, eon MIXgzip, mcf, bzip2, ammp MEMmcf, ammp, swim, lucas 8-threadILPgzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise MIXgzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas MEMmcf, ammp, swim, lucas, equake, applu, vpr, facerec

Feb. 15, 2005HPCA-1115 Performance Loss Due to Memory Access

Feb. 15, 2005HPCA-1116 Memory Access Concurrency

Feb. 15, 2005HPCA-1117 Memory Channel Configurations

Feb. 15, 2005HPCA-1118 Memory Channel Configurations

Feb. 15, 2005HPCA-1119 Mapping Schemes

Feb. 15, 2005HPCA-1120 Memory Access Concurrency

Feb. 15, 2005HPCA-1121 Thread-Aware Schemes

Feb. 15, 2005HPCA-1122 Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors  Mostly effective when a workload mix includes some memory-intensive programs  Performance is sensitive to memory channel organizations  DRAM-side locality is harder to explore due to contention  Thread-aware access scheduling schemes does bring good performance

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Similar presentations

Presentation on theme: "A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Similar presentations

Presentation on theme: "A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State."— Presentation transcript:

Similar presentations

About project

Feedback