Download presentation
Presentation is loading. Please wait.
Published byPhillip Siner Modified over 9 years ago
1
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer Architecture
2
Introduction Availability of multiprocessors (how to maximize performance?) Atomicity of operations (synchronization) Allow in-order processors to overlap store latency with other work (ie bypassing loads, overlapping with network latency etc) Allow processors to execute out-of-order (speculation) There exists a trade off between programmability and performance To simplify programming, implement a shared memory abstraction
3
Spring 2005: CS 7968 Parallel Computer Architecture Memory Models Shared memory systems implement memory consistency models Different models make different guarantees; the processor can reorder/overlap memory operations as long as the guarantees are upheld. Sequential Consistency (SC) is the simplest model which executes memory operations in program order Relaxed memory models require only some memory operations to perform in program order Release Consistency is the best of the relaxed memory models
4
Spring 2005: CS 7968 Parallel Computer Architecture Current Memory Consistency Models Sequential Consistency (SC) HP and MIPS processors Processor Consistency (PC) Intel processors Total Store Order Sun SPARC Release Consistency (RC) Sun SPARC, DEC Alpha, IBM PowerPC
5
Spring 2005: CS 7968 Parallel Computer Architecture Current Optimizations Techiniques used to exploit ILP Branch prediction Execute multiple instructions per cycle Non-blocking caches to overlap memory operations Out-of-order execution Implement precise exceptions and speculative execution Reorder buffer
6
Spring 2005: CS 7968 Parallel Computer Architecture Comparing SC and RC Sequential Consistency (SC) Guarantees memory order using hardware Easier to program Prevents high performance due to conservative nature Release Consistency (RC) Guarantees memory order using software Harder to program; more burden on programmer Achieves highest performance due to explicitness
7
Spring 2005: CS 7968 Parallel Computer Architecture SC Implementations Current SC use ILP Optimizations Hardware prefetching and non-blocking caches to overlap loads and stores using the reorder buffer Speculative load execution using reorder buffer and a special history buffer to roll back in case of invalidation Limitations Inability of stores to bypass other memory operations Long latency remote stores cause the relative small reorder buffer and load/store queue to fill up blocking the pipeline Capacity and conflict misses of small L2 caches causing frequent rollbacks
8
Spring 2005: CS 7968 Parallel Computer Architecture RC Implementations RC allows a programmer to specify the ordering constraints (fence instr) among specific memory operations to enforce order RC implementations use store buffering to allow loads and store to bypass pending stores Unlike SC, RC can use binding prefetches to perform loads in the reorder buffer RC can also relax ordering among fence instrns and use rollback mechanisms if there is a memory model violation
9
Spring 2005: CS 7968 Parallel Computer Architecture SC programmability with RC Perfor. SC can approach RC if hardware can provide support for: SC to relax the order speculatively of loads and stores Loads and stores to take place atomically and in program order Instructions to be allowed to execute out of program order Processor state must be remembered for rollbacks Limitations (costs) Memory order is arbitrary; no guarantees Rollbacks must be infrequent (enough space needed)
10
Spring 2005: CS 7968 Parallel Computer Architecture SC++ Architecture Modelled after R10k SHiQ allows for prefetching and non- blocking caches Other processors see SC History buffer allows speculative retirement unblocks RoB stores Load/store queue takes stores from RoB BLT has block addr’s for SHiQ
11
Spring 2005: CS 7968 Parallel Computer Architecture Experimental Setup Simulator: Simulator: RSIM, on an 8-node DSM Each DSM node is a R10k like processor Memory model implementations use Non-blocking caches Hardware prefetching for loads and stores Speculative load execution No speculative retirement is done in either SC or RC
12
Spring 2005: CS 7968 Parallel Computer Architecture Base System Configuration Each R10k processor node has the above configuration Large L2 cache – eliminates capacity and conflict misses Base configuration is used unless otherwise specified
13
Spring 2005: CS 7968 Parallel Computer Architecture Some points to remember Some points to remember SC and RC implementations… Use non-blocking caches Use hardware prefetching for loads and stores Perform speculative loads SC++ uses… Speculative History Queue (SHiQ) Block Lookup Table (BLT) Rollbacks due to Instructions in reorder buffer take one cycle Rollbacks due to Instructions in SHiQ take 4 cycles
14
Spring 2005: CS 7968 Parallel Computer Architecture Results – Base System Speedup normalized to that of SC implementation RC is better than SC Best for radix SC++ performs better than or equal to RC For raytrace it performs way better
15
Spring 2005: CS 7968 Parallel Computer Architecture Results – Network Latency Network latency increased by 4x RC hides the n/w latency by overlapping stores SC++inf keeps up with RC raytrace performs lesser since longer n/w latency dominates lock patterns.
16
Spring 2005: CS 7968 Parallel Computer Architecture Results – Reorder Buffer Size Allows more prefetch time Speeds up both SC and RC Hides store latencies by allowing more time for prefetches In raytrace, no speedup in both SC and RC Memory operations don’t overlap much In structured, the gap grows Due to increase in no. of rollbacks in SC
17
Spring 2005: CS 7968 Parallel Computer Architecture Res - SHiQ Size & Speculative Stores Absence of speculative stores causes significance performance loss radix and raytrace Reducing SHiQ sizes leads to performance degradation em3d and radix
18
Spring 2005: CS 7968 Parallel Computer Architecture Results – L2 Caches Size Two effects of smaller L2 cache Less room for speculative state => gap widens Lots of load misses for both SC and RC => might narrow performance gap Lu & radix – the high load miss rate degrades performance SC ++ is also sensitive to rollbacks due to replacements
19
Spring 2005: CS 7968 Parallel Computer Architecture Conclusions SC can perform equal to RC if hardware provides enough support for speculation SC++ allows for speculative bypassing for both loads and stores SC++ minimizes additional overheads to the processor pipeline critical paths by using the following structures SHiQ: to store speculative state, absorb remote latencies BLT: to allow fast lookups in SHiQ
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.