hLRC: Lazy Release Consistency For GPUs

hLRC: Lazy Release Consistency For GPUs
John AlsoP, Marc Orr, Brad Beckmann, David Wood

Motivation GPU systems use simple coherence actions and scoped synchronization Scopes complicate memory models, limit dynamic sharing This motivates innovations to the memory system: Past work: Remote-scope promotion (RSP) [ASPLOS’15] Allow dynamic sharing w/ scopes Further complicates memory models, initial implementation scales poorly Past work: DeNovo coherence for GPUs [MICRO’15] Eliminates scopes, avoids costly coherence actions for registered data Adds registration overhead We introduce heterogeneous Lazy Release Consistency (hLRC): Exploit synchronization locality to avoid costly coherence actions (similar to LRC in ISCA’92) Scopes for performance, not correctness wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 data data flush agent scope Modern GPU systems rely on simple, bulk-like coherence actions to keep cached data coherent, flushing the entire cache or invalidating the entire L1 cache on synchronization boundaries. To mitigate these high synch costs, scoped memory models are used that allow programmers to limit the visibility required by a synchronization operation. As has been argued in past work, scopes complicate the memory model and limit the types of dynamic data sharing that can be exploited. For example, whenever an access in one scope (e.g. wg0) is ever able to synchronize with an access from another scope (wg1), both accesses must always use a common scope level, otherwise you have a heterogeneous race. This is problematic for patterns like work stealing. Previous work has tried to address these inefficiencies. RSP allows a programmer to promote the scope of a local synch on a remote core, but the proposed implementation scales poorly, and the added semantics complicate the memory model even further. DeNovo has shown that it’s possible to achieve efficient heterogeneous coherence without using scopes, avoiding the added complexity of scopes and issues with dynamic sharing. It achieves this using a registration mechanism to track modified data, enabling this data to avoid coherence actions and exploit more locality. However, modified data is an uncommon source of reuse in many GPU kernels, and registering all modified data incurs overheads which often outweigh the benefits. As an alternative solution, we introduce heterogeneous Lazy Release Consistency. hLRC combines ideas from RSP – in that it avoids coherence actions entirely for locally scoped synchronization - from DeNovo – in that it does not require scopes to exploit locality, and it uses DeNovo registration to track synchronization variables – and from lazy release consistency for distributed shared memory CPU systems – in that it detects synchronization locality and only performs coherence actions when a synchronization variable moves, which indicates a potential remote synchronization. M. S. Orr et al. “Synchronization using remote-scope promotion.” in ASPLOS 2015 M. D. Sinclair et al. “Efficient GPU synchronization without scopes: Saying no to complex consistency models.” in MICRO 2015 P. Keleher et al. “Lazy release consistency for software distributed shared memory.” in ISCA 1992

Scoped Synchronization
Hower et al., ASPLOS 2014 work-group 0 work-group 1 wg0 wg1 thread 0 thread 2 t0 t1 t2 t3 st(data,2) wg_scope0 wg_scope1 st_rel_wg(L, 0) st_rel_wg(L, 0) st_rel_agt(L, 0) cas_acq_agt(&L, 0, 1) cas_acq_agt(&L, 0, 1) RACE! OK OK flush ld(R1, data) thread 1 agent scope Let’s first discuss why scoped synchronization can be so problematic. With a scoped memory model such as HRF, we can use easily use wg scoped accesses to synchronize between threads in the same work group. The threads share a L1 cache, so no coherence actions are necessary. We can also use agent scoped accesses to synchronize between threads that share an agent scope, since this will trigger coherence actions (an L1 flush and invalidate) expanding visibility of shared data to agent scope. However, if we try to synchronize without using the a shared scope in both threads, this forms a heterogeneous synchronization race, and the resulting program is not guaranteed to be SC. Writing correct HRF programs is difficult, and requires careful consideration of the relative scopes of communicating threads. In addition, dynamic sharing patterns such as work stealing, where a thread synchronize with local threads in the common case but occasionally may need to synchronize with a remote stealing thread, are not able to achieve efficient reuse. This has motivated changes and alternative models, discussed next. cas_acq_wg(&L, 0, 1) ld(R1, data) Complex memory model - heterogeneous races Dynamic sharing not possible

Remote Scope Promotion
Orr et al., ASPLOS 2015 Key idea: promote scope at remote cores Add new memory orders for remote synch Implementation used broadcasts Remote-scope acquire: broadcast flush Remote-scope release: broadcast invalidate wg0 wg1 t0 t1 t2 t3 wg_scopeN wg_scope0 wg_scope1 wg_scopeM flush flush flush agent scope Remote scope promotion adds flexibility to the HRF memory model by allowing the promotion of local synchronization accesses at remote cores to a common scope level. A new memory order is added, and the programmer must identify when such a promotion could be necessary. The promotion is implemented by broadcasting coherence actions to all CUs in the desired promotion level. E.g. if an acquire in wg1 wants to synchronize with a wg-scope release in wg0, (e.g. for a steal from a remote task queue), it uses remote agent scope, which broadcasts flush commands to all other CUs in the agent scope, effectively promoting the scope of all prior releases on those CUs. Additional global RMW lock and unlock actions are also needed for remote scope operations that write a value, described more in the paper. This strategy allows more flexible synchronization patterns, but does nothing to address the complexity of the scoped memory model, and it scales poorly due to the broadcast operations. Enables dynamic sharing Adds complexity to memory model Implementation does not scale to larger core counts

DeNovo coherence Supports DRF (scope-free) memory model
Sinclair et al., MICRO 2015 Key idea: Registration reduces flush/invalidation costs Obtain exclusive registration for writes and atomics L2 keeps track of up-to-date copy (local or remote) Release: register non-registered dirty data Acquire: invalidate all non-registered data wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 data flush agent scope R: data We next consider DeNovo coherence. The key idea of DeNovo is not to avoid performing invalidates and flushes on synchronization actions- instead, DeNovo aims to make them less expensive. It does this by obtaining exclusive registration for writes and atomic accesses, tracked by L2. So the L2 will always have either an up to date copy of cached data, or the id of the core with an up-to-date copy. On a release, all non-registered dirty data obtains registration from L2. On an acquire, only the non-registered data in a core is invalidated. The idea is that, if producer in wg0 writes some data, it can synchronize with a consumer at any other point in the system. If the next access is local (e.g. a task queue pop), the registered data is unaffected by the synchronization and DeNovo can exploit locality in this data. If the next access is remote (e.g. a remote steal), the L2 can tell the reader where to find an up-to-date copy. DeNovo directly addresses the problem of scoped synchronization complexity, enabling a data-race free (scopeless) memory model. It is able to exploit significant reuse in registered data, since data can stayed locally registered indefinitely. However, reuse for non-registered data is limited. The overheads of registration- the added state, L2 inclusivity, and indirection required when requested data is remotely registered – can therefore harm performance when there is minimal reuse of modified data. This is true in the graph applications studied in this paper, where most L1 reuse comes from read-read locality. Supports DRF (scope-free) memory model Exploits locality in registered data Reuse for non-registered data is limited Added state, indirection for registration

Heterogeneous Lazy Release Consistency For GPUs
Our paper Key idea: Exploit synch locality to avoid coherence actions Use DeNovo registration for only atomics Lazily perform coherence actions only when remote synch detected (LRC) Change of atomic registration indicates possible remote synch Invalidate when cache obtains registration Flush when cache loses registration wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 atomic flush agent scope Heterogeneous lazy release consistency takes ideas from both models. Like scoped synchronization it tries to avoid coherence actions completely when synchronization is local. However, instead of using synchronization scope to detect local synchronization, it uses the locality of the synchronization variable itself. Like DeNovo, it uses registration at L2 to track atomic variables. Unlike DeNovo, it does not obtain registration for written data, greatly reducing registration overhead. By tracking atomic variables, hLRC can detect remote synchronization and only perform coherence actions at those points, similar to lazy release consistency schemes for distributed shared memory. This works because each acquire or release access must obtain local registration. If the registration location of an atomic variable changes, this indicates a possible remote synchronization, and cache data should be flushed or invalidated accordingly- invalidated when a cache obtains registration, and flushed when a cache loses registration. To see how this automatically avoids coherence actions, consider the same system. A release in wg0 synchronizes on an atomic, bringing it into its cache and invalidating the non-registered data. All subsequent synchronization accesses from the same workgroup will hit in the cache and avoid coherence actions. It is only when a thread from wg1 accesses the atomic that the atomic registration changes, triggering a flush at wg0 and an invalidate at wg1. This scheme automatically detects local synchronization rather than relying on scopes, it adds minimal state overhead relative to DeNovo, and perhaps most importantly, it has the benefit that scopes are not necessary for correctness- any atomic access can correctly synchronize with any other atomic access. The downside to this scheme is that it is sensitive to any atomic variable movement, triggering potentially wasteful actions if synchronization locality is absent. However, these disadvantages can be mitigated by using scopes to indicate the expected level of synchronization, which is described in more detail in the paper. R: atomic Synch locality automatically detected Minimal added registration overhead Scopes not necessary for correctness Sensitive to synchronization locality

Overview Comparison Scopes Scopes+RSP DeNovo hLRC Local registration?
Atomics and written data Atomics only To review, Scopes and Scopes+RSP does not register any data and requires no overhead for registration. DeNovo registers atomics and written data, and hLRC uses DeNovo registration for Atomics only.

Atomics and written data Atomics only Synch cost depends on: Scope Scope, # processors Registered data reuse Location of synchronization variable For scoped synchronization, the coherence cost of a synchronization access depends mostly on the scope used for the synchronization. For Scopes+RSP, it again depends on the scope, but can also depend on the number of processors in the case of a broadcast operation. The coherence cost of synchronization in DeNovo depends on the level of reuse in registered data, and for hLRC it depends on the registered location of the target synchronization variable.

Atomics and written data Atomics only Synch cost depends on: Scope Scope, # processors Registered data reuse Location of synchronization variable Acquire actions wg scope None Invalidate non-registered L1 data Registered L1 agent scope Invalidate L1 Registered L2 remote agent scope Bcast flush, Registered remote L1 Flush remote L1, Now let’s examine the actions necessary for each synchronization access type. With scoped synchronization, an access with acquire semantics will trigger no action if it is wg scope, and a L1 invalidation if agent scope. This is true for Scopes+RSP as well, but there is an added remote-agent scope that triggers a broadcast flush in addition to a L1 invalidation. DeNovo always invalidates any non-registered data on an acquire. hLRC does nothing if the atomic access hits in L1, it triggers an L1 invalidation if the variable is registered in L2, and it triggers a flush at the owning L1 and an invalidate at the local L1 if it is registered at a remote L1.

Atomics and written data Atomics only Synch cost depends on: Scope Scope, # processors Registered data reuse Location of synchronization variable Acquire actions wg scope None Invalidate non-registered L1 data Registered L1 agent scope Invalidate L1 Registered L2 remote agent scope Bcast flush, Registered remote L1 Flush remote L1, Release actions Register non-registered L1 dirty data Flush L1 Flush L1, Bcast invalidate Similarly, a release only triggers a flush in Scopes and Scopes+RSP if it is agent scope, and additionally performs a broadcast invalidate for remote agent scope. DeNovo must obtain registration for non-registered L1 dirty data on a release. hLRC performs the same actions as for an acquire- nothing if locally registered, invalidate L1 if registered at L2, and flush the owning L1 and invalidate the local L1 if registered at a remote L1.

Methodology Simulation environment
Extended version of AMD gem5 APU simulator 128 CUs 16KB L1, 4MB L2 Workloads Benchmarks from pannotia benchmark suite: Single Source Shortest Path (SSSP) Graph Coloring (color) Pagerank (PR) 9 graph inputs from Florida sparse matrix collection Modified to use per-CU task queues and work stealing to mitigate load imbalance We next compare these configurations on a cycle-level architectural simulator. We use an extended version of the publicly available AMD gem5 APU simulator, with 128 CUs with 4 SIMD units and 40 hw wavefronts per CU. Each CU has a 16KB private L1 cache, and a 4MB L2 is shared by all GPU Cus. Finally, a memory controller is shared with an on-chip CPU. We use three benchmarks from the pannotia benchmark suite-SSSP, graph coloring, and pagerank- with 3 graph inputs each taken from the Florida sparse matrix collection. The benchmarks have been modified from their original versions to use per-CU task queues and work stealing- this is to mitigate load imbalance in the workloads

Methodology Scenario Coherence Steal enabled? Pop scope Steal scope
baseline Scopes no agent N/A scope-only wg steal-only yes RSP Scopes+RSP remote agent DeNovo-B DeNovo* hLRC We compare 6 different configurations in our evaluation, and the parameters of each are described here. We use scoped synchronization as a baseline, and include three versions to demonstrate the relative impact of stealing and local scope. Baseline does not steal and uses agent scope for all synchronization. Scope-only also does not steal, but uses wg scope for all synchronization. Steal-only has stealing enabled, but uses agent scope for all synchronization. RSP adds RSP to scoped synchronization, using wg scope for pop operations and remote agent scope for steals. DeNovo-B uses a simplified version of the DeNovo protocol described in prior work- it tracks registration at a block granularity and does not use any coherence regions. hLRC is the coherence scheme previously described. Both DeNovo and hLRC don’t really use scope and enable scope-free memory models. *Block-granularity DeNovo is used, with no coherence regions

Evaluation: Speedup The first thing we notice is that, at 128 cores, RSP is not an effective method for remote stealing, degrading performance by up to 33% DeNovo performs best for multiple workloads, but in many others it is limited by registration overheads hLRC for the most part matches or exceeds the performance of the best of steal-only and scope, achieves best average performance RSP does not scale to 128 cores, degrades performance by up to 33% DeNovo-B performs best for multiple workloads, but registration overheads limit benefits for others hLRC matches or exceeds the best of scope/steal for all workloads, experiences largest average speedup

Conclusions Scopes restrict synchronization efficiency and add complexity RSP: Embrace scopes: avoid coherence actions for local, implement flexibility via broadcasts Memory model more complex, remote scope actions scale poorly DeNovo: Reject scopes: don’t avoid coherence actions, make them cheaper! No heterogeneous races, good registered reuse, limited other reuse, reg overhead hLRC Automatically avoid coherence actions using synchronization locality No heterogeneous races, scopes used for performance when synch locality is absent To conclude, we’ve shown that scoped synchronization is bad because it limits efficiency for certain synchronization patterns and adds complexity. RSP keeps scopes and coherence action avoidance, addresses the reuse limitations of the model by adding a new remote promotion access type. However, it does nothing about memory model complexity, and the new scope type incurs significant overhead at larger system sizes. DeNovo rejects scopes and aims to make coherence actions cheaper rather than avoiding them. By registering written data and atomics, it can exploit significant reuse in registered data while using a scope-free model. However, it is not as good at exploiting other types of locality, which are more common in existing kernels, and the overheads of registration can harm performance. hLRC uses elements of both . Like scoped synchronization and RSP, it avoids coherence actions entirely when possible, but it does this based on the locality of the synchronization variable rather than the scope of the synchronization access. Like DeNovo, it avoids heterogeneous races and any two atomic accesses can synchronize correctly, but it still enables the use of scopes for performance optimizations when synchronization locality is absent. //Synchronization in GPUs is costly, so scopes are used to specify the level needed

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

Backup Slides

Evaluation: Coherence Actions
Here we see the relative proportions of different types of invalidate coherence actions triggered for each configuration. The light blue proportion shows the invalidation/flushes that occur at each kernel boundary for every configuration. Scope-only of course performs none, steal-only and DeNovo perform an invalidate or flush/StReg at every pop or steal (dark blue) (although DeNovo’s are less expensive due to registration), RSP incurs huge amounts of remote coherence actions (yellow) due to the 128-CU broadcasts, and hLRC only performs a flush or invalidate when a synchronization variable changes (green), but this is relatively rare in the work stealing apps studied. RSP broadcasts trigger huge amounts of wasteful invalidations and flushes DeNovo must invalidate/flush on every acquire/release, although they are cheaper due to registration hLRC only flushes and invalidates when synch variable moves (infrequent for work stealing apps)

Evaluation: Memory, Synch latencies
2.2 5.6 These graphs show the cumulative latencies of each type of memory operation (top graph) and synchronization operation (lower graph). Due to the massive parallelism in GPUs, it’s not possible to directly relate these to speedup data, but they give a sense of some of the opposing forces at work within each configuration. We see clearly the effects of RSP broadcasts- whenever stealing occurs, the bcast flushes and invalidates cause load and synch latency to increase. DeNovo achieves improved memory and release latency where reuse in written data is possible (mainly through L2), but the inability to exploit read-read reuse and indirection for registered data causes increased load and release latency for other workloads. hLRC generally only decreases load and release latency, but it can increase atomic access latency because atomic accesses must sometimes trigger multiple sequential coherence actions. RSP: broadcast actions increase load and acquire latency DeNovo: increased write-read and write-write reuse (mainly through L2), decreased read-read reuse hLRC: generally decreased load and release latency, increased atomic access latency (when steals common)

hLRC: Adding Scope For Performance
work-group 0 work-group 1 wg0 wg1 t0 t1 t2 t3 thread 0 thread 2 st(V,2) wg_scope0 wg_scope1 atomic st_rel_wg(L, 0) cas_acq_wg(&L, 0, 1) flush flush ld(R1, V) agent scope Let’s now examine hLRC’s main weakness: its sensitivity to synchronization locality. The previous evaluation performs all atomic accesses at wg scope- that is, it registers these variables locally in L1. However, this can be detrimental to performance if the variables are known to exhibit little locality. For example, a synchronization variable that is repeatedly passed between multiple work groups may trigger excessive coherence actions. Work group 0 only has release semantics, but it triggers an invalidation whenever it brings the atomic into its cache, and a flush whenever the atomic leaves the cache. Similarly, work group 1 may only need acquire semantics, but the transfer of the atomic variable triggers both an invalidation and a flush. What’s more, these actions are performed sequentially and on-demand, potentially adding to the critical path of the program. R: atomic If synchronization has minimal locality, hLRC can suffer from excessive actions on critical path

hLRC: Adding Scope For Performance
work-group 0 work-group 1 wg0 wg1 t0 t1 t2 t3 thread 0 thread 2 st(data,2) wg_scope0 wg_scope1 st_rel_agt(L, 0) cas_acq_agt(&L, 0, 1) flush ld(R1, data) agent scope atomic This is where using scope for performance optimization comes in. If a synchronization variable is expected to exhibit minimal locality, agent scope can be used, and registration is then only necessary at the L2 cache level. Agent scope acquires and releases trigger L1 invalidations and flushes just as they did for scoped synchronization, but it also avoids the wasteful cost of invalidating the cache for a release or flushing for an acquire. (animation) In addition, these actions are performed preemptively off the critical path. If synchronization has minimal locality, ⇒Use agent scope

hLRC: adding Scope for performance
By default, hLRC obtains local registration for all atomics Remote steals are expected to exhibit less locality ⇒ try using agent scope for steals Speedup is comparable hLRC is able to exploit synch locality in the stealing threads As mentioned before, hLRC obtains local (L1) registration, or wg scope, for all atomics in the evaluation shown. However, since remote steals are expected to be less common and exhibit less locality, it makes sense to use agent scope for these accesses. However, when we evaluated this we found the speedup to be comparable to using wg-scope only. It turns out that there actually is sufficient locality in successive steals for these accesses to benefit from wg scope.

Remote Scope Promotion
Key idea: promote scope at remote cores Add a new memory order for remote synch Remote-scope acquire: broadcast flush Remote-scope release: broadcast invalidate Scopes RSP Acquire actions wg scope None agent scope Invalidate L1 remote agent scope Bcast flush, Release actions Flush L1 Flush L1, Bcast invalidate wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 flush flush flush agent scope Flexible sharing possible Adds complexity to memory model Does not scale to larger core counts

DeNovo coherence Supports DRF (scope-free) memory model
Key idea: Register data to avoid flush/inval costs Obtain registration for writes and atomic accesses L2 keeps track of up-to-date copy (local or remote) Acquire: invalidate all non-registered data Release: register non-registered dirty data Scopes DeNovo Acquire actions wg scope None Invalidate non-R L1 data agent scope Invalidate L1 Release actions Register non-reg L1 R data Flush L1 wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 data flush agent scope R: data Supports DRF (scope-free) memory model Added state, indirection for registration

Heterogeneous Lazy Release Consistency
Key idea: Exploit synchronization locality to reduce coherence actions DeNovo registration for atomics only Invalidate when cache obtains registration Flush when cache loses registration Scopes hLRC Acquire actions wg scope None Registered L1 agent scope Invalidate L1 Registered L2 Registered remote L1 Flush remote L1, Release actions Flush L1 wg0 wg1 t0 t1 t2 t3 wg_scope0 wg_scope1 atomic flush agent scope R: atomic Scopes are optional Only flush/inval when atomic reg changes

L1 Hit rate

hLRC: Lazy Release Consistency For GPUs

Similar presentations

Presentation on theme: "hLRC: Lazy Release Consistency For GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

hLRC: Lazy Release Consistency For GPUs

Similar presentations

Presentation on theme: "hLRC: Lazy Release Consistency For GPUs"— Presentation transcript:

Similar presentations

About project

Feedback