Automatic Detection of Extended Data-Race-Free Regions Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Sweden Alexandra Jimborean Jonatan Waern, Per Ekemark, Stefanos Kaxiras, Alberto Ros (Murcia, Spain)
Software-hardware co-designs Compiler-assisted cache coherence Motivation and goal Multi- and many-core processors and heterogeneous architectures pose challenges on scalable cache coherence Simplify or even abandon hardware cache coherence Goal Software-hardware co-designs Compiler-assisted cache coherence
Traditional cache coherence Cache coherence ensures correct propagation of changes of data held in multiple private caches. Problem: Not all updates to variables have to be propagated. Unnecessary propagation of updates uses resources in vain. Background Private data does not require coherence! Solution: Design a dual-mode cache coherence protocol and enable coherence on-demand.
Private-shared data Motivation Runtime classification Static classification Motivation Compiler identifies most accesses to target temporarily private data! Identify the largest period during which data remains private: extended data race-free regions
Agenda Outline xDRF regions of private accesses Examples Compile-time xDRF analysis xDRF regions for optimizing coherence protocols Results: performance, energy on many-cores Outline
Data race free code as input Data-race free code, no data races given synchronization LOCK Regions UNLOCK
Synchronization points Data-race free code, no data races given synchronization Synchronizations divide code in two categories LOCK Regions UNLOCK
Parallel DRF regions do not share data. DRF and nDRF regions Data-race free code, no data races given synchronization Synchronizations divide code in two categories DRF regions outside synchronized code nDRF regions, i.e. synchronized code LOCK Regions UNLOCK Disclaimer: DRF/nDRF our notations Parallel DRF regions do not share data.
Happens-before sync are xDRF boundaries. Data-flow across a synchronization point sync enforces a happens-before. Regions Happens-before sync are xDRF boundaries.
xDRF regions extend across enclave nDRFs. Lock-set sync No data-flow across a synchronization point sync ensures atomicity nDRF is enclave Regions xDRF regions extend across enclave nDRFs.
Lock-set sync Regions No data-flow across a synchronization point sync ensures atomicity nDRF is enclave Regions Extend the synchronization-free semantics across enclave synchronization points.
Sync is an xDRF boundary Different xDRF regions. Data-flow across sync xDRF Conflict Conflict xDRF extends across sync. Sync is an xDRF boundary Different xDRF regions.
Data-flow across sync Signal-wait with locks, breaks xDRF. xDRF
No data-flow across sync Signal-wait with locks, no data-flow across sync, enclave in xDRF. xDRF
Data-flow across transitive sync xDRF Transitive synchronization. Conflict
xDRF static analysis xDRF analysis Distinguish between nDRF regions which represent xDRF boundaries and nDRF regions enclave in xDRF regions. xDRF analysis
Check the sync variable of each nDRF. Identify nDRF regions Identify synchronization points (nDRFs) 1. nDRF Regions Check the sync variable of each nDRF.
Matching nDRFs sync on the same synchronization variable. Matching nDRF regions Identify nDRFs that sync one with another. 1. nDRF Regions Matching nDRFs sync on the same synchronization variable.
DRF paths 2. DRF regions nDRF DRF-before = {1,2} DRF-after = {3} Not yet processed or previously identified as xDRF boundaries. nDRF DRF-before = {1,2} DRF-after = {3} DRF before Currently analyzed nDRF 2. DRF regions DRF after nDRF DRF region before = Union of DRF paths before DRF region after = Union of DRF paths after
Currently analyzed nDRF DRF regions DRF before DRF-before = {1,2,3} DRF-after = {4,5} nDRF nDRF Currently analyzed nDRF 2. DRF regions DRF after nDRF nDRF
DRF regions 2. DRF regions DRF-before = {1,2,3} DRF-after = {1,3} nDRF Currently analyzed nDRF DRF after nDRF More details in the paper.
Merging DRF regions 2. DRF regions nDRF DRF-before = {1,2} DRF-after = {3} DRF before 1. No conflict between DRF-before and DRF-after 2. DRF regions DRF after 2. No conflict between DRFs and current nDRF nDRF
Merging DRF regions 3. xDRF regions nDRF DRF-before = {1,2} DRF-after = {3} DRF before 1. No conflict between DRF-before and DRF-after xDRF region Enclave 3. xDRF regions DRF after 2. No conflict between DRFs and current nDRF nDRF Current nDRF is enclave. Merge DRF-before and DRF-after in one xDRF.
xDRF regions 3. xDRF Regions xDRF-before = {1,2} xDRF-after = {3} Not yet processed or previously identified as non-enclave. xDRF-before = {1,2} xDRF-after = {3} xDRF before Enclave 3. xDRF Regions Currently analyzed nDRF xDRF after If no conflict, merge xDRF-before and xDRF-after in one xDRF.
Transitive xDRF regions xDRF-before = {1,2,3,4} xDRF-after = {5} Matching enclave nDRFs 3. xDRF regions More details in the paper.
xDRF in practice Evaluation Number of static nDRF 70% non-enclave (inherent to the applications) 20% enclave (automatically detected) 10% potentially enclave (oracle) Number of executed xDRF regions Enclave nDRFs are on the hot path Compiler approaches oracle Evaluation
How are xDRF regions useful Compiler-assisted cache coherence protocol Deactivate coherence for xDRF accesses (temporarily private data) xDRF - cogerence
Traditional cache coherence 3xN coherence actions xDRF - coherence Traditional c.c.
Optimized cache coherence 3xN coherence actions N+1 coherence actions xDRF - coherence Traditional c.c. Optimized c.c.
xDRF cache coherence xDRF - coherence 3xN coherence actions N+1 Traditional c.c. Optimized c.c. xDRF c.c.
Performance – coherence prot. -4% DRF only 7% Compiler 8% Oracle Evaluation Normalized to a traditional protocol Compiler competitive with oracle
Energy savings – coherence protocol -10% DRF only 12% Compiler 16% Oracle Evaluation Compiler competitive with oracle
Conclusions Conclusions Compiler techniques to detect xDRF regions xDRF regions enable cache coherence optimizations Performance 7%, energy savings 12% Conclusions
Automatic Detection of Extended Data-Race-Free Regions Present self + team Previous decades: Time is money Performance Current decade: Power is money Energy This work improves energy efficiency without damaging performance. Sweden Thank you! Alexandra Jimborean Jonatan Waern, Per Ekemark , Stefanos Kaxiras, Alberto Ros (Murcia, Spain) Travelling to this conference was financed by the Swedish Wenner-Gren foundation
Enable compiler optimizations over Future work Enable compiler optimizations over xDRF regions Typical compiler optimizations for multi-threaded code operate within synchronization free regions xDRF extends the scope of the optimizations making them more effective Future work
Future work Future work Instead of splitting xDRF regions upon a conflict, place nDRF markings around the conflicting accesses xDRF limits: barriers, signal-waits, joins Any shared (conflicting) accesses are handled as nDRF Larger xDRF regions Future work