Automatic Detection of Extended Data-Race-Free Regions

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.
Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.
Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
SE-292 High Performance Computing
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.
An Case for an Interleaving Constrained Shared-Memory Multi- Processor CS6260 Biao xiong, Srikanth Bala.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Multiscalar processors
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Regional Traffic Simulation/Assignment Model for Evaluation of Transit Performance and Asset Utilization April 22, 2003 Athanasios Ziliaskopoulos Elaine.
Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.
What Change History Tells Us about Thread Synchronization RUI GU, GUOLIANG JIN, LINHAI SONG, LINJIE ZHU, SHAN LU UNIVERSITY OF WISCONSIN – MADISON, USA.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
Detecting Atomicity Violations via Access Interleaving Invariants
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
Implementing RISC Multi Core Processor Using HLS Language - BLUESPEC Liam Wigdor Instructor Mony Orbach Shirel Josef Semesterial Winter 2013.
John D. McGregor Module 3 Session 2 Architecture Analysis/Design
Healing Data Races On-The-Fly
High-performance transactions for persistent memories
Welcome: Intel Multicore Research Conference
Multi-core processors
Memory Consistency Models
Atomic Operations in Hardware
Atomic Operations in Hardware
Memory Consistency Models
Computer Engg, IIT(BHU)
Effective Data-Race Detection for the Kernel
Reactive Synchronization Algorithms for Multiprocessors
Multi-core processors
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Antonia Zhai, Christopher B. Colohan,
Threads and Memory Models Hal Perkins Autumn 2011
Software Cache Coherent Control by Parallelizing Compiler
Multi-core CPU Computing Straightforward with OpenMP
Superscalar Processors & VLIW Processors
Lecture 2: Snooping-Based Coherence
Lesson Objectives Aims You should be able to:
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Threading And Parallel Programming Constructs
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Chapter 4: Threads.
Programming with Shared Memory Specifying parallelism
Problems with Locks Andrew Whitaker CSE451.
Rethinking Support for Region Conflict Exceptions
Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab
Presentation transcript:

Automatic Detection of Extended Data-Race-Free Regions Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Sweden Alexandra Jimborean Jonatan Waern, Per Ekemark, Stefanos Kaxiras, Alberto Ros (Murcia, Spain)

Software-hardware co-designs Compiler-assisted cache coherence Motivation and goal Multi- and many-core processors and heterogeneous architectures pose challenges on scalable cache coherence Simplify or even abandon hardware cache coherence Goal Software-hardware co-designs Compiler-assisted cache coherence

Traditional cache coherence Cache coherence ensures correct propagation of changes of data held in multiple private caches. Problem: Not all updates to variables have to be propagated. Unnecessary propagation of updates uses resources in vain. Background Private data does not require coherence! Solution: Design a dual-mode cache coherence protocol and enable coherence on-demand.

Private-shared data Motivation Runtime classification Static classification Motivation Compiler identifies most accesses to target temporarily private data! Identify the largest period during which data remains private: extended data race-free regions

Agenda Outline xDRF regions of private accesses Examples Compile-time xDRF analysis xDRF regions for optimizing coherence protocols Results: performance, energy on many-cores Outline

Data race free code as input Data-race free code, no data races given synchronization LOCK Regions UNLOCK

Synchronization points Data-race free code, no data races given synchronization Synchronizations divide code in two categories LOCK Regions UNLOCK

Parallel DRF regions do not share data. DRF and nDRF regions Data-race free code, no data races given synchronization Synchronizations divide code in two categories DRF regions outside synchronized code nDRF regions, i.e. synchronized code LOCK Regions UNLOCK Disclaimer: DRF/nDRF our notations Parallel DRF regions do not share data.

Happens-before sync are xDRF boundaries. Data-flow across a synchronization point  sync enforces a happens-before. Regions Happens-before sync are xDRF boundaries.

xDRF regions extend across enclave nDRFs. Lock-set sync No data-flow across a synchronization point  sync ensures atomicity nDRF is enclave Regions xDRF regions extend across enclave nDRFs.

Lock-set sync Regions No data-flow across a synchronization point  sync ensures atomicity nDRF is enclave Regions Extend the synchronization-free semantics across enclave synchronization points.

Sync is an xDRF boundary  Different xDRF regions. Data-flow across sync xDRF Conflict Conflict xDRF extends across sync. Sync is an xDRF boundary  Different xDRF regions.

Data-flow across sync Signal-wait with locks, breaks xDRF. xDRF

No data-flow across sync Signal-wait with locks, no data-flow across sync, enclave in xDRF. xDRF

Data-flow across transitive sync xDRF Transitive synchronization. Conflict

xDRF static analysis xDRF analysis Distinguish between nDRF regions which represent xDRF boundaries and nDRF regions enclave in xDRF regions. xDRF analysis

Check the sync variable of each nDRF. Identify nDRF regions Identify synchronization points (nDRFs) 1. nDRF Regions Check the sync variable of each nDRF.

Matching nDRFs sync on the same synchronization variable. Matching nDRF regions Identify nDRFs that sync one with another. 1. nDRF Regions Matching nDRFs sync on the same synchronization variable.

DRF paths 2. DRF regions nDRF DRF-before = {1,2} DRF-after = {3} Not yet processed or previously identified as xDRF boundaries. nDRF DRF-before = {1,2} DRF-after = {3} DRF before Currently analyzed nDRF 2. DRF regions DRF after nDRF DRF region before = Union of DRF paths before DRF region after = Union of DRF paths after

Currently analyzed nDRF DRF regions DRF before DRF-before = {1,2,3} DRF-after = {4,5} nDRF nDRF Currently analyzed nDRF 2. DRF regions DRF after nDRF nDRF

DRF regions 2. DRF regions DRF-before = {1,2,3} DRF-after = {1,3} nDRF Currently analyzed nDRF DRF after nDRF More details in the paper.

Merging DRF regions 2. DRF regions nDRF DRF-before = {1,2} DRF-after = {3} DRF before 1. No conflict between DRF-before and DRF-after 2. DRF regions DRF after 2. No conflict between DRFs and current nDRF nDRF

Merging DRF regions 3. xDRF regions nDRF DRF-before = {1,2} DRF-after = {3} DRF before 1. No conflict between DRF-before and DRF-after xDRF region Enclave 3. xDRF regions DRF after 2. No conflict between DRFs and current nDRF nDRF Current nDRF is enclave. Merge DRF-before and DRF-after in one xDRF.

xDRF regions 3. xDRF Regions xDRF-before = {1,2} xDRF-after = {3} Not yet processed or previously identified as non-enclave. xDRF-before = {1,2} xDRF-after = {3} xDRF before Enclave 3. xDRF Regions Currently analyzed nDRF xDRF after If no conflict, merge xDRF-before and xDRF-after in one xDRF.

Transitive xDRF regions xDRF-before = {1,2,3,4} xDRF-after = {5} Matching enclave nDRFs 3. xDRF regions More details in the paper.

xDRF in practice Evaluation Number of static nDRF 70% non-enclave (inherent to the applications) 20% enclave (automatically detected) 10% potentially enclave (oracle) Number of executed xDRF regions Enclave nDRFs are on the hot path Compiler approaches oracle Evaluation

How are xDRF regions useful Compiler-assisted cache coherence protocol Deactivate coherence for xDRF accesses (temporarily private data) xDRF - cogerence

Traditional cache coherence 3xN coherence actions xDRF - coherence Traditional c.c.

Optimized cache coherence 3xN coherence actions N+1 coherence actions xDRF - coherence Traditional c.c. Optimized c.c.

xDRF cache coherence xDRF - coherence 3xN coherence actions N+1 Traditional c.c. Optimized c.c. xDRF c.c.

Performance – coherence prot. -4% DRF only 7% Compiler 8% Oracle Evaluation Normalized to a traditional protocol Compiler competitive with oracle

Energy savings – coherence protocol -10% DRF only 12% Compiler 16% Oracle Evaluation Compiler competitive with oracle

Conclusions Conclusions Compiler techniques to detect xDRF regions xDRF regions enable cache coherence optimizations Performance 7%, energy savings 12% Conclusions

Automatic Detection of Extended Data-Race-Free Regions Present self + team Previous decades: Time is money  Performance Current decade: Power is money  Energy This work improves energy efficiency without damaging performance. Sweden Thank you! Alexandra Jimborean Jonatan Waern, Per Ekemark , Stefanos Kaxiras, Alberto Ros (Murcia, Spain) Travelling to this conference was financed by the Swedish Wenner-Gren foundation

Enable compiler optimizations over Future work Enable compiler optimizations over xDRF regions Typical compiler optimizations for multi-threaded code operate within synchronization free regions xDRF extends the scope of the optimizations making them more effective Future work

Future work Future work Instead of splitting xDRF regions upon a conflict, place nDRF markings around the conflicting accesses xDRF limits: barriers, signal-waits, joins Any shared (conflicting) accesses are handled as nDRF  Larger xDRF regions Future work