Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.

Slides:



Advertisements
Similar presentations
Supporting existing code in a transactional memory system Nate Nystrom Mukund Raghavachari IBM TRAMP 5 Mar 2007.
Advertisements

Software Transactional Objects Guy Eddon Maurice Herlihy TRAMP 2007.
Nom Entité 1 Titre général du document THE VELOX STACK Patrick Marlier (UniNE)
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.
Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
EPFL - March 7th, 2008 Interfacing Software Transactional Memory Simplicity vs. Flexibility Vincent Gramoli.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.
1. 2 FUNCTION INLINE FUNCTION DIFFERENCE BETWEEN FUNCTION AND INLINE FUNCTION CONCLUSION 3.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming in Java with Shared Memory Directives.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
1 Thread II Slides courtesy of Dr. Nilanjan Banerjee.
©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.
Software Transactional Memory system for C++ Serge Preis, Ravi Narayanaswami Intel Corporation.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Introduction to OpenMP
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Why STMs Need Compilers (a war story) Maurice Herlihy Brown University.
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Consistency Oblivious Programming Hillel Avni Tel Aviv University.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Tuning Threaded Code with Intel® Parallel Amplifier.
Adding Concurrency to a Programming Language Peter A. Buhr and Glen Ditchfield USENIX C++ Technical Conference, Portland, Oregon, U. S. A., August 1992.
Martin Kruliš by Martin Kruliš (v1.1)1.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Chapter 5: Process Synchronization – Part 3
Part 2: Software-Based Approaches
Faster Data Structures in Transactional Memory using Three Paths
Computer Engg, IIT(BHU)
Computer Science Department
Workshop in Nihzny Novgorod State University Activity Report
Automatic Detection of Extended Data-Race-Free Regions
Multi-core CPU Computing Straightforward with OpenMP
Userspace Synchronization
CSE 153 Design of Operating Systems Winter 19
WG4: Language Integration & Tools
OpenMP on HiHAT James Beyer, 18 Sep 2017.
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
Presentation transcript:

Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University

Contents Background Design Implementation Optimization Experiment Conclusion

Transactional Memory Background Trend to concurrent programming Current solution: – Lock – Flaws: Association between locks and data Deadlock Not composable

Transactional Memory Background a.credit(amount); b.debit(amount); class Account{ int balance; lock mylock; bool credit(int amount); bool debit(int amount); }; bool credit(int amount){ acquire(mylock); balance+=amount; release(mylock); } bool debit(int amount){ acquire(mylock); balance-=amount; release(mylock); } inconsistent state acquire(a.mylock); acquire(b.mylock); release(a.mylock); release(b.mylock); Poor abstraction of class Account Deadlock Exposed implementation details transfer(Account a, Account b, int amount){ } atomic{ a.credit(amount); b.debit(amount); }

Transactional Memory Background Current Implementations – TM libraries DSTM DracoSTM TL2 TinySTM …….. Function calls: TM_INIT()/TM_SHUTDOWN() TM_ATOMIC_BEGIN()/TM_ATOMIC_END() TM_SHARED_READ()/TM_SHARED_WRITE() Explicit Transaction

Transactional Memory Background Current Implementations – Compilers Intel C++ STM Compiler Tanger OpenTM GCC

Design Programming Interfaces #pragma tm atomic [clause] structured block readonly private(var list) shared(var list) #pragma tm abort #pragma tm function function declaration #pragma tm waiver function declaration

Design TM runtime interfaces (TL2) InterfaceDescription Thread* TxNewThread()Allocate a new Thread structure to keep logs TxStart(Thread* Self, jmp_buf* buf, int flags)Start a new transaction for current thread TxCommit(Thread* Self)Commit the current transaction TxLoad(Thread* Self, void* addr)Perform synchronized load from given memory address TxStore(Thread* Self, void* addr, intptr_t val)Perform synchronized store to given memory address TxStoreLocal(Thread* Self, void* addr, intptr_t val)Perform locally logged store to given memory address TxAbort(Thread* Self)Abort the current transaction and re-execute

Design Wrapper functions – To ease the process of integrating new TM libraries tm_init()/tm_finalize() tm_thread_start()/tm_thread_end() __tm_atomic_begin()/__tm_atomic_end() __tm_shared_read()/__tm_shared_read_float() __tm_shared_write()/__tm_shared_write_float() __tm_local_write()/__tm_local_write_float() by programmers by compiler more wrapper functions are needed for other data types, and additional TM semantics

Design Optimization – Eliminate redundant calls to runtime libraries

Implementation General Transformation

Implementation General Transformation – #pragma tm atomic – simple statements – control flow statements IF WHILE_DO a = b+c; PARM #address of c CALL LDID STID #tm_preg_num_0 PARM #address of b CALL LDID STID #tm_preg_num_1 LDID #tm_preg_num_0 LDID #tm_preg_num_1 ADD PARM PARM #address of a CALL setjmp(); __tm_atomic_begin(); for(;i<10;i++){ } PARM #address of I CALL LDID STID #tm_preg_num_0 WHILE_DO LDID #tm_preg_num_0 INTCONST 9 LE BODY BLOCK ……………. PARM #address of I CALL LDID STID #tm_preg_num_0 END_BLOCK

Implementation General Transformation 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 TxStore(Self, &j, 0); 2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10; TxStore(Self, &j, TxLoad(Self, &j)+1)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self);

Implementation Functions – clone and instrument #pragma tm function void calculate(){} void calculate() __tm_cloned__calculate() //instrumented #pragma tm atomic { calculate(); } #pragma tm atomic { __tm_cloned__calculate(); }

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 TxStore(Self, &j, 0); 2.6 for (TxStore(Self, &i, 0);; TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(TxStore(Self, &j, 0); TxLoad(Self, &j)<10; TxStore(Self, &j, TxLoad(Self, &j)+1)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self); Transaction local variables : detected by the frontend

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 j=0; 2.6 for (TxStore(Self, &i, 0); TxLoad(Self, &i)<20; TxStore(Self, &i, TxLoad(Self, &i)+1)){ 2.7 for(j=0; j<10;j++)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self); Barrier Free variables : detected according to its storage class

Implementation Optimization 1.1 int i = 0; 1.2 #pragma tm atomic { 1.3 int j = 0; 1.4 for(i=0;i<20;i++) { 1.5 for(j=0;j<10;j++) { 1.6 result++; } 2.1 int i = 0; 2.2 jmpbuf jbuf; 2.3 _setjmp(jbuf); 2.4 TxStart(Self, jbuf); 2.5 j=0; 2.6 for (; i<20; TxStoreLocal(Self, &i, i+1)){ 2.7 for(j=0; j<10;j++)){ 2.8 TxStore(Self, &result, TxLoad(Self, &result)+1); }} 2.9 TxCommit(Self);

Implementation Optimization – Optimization opportunities detection strategy Pthread parallel task – transaction local: declared in tm atomic scope – barrier free: auto variables Cloned transactional function – transaction local: declared in the function OpenMP parallel task – transaction local: declared in tm atomic scope – barrier free: declared in micro task, marked in openmp private clause Checking readonly transactions – Limitation Reserved design for pointers Needs programmers to participate in optimization

Preliminary Experiments Compare with fine-grained lock based application

Preliminary Experiments Compare with manually instrumented application

Preliminary Experiments #pragma tm atomic { int j; *new_centers_len[index] ++; for(j=0;j<nfeatures;j++){ new_centers[index][j]+=feature[i][j]; } private(feature)

Conclusion & Future work A infrastructure for TM on Open64 – Replaceable TM implementation – Optimization More experiments on non-trivial applications are desired Nested transaction Signal processing Event handler Indirect calls Dealing with legacy code … FastDB: 8 out of 75 critical regions contain nested transactions FastDB: 28 out of 75 critical regions contain signal processing PARSEC: 20 out of 55 critical regions contain signal processing

Thanks