Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel.

Slides:

Advertisements

Similar presentations

Inferring Locks for Atomic Sections Cornell University (summer intern at Microsoft Research) Microsoft Research Sigmund CheremTrishul ChilimbiSumit Gulwani.

Advertisements

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Programming Systems Lab Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian.

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.

Software Transactional Objects Guy Eddon Maurice Herlihy TRAMP 2007.

Copyright 2008 Sun Microsystems, Inc Better Expressiveness for HTM using Split Hardware Transactions Yossi Lev Brown University & Sun Microsystems Laboratories.

IBM Research: Software Technology © 2006 IBM Corporation 1 Programming Language X10 Christoph von Praun IBM Research HPC WPL Sandia National Labs December.

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Software Transactional Memory and Conditional Critical Regions Word-Based Systems.

Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Monitoring Data Structures Using Hardware Transactional Memory Shakeel Butt 1, Vinod Ganapathy 1, Arati Baliga 2 and Mihai Christodorescu 3 1 Rutgers University,

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

Transactional Memory – Implementation Lecture 1 COS597C, Fall 2010 Princeton University Arun Raman 1.

McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

XCalls: Safe I/O in Memory Transactions Haris Volos, Andres Jaan Tack, Neelam Goyal +, Michael Swift, Adam Welc § University of Wisconsin - Madison + §

A Dynamic Elimination-Combining Stack Algorithm Gal Bar-Nissan, Danny Hendler and Adi Suissa Department of Computer Science, BGU, January 2011 Presnted.

©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the Open64 Compiler Dhruva R. Chakrabarti HP Labs, USA.

Software Transactional Memory system for C++ Serge Preis, Ravi Narayanaswami Intel Corporation.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

1 Improving Productivity With Fine-grain Compiler-based Checkpointing Chuck (Chengyan) Zhao Prof. Greg Steffan Prof. Cristiana Amza Allan Kielstra* Dept.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Reduced Hardware NOrec: A Safe and Scalable Hybrid Transactional Memory Alexander Matveev Nir Shavit MIT.

1 Efficient Type and Memory Safety for Tiny Embedded Systems John Regehr Nathan Cooprider Will Archer Eric Eide University of Utah School of Computing.

Extending Open64 with Transactional Memory features Jiaqi Zhang Tsinghua University.

WormBench A Configurable Application for Evaluating Transactional Memory Systems MEDEA Workshop Ferad Zyulkyarov 1, 2, Sanja Cvijic 3, Osman.

A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.

CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.

The ATOMOS Transactional Programming Language Mehdi Amirijoo Linköpings universitet.

StealthTest: Low Overhead Online Software Testing Using Transactional Memory Jayaram Bobba, Weiwei Xiong*, Luke Yen †, Mark D. Hill, and David A. Wood.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

How D can make concurrent programming a piece of cake Bartosz Milewski D Programming Language.

CS492B Analysis of Concurrent Programs Transactional Memory Jaehyuk Huh Computer Science, KAIST Based on Lectures by Prof. Arun Raman, Princeton University.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

AtomCaml: First-class Atomicity via Rollback Michael F. Ringenburg and Dan Grossman University of Washington International Conference on Functional Programming.

1 Compiler Support for Efficient Software-only Checkpointing Chuck (Chengyan) Zhao Dept. of Computer Science University of Toronto Ph.D. Thesis Exam Sept.

Compiler and Runtime Support for Efficient Software Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha,

Tuning Threaded Code with Intel® Parallel Amplifier.

Adaptive Software Lock Elision

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Part 2: Software-Based Approaches

PHyTM: Persistent Hybrid Transactional Memory

Aritra Sengupta Man Cao Michael D. Bond and Milind Kulkarni

Martin Rinard Laboratory for Computer Science

Replication-based Fault-tolerance for Large-scale Graph Processing

Lecture 6: Transactions

Lecture 22: Consistency Models, TM

Dynamic Performance Tuning of Word-Based Software Transactional Memory

Presentation transcript:

Code Generation and Optimization for Transactional Memory Construct in an Unmanaged Language Programming Systems Lab Microprocessor Technology Labs Intel Corporation *Computer Science Division University of California, Berkeley Cheng Wang, *Wei-Yu Chen, Youfeng Wu, Bratin Saha, Ali Adl-Tabatabai

2 Motivation Existing Transactional Memory (TM) constructs focus on managed Language Efficient software transactional memory (STM) takes advantages of managed language features –Optimistic Versioning (direct update memory with backup) –Optimistic Read (invisible read) Challenges in Unmanaged Language (e.g. C) –Consistency No type safety, first-class exception handling –Function call No just-in-time compilation –Stack rollback Stack alias –Conflict detection Not object oriented

3 Contributions First to introduce comprehensive transactional memory construct to C programming language –Transaction, function called within transaction, transaction rollback, … First to support transactions in a production-quality optimizing C compiler –Code generation, optimization, indirect function calls, … Novel STM algorithm and API that supports optimizing compiler in an unmanaged environment –quiescent transaction, stack rollback, …

4 Outline TM Language Construct STM Runtime Code Generation and Optimization Experimental Results Related Work Conclusion

5 TM Language Constructs #pragma tm_atomic { stmt1; stmt2; } #pragma tm_atomic { stmt 1; #pragma tm_atomic { stmt2; … tm_abort(); } } #pragma tm_function int foo(int); int bar(int); #pragma tm_atomic { foo(3); // OK bar(10); // ERROR } foo(2) // OK bar(1) // OK

6 Consistency Problem #pragma tm_atomic { if(tq->free) { for(temp1 = tq->free; temp1->next &&…, temp1 = temp1->next); task_struct[p_id].loc_free = tq->free; tq->free = temp1->next; temp1->next = NULL; … } #pragma tm_atomic { if(tq->free) { for(temp2 = tq->free; temp2->next &&…, temp2 = temp2->next); task_struct[p_id].loc_free = tq->free; tq->free = temp2->next; temp2->next = NULL; … } } Memory Fault NULL Not NULL Solution: timestamp based aggressive consistent checking Thread 1Thread 2 shared free list local free list

7 Inconsistency Caused by Privatization #pragma tm_atomic { if(tq->free) { for(temp1 = tq->free; temp1->next &&…, temp1 = temp1->next); task_struct[p_id1].loc_free = tq->free; tq->free = temp1->next; temp1->next = NULL; … } temp1 = task_struct[p_id1].loc_free; /* process temp */ task_struct[p_id1].loc_free = temp1->next; temp1->next = NULL; #pragma tm_atomic { if(tq->free) { for(temp2 = tq->free; temp2->next &&…, temp2 = temp2->next); task_struct[p_id2].loc_free = tq->free; tq->free = temp2->next; temp2->next = NULL; … } temp2 = task_struct[p_id2].loc_free; /* process temp */ task_struct[p_id2].loc_free = temp2->next; temp2->next = NULL; NULL Not NULL Memory Fault NULL Solution: Quiescent Transaction Thread 1 Thread 2

8 Quiescent Transaction #pragma tm_atomic { if(tq->free) { for(temp1 = tq->free; temp1->next &&…, temp1 = temp1->next); task_struct[p_id1].loc_free = tq->free; tq->free = temp1->next; temp1->next = NULL; … } temp1 = task_struct[p_id1].loc_free; /* process temp */ task_struct[p_id1].loc_free = temp1->next; temp1->next = NULL; #pragma tm_atomic { if(tq->free) { for(temp2 = tq->free; temp2->next &&…, temp2 = temp2->next); task_struct[p_id2].loc_free = tq->free; tq->free = temp2->next; temp2->next = NULL; … } temp2 = task_struct[p_id2].loc_free; /* process temp */ task_struct[p_id2].loc_free = temp2->next; temp2->next = NULL; Not NULL Consistency Checking Fail Quiescent Thread 1 Thread 2

9 TM Runtime Issues (Stack Rollback) #pragma tm_atomic { … foo() … // abort } foo() { int a; bar(&a) … } bar(int *p) { … *p  … … } Stack Crash back a rollback a Solution: Selective Stack Rollback

10 Optimization Issues (Redundant Barrier) #pragma tm_atomic { a = b + 1; …; // may alias a or b a = b + 1; } desc = stmGetTxnDesc(); rec1 = IRComputeTxnRec(&b); ver1 = IRRead(desc, rec1); t = b; IRCheckRead(desc, rec1, ver1); desc = stmGetTxnDesc(); rec2 = IRComputeTxnRec(&a); IRWrite(desc, rec2); IRUndoLog(desc, &a); a = t + 1; desc = stmGetTxnDesc(); rec1 = IRComputeTxnRec(&b); ver1 = IRRead(desc, rec1); t = b; IRCheckRead(desc, rec1, ver1); desc = stmGetTxnDesc(); rec2 = IRComputeTxnRec(&a); IRWrite(desc, rec2); IRUndoLog(desc, &a); a = t + 1; not redundant

11 Experiment Setup Target System –16-way IBM eServer xSeries 445, 2.2GHz Xeon –Linux , icc v9.0 (with STM), -O3 Benchmarks –3 synthetic concurrent data structure benchmarks Hashtable, btree, avltree –8 SPLASH-2 benchmarks 4 SPLASH-2 benchmarks spend little time in critical sections –Fine-grained lock v. coarse-grained lock v. STM Coarse-grain lock: replace all locks with a single global lock STM: –Replace all lock sections with transactions –Put non-transactional conflicting accesses in transactions

12 Hashtable STM scales similarly as fine grain lock Manual and compiler STM comparable performance

13 FMM STM is much better than coarse-grain lock FMM threads time (seconds) fine lock stm coarse lock no consistency

14 Splash 2 STM can be more scalable than locks raytrace threads time (seconds) fine lock stm coarse lock

15 Optimization Benefits barnescholeskyfmmraytraceradiositygeo-mean Normalized Execution Time no opt barrier elim inlining full opt no consistency lock The overhead is within 15%, with average only 6.4%

16 Related Work Transactional Memory –[Herlihy, ISCA93] –[Ananian, HPCA05], [Rajwar, ISCA05], [Moore, HPCA06], [Hammond, ASPLOS04], [McDonald, ISCA06], [Saha, MICRO 06] Software Transactional Memory –[Shavit, PODC95], [Herlihy, PODC03], [Harris, ASPLOS04] Prior work on TM constructs in managed languages –[Adl-Tabatabai, PLDI06], [Harris, PLDI06], [Carlstrom, PLDI06], [Ringengerg, ICFP05] Efficient STM –[Saha, PPoPP06] Time-stamp based approach –[Dice, DISC06], [Riegel, DISC06]

17 Conclusion We solve the key STM compiler problems for unmanaged languages –Aggressive consistency checking –Static function cloning –Selective stack rollback –Cache-line based conflict detection We developed a highly optimized STM compiler –Efficient register rollback –Barrier elimination –Barrier inlining We evaluated our STM compiler with well-known parallel benchmarks –The optimized STM compiler can achieve most of the hand-coded benefits –There are opportunities for future performance tuning and enhancement

Questions ?

19 STM Runtime API TxnDesc* stmGetTxnDesc(); uint32 stmStart(TxnDesc*, TxnMemento*); uint32 stmStartNested(TxnDesc*, TxnMemento*); void stmCommit(TxnDesc*); void stmCommitNested(TxnDesc*); void stmUserAbort(TxnDesc*); void stmAbort(TxnDesc*); uint32 stmValidate(TxnDesc*); uint32* stmComputeTxnRec(uint32* addr); uint32 stmRead(TxnDesc*, uint32* txnRec); void stmCheckRead(TxnDesc*, uint32* txnRec, uint32 version); void stmWrite(TxnDesc*,uint32* txnRec); Void stmUndoLog(TxnDesc*, uint32* addr,uint32 size);

20 Data Structures

21 Example 1 #pragma tm_atomic { t = head; Head = t->next; } … = *t; #pragma tm_atomic { s = head; *s = …; }

22 Example 2 #pragma tm_atomic { t = head; head = t->next; } … = *t; #pragma tm_atomic { s = head; *s = …; head = s->next; }

23 Example 3 #pragma tm_atomic { t = head; head = t->next; } *t = …; #pragma tm_atomic { s = head; … = *s; head = s->next; }

24 Optimization Issues (Register Checkpointing) Checkpointing all the live-in local data does not work with compiler optimizations across transaction boundary Checkpointing Code t2_bkup = t2; while(setjmp(…)) { t2 = t2_bkup; } stmStart(…) t1 = 0; t2 = t1 + t2; … stmCommit(…); t1 = t3; t3 = 1; Source Code #pragma tm_atomic { t1 = 0; t2 = t1 + t2; … } t1 = t3; t3 = 1; Optimized Code t2_backup = t2; t1 = 0; while(setjmp(…)) { t2 = t2_bkup; } stmStart(…); t2 = t1 + t2; t1 = t3; t3 = 1; … stmCommit(…); can not recover Abort

25 TimeStamp based Consistency Checking #pragma tm_atomic { if(tq->free) { for(temp1 = tq->free; temp1->next &&…, temp1 = temp1->next); task_struct[p_id].loc_free = tq->free; tq->free = temp1->next; temp1->next = NULL; … } #pragma tm_atomic { if(tq->free) { for(temp2 = tq->free; temp2->next &&…, temp2 = temp2->next); task_struct[p_id].loc_free = tq->free; tq->free = temp2->next; temp2->next = NULL; … } Version 1 Global Timestamp 1 Version 0 Version 1 Global Timestamp 0 Local Timestamp 0 Thread 1Thread 2 Local Timestamp 0

26 Checkpointing Approach #pragma tm_atomic { t2 = t1 + t2; t1 = t3; t3 = 1; … } t2_bkup = t2 t3_bkup = t3 t1 = 0 t2 = t2_bkup t3 = t3_bkup t1 = 0 retry entry #pragma tm_atomic { t1 = 0 t2 = t1 + t2; … } t1 = t3; t3 = 1; t2_bkup = t2 t3_bkup = t3 t2 = t2_bkup t3 = t3_bkup retry entrynormal entry Optimization normal entry

27 Function Clone Source Code #pragma tm_function void foo(…) { … } #pragma tm_atomic { foo(); (*fp)(); } STM Code : &foo_tm : // normal version no-op maker … // normal code : // transactional version … // code for transaction foo_tm(); if(*fp == “no-op marker”) (**(fp-4))(); // call foo_tm else handle non-TM binary Point to transactional version Unique Marker

28 STM is much better than coarse-grain lock (fine lock ???)