1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Program Slicing Mark Weiser and Precise Dynamic Slicing Algorithms Xiangyu Zhang, Rajiv Gupta & Youtao Zhang Presented by Harini Ramaprasad.

Steven Pelley, Peter M. Chen, Thomas F. Wenisch University of Michigan

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

ADVERSARIAL MEMORY FOR DETECTING DESTRUCTIVE RACES Cormac Flanagan & Stephen Freund UC Santa Cruz Williams College PLDI 2010 Slides by Michelle Goodstein.

1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.

CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

1 Refinement-Based Context-Sensitive Points-To Analysis for Java Manu Sridharan, Rastislav Bodík UC Berkeley PLDI 2006.

Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.

Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.

Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.

The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.

Evaluation of Memory Consistency Models in Titanium.

May/01/2000HIPS Online Computation of Critical Paths for Multithreaded Languages Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa University of Tokyo.

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

DoubleChecker: Efficient Sound and Precise Atomicity Checking Swarnendu Biswas, Jipeng Huang, Aritra Sengupta, and Michael D. Bond The Ohio State University.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Java Thread and Memory Model

Dataflow Analysis for Concurrent Programs using Datarace Detection Ravi Chugh, Jan W. Voung, Ranjit Jhala, Sorin Lerner LBA Reading Group Michelle Goodstein.

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.

CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.

Parallel Processing (CS526) Spring 2012(Week 8).  Shared Memory Architecture  Shared Memory Programming & PLs  Java Threads  Preparing the Environment.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Pointer and Escape Analysis for Multithreaded Programs Alexandru Salcianu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.

Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni PPPJ 2015, Melbourne, Florida, USA Toward Efficient Strong Memory Model Support for the Java.

1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.

Static Analysis of Object References in RMI-based Java Software

Aritra Sengupta Man Cao Michael D. Bond and Milind Kulkarni

Memory Consistency Models

Compositional Pointer and Escape Analysis for Java Programs

Memory Consistency Models

Martin Rinard Laboratory for Computer Science

Automatic Detection of Extended Data-Race-Free Regions

Amir Kamil and Katherine Yelick

Instruction Scheduling for Instruction-Level Parallelism

Threads and Memory Models Hal Perkins Autumn 2011

Introduction to High Performance Computing Lecture 20

Threads and Memory Models Hal Perkins Autumn 2009

Dr. Mustafa Cem Kasapbaşı

Memory Consistency Models

Amir Kamil and Katherine Yelick

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Problems with Locks Andrew Whitaker CSE451.

Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.

CS 201 Compiler Construction

Presentation transcript:

1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff, Jaejin Lee and David Padua University of Illinois at Urbana-Champaign IBM T.J. Watson Research Center Purdue University Seoul National University

2 Outline Memory Models The Pensieve System Escape Analyses Qualitative Impact of Escape Analyses on Delay Set Analysis and Synchronization Analysis Experimental Results Conclusion

3 Memory Models Consider the following code segments: –Thread 1 : data = 100; data_ready = true; –Thread 2 : while (!data_ready); t = data; Can t == 0? –Yes if reordering happens Thread 1 : data_ready = true; data = 100; Can be done by compiler and hardware –Memory models tell us the answer Sequential Consistency says no

4 Objective of the Pensieve Project Sequential consistency (SC) on top of Intel x86 memory models –Implementation based on Jikes RVM All analyses done in JIT time Need to minimize both analysis and application execution time

5 Enforcing SC Done by enforcing memory accesses orders –not all orderings need to be enforced –only enforce orders really needed Delay Set Analysis (DSA) [SS88] computes such orders Our approach : Approximation of DSA –Orders enforced by inserting fences in generated code

6 Original DSA Program edge –x executes before y in the same thread Conflict edge –x and x’ conflict accesses Order of access affects program outcome In this paper: –to the same memory location –one of them is a write xx’xy y’ y x x’

7 Original DSA (Cont’d) Critical cycle –Minimal Cannot form smaller cycle using subset of nodes –Mixed Contains both edges Enforce program edges on a critical cycle y’ y x x’ Minimal Not minimal y’ y x x’ z Not mixed y x Mixed y x

8 Approximate DSA Approximate of critical cycle –x precedes y –Conflict accesses for x and x’ y and y’ –y’ precedes x’ Enforce program edges on approx critical cycle x yx’ y’

9 Source Program Code Optimizations Fence Insertion & Optimization Program Analyses Thread Escape Analysis Program Analyses The Pensieve System Target Program Orders to Enforce Synchronization Analysis Delay Set Analysis

10 Escape Analyses Identify objects which may be accessed by two or more threads Output: set of variables –{v | v points to an object may be accessed by >= 2 threads}

11 Impact on Delay Set Analysis x, y, y’, x’ must be escaping accesses –Cannot form a cycle if one of them is not escaping access Fewer escaping accesses implies fewer possible pairs of (x,y) –Fewer checks to be done –Fewer delays yx’ y’ x

12 Impact on Synchronization Analysis Synchronization analysis reduces number of conflict edges considered by DSA –Consider synchronized construct –Calls to start() and join() Our system only consider t1.join() –if it can match some t2.start() call –t1 and t2 are not escaping More precise escape info  more join() calls matched  more precise DSA result

13 Escape Analyses Comparison In this study, we compare 4 algorithms: –Connectivity Analysis (Pensieve) –Field Base Analysis (Pensieve) For comparison purposes –Bogda’s Analysis Removing Unnecessary Synchronization in Java. (OOPSLA 1999) –Ruf’s Analysis Effective Synchronization Removal for Java. (PLDI 2000)

14 Connectivity Escape Analysis An object is escaping if both –Reachable by more than one thread due to two possible cases: Reachable by a static field Passed from a thread constructor –Accessed by more than one thread Do not assume this escaping in run() by default Field insensitive for most memory accesses –I.e. do not distinguish x.f vs x.g –Except accesses to Runnable objects

15 Field Base Escape Analysis An object is escaping if –Reachable from a static field –Passed from a thread constructor Do not assume this escaping in run() by default –Similar to connectivity base analysis, Field sensitive –Suppose O 1, O 2 of same type O 1.f different from O 1.g O 1.f same as O 2.f

16 Bogda’s Escape Analysis An object is escaping if it is reachable: –By a static field –By a Runnable object –Via more than 1 field reference

17 Ruf’s Escape Analysis An object is escaping if both –Reachable from either A static field or A Runnable object –Synchronized by more than one thread Adapted for our own use –“synchronized”  “accessed”

18 Experimental Settings (Machine) Intel (Dell PowerEdge 6600 SMP) –4 Intel hyperthreaded 1.5Ghz Xeon processors –with 1MB cache each –6G system memory.

19 Experimental Settings (Software) Original –default Jikes RVM implementation –base case for performance comparison Enforcing SC –Empty –Arg Escaping –Connectivity analysis –Field-base analysis –Bogda’s analysis (bogda) –Ruf’s analysis

20 Measurements Escape Analysis Time Impact on Delay Set Analysis Time Impact on Synchronization Analysis Time Slowdown due to fence insertion –Delay Set Analysis only –Delay Set Analysis with Synchronization Analysis

21 Escape Analysis Time

22 Impact on Delay Set Analysis Time

23 Impact on Synchronization Analysis Time

24 Escape+DSA+ Synchronization Analysis Time / Compilation Time

25 Slowdown (DSA Only)

26 Slowdown (DSA+Sync Analysis)

27 Slowdown of connect (DSA+Sync Analysis)

28 Conclusions Evaluate interaction between escape analysis and synchronization/delay set analysis Montecarlo and jbb motivates enabling field sensitivity for connectivity base analysis

29 Backup Slides Follow

30 Number of Delay Checks Performed

31 Total Compilation Time

32 Number of Delays Found (DSA Only)

33 Number of Delays Found (DSA + Sync Analysis)