Understanding Parallelism-Inhibiting Dependences in Sequential Java Programs Atanas (Nasko) Rountev Kevin Van Valkenburgh Dacong Yan P. Sadayappan Ohio.

Slides:



Advertisements
Similar presentations
Uncovering Performance Problems in Java Applications with Reference Propagation Profiling PRESTO: Program Analyses and Software Tools Research Group, Ohio.
Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Go with the Flow: Profiling Copies to Find Run-time Bloat Guoqing Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, Gary Sevitsky Ohio State University.
Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Probabilistic Calling Context Michael D. Bond Kathryn S. McKinley University of Texas at Austin.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Precise Memory Leak Detection for Java Software Using Container Profiling.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
Parameterized Object Sensitivity for Points-to Analysis for Java Presented By: - Anand Bahety Dan Bucatanschi.
ADVERSARIAL MEMORY FOR DETECTING DESTRUCTIVE RACES Cormac Flanagan & Stephen Freund UC Santa Cruz Williams College PLDI 2010 Slides by Michelle Goodstein.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Introduction to Analysis of Algorithms
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Finding Low-Utility Data Structures Guoqing Xu 1, Nick Mitchell 2, Matthew Arnold 2, Atanas Rountev 1, Edith Schonberg 2, Gary Sevitsky 2 1 Ohio State.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Regression Test Selection for AspectJ Software Guoqing Xu and Atanas.
SE-565 Software System Requirements More UML Diagrams.
Precise Memory Leak Detection for Java Software Using Container Profiling Guoqing Xu, Atanas Rountev Program analysis and software tools group Ohio State.
1 Testing Concurrent Programs Why Test?  Eliminate bugs?  Software Engineering vs Computer Science perspectives What properties are we testing for? 
Chapter 1 Algorithm Analysis
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
Static Control-Flow Analysis for Reverse Engineering of UML Sequence Diagrams Atanas (Nasko) Rountev Ohio State University with Olga Volgin and Miriam.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University STATIC ANALYSES FOR JAVA IN THE PRESENCE OF DISTRIBUTED COMPONENTS AND.
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Rethinking Soot for Summary-Based Whole- Program Analysis PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Dacong Yan.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Breadcrumbs: Efficient Context Sensitivity for Dynamic Bug Detection Analyses Michael D. Bond University of Texas at Austin Graham Z. Baker Tufts / MIT.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Static Detection of Loop-Invariant Data Structures Harry Xu, Tony Yan, and Nasko Rountev University of California, Irvine Ohio State University 1.
CBSE'051 Component-Level Dataflow Analysis Atanas (Nasko) Rountev Ohio State University.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Efficient Checkpointing of Java Software using Context-Sensitive Capture.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to.
M180: Data Structures & Algorithms in Java Algorithm Analysis Arab Open University 1.
Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,
Chapter 7 Analysis of Algorithms © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
 In the java programming language, a keyword is one of 50 reserved words which have a predefined meaning in the language; because of this,
Detecting Inefficiently-Used Containers to Avoid Bloat Guoqing Xu and Atanas Rountev Department of Computer Science and Engineering Ohio State University.
Winter 2006CISC121 - Prof. McLeod1 Stuff No stuff today!
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Simple algorithms on an array - compute sum and min.
5/7/03ICSE Fragment Class Analysis for Testing of Polymorphism in Java Software Atanas (Nasko) Rountev Ohio State University Ana Milanova Barbara.
Chapter 1 Java Programming Review. Introduction Java is platform-independent, meaning that you can write a program once and run it anywhere. Java programs.
Sept 12ICSM'041 Precise Identification of Side-Effect-Free Methods in Java Atanas (Nasko) Rountev Ohio State University.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Object Naming Analysis for Reverse- Engineered Sequence Diagrams Atanas (Nasko) Rountev Beth Harkness Connell Ohio State University.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Using Escape Analysis in Dynamic Data Race Detection Emma Harrington `15 Williams College
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Static Analysis of Object References in RMI-based Java Software
Algorithms and Problem Solving
Atanas (Nasko) Rountev Ohio State University
Parallel Programming By J. H. Wang May 2, 2017.
Building a Whole-Program Type Analysis in Eclipse
Synchronization trade-offs in GPU implementations of Graph Algorithms
Objective of This Course
Demand-Driven Context-Sensitive Alias Analysis for Java
Multithreaded Programming
Algorithms and Problem Solving
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Interactive Exploration of Reverse-Engineered UML Sequence Diagrams
Maximizing Speedup through Self-Tuning of Processor Allocation
Presentation transcript:

Understanding Parallelism-Inhibiting Dependences in Sequential Java Programs Atanas (Nasko) Rountev Kevin Van Valkenburgh Dacong Yan P. Sadayappan Ohio State University PRESTO: Program Analyses and Software Tools Research Group, Ohio State University

Overview and Motivation Multi-core hardware presents a great software evolution challenge Our goals – Characterize how much inherent method-level parallelism exists in a given sequential Java program – Identify run-time data dependences that reduce this inherent parallelism – Do this before the actual parallelization is attempted The technique – Analyzes dependences observed during a program run Similarly to existing work on ILP/loop analysis – “Turns off” dependences to find bottlenecks PRESTO: Program Analyses and Software Tools Research Group, Ohio State University2

Abstract Model of Best-Case Parallelism Method-level parallelism – Code in a method executes sequentially – When a call occurs, the callee method starts executing concurrently with the caller method Best-case (hypothetical) scenario – Each statement executes as soon as possible – Must respect data dependences from the run of the sequential program: e.g. if statement instance s4 is data dependent on s1 and s2 preceded by s3 in the same method PRESTO: Program Analyses and Software Tools Research Group, Ohio State University3 time(s4) = 1 + max( time(s1), time(s2), time(s3) )

Available Parallelism Compute a best-case parallel timestamp time(s) for each statement instance s during the sequential run – N = number of statement instances in the run – T = largest timestamp N/T is the available parallelism in the program – Austin-Sohi [ISCA’92] for instruction-level parallelism What does it mean? – Independent of number of processors, cost of thread creation and synchronization, thread scheduling, etc. – Impossible to achieve – Allows comparisons between programs, and searching for parallelism bottlenecks within a program PRESTO: Program Analyses and Software Tools Research Group, Ohio State University4

Specific Contributions A run-time analysis algorithm for measuring the available parallelism – Bytecode instumentation – Run-time computation of timestamps A bottleneck analysis built on top of it – Finds problematic instance fields Measurements on a large set of programs – Overall, low available parallelism; data dependences through instance fields are partly to blame Three case studies PRESTO: Program Analyses and Software Tools Research Group, Ohio State University5

Bytecode Instrumentation (1/2) Done on three-address IR (Soot framework) Memory locations of interest: static fields (class fields), instance fields (object fields), array elements, some locals Example: for an instance field f, add two new shadow instance fields f_w and f_r to the class – For a run-time object O, O.f_w stores the timestamp of the last statement instance that wrote to O.f – In O.f_r, store the largest timestamp among all reads of O.f since the last write to O.f Special local variable control in each method PRESTO: Program Analyses and Software Tools Research Group, Ohio State University6

Bytecode Instrumentation (2/2) Example: for an assignment x.f = y, add – control = 1 + max( x.f_r, x.f_w, control, x_w, y_w ) x.f_r : read-before-write x.f_w : write-before-write control : previous statement in the method x_w and y_w : if their values come as return values of calls that have already occurred Example: for a call x = y.m(z) – control = 1 + max( y_w, z_w ) – Initialize the callee’s control with the caller’s control – Make the call – Set x_w to be the callee’s control PRESTO: Program Analyses and Software Tools Research Group, Ohio State University7

What Next? We can measure the largest timestamp T and the available parallelism N/T, but what does it mean? – By itself, not much, although “low” is likely to be bad First client: characterize the memory locations – E.g., ignore (“turn off”) all dependences through static fields, and see the increase in available parallelism Second client: hunting for bad instance fields – Variant 1: turn off all dependences through a field f; the higher the increase, the more suspicious the field – Variant 2: turn off all dependences except through f; compare with a run with all instance fields turned off PRESTO: Program Analyses and Software Tools Research Group, Ohio State University8

Experimental Evaluation 26 single-threaded Java programs from 4 benchmark suites: SPEC JVM98, Java Grande 2.0, Java rewrite of Olden, and DaCapo Instrumented both the application and the standard Java libraries Question 1: what is the run-time overhead? – Median slowdown: 29 ×, which is comparable with similar existing work; definitely room for improvement – If the standard Java libraries are not instrumented, the overhead is reduced by about 56% but the precision suffers PRESTO: Program Analyses and Software Tools Research Group, Ohio State University9

Question 2: what is the available parallelism? – Quite low: median value is 2.1 – Need code analysis and changes Question 3: who is to blame? – Turning off static fields makes almost no difference – Turning off array elements makes little difference – Turning off instance fields significantly increases the avail- able parallelism PRESTO: Program Analyses and Software Tools Research Group, Ohio State University10 antlr1.6 bh47.4 bisort1.1 bloat5.1 chart2.0 compress1.2 db1.2 em3d4.0 euler1.6 fop2.4 health116.9 jack1.7 javac2.7 jess5.2 jython2.2 luindex4.0 moldyn1.8 montecarlo1.3 mpegaudio1.6 pmd2.0 power134.1 raytrace2.4 raytracer1.2 search1.6 tsp38.2 voronoi32.1

Case Studies (1/2) Study 1: moldyn shows two orders of magnitude increase when static fields are ignored – Static fields epot, vir, and interactions are updated in method force (e.g. epot=epot+expr ) – All calls to force conflict with each other, but in reality are independent; a simple code change increases the available parallelism from 1.8 to Study 2: Java Grande raytrace shows lower than expected available parallelism – Five different parallelism bottlenecks – Examine a highly-ranked field, change the code, recompute rankings (did the field become harmless?) PRESTO: Program Analyses and Software Tools Research Group, Ohio State University11

Case Studies (2/2) Study 3: em3d shows the greatest increase when instance fields are ignored – Problem – during the building of a directed graph – When adding a node n2 to n1.successor_list, the code also increments n2.num_predecessors – If n2 is also added to n3.successor_list, the two increments of n2.num_predecessors are serialized Solution: break up the computation – Populate all successor_list fields – highly parallel – Next, traverse all successor_list and compute all num_predecessors fields – very little parallelism – Available parallelism jumps from 4.0 to 48.9 PRESTO: Program Analyses and Software Tools Research Group, Ohio State University12

Summary A measure of inherent method-level parallelism in a sequential Java program – Will be easy to generalize to loop-level parallelism Run-time analysis algorithm: bytecode instrumentation computes best-case timestamps Interesting applications – How harmful are the data dependences through certain memory locations? – What are the effects of code changes? PRESTO: Program Analyses and Software Tools Research Group, Ohio State University13

Questions? 14PRESTO: Program Analyses and Software Tools Research Group, Ohio State University