Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

Slides:



Advertisements
Similar presentations
Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.
Advertisements

Mohamed. M. Saad.  Java Virtual Machine Prototype based on Jikes RVM  Targets  Code profiling/visualization using execution flow  Utilize large number.
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Computer Architecture Instruction-Level Parallel Processors
Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Chapter 2 Operating System Overview Operating Systems: Internals and Design Principles, 6/E William Stallings.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
B. Childers, M. L. Soffa, J. Beaver, L. Ber, K. Cammarata, J. Litman, J. Misurda Presented by: Priyanka Puri SOFTTEST: A FRAMEWORK FOR SOFTWARE.
The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
OOP in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
PSUCS322 HM 1 Languages and Compiler Design II Basic Blocks Material provided by Prof. Jingke Li Stolen with pride and modified by Herb Mayer PSU Spring.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Multithreading in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Compiling with multicore Jeehyung Lee Spring 2009.
Operating System Overview
Parallel Programming in Java with Shared Memory Directives.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Programmer's view on Computer Architecture by Istvan Haller.
1 © 2002, Cisco Systems, Inc. All rights reserved. Arrays Chapter 7.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Modeling VHDL in POSE. Overview Motivation Motivation Quick Introduction to VHDL Quick Introduction to VHDL Mapping VHDL to POSE (the Translator) Mapping.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
CS 614: Theory and Construction of Compilers Lecture 15 Fall 2003 Department of Computer Science University of Alabama Joel Jones.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Code Optimization Overview and Examples
Static Single Assignment
Parallel Algorithm Design
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture 1 Runtime environments.
Amir Kamil and Katherine Yelick
CSCI1600: Embedded and Real Time Software
Chapter 4: Threads.
Chapter 4: Threads.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
Adaptive Optimization in the Jalapeño JVM
Instruction Level Parallelism (ILP)
Multithreaded Programming
Amir Kamil and Katherine Yelick
Lecture 1 Runtime environments.
Course Overview PART I: overview material PART II: inside a compiler
Why Events Are a Bad Idea (for high concurrency servers)
Garbage Collection Advantage: Improving Program Locality
Operating System Overview
Instruction Scheduling Hal Perkins Autumn 2011
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto

Introduction Automatically parallelize programs by using traces Target shared memory multiprocessors Use traces –Collect –Package –Execute in parallel Modify Jikes RVM –Initial results

Trace Definition A trace is a frequently executed sequence of unique basic blocks or instructions a=0 i=0 goto B2 a+=i i++ if (i<n) goto B1 return a B0 B1 B2 B3 public static int foo() { int a=0; for (int i=0;i<n;i++) a+=i; return a; } Trace 1

Benefits Source code not required Granularity of parallelism can vary Restrict control flow Simple to identify

System Overview Extraction Context Passing Parallel Execution Run-Time Compiler Single-Threaded Program Compiled Methods Traces

Extraction BB1’ BB3’BB2 BB4’ start end BB4 BB1 BB3BB2 BB4 start end Trace 1

BB1 BB3BB2 BB4 start end Trace 1 BB1’ BB3’ BB2 BB4’ start end BB4 prologue call trace epilogue prologue Separate Method epilogue

BB1 c=… BB3 …=a BB2 BB4 …=b start end BB4 BB1’ c’=… BB3’ …=a’ BB2 BB4’ …=b’ start end BB4 …=b save a,b call trace c=c’ check exit save c’ exit1 save c’ exit2 a’=a,b’=b Separate Method save c’ exit3

Challenges Extract basic blocks Create and call separate method –Reflection –Jikes Entrypoints Pass context –Efficient –Uniform

Parallel Execution Execute multiple traces in parallel Execute the same set of traces –Similar to data level parallelism Execute different sets of traces –Similar to task level parallelism Traces need to be set up and scheduled Our initial focus is on data level parallelism

Processor 1Processor 2 parallel setup Parallel Execution prologue trace epilogue prologue trace epilogue prologue trace epilogue prologue trace epilogue …… sequential execution start trace execution

Processor 1Processor 2 parallel setup Parallel Execution prologue trace … epilogue sequential execution start trace execution prologue trace … epilogue

Strongly Connected Component A graph that contains traces and edges between them such that paths exist between all trace pairs … …… … …… … … … … …… …

Processor 1Processor 2 parallel setup … … … …… … epilogue prologue epilogue … … … …… … prologue epilogue parallel setup

Challenges Identifying SCCs Handling dependence –Induction –Reduction Setting up parallel execution

Preliminary Results Modified Jikes RVM –Extract traces and SCCs –Execute SCCs in parallel Run on 2 processor Athlon system with 512MB RAM Java Grande Section 3 Benchmarks Measurements –Performance

Performance 8.5

Related Work Automatic Parallelization Traces Runtime Systems Program and Alias Analysis

Remaining Challenges Infrastructure for parallel execution of traces Granularity Data dependence –Beyond induction and reduction –Analysis vs speculation Control dependence Load balancing Data locality Online system