Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.

Slides:



Advertisements
Similar presentations
COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Programmability Issues
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Program Representations. Representing programs Goals.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Processes CSCI 444/544 Operating Systems Fall 2008.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
3.5 Interprocess Communication
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Multiscalar processors
1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.
Software Development and Software Loading in Embedded Systems.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
5.3 Machine-Independent Compiler Features
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Thread-Level Speculation Karan Singh CS
CE Operating Systems Lecture 11 Windows – Object manager and process management.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
CSE 425: Control Abstraction I Functions vs. Procedures It is useful to differentiate functions vs. procedures –Procedures have side effects but usually.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Processes and Virtual Memory
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
1 Distributed BDD-based Model Checking Orna Grumberg Technion, Israel Joint work with Tamir Heyman, Nili Ifergan, and Assaf Schuster CAV00, FMCAD00, CAV01,
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Lecture 3 Translation.
Gwangsun Kim, Jiyun Jeong, John Kim
Processes and threads.
Memory Protection: Kernel and User Address Spaces
Process Management Presented By Aditya Gupta Assistant Professor
Chapter 3: Processes.
Antonia Zhai, Christopher B. Colohan,
Memory Protection: Kernel and User Address Spaces
Chapter 9 :: Subroutines and Control Abstraction
MapReduce Simplied Data Processing on Large Clusters
Memory Protection: Kernel and User Address Spaces
Memory Protection: Kernel and User Address Spaces
Performance Optimization for Embedded Software
Computer Organization and Design Assembly & Compilation
Instruction Level Parallelism (ILP)
Languages and Compilers (SProg og Oversættere)
Multithreading Why & How.
Chapter 3: Processes.
How to improve (decrease) CPI
EECE.4810/EECE.5730 Operating Systems
Foundations and Definitions
Introduction to Optimization
Memory Protection: Kernel and User Address Spaces
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores

2 Goal: Exploit parallelism frequently observed --- not guaranteed to be present due the presence of dependences:  Dependences due to Cold Code  Dependences that are Harmless  Dependences that are Silent Speculative Parallelization

3  Thread-based Execution Model Non-Speculative thread and state Speculative threads and state Committing Results Rollback-free Recovery  Software Only – Coarse-grained parallelism Speculative parallelization of loops Speedups on a real machine Outline

4  Main Thread Performs Non-speculative Computation  Non-Parallelizable Code  Parts of Parallelized Code Controls Parallel Threads  Initialization & Memory Allocation  Termination & Miss-speculation Checks  Commit Results In-Order  Multiple Parallel Threads Perform Speculative Computations  E.g., Speculative Loop Bodies Execution Model

5 prologue speculative body epilogue Static codeSequential Execution p1 e1 sp1 p2 e2 sp2 p3 e3 sp3 p4 sp4 e4 Execution Model

6 prologue speculative body epilogue Static codeSequential Execution p1 e1 sp1 p2 e2 sp2 p3 e3 sp3 p4 sp4 e4 Parallel Execution Main threadP1P2 1.In-order commit. 2.At any time, only two threads are executing. So the main thread doesn’t require a separate core. Execution Model

7  Non-Speculative State (D space) Maintained by the main thread.  Speculative State (P space) Allocated by the main thread and used by parallel threads. Results will be either committed to D space or discarded.  Coordinating State (C space) Version number for variables in D space. Mapping table for variables in P space. Main Thread D space C space Parallel Thread P space Parallel Thread …… Memory State

8  Naïve scheme Copy-in: copy values from D space to P space when work assigned. Copy-out: copy variable values from P space to D space when the speculation check succeeds.  Optimized scheme Use profiling to discover access pattern of variables in the speculative loop body.  In-Out, Only-in, Only-out and Thread-local. Unknown variables untouched in the profiling run.  Copied on-the-fly through message passing. Mapping table Mapping information of those variables.  D space address, P space address, size, version and write-flag. Updated when variables are copied into P space. Referred to when variables are copied back to D space. Copy Operations

9  Version number – maintained by the main thread For each variable that is potentially read/written by parallel threads. Version number is copied into mapping table when the corresponding variable is being copied into P space.  Miss-speculation check – performed by the main thread For every entry in the mapping table, compare its version number with the one maintained by the main thread. If all are same, the speculation succeeds.  Perform the copy-out operations.  Update the version numbers accordingly. If any version number is different, the speculation fails because some earlier thread has changed this variable’s value:  Re-execute the speculative body with the latest value.  Value-based dependence check.  Rollback Free Recovery Miss-speculation Check

10  Access Checks to consult the mapping table at: Loads and Stores Pointer Assignments  Reducing Overhead of Access Checks Stack & Global variables: Based upon classification.  In-Out, Only-in, Only-out, Thread-local. Heap: Optimizations beyond classification.  Locally created objects require no checks.  Once object is copied, other fields accessed without checking.  Copy-on-write only: No checks needed at Loads; Since version number not copied on a read, miss-speculation detection implicitly carried out by another copied variable. On-the-fly Copying

11  Reducing Thread Idling Scenario: an earlier thread finishes its task, but the main thread has not finished assigning tasks to later threads, and hence cannot handle this earlier thread.  Performance fell when 4 or more parallel threads are used. Solution: assign more work to each thread by loop unrolling.  Reducing Miss-speculation Rate Scenario: the value of a speculative variable being used by a thread is changed by an earlier thread, and hence the speculation fails.  For benchmark 181.mcf, the miss-speculation rate becomes higher when more threads are used. Solution : delay copying of some variables - on-the-fly mechanism.  Increases the chance of getting the latest version. Other Enhancements

12  Prologue Input statements (e.g. fgets). Loop counters.  Epilogue Output statements (e.g. printf) Statements highly-dependent on previous iteration.  E.g. line_handled while (){ line = read_one_line(input_file); if (line cannot be parsed) { error_num++; } else { result = parse(line); } line_handled ++; print(result); } An example from 197.parse  Speculative body The remainder.  Loop carried dependence on error_num rarely manifests itself. Speculative Parallelization

13 while (){ Prologue code; Speculative body code; Epilogue code } Assign a new iteration for (i=0; i < Num_Proc; i++) { allocate P and C space for thread i; Prologue code; create thread i to execute thrd_func (i); } reset i = 0; while() { while (speculation_check(i) == FAIL) { update P and C space for thread i; re-execute thrd_func (i); } commit result and execute Epilogue code; Prologue code; update P and C space for thread i; ask thread i to execute thrd_func (i); i= (i+1) % Num_Proc; } wait for all threads’ completion and execute Epilogue code; Create thread and initialize their tasks Handle misspeculation In-order commit Main Thread

14 while (){ Prologue code; Speculative body code; Epilogue code } void * thrd_func(i) { while (1) { wait for the “start” message; Speculative body code; send “finish” message; } Parallel Thread Checks preceding/following  Loads, Stores, Pointer Assignments

15 Profiling tool (Pin) Dependence graph and access patterns Compiler infrastructure (LLVM) Binary and a small input Source code objdump Symbols Transformation Template x86 binary -native option Dell PowerEdge 1900 Two Intel Xeon Quadcores 3 GHz, 16 GB Experimental Setup

16  Benchmarks 5 SPEC benchmarks  197.parser, 181.mcf, 130.li, 256.bzip2 & 255.vortex. 1 MiBench benchmark  CRC32 - Best speedup achieved among all benchmarks.  Variables in speculative body (obtained via profiling) ProgramsOnly-InOnly-OutThread LocalIn-Out 197.parser mcf li bzip vortex CRC Dell PowerEdge 1900 server with two quad-core processors, 3GHz, &16 GB. Experimental Setup

17  All benchmarks get the best speedup when 8 threads are used.  The highest speedups ranges from 3.7 to 7.8 across all benchmarks. Execution Speedups

18 Thread Idling

19  Without delayed copying, miss-speculation rate of 181.mcf increases from 0.7% to 17.5% as the number of parallel thread increases from 2 to 8.  With delayed copying, miss-speculation rate of 181.mcf is below 10%.  The miss-speculation rate of other benchmarks is less than 2%. Delayed Copying Threads

20  Considered three schemes: 1. All: all variables copied before parallel thread starts work  Unnecessary copying occurs. 2. On-the-fly: all variables copied on-the-fly via message passing  Need to check every variable to see if it has been copied into P space 3. Opt. : profiling used to determine when to copy Copy Optimization

21  The experiment shows the result when 4 threads are used.  Opt. outperforms other two schemes.  On-the-fly outperforms all when heap accesses dominate (bzip2, mcf). Copy Optimization

22  Overhead breakdown per core when 8 threads are used.  No more than 7% of total instructions are used for operations related to the execution model. ProgramsCopy On Start Copy On-the-fly Exception Check Miss-speculation Check Setup 197.parser 3.51%0.33%0.02%1.76%0.62% 181.mcf 0.08%0% 1.08%0.07% 130.li 1.32%0.25%0.06%1.03%0.48% 256.bzip2 1.97%0.13%0.08%2.81%2.15% 255.vortex 5.28%0.04%0.01%1.25%0.39% CRC %0% 0.01%0.32% Overhead – Instruction Count

23  For most benchmarks, the space overhead is around 2-3x.  256.bzip2 - a large chunk of heap needs to be copied to P space. Overhead – Memory Space