Executing Parallel Programs with Potential Bottlenecks Efficiently University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa.

Slides:

Advertisements

Similar presentations

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Advertisements

Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Chapter 6: Process Synchronization

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Garbage Collection CSCI 2720 Spring Static vs. Dynamic Allocation Early versions of Fortran –All memory was static C –Mix of static and dynamic.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

1 Concurrency: Mutual Exclusion and Synchronization Chapter 5.

1 Concurrency: Mutual Exclusion and Synchronization Chapter 5.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

Computer Organization and Architecture The CPU Structure.

Avishai Wool lecture Priority Scheduling Idea: Jobs are assigned priorities. Always, the job with the highest priority runs. Note: All scheduling.

1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?

Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

May/01/2000HIPS Online Computation of Critical Paths for Multithreaded Languages Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa University of Tokyo.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

Multithreading in Java Project of COCS 513 By Wei Li December, 2000.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Scheduler Activations: Effective Kernel Support for the User- Level Management of Parallelism. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

CSE 451: Operating Systems Section 5 Midterm review.

Kernel Locking Techniques by Robert Love presented by Scott Price.

Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.

Sep/05/2001PaCT Fusion of Concurrent Invocations of Exclusive Methods Yoshihiro Oyama (Japan Science and Technology Corporation, working in University.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

C H A P T E R E L E V E N Concurrent Programming Programming Languages – Principles and Paradigms by Allen Tucker, Robert Noonan.

Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.

Operating System Chapter 5. Concurrency: Mutual Exclusion and Synchronization Lynn Choi School of Electrical Engineering.

An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa Yonezawa Laboratory.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Processes and threads.

Background on the need for Synchronization

Operating Systems (CS 340 D)

Atomic Operations in Hardware

Atomic Operations in Hardware

Operating Systems (CS 340 D)

143a discussion session week 3

Background and Motivation

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

CSE 153 Design of Operating Systems Winter 19

CSE 153 Design of Operating Systems Winter 2019

CS333 Intro to Operating Systems

Lecture 18: Coherence and Synchronization

CSE 542: Operating Systems

CSE 542: Operating Systems

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Executing Parallel Programs with Potential Bottlenecks Efficiently University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa

Programs We Consider bottleneck object (e.g., counter) exclusive method exclusive method exclusive method exclusive method …….. Context: Implementation of concurrent OO langs on SMPs and DSM machines e.g., synchronized methods in Java update! programs updating shared data frequently with mutex operations

Amdahl’s Law int foo(…) { int x = 0, y = 0; parallel for (…) {... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y; } int foo(…) { int x = 0, y = 0; parallel for (…) {... } lock(); printf(…); unlock(); parallel for (…) { c[i]=0; } parallel for (…) { baz(5); } return x * 2 + y; } 90% → can execute in parallel 10% → must execute sequentially (bottleneck) 10 times speedup, at most Can you really gain 10 times speedup??? but... You expect 10 times speedup

Speedup Curves for Programs with Bottlenecks # of PEs time ideal real “Excessive” processors may be used! ∵ It is difficult to predict dynamic behavior ∵ Different phases need different num. of PEs

Preliminary Experiments using a Simple Counter Program in C Solaris threads & Ultra Enterprise Each processor increments a shared counter in parallel The time didn’t remain constant, but increases dramatically.

Goal Efficient execution of programs with bottlenecks –Focusing on synchronization of methods time to execute a whole program in parallel time to execute only bottlenecks sequentially making closer other parts 1PE bottleneck parts other parts 50PE bottleneck parts

What Problem Should We Solve? 1PE 50PE other parts bottleneck parts ideal implementation Stop the increase of the time consumed in bottlenecks! other parts bottleneck parts bottleneck parts other parts naïve implementation

Put it in Prof. Ito’s terminology! He aims at keeping: –the PP/M ≧ SP/S property Our work aims at keeping: –the PP/M ≧ PP/S property Performance on 100 PE should be higher than that on 1 PE!

Presentation Overview Examples of potential bottlenecks Two naïve schemes and their problems –Local-based execution –Owner-based execution Our scheme –Detachment of requests –Priority mechanism using compare-and-swap –Two compile-time optimizations Performance evaluation & Related work

Examples of Potential Bottleneck Objects Objects introduced to easily reuse MT-unsafe functions in MT env. Abstract I/O objects Stubs in distributed systems –One stub conducts all communications in a site Shared global variables –e.g., counters to collect statistics information It is sometimes difficult to eliminate them.

Local-based Execution (e.g., Implementation with Spin-locks) instance variables method Advantage: No need to move “computation” Disadvantage: Cache misses when referencing an object (due to invalidation/update of cache by other processors) object method Each PE executes methods by itself ↓ Each PE references/updates an object by itself

Confirmation of Overhead in Local-based Execution C program on Ultra Enterprise Overhead of referencing/updating an object increases according to the increase of PEs occupies 1/3 of whole exec. time on 60 PEs

Owner-based Execution a request (a data structure containing method info) object owner non-owners owner = a processor holding an object’s lock currently owner present → creates and inserts a request owner absent → becomes an owner and executes a method =

Owner-based Execution with Simple Blocking Locks object instance variables Dequeued one by one with aux. locks One processor likely executes multiple methods consecutively

Advantages/Disadvantages of Owner-based Execution Advantage: Disadvantages: Less cache misses to reference an object Overhead to move “computation” Synchronization operations for a queue Waiting time to manipulate a queue Cache misses to read requests (focusing on owner’s execution, which typically gives a critical path) Can they be reduced?

Overview of Our Scheme Improve simple blocking locks –Detach requests Reduce the frequency of mutex operations –Give high priority to owner Reduce the time required to take control of requests –Prefetch requests Reduce cache misses in reading requests Our scheme is realized implicitly by a compiler and runtime of a concurrent object-oriented language Schematic

Data Structures Requests are managed with a list –1-word pointer area (lock area) is added to each object –Non-owner ： creates and inserts a request –Owner ： picks requests out and execute them object

Design Policy Owner’s behavior determines a critical path We make owner’s execution fast, above all We allow non-owners’ execution to be slow Battle in Bottleneck: 1 owner vs. 99 non-owners We should help him!

Non-owners Inserting a Request BC object A YZX

Non-owners Inserting a Request BC object A YZX Update with compare-and-swap Retry if interrupted

Non-owners Inserting a Request BC object A YZX Update with compare-and-swap Retry if interrupted ♪ Non-owners repeat the loop until success

Y Owner Detaching Requests BC object A Important A whole list is detached Update with swap always succeeds → owner is never interrupted by other processors

Y Owner Detaching Requests BC object A Important A whole list is detached Update with swap always succeeds → owner is never interrupted by other processors ♪

Owner Executing Requests YBC object A 1. No synchronization operations by owner inserting requests without disturbing owner ZX executed in turn without mutex ops

Giving Higher Priority to Owner Insertion by non-owner ( compare-and-swap ): may fail many times Detachment by owner ( swap ): always succeeds in constant steps 2. Owner never spins to manipulate requests

Compile-time Optimization (1/2) Prefetch requests while this request is processed the request is prefetched 3. Reduce cache misses to read requests... while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next; }... while (req != NULL) { PREFETCH(req->next); EXECUTE(req); req = req->next; }...

Compile-time Optimization (2/2) Caching instance variables in registers –Non-owners do not reference/update an object while detached requests are processed passing IVs in registers Two versions of code are provided for one method Code to process requests ： uses instance variables on memory Code to execute methods directly ： uses instance variables in registers object

Achieving Similar Effects in Low-level Languages (e.g., in C) “Always spin-lock” approach –Waste of CPU cycles, memory bandwidth –Deadlocks “Finding bottlenecks→rewriting code” approach –Implements owner-based execution only in bottlenecks –Harder than “support of high-level lang” approach Implementing owner-based execution is troublesome Bottlenecks appear dynamically in some programs

Experimental Results (1/2)

Experimental Results (2/2)

Interesting Results using a Simple Counter Program in C Simple blocking locks ： waiting time was the largest overhead –70 % of owner’s whole execution time Our scheme is efficient also on uniprocessor –Spin-locks: 641 msec –Simple blocking locks: 1025 msec –Our scheme: 810 msec (execution time)

Related Work (1/3) - execution of methods invoked in parallel - ICC++ [Chien et al. 96] –Detects nonexclusive methods through static analysis Concurrent Aggregates [Chien 91] –Realizes interleaving through explicit programming Cooperative Technique [Barnes 93] –PE entering critical section later “helps” predecessors Focus on exposing parallelism among nonexclusive operations No remark on performance loss in bottlenecks

Related Work (2/3) - efficient spin-locks when contention occurs - MCS Lock [Mellor-Crummey et al. 91] –Provides spin area for each processor Exponential Backoff [Anderson 90] –Is heuristics to “withdraw” processors which failed in lock acquisition –Needs some skills to determine parameters These locks give local-based execution → Low locality in referencing bottleneck objects

Related Work (3/3) - efficient Java monitors - Bimodal object-locking [Onodera et al. 98], Thin Locks [Bacon et al. 98] –Affected our low-level implementation –Uses unoptimized “fat locks” in contended objects Meta-locks [Agesen et al. 99] –Clever technique similar to MCS locks –No busy-waiting even in contended objects Their primary concern lies on uncontended cases They do not take locality of object references into account

Summary Serious performance loss in existing schemes –spin-locks: low locality of object references –blocking locks: overhead in contended request queue Very fast execution in contended objects –Highly-optimized owner-based execution Excellent Performance –Several times faster than simple schemes! (several hundred percent speedup!)

Future Work Solving a problem to use large memory in some cases –A long list of requests may be formed –The problem is common to owner-based schemes –This work focused on time-efficiency, not on space-efficiency –Simple solution: memory used for requests ≧ some threshold ⇒ dynamic switch to local-based execution Increasing/decreasing PEs according to exec. status –System automatically decides the “best” number of PEs for each program point –It eliminates the existence of excessive processors itself

ここからは質問タイムに見せるスライド

More Detailed Measurements using a Counter Program in C Solaris threads & Sun Ultra Enterprise Each processor increments a shared counter

No guarantee of FIFO order The method invoked later may be executed earlier –Simple solution: “reverse” detached requests –Better solution: Can we use a queue, instead of list? Are 64bit compare-and-swap/swap necessary?