Download presentation
Presentation is loading. Please wait.
1
CS 3304 Comparative Languages
Lecture 22: Concurrency - Fundamentals 5 April 2012
2
Introduction Classic von Neumann (stored program) model of computing has single thread of control. Parallel programs have more than one thread of control. Motivations for concurrency: To capture the logical structure of a problem. To exploit extra processors, for speed. To cope with separate physical devices. Concurrent: any system in which two or more tasks may be underway (at an unpredictable point in their execution). Concurrent and parallel: more than one task can by physically active at once (more than one processor). Concurrent, parallel and distributed: processors are associated with the devices physically separated in the real world.
3
Process A process or thread is a potentially-active execution context.
Processes/threads can come from: Multiple CPUs. Kernel-level multiplexing of single physical machine. Language or library level multiplexing of kernel-level abstraction. They can run: In true parallel. Unpredictably interleaved. Run-until-block. Most work focuses on the first two cases, which are equally difficult to deal with. A process could be thought of as an abstraction of a physical processor.
4
Levels of Parallelism Circuits and gates level: signals can propagate down the thousands connections at once. Instruction-level parallelism (ILP): machine language programs. Vector parallelism: operations repeated on every element of a very large program. Thread-level: coarser-grain (multicore processors), represents a fundamental shift where parallelism must be written explicitly into high-level program structure.
5
Levels of Abstraction “Black box” parallel libraries: parallelism under the hood. Programmer’s specified mutually independent tasks. Synchronization: serves to eliminate races between threads by controlling the ways in which their actions can interleave in time. Race condition occurs whenever two or more threads are racing towards points in the code at which they touch some common object. Behavior of the systems depends on which thread gets there first. Synchronization makes some sequence of instructions (critical sections) appear to be atomic – happen all at once. Implementing synchronization mechanisms may require a good understanding of the hardware and run-time system.
6
Race Conditions A race condition occurs when actions in two processes are not synchronized and program behavior depends on the order in which the actions happen. Race conditions are not all bad; sometimes any of the possible program outcomes are ok (e.g. workers taking things off a task queue). Race conditions (we want to avoid race conditions): Suppose processors A and B share memory, and both try to increment variable X at more or less the same time Very few processors support arithmetic operations on memory, so each processor executes LOAD X INC STORE X If both processors execute these instructions simultaneously X could go up by one or by two.
7
Synchronization Synchronization is the act of ensuring that events in different processes happen in a desired order. Synchronization can be used to eliminate race conditions In our example we need to synchronize the increment operations to enforce mutual exclusion on access to X. Most synchronization can be regarded as one of the following: Mutual exclusion: making sure that only one process is executing a critical section (e.g., touching a variable) at a time. Usually using a mutual exclusion lock (acquire/release). Condition synchronization: making sure that a given process does not proceed until some condition holds (e.g., that a variable contains a given value).
8
Multithreaded Programs
Motivations: Hardware: I/O, device handlers. Simulation: discrete events. Web-based applications: browsers. Use of many threads ensure that comparatively fast operations do not wait for slow operations. If a thread blocks, the implementation automatically switches to another thread. Preemptive approach: the implementation automatically switches among threads even if there is no blocking to prevent any one thread from monopolizing the CPU.
9
Dispatch Loop Alternative
Dispatch loop: centralizes the handling of all delay-inducing events in a single dispatch loop. Identify all tasks the corresponding state information. Very complex to subdivide tasks and save state. The principal problem - hides the algorithmic structure of the program: Every distinct task could be described elegantly if not for the fact that we must return to the top of the dispatch loop at every delay-inducing operation. Turns the program inside out: The management of tasks is explicit. The control flow within tasks implicit.
10
Multiprocessor Architecture
Single site (nondistributed) parallel computers: Processors share access to common memory. Processors must communicate with messages. Shared-memory machines are typically referred to as multiprocessors: Smaller (2-8 processors) are usually symmetric. Larger employ a distributed-memory architecture where each memory bank is physically adjacent to a particular processor or a small group of processors. Two main classes of programming notation: Synchronized access to shared memory: a processors can read/write remote memory without the assistance of another processor. Message passing between processes that don't share memory: requires the active participation of processors at both ends.
11
Shared Memory To implement synchronization you have to have something that is atomic: That means it happens all at once, as an indivisible action. In most machines, reads and writes of individual memory locations are atomic (note that this is not trivial; memory and/or busses must be designed to arbitrate and serialize concurrent accesses). In early machines, reads and writes of individual memory locations were all that was atomic. To simplify the implementation of mutual exclusion, hardware designers began in the late 60's to build so-called read-modify-write, or fetch-and-phi, instructions into their machines.
12
Memory Coherence Cache - has a serious problem on memory-shared machines: A processor that has cached a particular memory location will not see changes made on that location by other processor. Coherence problem: how to keep cached copies of a memory location consistent with one another. Bus-based symmetric machines – relatively easy: leverages broadcast nature so when a processor needs to write a cache line, it request an exclusive copy, and waits for other processors to invalidate their copies. No broadcast bus –difficult: notifications take time to pro- pagate, the updates order is important (consistency is hard). Supercomputers: how to accommodate nonuniform access time and the lack of hardware support for shared memory across the full machine.
13
Concurrent Programming Fundamentals
Thread: an active entity that the programmer thinks of as running concurrently with other threads. Built on top of one or more processes provided by the operating system: Heavyweight process: has its own address space. Lightweight processes: share an address space. Task: a well defined unit of work that must be performed by some thread: A collection of threads share a common “bag of tasks”. Terminology inconsistent across systems and authors.
14
Communication and Synchronization
Communication - any mechanism that allows one thread to obtain information produced by another: Shared memory: program’s variables accessible to multiple threads. Message passing: threads have no common state. Synchronization – any mechanism that allows the programmer to control the relative order in which operations occur on different threads. Shared memory: not implicit, requires special constructs. Message passing: implicit. Synchronization implementation: Spinning (busy-waiting): a thread runs in a loop reevaluating some condition (makes no sense on uniprocessor). Blocking (scheduler-based): the waiting thread voluntarily relinquishes its processor to some other thread (needed a data structure associated with the synchronization action).
15
Languages and Libraries
Thread-level concurrency: Explicitly concurrent language. Compiler-supported extensions to traditional sequential languages. Library package outside the language proper. Support for concurrency in Algol 68. Support for coroutines in Simula. An explicitly concurrent programming language has the advantage of compiler support. Shared memory Message passing Distributed computing Language Java, C# Extension OpenMP Remote procedure call Library pthreads Win32 threads MPI Internet libraries
16
Thread Creation Syntax
Six principal options: Co-begin. Parallel loops. Launch-at-Elaboration. Fork/Join. Implicit Receipt. Early Reply. The first two options delimit thread with special control-flow constructs. SR language provides all six options. Java, C# and most libraries: fork/join. Ada: launch-at-elaboration and fork/join. OpenMP: co-being and parallel loops. RPC systems: implicit receipt.
17
Co-Begin A compound statement that calls for concurrent execution: co-begin -- all n statements run concurrently stmt_1 stmt_2 … stmt_n end The principal means of creating threads in Algol-60. OpenMP: #pragma omp sections { # pragma omp section { printf(“thread 1 here\n”); } # pragma omp section {printf(“thread 2 here\n”); } }
18
Parallel Loops Loops whose iteration can be executed concurrently.
OpenMP/C: pragma omp parallel for for (int I = 0; I < 3; i++) { printf(“thread %d here\n”, i); } C#/Parallel FX: Parallel.For(0, 3, i => { Console.WriteLine(″Thread ″ + i + ″ here″); }); High performance Fortran: forall (i=1:n-1) A(i) = B(I) + C(i) A(i+1) = A(i) + A(i+1) end forall Support for scheduling and data semantics: OpenMP.
19
Launch-at-Elaboration
The code for a thread is declared with syntax resembling that of a subroutine with no parameters. When the declaration is elaborated, a thread is created to execute the code. Ada: procedure P is task T is … end T; begin – P … end P; Task T begins to execute as soon as control enters the procedure P. When control reaches the end of P, it waits for the corresponding Ts to complete before returning.
20
Fork/Join Fork (b) is more general compared to co-begin, parallel loops, and launch-at-elaboration (a) that have properly nested thread executions. Join allows a thread to wait for the completion of a previously forked thread. Ada support defining tasks types: shared variables used for communication. Java: parameters passing at start time. Objects of a class implementing Runnable interface are passed to an Executor object. Cilk - prepends spawn to an ordinary function call: spawn foo(args); Highly efficient mechanism for scheduling tasks.
21
Implicit Receipt RPC systems: create a new thread automatically in response to an incoming request from some other address space. A server binds a communication channel to a local thread body or subroutine. When a request comes in, a new thread spring into existence to handle it. Bind operation grants remote clients the ability to perform a fork within the server’s address space. The process is not fully automatic.
22
Early Reply Sequential subroutines can be done as a single thread or two threads. A single thread which saves its current context (its program counter and registers), executes the subroutines, and returns to what it was doing before. Two threads: One that executes the caller and another that executes the callee. The call is essentially a fork/join pair. The callee does not have to terminate, just complete its portion of work. In languages like SR or Hermes the callee can execute a reply operation that returns the results to the caller without termination. The portion of the of the callee prior to the reply is like the constructor of Java/C# thread.
23
Implementation of Threads
The threads: usually implemented on top of one or more processes provided by the operating system. Every thread a separate process: Processes are too expensive. Requires a system call. Provide features are seldom used (e.g., priorities). All thread in a single process: Precludes parallel execution on a multicore or multiprocessor machine. If the currently running thread makes a system call that blocks, then none of the program’s other threads can run.
24
Two-Level Thread Implementation
User level threads on top of kernel-level processes: Similar code appears at both level of the system: The language run-time system implements threads on top of one or more processes. The operating system implements processes on top of one or more physical processors. The typical implementation starts with coroutines. Turning coroutines into threads: Hide the argument to transfer by implementing scheduler. Implement a preemption mechanisms. Allow data structure sharing.
25
Coroutines Multiple execution contexts, only one of which is active.
Transfer (other): Save all callee-saves registers on stack, including ra and fp: *current := sp current := other sp := *current Pop all callee-saves registers (including ra, but not sp) return (into different coroutine!) other and current are pointers to context blocks: Contains sp; may contain other stuff as well (priority, I/O status, etc.) No need to change PC; always changes at the same place Create new coroutine in a state that looks like it's blocked in transfer. (Or maybe let it execute and then "detach". That's basically early reply).
26
Uniprocessor Scheduling
A thread is either blocked or runnable: current_thread: thread running “on a process”. ready_list: a queue for runnable thread. Waiting queues: queus for threads blocked waiting for conditions. Fairness: each thread gets a frequent “slice” of the processor. Cooperative multithreading: any long-running thread must yield the processor explicitly from time to time. Schedulers: ability to "put a thread/process to sleep" and run something else: Start with coroutines. Make uniprocessor run-until-block threads. Add preemption. Add multiple processors.
27
Run-until Block Need to get rid of explicit argument to transfer.
ready_list: threads that are runnable but not running procedure reschedule: t : thread := dequeue(ready_list) transfer(t) To do this safely, we need to save current_thread: Suppose we're just relinquishing the processor for the sake of fairness (as in MacOS or Windows 3.1): procedure yield: enqueue(ready_list, current_thread) reschedule Now suppose we're implementing synchronization: sleep_on(q) enqueue(q, current_thread) Some other thread/process will move us to the ready list when we can continue.
28
Preemption Use timer interrupts (in OS) or signals (in library package) to trigger involuntary yields. Requires that we protect the scheduler data structures: procedure yield: disable_signals enqueue(ready_list, current) Reschedule re-enable_signals Note that reschedule takes us to a different thread, possibly in code other than yield. Every call to reschedule must be made with signals disabled, and must re-enable them upon its return: if not <desired condition> sleep_on <condition queue> re-enable signals
29
Multiprocessors Scheduling
True or quasi parallelism introduces race between calls in separate OS processes. Additional synchronization needed to make scheduler operations in separate processes atomic: procedure yield: disable_signals acquire(scheduler_lock) // spin lock enqueue(ready_list, current) reschedule release(scheduler_lock) re-enable_signals if not <desired condition> sleep_on <condition queue> re-enable signals
30
Summary Six different constructs for creating threads: co-begin, parallel loops, launch-at-elaboration, fork/join, implicit receipt, and early reply. Most concurrent programming systems implement their language- or library-level threads on top of a collection of OS- level processes. OS implements its processes on top of a collection of hardware processors.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.