Introduction to Concurrency: Hardware & Software

Introduction to Concurrency: Hardware & Software
Adar Amir Assumptions: Graduate level Operating Systems Making Choices about operation systems Why a micro-century? …just about enough time for one concept Advanced Topics in Concurrent Programming Winter 17/18 seminar

Hardware Assumptions: Graduate level Operating Systems
Making Choices about operation systems Why a micro-century? …just about enough time for one concept

Introduction to Concurrency
Example Assume multiple threads are sharing a resource Can be used by one thread at a time Assume the lock is a Boolean field If the field is false, the lock is free getAndSet(val) Atomically swaps val with the lock field If the call returns false, then the lock was free, and vice versa Getandset – the function that manipulates lock value CS-3013 A-term 2008 Introduction to Concurrency 3

Example The algorithems are logically the same CS-3013 A-term 2008 Introduction to Concurrency 4 Screenshot taken from “the art of multiprocessor programming”

Example Experiment on multiprocessor system measured elapsed time for n threads to execute a short critical section (shared resource access) one million times CS-3013 A-term 2008 Introduction to Concurrency 5 Screenshot taken from “the art of multiprocessor programming”

Multiprocessor Systems
A multiprocessor consists of multiple hardware processors Each of which executes a sequential program The basic unit of time is a cycle the time it takes a processor to fetch and execute a single instruction The processors are hardware devices, that execute software threads Typically, each processor runs a thread for a while, sets it aside, and turns its attention to another thread In absolute terms, cycle times change as technology advances (from about 10 million cycles per second in 1980 to about 3000 million in 2005), and they vary from one platform to another (processors that control toasters have longer cycles than processors that control web servers). Nevertheless, the relative cost of instructions such as memory access changes slowly when expressed in terms of cycles. CS-3013 A-term 2008 Introduction to Concurrency 6

Memory & Caches Processors share a main memory A large array of words, indexed by address The core architectural principle: Processors and main memory are far apart We can alleviate this problem by introducing one or more caches small memories that are closer to the processors and are therefore much faster Simplifying somewhat, a processor reads a value from memory by sending a message containing the desired address to memory. The response message contains the associated data, that is, the contents of memory at that address. A processor writes a value by sending the address and the new data to memory, and the memory sends back an acknowledgment when the new data has been installed. It takes a long time for a processor to read a value from memory. It also takes a long time for a processor to write a value to memory, and longer still for the processor to be sure that value has actually been installed in memory . These caches are logically situated “between” the processor and the memory: when a processor attempts to read a value from a given memory address, it first looks to see if the value is already in the cache, and if so, it does not need to perform the slower access to memory. If the desired address’s value was found, we say the processor hits in the cache, and otherwise it misses. In a similar way, if a processor attempts to write an address that is in the cache, it does not need to perform the slower access to memory. The proportion of requests satisfied in the cache is called the cache hit ratio (or hit rate). CS-3013 A-term 2008 Introduction to Concurrency 7

Memory & Caches Caches are effective because most programs display a high degree of locality a cache holds a group of neighboring words called cache lines Most processors have two levels of caches, called the L1 and L2 caches Both are significantly faster than the hundreds of cycles required to access the memory : if a processor reads or writes a memory address (also called a memory location), then it is likely to read or write the same location again soon. Moreover, if a processor reads or writes a memory location, then it is also likely to read or write nearby locations soon. To exploit this second observation, caches typically operate at a granularity larger than a single word: a cache holds a group of neighboring words called cache lines (sometimes called cache blocks) The L1 cache typically resides on the same chip as the processor, and takes one or two cycles to access. The L2 cache may reside either on or off-chip, and may take tens of cycles to access. CS-3013 A-term 2008 Introduction to Concurrency 8

Memory & Caches Caches are expansive only a fraction of the memory locations will fit in a cache at the same time We would therefore like the cache to maintain values of the most highly used locations We need a replacement policy Areplacement policy determines which cache line to replace to make room for a given new location. If the replacement policy is free to replace any line then we say the cache is fully associative. If, on the other hand, there is only one line that can be replaced then we say the cache is direct mapped. If we split the difference, allowing any line from a set of size k to be replaced to make room for a given line, then we say the cache is k-way set associative. CS-3013 A-term 2008 Introduction to Concurrency 9

Memory & Caches Memory contention occurs when one processor reads or writes a memory address that is cached by another If one processor tries to update a shared cache value, then the other’s must be invalidated. This problem is called cache coherence. Main memory is shared, but caches aren’t Processor 1 writes to address A in main memory, address A is in processor 2 cache. Processor 2 reads address A – now reads wrong value. If both processors are reading the data without modifying it, then the data can be cached at both processors. If, however, one processor tries to update the shared cache line, then the other’s copy must be invalidated to ensure that it does not read an out-of-date value. In its CS-3013 A-term 2008 Introduction to Concurrency 10

MESI Modified: the line has been modified in the cache. and it must eventually be written back to main memory. No other processor has this line cached. Exclusive: the line has not been modified, and no other processor has this line cached. Shared: the line has not been modified, and other processors may have this line cached. Invalid: the line does not contain meaningful data. Example of the MESI cache coherence protocol’s state transitions. (a) Processor A reads data from address a, and stores the data in its cache in the exclusive state. (b) When processor B attempts to read from the same address, A detects the address conflict, and responds with the associated data. Now a is cached at both A and B in the shared state. (c) If B writes to the shared address a, it changes its state to modified, and broadcasts a message warning A (and any other processor that might have that data cached) to set its cache line state to invalid. (d) If A then reads from a, it broadcasts a request, and B responds by sending the modified data both to A and to the main memory, leaving both copies in the shared state. CS-3013 A-term 2008 Introduction to Concurrency 11 Screenshot taken from “the art of multiprocessor programming”

Interconnect The interconnect is a communication medium that links processors to processors, and processors to memory There are essentially two kinds of interconnect architectures in use: SMP and NUMA CS-3013 A-term 2008 Introduction to Concurrency 12

Interconnect In SMP architecture, processors and memory are linked by a bus interconnect SMP architectures are the most common, because they are the easiest to build In a NUMA architecture, a collection of nodes are linked by a point-to-point network Each node contains one or more processors and a local memory One node’s local memory is accessible to the other nodes, and together, the nodes’ memories form a global memory shared by all processors SMP - Both processors and the main memory have bus controller units in charge of sending and listening for messages broadcast on the bus. but they are not scalable to large numbers of processors because eventually the bus becomes overloaded. SMP (symmetric multiprocessing) NUMA (nonuniform memory access) NUMA - Networks are more complex than buses, and require more elaborate protocols, but they scale better than buses to large numbers of processors CS-3013 A-term 2008 Introduction to Concurrency 13

Interconnect The interconnect is a finite resource shared among the processors If one processor uses too much of the interconnect’s bandwidth, then the others may be delayed NUMA - Networks are more complex than buses, and require more elaborate protocols, but they scale better than buses to large numbers of processors CS-3013 A-term 2008 Introduction to Concurrency 14 Screenshot taken from Wikipedia and Cs.Ucla.Edu

Spinning A processor is spinning if it is repeatedly testing some word in memory On an SMP architecture without caches, spinning is a very bad idea Each time the processor reads the memory, it consumes bus bandwidth, without accomplishing work On a NUMA architecture without caches, spinning is acceptable if the address resides in the processor’s local memory Smp - . Because the bus is a broadcast medium, these requests directed to memory may prevent other processors from making progress. CS-3013 A-term 2008 Introduction to Concurrency 15

Spinning On an SMP or NUMA architecture with caches, spinning consumes significantly fewer resources The first time the processor reads the address, it takes a cache miss, and loads the contents of that address into a cache line. Thereafter, as long as that data remains unchanged, the processor simply rereads from its own cache, consuming no interconnect bandwidth CS-3013 A-term 2008 Introduction to Concurrency 16

Back to example Reminder on getAndSet CS-3013 A-term 2008 Introduction to Concurrency 17 Screenshot taken from “the art of multiprocessor programming”

Back to example (assuming SMP)
Each getAndSet() is broadcast to the bus, because we write to “state”; Even worse, these getAndSet() calls forces other processors to discard their own cached copies of the lock Reading locally cached copy of “state” The TTASLock is itself however far from ideal. When the lock is released, all its cached copies are invalidated, and all waiting threads call getAndSet(true), resulting in a burst of traffic, smaller than that of the TASLock, but nevertheless significant. CS-3013 A-term 2008 Introduction to Concurrency 18 Screenshot taken from “the art of multiprocessor programming”

Multi-core & Multi-threaded architectures
In a multi-threaded architecture, a single processor may execute two or more threads at once Many modern processors have substantial internal parallelism In a multi-core architecture, multiple processors are placed on the same chip Multi threaded - They can execute instructions out of order, or in parallel (e.g., keeping both fixed and floating-point units busy), or even execute instructions speculatively CS-3013 A-term 2008 Introduction to Concurrency 19

Multi-core & Multi-threaded architectures
Each processor on that chip typically has its own L1 cache, but they share a common L2 cache avoiding the need to invoke the cumbersome cache coherence protocol Multi core – SMP architecture Modern processor architectures combine multi-core with multi-threading, where multiple individually multi-threaded cores may reside on the same chip. The context switches on some multi-core chips are inexpensive and are performed at a very fine granularity, essentially context switching on every instruction. Thus, multi-threading serves to hide the high latency of accessing memory: whenever a thread accesses memory, the processor allows another thread to execute CS-3013 A-term 2008 Introduction to Concurrency 20 Screenshot taken from “the art of multiprocessor programming”

Memory Consistency When a processor writes a value to memory, that value is kept in the cache and marked as dirty Meaning that it must eventually be written back to main memory On most modern processors, write requests are not applied to memory when they are issued Rather, they are collected in a hardware queue, called a write buffer, and applied to memory together at a later time First, it is often more efficient to issue a number of requests all at once, a phenomenon called batching. Second, if a thread writes to an address more than once, the earlier request can be discarded, saving a trip to memory, a phenomenon called write absorption. CS-3013 A-term 2008 Introduction to Concurrency 21

Memory Consistency The use of write buffers has a very important consequence: The order in which reads–writes are issued to memory is not necessarily the order in which they occur in the program Different architectures provide different guarantees about the extent to which memory reads–writes can be reordered. If two processors each first write their own flag and then read the other’s flag location, then one of them will see the other’s newly written flag value. CS-3013 A-term 2008 Introduction to Concurrency 22

Memory Consistency All architectures allow you to force your writes to take place in the order they are issued, but at a price - A memory barrier Memory barriers are expensive, but can take care of difficult synchronization bugs A memory barrier instruction (sometimes called a fence) flushes write buffers, ensuring that all writes issued before the barrier become visible to the processor that issued the barrier memory barriers are expensive (100s of cycles, maybe more), and should be used only when necessary. On the other, synchronization bugs can be very difficult to track down, so memory barriers should be used liberally, rather than relying on complex platform-specific guarantees about limits to memory instruction reordering. CS-3013 A-term 2008 Introduction to Concurrency 23

Hardware Synchronization Instructions
Modern multiprocessor architecture supports powerful synchronization primitives, which languages relies on to implement synchronization The compare-and-swap (CAS) instruction takes three arguments: an address a in memory, an expected value e, and an update value v. It returns a Boolean. If the memory at address a contains the expected value e, write the update value v to that address and return true, otherwise leave the memory unchanged and return false CS-3013 A-term 2008 Introduction to Concurrency 24

Hardware Synchronization Instructions
Another hardware synchronization primitive is a pair of instructions: load-linked and store-conditional (LL/SC). The LL instruction reads from an address a A later SC instruction to a attempts to store a new value at that address, but only succeeds if the contents of address a are unchanged since that thread issued the earlier LL instruction to a CS-3013 A-term 2008 Introduction to Concurrency 25

Software

Java The Java programming language uses a concurrency model in which threads manipulate objects by calling the objects’ methods coordinating these possibly concurrent calls using various language and library constructs. In Java, a thread is usually a subclass of java.lang.Thread, which provides methods for creating threads, starting them, suspending them, and waiting for them to finish. CS-3013 A-term 2008 Introduction to Concurrency 28

Java In order to set up a thread: Create a class implementing Runnable interface. The class’s run() method is executed by the thread A Runnable object can be turned into a thread by calling the Thread class constructor CS-3013 A-term 2008 Introduction to Concurrency 29 Screenshot taken from “the art of multiprocessor programming”

Java After a thread has been created, it must be started: The thread that calls this method returns immediately. If the caller wants to wait for the thread to finish, it must join the thread: The caller is blocked until the thread’s run() method returns. . An application reads value a from a given memory address, and computes a new value c for that location. It intends to store c, but only if the value a in the address has not changed since it was read. One might think that applying a CAS with expected value a and update value c would accomplish this goal. There is a problem: a thread could have overwritten the value a with another value b, and later written a again to the address. T CS-3013 A-term 2008 Introduction to Concurrency 30

Java . An application reads value a from a given memory address, and computes a new value c for that location. It intends to store c, but only if the value a in the address has not changed since it was read. One might think that applying a CAS with expected value a and update value c would accomplish this goal. There is a problem: a thread could have overwritten the value a with another value b, and later written a again to the address. T CS-3013 A-term 2008 Introduction to Concurrency 31 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors Java provides a number of built-in ways to synchronize access to shared data One of which called the monitor model Let’s assume we want to manage a queue of calls, where each cal need to be teaken care of by a different thread It is easy to see that this class does not work correctly if two operators try to dequeue a call at the same time CS-3013 A-term 2008 Introduction to Concurrency 32 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors Java provides a useful built-in mechanism to support mutual exclusion Each object has an (implicit) lock. If a thread A acquires the object’s lock (or, equivalently, locks that object), then no other thread can acquire that lock until A releases the lock If a class declares a method to be synchronized, then that method implicitly acquires the lock when it is called, and releases it when it returns. CS-3013 A-term 2008 Introduction to Concurrency 33

Java - Monitors What should an thread do if there are no calls waiting in the queue? CS-3013 A-term 2008 Introduction to Concurrency 34 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors CS-3013 A-term 2008 Introduction to Concurrency 35 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors DEADLOCK! CS-3013 A-term 2008 Introduction to Concurrency 36 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors In Java, each object provides a wait() method that unlocks the object and suspends the caller While that thread is waiting, another thread can lock and change the object. Later, when the suspended thread resumes, it locks the object again before it returns from the wait() call. CS-3013 A-term 2008 Introduction to Concurrency 37 Screenshot taken from “the art of multiprocessor programming”

Java - Monitors The notify() method wakes up one waiting thread chosen arbitrarily from the set of waiting threads The notifyAll() method wakes up all waiting threads When that thread awakens, it competes for the lock like any other thread. CS-3013 A-term 2008 Introduction to Concurrency 38

Java – Yielding and sleeping
A yield() call pauses the thread, asking the schedule to run something else The scheduler decides whether to pause the thread, and when to restart it A call to sleep(t), where t is a time value, instructs the scheduler not to run that thread for that duration In addition to the wait() method, which allows a thread holding a lock to release the lock and pause, Java provides other ways for a thread that does not hold a lock to pause CS-3013 A-term 2008 Introduction to Concurrency 39

Java – Thread-Local Objects
Often it is useful for each thread to have its own private instance of a variable Java supports such thread-local objects through the ThreadLocal<T> class, which manages a collection of objects of type T, one for each thread The ThreadLocal class provides get() and set() methods that read and update the thread’s local value The initialValue() method is called the first time a thread tries to get the value of a thread-local object. CS-3013 A-term 2008 Introduction to Concurrency 40

C# provides a threading model similar to Java’s C# threads are implemented by the System.Threading.Thread class When you create a thread, you tell it what to do by passing it a ThreadStart delegate Delegate in C# - A delegate in C# is similar to a function pointer CS-3013 A-term 2008 Introduction to Concurrency 42 Screenshot taken from “the art of multiprocessor programming”

As in Java, after a thread has been created, it must be started: This call causes the thread to run, while the caller returns immediately. If the caller wants to wait for the thread to finish, it must join the thread: Join - The caller is blocked until the thread’s method returns CS-3013 A-term 2008 Introduction to Concurrency 43

CS-3013 A-term 2008 Introduction to Concurrency 44 Screenshot taken from “the art of multiprocessor programming”

C# - Monitors For simple mutual exclusion, C# provides the ability to lock an object much like the synchronized modifier in Java: Unlike Java, C# does not allow you to use a lock statement to modify a method directly. Instead, the lock statement is used to enclose the method body CS-3013 A-term 2008 Introduction to Concurrency 45 Screenshot taken from “the art of multiprocessor programming”

C# - Monitors Unlike in Java, where every object is an implicit monitor, in C# you must explicitly create the monitor associated with an object To acquire a monitor lock, call Monitor.Enter(this), and to release the lock, call Monitor.Exit(this) Each monitor has a single implicit condition, which is waited upon by Monitor.Wait(this), and signaled by Monitor.Pulse(this) or Monitor.PulseAll(this) CS-3013 A-term 2008 Introduction to Concurrency 46

C# - Monitors CS-3013 A-term 2008 Introduction to Concurrency 47 Screenshot taken from “the art of multiprocessor programming”

C# - Monitors CS-3013 A-term 2008 Introduction to Concurrency 48 Screenshot taken from “the art of multiprocessor programming”

C# - Thread-Local Objects
C# provides a very simple way to make a static field thread-local: simply prefix the field declaration with the attribute [ThreadStatic]. Do not provide an initial value for a [ThreadStatic] field, because the initialization happens once initially has that type’s default value: zero for integers, null for references, and so on. CS-3013 A-term 2008 Introduction to Concurrency 49 Screenshot taken from “the art of multiprocessor programming”

Pthreads

Pthreads Pthread is the main thread library used in C and C++. The following function creates and starts a thread: The first argument is a pointer to the thread itself. The second allows you to specify various aspects of the thread, the third is a pointer to the code the thread is to run (in C# this would be a delegate, and in Java a Runnable object), and the fourth is the argument to the thread function. Unlike Java or C#, a single call both creates and starts a thread. CS-3013 A-term 2008 Introduction to Concurrency 51 Screenshot taken from “the art of multiprocessor programming”

Pthreads A thread terminates when the function returns or calls pthread_exit(). Threads can also join by the call: The exit status is stored in the last argument CS-3013 A-term 2008 Introduction to Concurrency 52 Screenshot taken from “the art of multiprocessor programming”

Pthreads For example, the following program prints out a simple per-thread message. CS-3013 A-term 2008 Introduction to Concurrency 53 Screenshot taken from “the art of multiprocessor programming”

Pthreads The Pthreads library calls locks mutexes. A mutex is created by calling: A mutex can be locked: and unlocked: The exit status is stored in the last argument CS-3013 A-term 2008 Introduction to Concurrency 54 Screenshot taken from “the art of multiprocessor programming”

Pthreads The Pthreads library provides condition variables, which can be created by calling: The following call releases a lock and waits on a condition variable: The following calls awaken suspended threads: As usual, the second argument sets attributes to nondefault values. Unlike in Java or C#, the association between a lock and a condition variable is explicit, not implicit. The following call releases a lock and waits on a condition variable: Because C is not garbage collected, threads, locks, and condition variables all provide destroy() functions that allow their resources to be reclaimed. CS-3013 A-term 2008 Introduction to Concurrency 55 Screenshot taken from “the art of multiprocessor programming”

Pthreads CS-3013 A-term 2008 Introduction to Concurrency 56 Screenshot taken from “the art of multiprocessor programming”

Pthreads CS-3013 A-term 2008 Introduction to Concurrency 57 Screenshot taken from “the art of multiprocessor programming”

Questions?

Introduction to Concurrency: Hardware & Software

Similar presentations

Presentation on theme: "Introduction to Concurrency: Hardware & Software"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Concurrency: Hardware & Software

Similar presentations

Presentation on theme: "Introduction to Concurrency: Hardware & Software"— Presentation transcript:

Similar presentations

About project

Feedback