Download presentation
Presentation is loading. Please wait.
Published byEsther Dora Grant Modified over 9 years ago
1
Module 1 Parallel Programming And Threads Parallel Programming And Threads
2
Parallelism and Concurrency: System and Environment Parallelism: exploit system resources to speed up computation Concurrency: respond quickly/properly to events from the environment from other parts of system Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 2 Environment System Events 1/3/2016
3
Components and Parallelism A component can use parallelism internally to improve performance Usually, clients need not be aware of internal parallelism Why would the interface change because of internal parallelism? m(0)m(1) m(N-1) … A C Client call return
4
Encapsulating Parallelism A component can have a parallel implementation it is an “implementation” detail whether or not there is internal parallelism Behavior of parallel implementation should be the “same” as sequential, where component specification defines “same”
5
Examples Parallel parsing of HTML Parallel XML query processing Use of commands in Linux Applying same command to multiple files Searching different Internet sites
6
Main memory Processor core ALU 1 1 2 2 3 3 4 4 5 5... Instruction stream Clock: 0Clock: 1Clock: 2Clock: 3Clock: 4Clock: 5Clock: 6Clock: 7Clock: 8Clock: 9 2 2 4 4 6 6 9 9 Completion time A simple microprocessor model ~ 1985 Clock: 10 12 Clock: 11 Single h/w thread Instructions execute one after the other Memory access time ~ clock cycle time Single h/w thread Instructions execute one after the other Memory access time ~ clock cycle time Clock: 12 ALU: arithmetic logic unit Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 61/3/2016
7
Main memory Instruction stream 2 2 2 2 2 2 204 (main memory) Completion time FastFwd Two Decades (circa 2005): Power Hungry Superscalar with Caches 226 (hit in L2) Multiple levels of cache, 2 cycles for L1, 20 cycles for L2, 200 cycles for memory ALU 1 1 2 2 3 3 4 4 5 5... L2 cache (4MB) L1 cache (64KB) Dynamic out-of- order I execution Pipelined memory accesses Speculation - ex I b4 branch resolved Dynamic out-of- order I execution Pipelined memory accesses Speculation - ex I b4 branch resolved Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 71/3/2016
8
Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 8
9
Power wall we can’t clock processors faster Memory wall many workload’s performance is dominated by memory access times Instruction-level Parallelism (ILP) wall we can’t find extra work to keep functional units busy while waiting for memory accesses Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 91/3/2016
10
Core 1 1 2 2 3 3 4 4 5 5... Multi-core h/w – common L2 1 1 2 2 3 3 4 4 5 5... L2 cache Core Main memory L1 cache ALU Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 101/3/2016
11
1 1 2 2 3 3 4 4 5 5... Multi-core h/w – additional L3 1 1 2 2 3 3 4 4 5 5... Main memory Single- threaded core L1 cache Single- threaded core L1 cache L2 cache L3 cache Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 111/3/2016
12
SMP multiprocessor Single- threaded core 1 1 2 2 3 3 4 4 5 5... 1 1 2 2 3 3 4 4 5 5 L1 cache Single- threaded core L1 cache L2 cache Main memory Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 121/3/2016
13
Interconnect NUMA multiprocessor non-uniform memory access Single- threaded core L1 cache Single- threaded core L1 cache Memory & directory L2 cache Single- threaded core L1 cache Single- threaded core L1 cache Memory & directory L2 cache Single- threaded core L1 cache Single- threaded core L1 cache Memory & directory L2 cache Single- threaded core L1 cache Single- threaded core L1 cache Memory & directory L2 cache Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 131/3/2016
14
Three kinds of parallel hardware Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed Multiple cores Increase ops/cycle Don’t necessarily scale caches and off-chip resources proportionately Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w proportionately Practical Parallel and Concurrent Programming DRAFT: comments to msrpcpcp@microsoft.com 141/3/2016
15
Sequential Program int sum = 5; for (int i=0; i<5; i++) sum += i; int sum = 5; for (int i=0; i<5; i++) sum += i;
16
Sequential Program Determinism Given a current program state and a code fragment, determine the next program state Termination Prove that a program terminates Usually depends on loop or procedure recursion termination conditions Determinism Given a current program state and a code fragment, determine the next program state Termination Prove that a program terminates Usually depends on loop or procedure recursion termination conditions
17
Parallel Programs Concurrent Non-deterministic Given state and code, next state is ??? Non-terminating Distributed Concurrent but can survive partial failure of thread or process Byzantine Distributed but can survive partial failure at the worst time and in the worst way Concurrent Non-deterministic Given state and code, next state is ??? Non-terminating Distributed Concurrent but can survive partial failure of thread or process Byzantine Distributed but can survive partial failure at the worst time and in the worst way
18
Program Representation #include static int X=5; int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0; } #include static int X=5; int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0; } preprocessor Allocated space and initialized at compile time Local variables String constant
19
Program Representation (after compilation) Object Module (.o.obj) Code Uninitialized static data Initialized static data X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n” Symbol Table Defined main Referenced printf Object Module (.o.obj) Code Uninitialized static data Initialized static data X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n” Symbol Table Defined main Referenced printf
20
Linking and Loading Linker (can also create libraries) Combine multiple object modules into one Satisfies any symbol references among the combined modules Loader Combine object modules and libraries into an executable file (a.out or.exe) All symbol references must be satisfied Symbol table used by debuggers Dynamic linking Stops program on reference to an undefined symbol, finds obj in file system, links and continues Demand loading Symbol ref satisfied before execution but load delayed Linker (can also create libraries) Combine multiple object modules into one Satisfies any symbol references among the combined modules Loader Combine object modules and libraries into an executable file (a.out or.exe) All symbol references must be satisfied Symbol table used by debuggers Dynamic linking Stops program on reference to an undefined symbol, finds obj in file system, links and continues Demand loading Symbol ref satisfied before execution but load delayed
21
Program Representation at Runtime Same as in the object modules Code Static Data Created at runtime Procedure call frame stack Heap to support new/delete on dynamic variables & Ref -- implicit pointer variable to data in the heap * -- explicit pointer variable to data in the heap Same as in the object modules Code Static Data Created at runtime Procedure call frame stack Heap to support new/delete on dynamic variables & Ref -- implicit pointer variable to data in the heap * -- explicit pointer variable to data in the heap
22
Coroutine, state vector hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor. Coroutine - data structure for saved State Vector - hardware registers hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor. Coroutine - data structure for saved State Vector - hardware registers
23
Intel x86 State Vector AX=0000 BX=0000 CX=0000 DX=0000 SI=0000 DI=0000 SP=FFEE top-of-stack pointer BP=0000 procedure call frame pointer DS=0AD5 data segment pointer SS=0AD5 stack segment pointer CS=0AD5 code segment pointer ES=0AD5 IP=0100 instruction pointer (next instruction to execute) NV UP EI PL NZ NA PO NC (processor status bits) CS:IP Code Bytes Instruction 0AD5:0100 E8 FD 00 CALL 0200 AX=0000 BX=0000 CX=0000 DX=0000 SI=0000 DI=0000 SP=FFEE top-of-stack pointer BP=0000 procedure call frame pointer DS=0AD5 data segment pointer SS=0AD5 stack segment pointer CS=0AD5 code segment pointer ES=0AD5 IP=0100 instruction pointer (next instruction to execute) NV UP EI PL NZ NA PO NC (processor status bits) CS:IP Code Bytes Instruction 0AD5:0100 E8 FD 00 CALL 0200
24
C Procedure Call Frame
25
Command Line./a.out apple do* pear The shell expands command-line arguments../a.out apple donut doright pear argc = 5 argv[0] = “/Users/bobcook/home/bin/a.out” argv[1] = “apple” argv[2] = “donut” argv[3] = “doright” argv[4] = “pear”./a.out apple do* pear The shell expands command-line arguments../a.out apple donut doright pear argc = 5 argv[0] = “/Users/bobcook/home/bin/a.out” argv[1] = “apple” argv[2] = “donut” argv[3] = “doright” argv[4] = “pear”
26
Apple Xcode Debugger
27
Macintosh-5:fact bobcook$ gcc -g main.c Macintosh-5:fact bobcook$ gdb a.out (gdb) list 1#include 2 3int factorial(int n) { 4if (n<2) 5return 1; 6return n*factorial(n-1); 7} 8 9int main(int argc, char *argv[]) { 10printf("%d\n", factorial(atoi(argv[1]))); Macintosh-5:fact bobcook$ gcc -g main.c Macintosh-5:fact bobcook$ gdb a.out (gdb) list 1#include 2 3int factorial(int n) { 4if (n<2) 5return 1; 6return n*factorial(n-1); 7} 8 9int main(int argc, char *argv[]) { 10printf("%d\n", factorial(atoi(argv[1])));
28
(gdb) set args 5 (gdb) b 5 Breakpoint 1 at 0x1f0a: file main.c, line 5. (gdb) r Starting program: /Users/bobcook/Desktop/fact/a.out 5 Reading symbols for shared libraries +. done Breakpoint 1, factorial (n=1) at main.c:5 5return 1; (gdb) bt #0 factorial (n=1) at main.c:5 #1 0x00001f1f in factorial (n=2) at main.c:6 #2 0x00001f1f in factorial (n=3) at main.c:6 #3 0x00001f1f in factorial (n=4) at main.c:6 #4 0x00001f1f in factorial (n=5) at main.c:6 #5 0x00001f52 in main (argc=2, argv=0xbffff840) at main.c:10 (gdb) set args 5 (gdb) b 5 Breakpoint 1 at 0x1f0a: file main.c, line 5. (gdb) r Starting program: /Users/bobcook/Desktop/fact/a.out 5 Reading symbols for shared libraries +. done Breakpoint 1, factorial (n=1) at main.c:5 5return 1; (gdb) bt #0 factorial (n=1) at main.c:5 #1 0x00001f1f in factorial (n=2) at main.c:6 #2 0x00001f1f in factorial (n=3) at main.c:6 #3 0x00001f1f in factorial (n=4) at main.c:6 #4 0x00001f1f in factorial (n=5) at main.c:6 #5 0x00001f52 in main (argc=2, argv=0xbffff840) at main.c:10
29
Context Block User struct in UNIX Operating system information to define its virtual processor Coroutine Code, data, stack, heap segments User id, group id, process id, parent id Resource usage information Scheduling information (priority) Operating system information to define its virtual processor Coroutine Code, data, stack, heap segments User id, group id, process id, parent id Resource usage information Scheduling information (priority)
30
Process A program in execution Thread -- entity within a process that can be scheduled for execution Coroutine, thread id, thread priority, thread local storage, a unique call stack All threads in a process share code, data, heap A program in execution Thread -- entity within a process that can be scheduled for execution Coroutine, thread id, thread priority, thread local storage, a unique call stack All threads in a process share code, data, heap
31
#include void *p(void *arg) { int i; for (i=0; i<5; i++) { printf("X\n"); sleep(1); } pthread_exit((void *)99); } int main() { //X Y interleaving is unpredictable pthread_t x; void *r; int i; assert(pthread_create(&x, NULL, p, (void *)34) == 0); for (i=0; i<5; i++) { printf("Y\n"); sleep(1); } assert(pthread_join(x, &r) == 0); return 0; } #include void *p(void *arg) { int i; for (i=0; i<5; i++) { printf("X\n"); sleep(1); } pthread_exit((void *)99); } int main() { //X Y interleaving is unpredictable pthread_t x; void *r; int i; assert(pthread_create(&x, NULL, p, (void *)34) == 0); for (i=0; i<5; i++) { printf("Y\n"); sleep(1); } assert(pthread_join(x, &r) == 0); return 0; }
32
Thread State Transitions
33
Multi-Thread Debugging Thread ID Thread ID Thread ID Call Stack Nth frame … 1st frame Call Stack Nth frame … 1st frame Local variables
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.