1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1
2 Why Multi-Core? Until recently increasing clock frequency is the holy grail to all processor designers to boost performance. But it seems that they reach the dead end for raising clock speed because of power consumption and overheating. So, they realize that it is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency.
3 Power and Frequency (source : Intel Academy program)
4 A little bit of history In the past, performance scaling in single- core processors was achieved by increasing the clock frequency. When processors shrink and clock frequencies rise, Excess power consumption, and overheating Memory access time failed to keep pace with increasing clock frequencies.
5 Instruction/data-level parallelism Since 1993, processor designers supported parallel execution at instruction and data level. Instruction-level parallelism Out-of-order execution pipeline and multiple functional units to execute instructions in parallel Data-level parallelism Multimedia Extension (MMX) in 1997 Streaming SIMD Extension (SSE)
6 Hyper-Threading In 2002, Intel utilized additional copies of execution resources to execute two separate threads simultaneously on the same processor core. This multi-threading idea eventually lead to introducing dual-core processor in 2005.
7 Evolution of Multi-Core Technology (source : Intel Academy program)
8 Multi-processors Architecture Shared memory multiprocessor (SMP) Non-shared memory architecture Massively Parallel Processor (MPP) Cluster CPU Shared memory SMP CPU Interconnected memory MPP
9 Multi-processors vs. Multi-cores Shared memory multi-processors (SMP) Multiple thread on a single core (SMT) Multiple thread on multi-cores (CMT) Tricky acronym CMP (Chip Multi-processor) SMT (Simultaneous MultiThreading) CMT (Chip-level MultiThreading)
10 CMT processor products 1 st generation: Sun Microsystems (late 2005) Intel Dual-Core Xeon (2005) Intel Quad-Core Xeon (late 2006) AMD Quad-Core Opteron (2007) 8-Core (??)
11 Thread A thread is a sequential flow of instructions executed within a program. Thread vs. Process A single process always has one main thread which initialize the process and begins executing the instructions. Any thread can create other threads within a process which share code and data segments. But each thread has its own stack.
12 Thread in a Process process
13 Why use threads? Threads are intended to improve performance and responsiveness of a program. Quick turnaroud time Completing a single job in the smallest amount of time possible High throughput Finishing the most tasks in a fixed amount of time
14 Risks of using Threads But if they are not used properly, they can lead to degrade performance, and sometimes unpredictable behavior, and error conditions Data race (race conditions) Deadlock And other extra burdens. Code complexity Portability issues Testing and debugging difficulty
15 Race condition It happens when more than two threads access a shared variable. “It is nondeterministic!” For example, when Tread A and Tread B are executing the statement. area = area / (1.0 + x*x)
16 (source : Intel Academy program)
17 How to deal with race condition Synchronization Critical region Mutual exclusion
18 Concurrency vs. Parallelism Generally two terminologies can be used interchangeably. But conventional wisdom has the following distinction. Concurrency It happens when more than two threads are in progress simultaneously, normally on a single processor. Parallelism It occurs when more than two threads are executed simultaneously on multiple cores.
19 Performance criteria Speedup Efficiency Granularity Load balance
20 Speedup The most noticeable quantitative measure is to compare the execution time of the best serial algorithm with that of the parallel algorithm. Speedup = Ts/Tp Ts = Serial Time, Tp = Parallel Time Amdahl’s Law Speedup = 1/[S+(1-S)/n + H(n)] S: percentage of time spent on executing the serial portion H(n) : parallel overhead n: the number of cores
21 Example Consider painting a fence. Suppose it takes 30 min to get ready to paint and 30 min for cleanup after painting. Assume that it takes 1 min to paint one single picket and there are 300 pickets. What are the speedups when 1, 2, 10, 100 painters do this job respectively? What is the maximum speedup? What if you use a spray gun to paint the fence? What happens if the fence owner uses spray gun to paint 300 pickets in 1 hrs?
22 Parallel Efficiency A measure of how efficiently core resources are used during parallel computations In the previous example, assume that you knew that all painters were only busy for an average of less than 6% of entire job time but are still getting paid for the whole time. Do you think you were getting the money’s worth from the 100 painters? Efficiency = (Speedup / Number of Threads) * 100%
23 Granularity The ratio of computation to synchronization Coarse-grained Concurrent threads have a large amount of computation between synchronization events. Fine-grained Concurrent threads have a very little computation between synchronization events.
24 Load Balance Balancing the workloads among multiple threads If more work is assigned to some threads, they will sit idle until other threads with more work finish. All the cores must be busy to get max. performance. For load balancing, which size of task will be better? Large-sized or small-sized?
25 Flash Demo demo
26 Computer Memory Hierarchy CPU L1 cache L2 cache Main memory disk 1’s cycle 1’s ~10 cycle ~100’s cycle ~1000’s cycle
27 Architecture consideration(1) In order to obtain better performance, we need to understand how the work is done inside. Cache Cache line (cache block, e.g. 64bytes) Data moves between memory and caches in cache line. Shared caches or separate caches between cores Cache miss is very costly. Cache coherency when they are separate. Replacement policies such as LRU
28 Architecture consideration(2) Memory management Paging Translation look-aside table (TLB) Inside CPU Registers
29 False sharing Assume the cache line is 64 bytes. What happens if two threads try to execute at the same time? Thread 1 int a[1000]; int b[1000]; while a[998] = i * 1000; Thread 2 int a[1000]; int b[1000]; while b[0] = i;
30 Poor cache utilization What is the difference between the following two codes? int a[1000][1000]; for (i=0; i<100; ++i) for (j=0; j<1000; ++j) a[i][j] = i*j; int b[1000][1000]; for (i=0; i<100; ++i) for (j=0; j<1000; ++j) b[j][i] = i*j;
31 Poor Cache Utilization - with eggs (source : Intel Academy program)
32 Good Cache Utilization – with eggs (source : Intel Academy program)