An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
The Multicore Revolution is Here! More instruction-level parallelism hard to find Very complex designs needed for small gain Thread-level parallelism appears live and well Clock frequency scaling is slowing drastically Too much power and heat when pushing envelope Cannot communicate across chip fast enough Better to design small local units with short paths Effective use of billions of transistors Easier to reuse a basic unit many times Potential for very easy scaling Just keep adding processors/cores for higher (peak) performance
Vocabulary in the Multi Era AMP, Assymetric MP: Each processor has local memory, tasks statically allocated to one processor SMP, Shared-Memory MP: Processors share memory, tasks dynamically scheduled to any processor
Vocabulary in the Multi Era Heterogeneous: Specialization among processors. Often different instruction sets. Usually AMP design. Homogeneous: all processors have the same instruction set, can run any task, usually SMP design.
Future Embedded Systems
The First Software Crisis 60’s and 70’s: PROBLEM: Assembly Language Programming Need to get abstraction and portability without losing performance SOLUTION: High-level Languages (Fortran and C) Provided “common machine language” for uniprocessors
The Second Software Crisis 80’s and 90’s: PROBLEM: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundred programmers Need to composability, malleability and maintainability SOLUTION: Object-Oriented Programming (C++ and Java) Better tools and software engineering methodology (design patterns, specification, testing)
The Third Software Crisis Today: PROBLEM: Solid boundary between hardware and software High-level languages abstract away the hardware Sequential performance is left behind by Moore’s Law SOLUTION: What’s under the hood? Language features for architectural awareness
The Software becomes the Problem, AGAIN Parallelism required to gain performance Parallel hardware is “easy” to design Parallel software is (very) hard to write Fundamentally hard to grasp true concurrency Especially in complex software environments Existing software assumes single-processor Might break in new and interesting ways Multitasking no guarantee to run on multiprocessor
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
Coverage More, less powerful (and power-hungry) cores to achieve same performance?
Coverage Amdahl's Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Speedup = old running time / new running time = 100 seconds / 60 seconds = 1.67
Amdahl’s Law p = fraction of work that can be parallelized n = the number of processors
Implications of Amdahl’s Law Speedup tends to 1/(1-p) as number of processors tends to infinity Parallel programming is worthwhile when programs have a lot of work that is parallel in nature Overhead
Overhead of Parallelism Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
Communication/Synchronization Only few programs are “embarassingly” parallel Programs have sequential parts and parallel parts Need to orchestrate parallel execution among processors Synchronize threads to make sure dependencies in the program are preserved Communicate results among threads to ensure a consistent view of data being processed
Communication/Synchronization Shared Memory Communication is implicit. One copy of data shared among many threads Atomicity, locking and synchronization essential for correctness Synchronization is typically in the form of a global barrier Distributed memory Communication is explicit through messages Cores access local memory Data distribution and communication orchestration is essential for performance Synchronization is implicit in messages Overhead
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
Granularity Granularity is a qualitative measure of the ratio of computation to communication Computation stages are typically separated from periods of communication by synchronization events
Granularity Fine-grain Parallelism Low computation to communication ratio Small amounts of computational work between communication stages Less opportunity for performance enhancement High communication overhead Coarse-grain Parallelism High computation to communication ratio Large amounts of computational work between communication events More opportunity for performance increase Harder to load balance efficiently
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
The Load Balancing Problem Processors that finish early have to wait for the processor with the largest amount of work to complete Leads to idle time, lowers utilization Particularly urgent with barrier synchronization BALANCED workloads UNBALANCED workloads Slowest core dictates overall execution time
Static Load Balancing Programmer make decisions and assigns a fixed amount of work to each processing core a priori Works well for homogeneous multicores All core are the same Each core has an equal amount of work Not so well for heterogeneous multicore Some cores may be faster than others Work distribution is uneven
Dynamic Load Balancing Workload is partitioned in small tasks. Available tasks for processing are pushed in a work-queue When one core finishes its allocated task, it takes on further work from the queue. The process continues until all tasks are assigned to some core for processing. Ideal for codes where work is uneven, and in heterogeneous multicore
Parallel Programming Principles Coverage (Amdahl’s Law) Communication/Synchronization Granularity Load Balance Locality
Memory Access Latency Uniform Memory Access (UMA) – Shared Memory Centrally located shared memory All processors are equidistant (access times) Non-Uniform Access (NUMA) Shared memory – Processors have the same address space data is directly accessible by all, but cost depends on the distance Placement of data affects performance Distributed memory – Processors have private address spaces Data access is local, but cost of messages depends on the distance Communication must be efficiently architected
Locality of Memory Accesses (UMA Shared Memory) Parallel computation is serialized due to memory contention and lack of bandwidth
Locality of Memory Accesses (UMA Shared Memory) Distribute data to relieve contention and increase effective bandwidth
Locality of Memory Accesses (NUMA Shared Memory) int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } INTERCONNECT SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 SHARED MEMORY Once parallel tasks have been assigned to different processors..
int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 AB..phisical placement of data can have a great impact on performance!
int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2
int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } int main() { /* Task 1 */ for (i = 0; i < n; i++) A[i][rand()] = foo (); /* Task 2 */ for (j = 0; j < n; j++) B[j] = goo (); } Locality of Memory Accesses (NUMA Shared Memory) INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2
Locality in Communication (Message Passing)