The Art of Parallel Processing m The Art of Parallel Processing Ahmad Siavashi April 2017
The Art of Parallel Processing - Siavashi The Software Crisis “As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." 1972, Turing Award Lecture Edsger W. Dijkstra The Art of Parallel Processing - Siavashi
1st Software Crisis (~ ’60s & ‘70s) Assembly Language Programming C, FORTRAN, … Complex Programs Low Abstraction HW dependent The Art of Parallel Processing - Siavashi
2nd Software Crisis (~ ’80s & ‘90s) Handling Multi-million lines of code C++, C#, Java, Libraries, Design Patterns, Software Engineering Methods, Tools Maintainability Composability The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Until Yesterday Programmers were oblivious to processors High level languages abstracted away the system E.g., Java byte code is machine independent Performance was left to Moore’s Law The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Moore’s Law Gordon Moore 1965 Circuit complexity doubles every year 1975 (revised) Circuit complexity doubles every 1.5 years The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi The Free Lunch Instead of improving the software, wait for the hardware to improve E.g., doubling the speed of a program may take 2 years If technology improves 50%/year In 2 years, 1.52 = 2.25 So the investment is wrong! Unless it also employs new technology The Art of Parallel Processing - Siavashi
But Today, The Free Lunch Is Over! There were limiting forces (Brick wall) Power wall Memory wall Instruction-Level Parallelism (ILP) wall The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Limit #1: Power Wall 𝑃𝑜𝑤𝑒𝑟 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ∝𝐶 . 𝑉 𝑑𝑑 2 . 𝑓 𝐶=𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 𝑉 𝑑𝑑 =𝑆𝑢𝑝𝑝𝑙𝑦 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 𝑓=𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑃𝑜𝑤𝑒𝑟 𝑠𝑡𝑎𝑡𝑖𝑐 = 𝐼 𝑙𝑒𝑎𝑘𝑎𝑔𝑒 . 𝑉 𝑑𝑑 The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Limit #2: Memory Wall Each memory access requires hundreds of CPU cycles Source: Sun World Wide Analyst Conference Feb. 25, 2003 The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Limit #3: ILP Wall Since 1985, all processors use pipelining to overlap the execution of instructions. This potential overlap among instructions is called Instruction-Level Parallelism, since the instructions can be evaluated in parallel. Ordinary programs are written and executed sequentially. IPL allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. How much ILP exists in programs is very application specific. The Art of Parallel Processing - Siavashi
Limit #3: ILP Wall (Cont’d) Example: A loop that adds two 1000-element arrays Every iteration of the loop can overlap with any other iteration. Such techniques works by unrolling the loop either statically by the compiler or dynamically by the hardware. for (int i = 0; i < 1000; i++) { C[i] = A[i] + B[i]; } The Art of Parallel Processing - Siavashi
Limit #3: ILP Wall (Cont’d) Superscalar designs were the state of the art Multiple instruction issue Dynamic scheduling Speculative execution etc. You may have heard of these, but you haven’t needed to know about them to write software! Unfortunately, these sources have been used up. The Art of Parallel Processing - Siavashi
Revolution is Happening Now Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The Art of Parallel Processing - Siavashi
Evolution of Microprocessors 1971-2015 Ref: Intel processors: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507. Oracle M7: Timothy Prickett Morgan Oracle Cranks Up The Cores To 32 With Sparc M7 Chip, Enterprise Tech - Systems Edition, August 13, 2014. The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray - The Art of Parallel Processing - Siavashi
Pictorial Depiction of Amdahl’s Law Unaffected, fraction: (1- F) Affected fraction: F F/S Unchanged 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 = 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Amdahl’s Law 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Example. Overall speed up if we make 80% of a program run 20% faster. 𝐹=0.8 𝑆=1.2 1 1−0.8 + 0.8 1.2 =1.153 Gene Myron Amdahl The Art of Parallel Processing - Siavashi
Amdahl’s Law for Multicores 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑁 𝐹=𝑇ℎ𝑒 ′parallelable′ fraction 𝑁=𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑯𝒆𝒏𝒄𝒆, 𝒓𝒖𝒏𝒏𝒊𝒏𝒈 𝒂 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒐𝒏 𝒂 𝒅𝒖𝒂𝒍 𝒄𝒐𝒓𝒆 𝒅𝒐𝒆𝒔 𝒏 ′ 𝒕 𝒎𝒆𝒂𝒏 𝒈𝒆𝒕𝒕𝒊𝒏𝒈 𝒕𝒘𝒐 𝒕𝒊𝒎𝒆𝒔 𝒕𝒉𝒆 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆. 𝑖𝑓 𝐹≫ 1−𝐹 𝑜𝑟 𝐹> 1−𝐹 +𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 𝑡ℎ𝑒𝑛 speedup>0 Gene Myron Amdahl The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Myths and Realities 2 x 3GHz better than 6GHz Wrong 6GHz is better for single-threaded apps 6GHz may be better for multi-threaded apps due to the existing overheads (discussed latter) So, Why a dual core processor may run the same app faster? OS would run on another core, dedicating an idle core to the app The Art of Parallel Processing - Siavashi
3rd Software Crisis (Present) Solution? Concurrency is the next major revolution in how we write software No perfect solution/implementation yet There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects. Sequential performance is left behind by Moore’s law Needed continuous and reasonable performance improvements To support new features To support larger datasets Needed improvements while sustaining portability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software The Art of Parallel Processing - Siavashi
Concurrency Vs Parallelism The Art of Parallel Processing - Siavashi
Classes of Parallelism and Parallel Architectures Data Parallelism Task Parallelism The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Race Condition // Shared variable int sum = 0; // t threads running the code below for (int i = 0; i < N; i++) sum++; Where output is dependent on the sequence of accessing a shared resource The Art of Parallel Processing - Siavashi
Principles of Parallel Programming Finding enough parallelism (Amdahl’s law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming even harder than sequential programming. The Art of Parallel Processing - Siavashi
Flynn’s Classification The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi SISD Single Instruction, Single Data Uniprocessors Implicit Parallelism Pipelining Hyperthreading Speculative execution Dynamic scheduling Scoreboard algorithm Tomasulo’s algorithm Etc. The Art of Parallel Processing - Siavashi
Intel Pentium-4 Hyperthreading The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi MIMD Multiple Instructions, Multiple Data A type of parallel system Every processor may be executing a different instruction stream working with a different data stream The Art of Parallel Processing - Siavashi
Parallel Computer Memory Architectures Shared Memory Single Address Space Non-Uniform Memory Access (NUMA) Uniform Memory Access (UMA) The Art of Parallel Processing - Siavashi
AMD Orchi Die Floorplan Based on AMD’s Bulldozer Microarchitecture The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi OpenMP Open Multi-Processing Application Program Interface (API) used to Explicitly direct multi-threaded, shared memory parallelism. The API is specified for C/C++ and Fortran. Most major platforms Unix/Linux platforms Windows The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi OpenMP (Cont’d) Fork – Join Model The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi OpenMP (Cont’d) Example. #pragma omp parallel for num_threads(4) for (int i = 0; i < 40; i++) A[i] = i; The Art of Parallel Processing - Siavashi
Parallel Computer Memory Architectures (Cont’d) Distributed Memory No Global Address Space The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Computer Cluster The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi MPI Message Passing Interface Addresses the message-passing parallel programming model data is moved from the address space of one process to that of another process Originally designed for distributed memory architectures The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi MPI (Cont’d) Today, MPI runs on virtually any hardware platform Distributed Memory Shared Memory Hybrid All parallelism is explicit The programmer is responsible for correctly identifying parallelism The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi MPI (Cont’d) Communication Model The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi MPI (Cont’d) Structure The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi SIMD Single Instruction, Multiple Data A type of parallel computer All processing units execute the same instruction at any given clock cycle Each processing unit can operate on a different data element The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi GPU Stands for “Graphics Processing Unit” Integration Scheme: A card on the motherboard NVIDIA GeForce GTX 1080 The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi NVIDIA CUDA The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi NVIDIA Fermi The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi NVIDIA Pascal The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi How Does It Work? GPU CPU 2. Call Computation 3. Result GPU Memory PC Memory 1. Copy Input Data 4. Copy Result The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi OpenCL NVIDIA AMD Intel Altera Etc. The Art of Parallel Processing - Siavashi
The Art of Parallel Processing - Siavashi Thank you You Are Asking ‘Ahmad Siavashi’ About “Parallel Processing” The Art of Parallel Processing - Siavashi