Download presentation
Presentation is loading. Please wait.
1
The Art of Parallel Processing
m The Art of Parallel Processing Ahmad Siavashi April 2017
2
The Art of Parallel Processing - Siavashi
The Software Crisis “As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." 1972, Turing Award Lecture Edsger W. Dijkstra The Art of Parallel Processing - Siavashi
3
1st Software Crisis (~ ’60s & ‘70s)
Assembly Language Programming C, FORTRAN, … Complex Programs Low Abstraction HW dependent The Art of Parallel Processing - Siavashi
4
2nd Software Crisis (~ ’80s & ‘90s)
Handling Multi-million lines of code C++, C#, Java, Libraries, Design Patterns, Software Engineering Methods, Tools Maintainability Composability The Art of Parallel Processing - Siavashi
5
The Art of Parallel Processing - Siavashi
Until Yesterday Programmers were oblivious to processors High level languages abstracted away the system E.g., Java byte code is machine independent Performance was left to Moore’s Law The Art of Parallel Processing - Siavashi
6
The Art of Parallel Processing - Siavashi
Moore’s Law Gordon Moore 1965 Circuit complexity doubles every year 1975 (revised) Circuit complexity doubles every 1.5 years The Art of Parallel Processing - Siavashi
7
The Art of Parallel Processing - Siavashi
The Free Lunch Instead of improving the software, wait for the hardware to improve E.g., doubling the speed of a program may take 2 years If technology improves 50%/year In 2 years, 1.52 = 2.25 So the investment is wrong! Unless it also employs new technology The Art of Parallel Processing - Siavashi
8
But Today, The Free Lunch Is Over!
There were limiting forces (Brick wall) Power wall Memory wall Instruction-Level Parallelism (ILP) wall The Art of Parallel Processing - Siavashi
9
The Art of Parallel Processing - Siavashi
Limit #1: Power Wall 𝑃𝑜𝑤𝑒𝑟 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ∝𝐶 . 𝑉 𝑑𝑑 2 . 𝑓 𝐶=𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 𝑉 𝑑𝑑 =𝑆𝑢𝑝𝑝𝑙𝑦 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 𝑓=𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑃𝑜𝑤𝑒𝑟 𝑠𝑡𝑎𝑡𝑖𝑐 = 𝐼 𝑙𝑒𝑎𝑘𝑎𝑔𝑒 . 𝑉 𝑑𝑑 The Art of Parallel Processing - Siavashi
10
The Art of Parallel Processing - Siavashi
Limit #2: Memory Wall Each memory access requires hundreds of CPU cycles Source: Sun World Wide Analyst Conference Feb. 25, 2003 The Art of Parallel Processing - Siavashi
11
The Art of Parallel Processing - Siavashi
Limit #3: ILP Wall Since 1985, all processors use pipelining to overlap the execution of instructions. This potential overlap among instructions is called Instruction-Level Parallelism, since the instructions can be evaluated in parallel. Ordinary programs are written and executed sequentially. IPL allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. How much ILP exists in programs is very application specific. The Art of Parallel Processing - Siavashi
12
Limit #3: ILP Wall (Cont’d)
Example: A loop that adds two 1000-element arrays Every iteration of the loop can overlap with any other iteration. Such techniques works by unrolling the loop either statically by the compiler or dynamically by the hardware. for (int i = 0; i < 1000; i++) { C[i] = A[i] + B[i]; } The Art of Parallel Processing - Siavashi
13
Limit #3: ILP Wall (Cont’d)
Superscalar designs were the state of the art Multiple instruction issue Dynamic scheduling Speculative execution etc. You may have heard of these, but you haven’t needed to know about them to write software! Unfortunately, these sources have been used up. The Art of Parallel Processing - Siavashi
14
Revolution is Happening Now
Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The Art of Parallel Processing - Siavashi
15
Evolution of Microprocessors 1971-2015
Ref: Intel processors: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages / Oracle M7: Timothy Prickett Morgan Oracle Cranks Up The Cores To 32 With Sparc M7 Chip, Enterprise Tech - Systems Edition, August 13, 2014. The Art of Parallel Processing - Siavashi
16
The Art of Parallel Processing - Siavashi
“If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray - The Art of Parallel Processing - Siavashi
17
Pictorial Depiction of Amdahl’s Law
Unaffected, fraction: (1- F) Affected fraction: F F/S Unchanged 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 = 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 The Art of Parallel Processing - Siavashi
18
The Art of Parallel Processing - Siavashi
Amdahl’s Law 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Example. Overall speed up if we make 80% of a program run 20% faster. 𝐹=0.8 𝑆=1.2 1 1− =1.153 Gene Myron Amdahl The Art of Parallel Processing - Siavashi
19
Amdahl’s Law for Multicores
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑁 𝐹=𝑇ℎ𝑒 ′parallelable′ fraction 𝑁=𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑯𝒆𝒏𝒄𝒆, 𝒓𝒖𝒏𝒏𝒊𝒏𝒈 𝒂 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒐𝒏 𝒂 𝒅𝒖𝒂𝒍 𝒄𝒐𝒓𝒆 𝒅𝒐𝒆𝒔 𝒏 ′ 𝒕 𝒎𝒆𝒂𝒏 𝒈𝒆𝒕𝒕𝒊𝒏𝒈 𝒕𝒘𝒐 𝒕𝒊𝒎𝒆𝒔 𝒕𝒉𝒆 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆. 𝑖𝑓 𝐹≫ 1−𝐹 𝑜𝑟 𝐹> 1−𝐹 +𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 𝑡ℎ𝑒𝑛 speedup>0 Gene Myron Amdahl The Art of Parallel Processing - Siavashi
20
The Art of Parallel Processing - Siavashi
Myths and Realities 2 x 3GHz better than 6GHz Wrong 6GHz is better for single-threaded apps 6GHz may be better for multi-threaded apps due to the existing overheads (discussed latter) So, Why a dual core processor may run the same app faster? OS would run on another core, dedicating an idle core to the app The Art of Parallel Processing - Siavashi
21
3rd Software Crisis (Present)
Solution? Concurrency is the next major revolution in how we write software No perfect solution/implementation yet There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects. Sequential performance is left behind by Moore’s law Needed continuous and reasonable performance improvements To support new features To support larger datasets Needed improvements while sustaining portability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software The Art of Parallel Processing - Siavashi
22
Concurrency Vs Parallelism
The Art of Parallel Processing - Siavashi
23
Classes of Parallelism and Parallel Architectures
Data Parallelism Task Parallelism The Art of Parallel Processing - Siavashi
24
The Art of Parallel Processing - Siavashi
Race Condition // Shared variable int sum = 0; // t threads running the code below for (int i = 0; i < N; i++) sum++; Where output is dependent on the sequence of accessing a shared resource The Art of Parallel Processing - Siavashi
25
Principles of Parallel Programming
Finding enough parallelism (Amdahl’s law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming even harder than sequential programming. The Art of Parallel Processing - Siavashi
26
Flynn’s Classification
The Art of Parallel Processing - Siavashi
27
The Art of Parallel Processing - Siavashi
SISD Single Instruction, Single Data Uniprocessors Implicit Parallelism Pipelining Hyperthreading Speculative execution Dynamic scheduling Scoreboard algorithm Tomasulo’s algorithm Etc. The Art of Parallel Processing - Siavashi
28
Intel Pentium-4 Hyperthreading
The Art of Parallel Processing - Siavashi
29
The Art of Parallel Processing - Siavashi
MIMD Multiple Instructions, Multiple Data A type of parallel system Every processor may be executing a different instruction stream working with a different data stream The Art of Parallel Processing - Siavashi
30
Parallel Computer Memory Architectures
Shared Memory Single Address Space Non-Uniform Memory Access (NUMA) Uniform Memory Access (UMA) The Art of Parallel Processing - Siavashi
31
AMD Orchi Die Floorplan
Based on AMD’s Bulldozer Microarchitecture The Art of Parallel Processing - Siavashi
32
The Art of Parallel Processing - Siavashi
OpenMP Open Multi-Processing Application Program Interface (API) used to Explicitly direct multi-threaded, shared memory parallelism. The API is specified for C/C++ and Fortran. Most major platforms Unix/Linux platforms Windows The Art of Parallel Processing - Siavashi
33
The Art of Parallel Processing - Siavashi
OpenMP (Cont’d) Fork – Join Model The Art of Parallel Processing - Siavashi
34
The Art of Parallel Processing - Siavashi
OpenMP (Cont’d) Example. #pragma omp parallel for num_threads(4) for (int i = 0; i < 40; i++) A[i] = i; The Art of Parallel Processing - Siavashi
35
Parallel Computer Memory Architectures (Cont’d)
Distributed Memory No Global Address Space The Art of Parallel Processing - Siavashi
36
The Art of Parallel Processing - Siavashi
Computer Cluster The Art of Parallel Processing - Siavashi
37
The Art of Parallel Processing - Siavashi
MPI Message Passing Interface Addresses the message-passing parallel programming model data is moved from the address space of one process to that of another process Originally designed for distributed memory architectures The Art of Parallel Processing - Siavashi
38
The Art of Parallel Processing - Siavashi
MPI (Cont’d) Today, MPI runs on virtually any hardware platform Distributed Memory Shared Memory Hybrid All parallelism is explicit The programmer is responsible for correctly identifying parallelism The Art of Parallel Processing - Siavashi
39
The Art of Parallel Processing - Siavashi
MPI (Cont’d) Communication Model The Art of Parallel Processing - Siavashi
40
The Art of Parallel Processing - Siavashi
MPI (Cont’d) Structure The Art of Parallel Processing - Siavashi
41
The Art of Parallel Processing - Siavashi
SIMD Single Instruction, Multiple Data A type of parallel computer All processing units execute the same instruction at any given clock cycle Each processing unit can operate on a different data element The Art of Parallel Processing - Siavashi
42
The Art of Parallel Processing - Siavashi
GPU Stands for “Graphics Processing Unit” Integration Scheme: A card on the motherboard NVIDIA GeForce GTX 1080 The Art of Parallel Processing - Siavashi
43
The Art of Parallel Processing - Siavashi
NVIDIA CUDA The Art of Parallel Processing - Siavashi
44
The Art of Parallel Processing - Siavashi
NVIDIA Fermi The Art of Parallel Processing - Siavashi
45
The Art of Parallel Processing - Siavashi
NVIDIA Pascal The Art of Parallel Processing - Siavashi
46
The Art of Parallel Processing - Siavashi
How Does It Work? GPU CPU 2. Call Computation 3. Result GPU Memory PC Memory 1. Copy Input Data 4. Copy Result The Art of Parallel Processing - Siavashi
47
The Art of Parallel Processing - Siavashi
OpenCL NVIDIA AMD Intel Altera Etc. The Art of Parallel Processing - Siavashi
48
The Art of Parallel Processing - Siavashi
Thank you You Are Asking ‘Ahmad Siavashi’ About “Parallel Processing” The Art of Parallel Processing - Siavashi
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.