Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Art of Parallel Processing

Similar presentations


Presentation on theme: "The Art of Parallel Processing"— Presentation transcript:

1 The Art of Parallel Processing
m The Art of Parallel Processing Ahmad Siavashi April 2017

2 The Art of Parallel Processing - Siavashi
The Software Crisis “As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." 1972, Turing Award Lecture Edsger W. Dijkstra The Art of Parallel Processing - Siavashi

3 1st Software Crisis (~ ’60s & ‘70s)
Assembly Language Programming C, FORTRAN, … Complex Programs Low Abstraction HW dependent The Art of Parallel Processing - Siavashi

4 2nd Software Crisis (~ ’80s & ‘90s)
Handling Multi-million lines of code C++, C#, Java, Libraries, Design Patterns, Software Engineering Methods, Tools Maintainability Composability The Art of Parallel Processing - Siavashi

5 The Art of Parallel Processing - Siavashi
Until Yesterday Programmers were oblivious to processors High level languages abstracted away the system E.g., Java byte code is machine independent Performance was left to Moore’s Law The Art of Parallel Processing - Siavashi

6 The Art of Parallel Processing - Siavashi
Moore’s Law Gordon Moore 1965 Circuit complexity doubles every year 1975 (revised) Circuit complexity doubles every 1.5 years The Art of Parallel Processing - Siavashi

7 The Art of Parallel Processing - Siavashi
The Free Lunch Instead of improving the software, wait for the hardware to improve E.g., doubling the speed of a program may take 2 years If technology improves 50%/year In 2 years, 1.52 = 2.25 So the investment is wrong! Unless it also employs new technology The Art of Parallel Processing - Siavashi

8 But Today, The Free Lunch Is Over!
There were limiting forces (Brick wall) Power wall Memory wall Instruction-Level Parallelism (ILP) wall The Art of Parallel Processing - Siavashi

9 The Art of Parallel Processing - Siavashi
Limit #1: Power Wall 𝑃𝑜𝑤𝑒𝑟 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ∝𝐶 . 𝑉 𝑑𝑑 2 . 𝑓 𝐶=𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 𝑉 𝑑𝑑 =𝑆𝑢𝑝𝑝𝑙𝑦 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 𝑓=𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑃𝑜𝑤𝑒𝑟 𝑠𝑡𝑎𝑡𝑖𝑐 = 𝐼 𝑙𝑒𝑎𝑘𝑎𝑔𝑒 . 𝑉 𝑑𝑑 The Art of Parallel Processing - Siavashi

10 The Art of Parallel Processing - Siavashi
Limit #2: Memory Wall Each memory access requires hundreds of CPU cycles Source: Sun World Wide Analyst Conference Feb. 25, 2003 The Art of Parallel Processing - Siavashi

11 The Art of Parallel Processing - Siavashi
Limit #3: ILP Wall Since 1985, all processors use pipelining to overlap the execution of instructions. This potential overlap among instructions is called Instruction-Level Parallelism, since the instructions can be evaluated in parallel. Ordinary programs are written and executed sequentially. IPL allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. How much ILP exists in programs is very application specific. The Art of Parallel Processing - Siavashi

12 Limit #3: ILP Wall (Cont’d)
Example: A loop that adds two 1000-element arrays Every iteration of the loop can overlap with any other iteration. Such techniques works by unrolling the loop either statically by the compiler or dynamically by the hardware. for (int i = 0; i < 1000; i++) { C[i] = A[i] + B[i]; } The Art of Parallel Processing - Siavashi

13 Limit #3: ILP Wall (Cont’d)
Superscalar designs were the state of the art Multiple instruction issue Dynamic scheduling Speculative execution etc. You may have heard of these, but you haven’t needed to know about them to write software! Unfortunately, these sources have been used up. The Art of Parallel Processing - Siavashi

14 Revolution is Happening Now
Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The Art of Parallel Processing - Siavashi

15 Evolution of Microprocessors 1971-2015
Ref: Intel processors: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages / Oracle M7: Timothy Prickett Morgan Oracle Cranks Up The Cores To 32 With Sparc M7 Chip, Enterprise Tech - Systems Edition, August 13, 2014. The Art of Parallel Processing - Siavashi

16 The Art of Parallel Processing - Siavashi
“If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray - The Art of Parallel Processing - Siavashi

17 Pictorial Depiction of Amdahl’s Law
Unaffected, fraction: (1- F) Affected fraction: F F/S Unchanged 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 = 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 The Art of Parallel Processing - Siavashi

18 The Art of Parallel Processing - Siavashi
Amdahl’s Law 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Example. Overall speed up if we make 80% of a program run 20% faster. 𝐹=0.8 𝑆=1.2 1 1− =1.153 Gene Myron Amdahl The Art of Parallel Processing - Siavashi

19 Amdahl’s Law for Multicores
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑁 𝐹=𝑇ℎ𝑒 ′parallelable′ fraction 𝑁=𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑯𝒆𝒏𝒄𝒆, 𝒓𝒖𝒏𝒏𝒊𝒏𝒈 𝒂 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒐𝒏 𝒂 𝒅𝒖𝒂𝒍 𝒄𝒐𝒓𝒆 𝒅𝒐𝒆𝒔 𝒏 ′ 𝒕 𝒎𝒆𝒂𝒏 𝒈𝒆𝒕𝒕𝒊𝒏𝒈 𝒕𝒘𝒐 𝒕𝒊𝒎𝒆𝒔 𝒕𝒉𝒆 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆. 𝑖𝑓 𝐹≫ 1−𝐹 𝑜𝑟 𝐹> 1−𝐹 +𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 𝑡ℎ𝑒𝑛 speedup>0 Gene Myron Amdahl The Art of Parallel Processing - Siavashi

20 The Art of Parallel Processing - Siavashi
Myths and Realities 2 x 3GHz better than 6GHz Wrong 6GHz is better for single-threaded apps 6GHz may be better for multi-threaded apps due to the existing overheads (discussed latter) So, Why a dual core processor may run the same app faster? OS would run on another core, dedicating an idle core to the app The Art of Parallel Processing - Siavashi

21 3rd Software Crisis (Present)
Solution? Concurrency is the next major revolution in how we write software No perfect solution/implementation yet There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects. Sequential performance is left behind by Moore’s law Needed continuous and reasonable performance improvements To support new features To support larger datasets Needed improvements while sustaining portability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software The Art of Parallel Processing - Siavashi

22 Concurrency Vs Parallelism
The Art of Parallel Processing - Siavashi

23 Classes of Parallelism and Parallel Architectures
Data Parallelism Task Parallelism The Art of Parallel Processing - Siavashi

24 The Art of Parallel Processing - Siavashi
Race Condition // Shared variable int sum = 0; // t threads running the code below for (int i = 0; i < N; i++) sum++; Where output is dependent on the sequence of accessing a shared resource The Art of Parallel Processing - Siavashi

25 Principles of Parallel Programming
Finding enough parallelism (Amdahl’s law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming even harder than sequential programming. The Art of Parallel Processing - Siavashi

26 Flynn’s Classification
The Art of Parallel Processing - Siavashi

27 The Art of Parallel Processing - Siavashi
SISD Single Instruction, Single Data Uniprocessors Implicit Parallelism Pipelining Hyperthreading Speculative execution Dynamic scheduling Scoreboard algorithm Tomasulo’s algorithm Etc. The Art of Parallel Processing - Siavashi

28 Intel Pentium-4 Hyperthreading
The Art of Parallel Processing - Siavashi

29 The Art of Parallel Processing - Siavashi
MIMD Multiple Instructions, Multiple Data A type of parallel system Every processor may be executing a different instruction stream working with a different data stream The Art of Parallel Processing - Siavashi

30 Parallel Computer Memory Architectures
Shared Memory Single Address Space Non-Uniform Memory Access (NUMA) Uniform Memory Access (UMA) The Art of Parallel Processing - Siavashi

31 AMD Orchi Die Floorplan
Based on AMD’s Bulldozer Microarchitecture The Art of Parallel Processing - Siavashi

32 The Art of Parallel Processing - Siavashi
OpenMP Open Multi-Processing Application Program Interface (API) used to Explicitly direct multi-threaded, shared memory parallelism. The API is specified for C/C++ and Fortran. Most major platforms Unix/Linux platforms Windows The Art of Parallel Processing - Siavashi

33 The Art of Parallel Processing - Siavashi
OpenMP (Cont’d) Fork – Join Model The Art of Parallel Processing - Siavashi

34 The Art of Parallel Processing - Siavashi
OpenMP (Cont’d) Example. #pragma omp parallel for num_threads(4) for (int i = 0; i < 40; i++) A[i] = i; The Art of Parallel Processing - Siavashi

35 Parallel Computer Memory Architectures (Cont’d)
Distributed Memory No Global Address Space The Art of Parallel Processing - Siavashi

36 The Art of Parallel Processing - Siavashi
Computer Cluster The Art of Parallel Processing - Siavashi

37 The Art of Parallel Processing - Siavashi
MPI Message Passing Interface Addresses the message-passing parallel programming model data is moved from the address space of one process to that of another process Originally designed for distributed memory architectures The Art of Parallel Processing - Siavashi

38 The Art of Parallel Processing - Siavashi
MPI (Cont’d) Today, MPI runs on virtually any hardware platform Distributed Memory Shared Memory Hybrid All parallelism is explicit The programmer is responsible for correctly identifying parallelism The Art of Parallel Processing - Siavashi

39 The Art of Parallel Processing - Siavashi
MPI (Cont’d) Communication Model The Art of Parallel Processing - Siavashi

40 The Art of Parallel Processing - Siavashi
MPI (Cont’d) Structure The Art of Parallel Processing - Siavashi

41 The Art of Parallel Processing - Siavashi
SIMD Single Instruction, Multiple Data A type of parallel computer All processing units execute the same instruction at any given clock cycle Each processing unit can operate on a different data element The Art of Parallel Processing - Siavashi

42 The Art of Parallel Processing - Siavashi
GPU Stands for “Graphics Processing Unit” Integration Scheme: A card on the motherboard NVIDIA GeForce GTX 1080 The Art of Parallel Processing - Siavashi

43 The Art of Parallel Processing - Siavashi
NVIDIA CUDA The Art of Parallel Processing - Siavashi

44 The Art of Parallel Processing - Siavashi
NVIDIA Fermi The Art of Parallel Processing - Siavashi

45 The Art of Parallel Processing - Siavashi
NVIDIA Pascal The Art of Parallel Processing - Siavashi

46 The Art of Parallel Processing - Siavashi
How Does It Work? GPU CPU 2. Call Computation 3. Result GPU Memory PC Memory 1. Copy Input Data 4. Copy Result The Art of Parallel Processing - Siavashi

47 The Art of Parallel Processing - Siavashi
OpenCL NVIDIA AMD Intel Altera Etc. The Art of Parallel Processing - Siavashi

48 The Art of Parallel Processing - Siavashi
Thank you You Are Asking ‘Ahmad Siavashi’ About “Parallel Processing” The Art of Parallel Processing - Siavashi


Download ppt "The Art of Parallel Processing"

Similar presentations


Ads by Google