The Art of Parallel Processing

Slides:



Advertisements
Similar presentations
Parallelism Lecture notes from MKP and S. Yalamanchili.
Advertisements

Parallel Processing with OpenMP
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Introductions to Parallel Programming Using OpenMP
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CS 470/570:Introduction to Parallel and Distributed Computing.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Computer Architecture Parallel Processing
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
GPU Architecture and Programming
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Lecture 3: Computer Architectures
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Martin Kruliš by Martin Kruliš (v1.1)1.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Advanced Architectures
Introduction to Parallel Processing
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Constructing a system with multiple computers or processors
Morgan Kaufmann Publishers
Lecture 5: GPU Compute Architecture
EE 193: Parallel Computing
What is Parallel and Distributed computing?
Lecture 5: GPU Compute Architecture for the last time
Multicore / Multiprocessor Architectures
Parallel Processing Architectures
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Distributed Systems CS
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Computer Evolution and Performance
Multithreading Why & How.
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Multicore and GPU Programming
Types of Parallel Computers
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

The Art of Parallel Processing m The Art of Parallel Processing Ahmad Siavashi April 2017

The Art of Parallel Processing - Siavashi The Software Crisis “As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem." 1972, Turing Award Lecture Edsger W. Dijkstra The Art of Parallel Processing - Siavashi

1st Software Crisis (~ ’60s & ‘70s) Assembly Language Programming C, FORTRAN, … Complex Programs Low Abstraction HW dependent The Art of Parallel Processing - Siavashi

2nd Software Crisis (~ ’80s & ‘90s) Handling Multi-million lines of code C++, C#, Java, Libraries, Design Patterns, Software Engineering Methods, Tools Maintainability Composability The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Until Yesterday Programmers were oblivious to processors High level languages abstracted away the system E.g., Java byte code is machine independent Performance was left to Moore’s Law The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Moore’s Law Gordon Moore 1965 Circuit complexity doubles every year 1975 (revised) Circuit complexity doubles every 1.5 years The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi The Free Lunch Instead of improving the software, wait for the hardware to improve E.g., doubling the speed of a program may take 2 years If technology improves 50%/year In 2 years, 1.52 = 2.25 So the investment is wrong! Unless it also employs new technology The Art of Parallel Processing - Siavashi

But Today, The Free Lunch Is Over! There were limiting forces (Brick wall) Power wall Memory wall Instruction-Level Parallelism (ILP) wall The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Limit #1: Power Wall 𝑃𝑜𝑤𝑒𝑟 𝑑𝑦𝑛𝑎𝑚𝑖𝑐 ∝𝐶 . 𝑉 𝑑𝑑 2 . 𝑓 𝐶=𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑎𝑛𝑐𝑒 𝑉 𝑑𝑑 =𝑆𝑢𝑝𝑝𝑙𝑦 𝑣𝑜𝑙𝑡𝑎𝑔𝑒 𝑓=𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑃𝑜𝑤𝑒𝑟 𝑠𝑡𝑎𝑡𝑖𝑐 = 𝐼 𝑙𝑒𝑎𝑘𝑎𝑔𝑒 . 𝑉 𝑑𝑑 The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Limit #2: Memory Wall Each memory access requires hundreds of CPU cycles Source: Sun World Wide Analyst Conference Feb. 25, 2003 The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Limit #3: ILP Wall Since 1985, all processors use pipelining to overlap the execution of instructions. This potential overlap among instructions is called Instruction-Level Parallelism, since the instructions can be evaluated in parallel. Ordinary programs are written and executed sequentially. IPL allows the compiler and the processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed. How much ILP exists in programs is very application specific. The Art of Parallel Processing - Siavashi

Limit #3: ILP Wall (Cont’d) Example: A loop that adds two 1000-element arrays Every iteration of the loop can overlap with any other iteration. Such techniques works by unrolling the loop either statically by the compiler or dynamically by the hardware. for (int i = 0; i < 1000; i++) { C[i] = A[i] + B[i]; } The Art of Parallel Processing - Siavashi

Limit #3: ILP Wall (Cont’d) Superscalar designs were the state of the art Multiple instruction issue Dynamic scheduling Speculative execution etc. You may have heard of these, but you haven’t needed to know about them to write software! Unfortunately, these sources have been used up. The Art of Parallel Processing - Siavashi

Revolution is Happening Now Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The Art of Parallel Processing - Siavashi

Evolution of Microprocessors 1971-2015 Ref: Intel processors: Shekhar Borkar, Andrew A. Chien, The Future of Microprocessors. Communications of the ACM, Vol. 54 No. 5, Pages 67-77 10.1145/1941487.1941507. Oracle M7: Timothy Prickett Morgan Oracle Cranks Up The Cores To 32 With Sparc M7 Chip, Enterprise Tech - Systems Edition, August 13, 2014. The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” - Seymour Cray - The Art of Parallel Processing - Siavashi

Pictorial Depiction of Amdahl’s Law Unaffected, fraction: (1- F) Affected fraction: F F/S Unchanged 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 = 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Amdahl’s Law 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑆 𝐹=𝑇ℎ𝑒 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑆=𝑇ℎ𝑒 𝑠𝑝𝑒𝑒𝑑𝑢𝑝 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑑 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛 Example. Overall speed up if we make 80% of a program run 20% faster. 𝐹=0.8 𝑆=1.2 1 1−0.8 + 0.8 1.2 =1.153 Gene Myron Amdahl The Art of Parallel Processing - Siavashi

Amdahl’s Law for Multicores 𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑠𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝐹 + 𝐹 𝑁 𝐹=𝑇ℎ𝑒 ′parallelable′ fraction 𝑁=𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑒𝑠 𝑯𝒆𝒏𝒄𝒆, 𝒓𝒖𝒏𝒏𝒊𝒏𝒈 𝒂 𝒑𝒓𝒐𝒈𝒓𝒂𝒎 𝒐𝒏 𝒂 𝒅𝒖𝒂𝒍 𝒄𝒐𝒓𝒆 𝒅𝒐𝒆𝒔 𝒏 ′ 𝒕 𝒎𝒆𝒂𝒏 𝒈𝒆𝒕𝒕𝒊𝒏𝒈 𝒕𝒘𝒐 𝒕𝒊𝒎𝒆𝒔 𝒕𝒉𝒆 𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒂𝒏𝒄𝒆. 𝑖𝑓 𝐹≫ 1−𝐹 𝑜𝑟 𝐹> 1−𝐹 +𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑 𝑡ℎ𝑒𝑛 speedup>0 Gene Myron Amdahl The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Myths and Realities 2 x 3GHz better than 6GHz Wrong 6GHz is better for single-threaded apps 6GHz may be better for multi-threaded apps due to the existing overheads (discussed latter) So, Why a dual core processor may run the same app faster? OS would run on another core, dedicating an idle core to the app The Art of Parallel Processing - Siavashi

3rd Software Crisis (Present) Solution? Concurrency is the next major revolution in how we write software No perfect solution/implementation yet There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software The vast majority of programmers today don’t grok concurrency, just as the vast majority of programmers 15 years ago didn’t yet grok objects. Sequential performance is left behind by Moore’s law Needed continuous and reasonable performance improvements To support new features To support larger datasets Needed improvements while sustaining portability and maintainability without unduly increasing complexity faced by the programmer critical to keep-up with the current rate of evolution in software The Art of Parallel Processing - Siavashi

Concurrency Vs Parallelism The Art of Parallel Processing - Siavashi

Classes of Parallelism and Parallel Architectures Data Parallelism Task Parallelism The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Race Condition // Shared variable int sum = 0; // t threads running the code below for (int i = 0; i < N; i++) sum++; Where output is dependent on the sequence of accessing a shared resource The Art of Parallel Processing - Siavashi

Principles of Parallel Programming Finding enough parallelism (Amdahl’s law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming even harder than sequential programming. The Art of Parallel Processing - Siavashi

Flynn’s Classification The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi SISD Single Instruction, Single Data Uniprocessors Implicit Parallelism Pipelining Hyperthreading Speculative execution Dynamic scheduling Scoreboard algorithm Tomasulo’s algorithm Etc. The Art of Parallel Processing - Siavashi

Intel Pentium-4 Hyperthreading The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi MIMD Multiple Instructions, Multiple Data A type of parallel system Every processor may be executing a different instruction stream working with a different data stream The Art of Parallel Processing - Siavashi

Parallel Computer Memory Architectures Shared Memory Single Address Space Non-Uniform Memory Access (NUMA) Uniform Memory Access (UMA) The Art of Parallel Processing - Siavashi

AMD Orchi Die Floorplan Based on AMD’s Bulldozer Microarchitecture The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi OpenMP Open Multi-Processing Application Program Interface (API) used to Explicitly direct multi-threaded, shared memory parallelism. The API is specified for C/C++ and Fortran. Most major platforms Unix/Linux platforms Windows The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi OpenMP (Cont’d) Fork – Join Model The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi OpenMP (Cont’d) Example. #pragma omp parallel for num_threads(4) for (int i = 0; i < 40; i++) A[i] = i; The Art of Parallel Processing - Siavashi

Parallel Computer Memory Architectures (Cont’d) Distributed Memory No Global Address Space The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Computer Cluster The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi MPI Message Passing Interface Addresses the message-passing parallel programming model data is moved from the address space of one process to that of another process Originally designed for distributed memory architectures The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi MPI (Cont’d) Today, MPI runs on virtually any hardware platform Distributed Memory Shared Memory Hybrid All parallelism is explicit The programmer is responsible for correctly identifying parallelism The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi MPI (Cont’d) Communication Model The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi MPI (Cont’d) Structure The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi SIMD Single Instruction, Multiple Data A type of parallel computer All processing units execute the same instruction at any given clock cycle Each processing unit can operate on a different data element The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi GPU Stands for “Graphics Processing Unit” Integration Scheme: A card on the motherboard NVIDIA GeForce GTX 1080 The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi NVIDIA CUDA The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi NVIDIA Fermi The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi NVIDIA Pascal The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi How Does It Work? GPU CPU 2. Call Computation 3. Result GPU Memory PC Memory 1. Copy Input Data 4. Copy Result The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi OpenCL NVIDIA AMD Intel Altera Etc. The Art of Parallel Processing - Siavashi

The Art of Parallel Processing - Siavashi Thank you You Are Asking ‘Ahmad Siavashi’ About “Parallel Processing” The Art of Parallel Processing - Siavashi