Morgan Kaufmann Publishers

Slides:



Advertisements
Similar presentations
Parallelism Lecture notes from MKP and S. Yalamanchili.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Chapter 7 Multicores, Multiprocessors, and Clusters.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Chapter 7 Multicores, Multiprocessors, and Clusters.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Computer Organization CS224 Fall 2012 Lesson 52. Message Passing  Each processor has private physical address space  Hardware sends/receives messages.
Chapter 7 - Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency.
Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
1 Chapter 04 Authors: John Hennessy & David Patterson.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS 352H: Computer Systems Architecture Topic 14: Multicores,
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Morgan Kaufmann Publishers
Introduction to MMX, XMM, SSE and SSE2 Technology
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Vector computers.
Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly.
These slides are based on the book:
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Static Compiler Optimization Techniques
Parallel Processing - introduction
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Morgan Kaufmann Publishers
Parallelism Lecture notes from MKP and S. Yalamanchili.
Morgan Kaufmann Publishers
COMP4211 : Advance Computer Architecture
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Multi-Processing in High Performance Computer Architecture:
The University of Adelaide, School of Computer Science
Pipelining and Vector Processing
Morgan Kaufmann Publishers Parallel Processors from Client to Cloud
Parallel Processors from Client to Cloud
Morgan Kaufmann Publishers Parallel Processors from Client to Cloud
Morgan Kaufmann Publishers Parallel Processors from Client to Cloud
Multivector and SIMD Computers
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
CSC3050 – Computer Architecture
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Chapter 4 Multiprocessors
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Static Compiler Optimization Techniques
Morgan Kaufmann Publishers Parallel Processors from Client to Cloud
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Static Compiler Optimization Techniques
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Presentation transcript:

Morgan Kaufmann Publishers 14 September, 2018 Chapter 6 Parallelism Why Parallelism? Heat (power) & energy Difference between power and energy? Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 1 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 14 September, 2018 Introduction §6.1 Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Task-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores) Chapter 6 — Parallel Processors from Client to Cloud — 2 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware Chapter 6 — Parallel Processors from Client to Cloud — 3 Chapter 7 — Multicores, Multiprocessors, and Clusters

Parallelism in the Book Morgan Kaufmann Publishers 14 September, 2018 Parallelism in the Book §2.11: Parallelism and Instructions Synchronization §3.6: Parallelism and Computer Arithmetic Subword Parallelism §4.10: Parallelism and Advanced Instruction-Level Parallelism §5.10: Parallelism and Memory Hierarchies Cache Coherence Chapter 6 — Parallel Processors from Client to Cloud — 4 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning Coordination Communications overhead §6.2 The Difficulty of Creating Parallel Processing Programs Reporting analogy – equal sized pieces, communication overhead, balancing the load, time to synchronize The more processors (reporters), the harder this coordination becomes… Chapter 6 — Parallel Processors from Client to Cloud — 5 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Amdahl’s Law Who remembers Amdahl’s law? If not the details, then what is means at a high level… Chapter 6 — Parallel Processors from Client to Cloud — 6 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Amdahl’s Law Sequential part can limit speedup Example: 100 processors, 90× speedup? Tnew = Tparallelizable/100 + Tsequential Solving: Fparallelizable = 0.999 Need sequential part to be 0.1% of original time Speedup = T_old / T_new = T_old / (T_par/100 + T_seq) = 1/( (T_par/100/T_old) + (T_seq/T_old) = 1 / ((F_par/100) + F_seq) Chapter 6 — Parallel Processors from Client to Cloud — 7 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Scaling Example Workload: sum of 10 scalars, and 10 × 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) × tadd 10 processors Time = 10 × tadd + 100/10 × tadd = 20 × tadd Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 × tadd + 100/100 × tadd = 11 × tadd Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors Now I want you do repeat these calculations with a 100x100 matrix. Chapter 6 — Parallel Processors from Client to Cloud — 8 Chapter 7 — Multicores, Multiprocessors, and Clusters

Scaling Example (cont) Morgan Kaufmann Publishers 14 September, 2018 Scaling Example (cont) What if matrix size is 100 × 100? Single processor: Time = (10 + 10000) × tadd 10 processors Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 × tadd + 10000/100 × tadd = 110 × tadd Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced Chapter 6 — Parallel Processors from Client to Cloud — 9 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 × 10 matrix Time = 20 × tadd 100 processors, 32 × 32 matrix Time = 10 × tadd + 1000/100 × tadd = 20 × tadd Constant performance in this example Exception: memory hierarchy! Chapter 6 — Parallel Processors from Client to Cloud — 10 Chapter 7 — Multicores, Multiprocessors, and Clusters

Instruction and Data Streams Morgan Kaufmann Publishers 14 September, 2018 Instruction and Data Streams An alternate classification Data Streams Single Multiple Instruction Streams SISD: Intel Pentium 4 SIMD: SSE instructions of x86 MISD: No examples today MIMD: Intel Xeon e5345 §6.3 SISD, MIMD, SIMD, SPMD, and Vector SPMD: Single Program Multiple Data A parallel program on a MIMD computer Conditional code for different processors Chapter 6 — Parallel Processors from Client to Cloud — 11 Chapter 7 — Multicores, Multiprocessors, and Clusters

Example: DAXPY (Y = a × X + Y) Morgan Kaufmann Publishers 14 September, 2018 Example: DAXPY (Y = a × X + Y) Conventional MIPS code l.d $f0,a($sp) ;load scalar a addiu r4,$s0,#512 ;upper bound of what to load loop: l.d $f2,0($s0) ;load x(i) mul.d $f2,$f2,$f0 ;a × x(i) l.d $f4,0($s1) ;load y(i) add.d $f4,$f4,$f2 ;a × x(i) + y(i) s.d $f4,0($s1) ;store into y(i) addiu $s0,$s0,#8 ;increment index to x addiu $s1,$s1,#8 ;increment index to y subu $t0,r4,$s0 ;compute bound bne $t0,$zero,loop ;check if done Vector MIPS code l.d $f0,a($sp) ;load scalar a lv $v1,0($s0) ;load vector x mulvs.d $v2,$v1,$f0 ;vector-scalar multiply lv $v3,0($s1) ;load vector y addv.d $v4,$v2,$v3 ;add y to product sv $v4,0($s1) ;store the result Chapter 6 — Parallel Processors from Client to Cloud — 12 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Vector Processors Highly pipelined function units Stream data from/to vector registers to units Data collected from memory into registers Results stored from registers to memory Example: Vector extension to MIPS 32 × 64-element registers (64-bit elements) Vector instructions lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each element of vector of double Significantly reduces instruction-fetch bandwidth Chapter 6 — Parallel Processors from Client to Cloud — 13 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 Vector vs. Scalar Vector architectures and compilers Simplify data-parallel programming Explicit statement of absence of loop-carried dependences Reduced checking in hardware Regular access patterns benefit from interleaved and burst memory Avoid control hazards by avoiding loops More general than ad-hoc media extensions (such as MMX, SSE) Better match with compiler technology Chapter 6 — Parallel Processors from Client to Cloud — 14 Chapter 7 — Multicores, Multiprocessors, and Clusters

Morgan Kaufmann Publishers 14 September, 2018 SIMD Operate elementwise on vectors of data E.g., MMX and SSE instructions in x86 Multiple data elements in 128-bit wide registers All processors execute the same instruction at the same time Each with different data address, etc. Simplifies synchronization Reduced instruction control hardware Works best for highly data-parallel applications Chapter 6 — Parallel Processors from Client to Cloud — 15 Chapter 7 — Multicores, Multiprocessors, and Clusters

Vector vs. Multimedia Extensions Vector instructions have a variable vector width, multimedia extensions have a fixed width Vector instructions support strided access, multimedia extensions do not Vector units can be combination of pipelined and arrayed functional units: Chapter 6 — Parallel Processors from Client to Cloud — 16