Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Slides:



Advertisements
Similar presentations
Parallelism Lecture notes from MKP and S. Yalamanchili.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Review: Multiprocessor Systems (MIMD)
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Chapter 7 Multicores, Multiprocessors, and Clusters.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter 7 Multicores, Multiprocessors, and Clusters.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Chapter 7 - Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency.
Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
Performance Evaluation of Parallel Processing. Why Performance?
Lecture 22Fall 2006 Computer Systems Fall 2006 Lecture 22: Intro. to Multiprocessors Adapted from Mary Jane Irwin ( )
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS 352H: Computer Systems Architecture Topic 14: Multicores,
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Morgan Kaufmann Publishers
1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
Lecture 3: Computer Architectures
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
CS61C L20 Thread Level Parallelism I (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CS203 – Advanced Computer Architecture Performance Evaluation.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
Classification of parallel computers Limitations of parallel processing.
Modified by S. J. Fritz Spring 2009 (1) Based on slides from D. Patterson and www-inst.eecs.berkeley.edu/~cs152/ COM 249 – Computer Organization and Assembly.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
18-447: Computer Architecture Lecture 30B: Multiprocessors
4- Performance Analysis of Parallel Programs
Introduction to parallel programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Processing - introduction
Morgan Kaufmann Publishers
Concurrent and Distributed Programming
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Chapter 3: Principles of Scalable Performance
Distributed Systems CS
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Presentation transcript:

Computer Organization CS224 Fall 2012 Lesson 52

Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability, availability, power efficiency  Job-level (process-level) parallelism l High throughput for independent jobs  Parallel processing program l Single program run on multiple processors  Multicore microprocessors l Chips with multiple processors (cores) §9.1 Introduction

Types of Parallelism Data-Level Parallelism (DLP) Time Thread-Level Parallelism (TLP) Time Instruction-Level Parallelism (ILP) Pipelining Time

Hardware and Software  Hardware l Serial: e.g., Pentium 4 l Parallel: e.g., quad-core Xeon e5345  Software l Sequential: e.g., matrix multiplication l Concurrent: e.g., operating system  Sequential/concurrent software can run on serial/parallel hardware l Challenge: making effective use of parallel hardware

What We’ve Already Covered  §2.11: Parallelism and Instructions l Synchronization  §3.6: Parallelism and Computer Arithmetic l Associativity  §4.10: Parallelism and Advanced Instruction-Level Parallelism  §5.8: Parallelism and Memory Hierarchies l Cache Coherence (actually, we skipped this)  §6.9: Parallelism and I/O: l Redundant Arrays of Inexpensive Disks

Parallel Programming  Parallel software is the problem  Need to get significant performance improvement l Otherwise, just use a faster uniprocessor, since it’s easier!  Difficulties l Partitioning l Coordination l Communications overhead §7.2 The Difficulty of Creating Parallel Processing Programs

Amdahl’s Law  Sequential part can limit speedup  Example: 100 processors, 90× speedup? l T new = T parallelizable /100 + T sequential l l Solving: F parallelizable =  Need sequential part to be 0.1% of original time (99.9% needs to be parallelizable)  Obviously, less-than-expected speedups are common!

Scaling Example  Workload: sum of 10 scalars, and 10 × 10 matrix sum l Speed up from 10 to 100 processors  Single processor: Time = ( ) × t add  10 processors l Time = 10 × t add + 100/10 × t add = 20 × t add l Speedup = 110/20 = 5.5 (55% of potential)  100 processors l Time = 10 × t add + 100/100 × t add = 11 × t add l Speedup = 110/11 = 10 (10% of potential)  Assumes load can be balanced across processors

Scaling Example (cont)  What if matrix size is 100 × 100?  Single processor: Time = ( ) × t add  10 processors l Time = 10 × t add /10 × t add = 1010 × t add l Speedup = 10010/1010 = 9.9 (99% of potential)  100 processors l Time = 10 × t add /100 × t add = 110 × t add l Speedup = 10010/110 = 91 (91% of potential)  Assuming load is balanced

Strong vs Weak Scaling  Strong scaling: problem size fixed l As in previous example  Weak scaling: problem size proportional to number of processors l 10 processors, 10 × 10 matrix -Time = 20 × t add l 100 processors, 32 × 32 matrix -Time = 10 × t add /100 × t add = 20 × t add l Constant performance in this example

Shared Memory  SMP: shared memory multiprocessor l Hardware provides single physical address space for all processors l Synchronize shared variables using locks l Memory access time -UMA (uniform) vs. NUMA (nonuniform) §7.3 Shared Memory Multiprocessors

Example: Sum Reduction  Sum 100,000 numbers on 100 processor UMA l Each processor has ID: 0 ≤ Pn ≤ 99 l Partition 1000 numbers per processor l Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];  Now need to add these partial sums l Reduction: divide and conquer l Half the processors add pairs, then quarter, … l Need to synchronize between reduction steps

Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);