CSC3050 – Computer Architecture

Slides:



Advertisements
Similar presentations
Parallelism Lecture notes from MKP and S. Yalamanchili.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Review: Multiprocessor Systems (MIMD)
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter 7 Multicores, Multiprocessors, and Clusters.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Computer System Architectures Computer System Software
Chapter 7 Multicores, Multiprocessors, and Clusters CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.
1 Parallelism, Multicores, Multiprocessors, and Clusters [Adapted from Computer Organization and Design, Fourth Edition, Patterson & Hennessy, © 2009]
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Memory/Storage Architecture Lab Computer Architecture Multiprocessors.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Threads by Dr. Amin Danial Asham. References Operating System Concepts ABRAHAM SILBERSCHATZ, PETER BAER GALVIN, and GREG GAGNE.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Processing - introduction
Simultaneous Multithreading
Multi-core processors
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Operating Systems (CS 340 D)
CMSC 611: Advanced Computer Architecture
Levels of Parallelism within a Single Processor
Hardware Multithreading
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Introduction to Multiprocessors
Distributed Systems CS
/ Computer Architecture and Design
Multithreaded Programming
Operating Systems (CS 340 D)
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Levels of Parallelism within a Single Processor
Chapter 4 Multiprocessors
Presentation transcript:

CSC3050 – Computer Architecture Prof. Yeh-Ching Chung School of Science and Engineering Chinese University of Hong Kong, Shenzhen

Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores)

Hardware and Software Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware

Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it’s easier! Difficulties Partitioning, load balancing Coordination, synchronization Communications overhead Sequential dependencies

= 1 1 − fimprovable + fimprovable Mimprovement Amdahl’s Law (1) tnew = timprovable Mimprovement + tunimprovable Speed−up = told tnew = told told − timprovable + timprovable Mimprovement = 1 1 − fimprovable + fimprovable Mimprovement

Amdahl’s Law (2) To achieve 90x speed-up using 100 processors Speed−up = 90 = 1 1 − fimprovable + fimprovable 100 fimprovable = 0.999  funimprovable = fsequential = 0.1% Sequential part can limit speed-up

Scaling Example Calculate sum of 10 scalars, and sum of two 10×10 matrices Single processor: ttotal = (10 + 100) × tadd = 110 tadd 10 processors: ttotal = 10 × tadd + (100/10) × tadd = 20 tadd Speed-up = 110/20 = 5.5 (55% of ideal) 100 processors: ttotal = 10 × tadd + (100/100) × tadd = 11 tadd Speed-up = 110/11 = 10 (10% of ideal)

Scaling Example (cont’d) What if matrix size is 100 × 100? Single processor: ttotal = (10 + 10000) × tadd = 10010 tadd 10 processors: ttotal = 10 × tadd + (10000/10) × tadd = 1010 tadd Speed-up = 10010/1010 = 9.9 (99% of ideal) 100 processors: ttotal = 10 × tadd + (10000/100) × tadd = 110 tadd Speed-up = 10010/110 = 91 (91% of ideal)

Strong vs Weak Scaling Strong scaling Weak scaling Speed-up achieved on a multiprocessor without increasing the size of the problem. Weak scaling Speedup achieved on a multiprocessor while increasing the size of the problem proportionally to the increase in the number of processors.

Multithreading Difficult to continue to extract instruction-level parallelism (ILP) from a single sequential thread of control Many workloads can make use of thread-level parallelism (TLP) TLP from multiprogramming (run independent sequential jobs) TLP from multithreaded applications (run one job faster using parallel threads) Multithreading uses TLP to improve utilization of a single processor

Examples of Threads A web browser A word processor A web server One thread displays images One thread retrieves data from network A word processor One thread displays graphics One thread reads keystrokes One thread performs spell checking in the background A web server One thread accepts requests When a request comes in, separate thread is created to service Many threads to support thousands of client requests

Multithreading on a Chip Find a way to “hide” true data dependency stalls, cache miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions Hardware multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread The caches, Translation Look-aside Buffers (TLBs), Branch History Table (BHT), Branch Target Buffer (BTB), Register Update Unit (RUU) can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching

Types of Multithreading Fine-grained multithreading Switch threads after each cycle Interleave instruction execution If one thread stalls, others are executed Coarse-grained multithreading Only switch on long stall (e.g., L2-cache miss) Simplifies hardware, but does not hide short stalls Also has pipeline start-up costs

Simultaneous Multithreading (SMT) In multiple-issue dynamically scheduled processor Schedule instructions from multiple threads Instructions from independent threads execute when function units are available Within threads, dependencies handled by scheduling and register renaming Example: Intel Pentium-4 HyperThreading Two threads: duplicated registers, shared function units and caches

Threading on a 4-way SS Processor Example

The Big Picture: Where are We Now? Multiprocessor – a computer system with at least two processors Can deliver high throughput for independent jobs via job-level parallelism or process-level parallelism And improve the run time of a single program that has been specially crafted to run on a multiprocessor – a parallel processing program

Multicores Now Universal Power challenge has forced a change in microprocessor design Since 2002 the rate of improvement in the response time of programs has slowed from a factor of 1.5 per year to less than a factor of 1.2 per year Today’s microprocessors typically contain more than one core – Chip Multicore microProcessors (CMPs) – in a single IC Product AMD Barcelona Intel Nehalem IBM Power 6 Sun Niagara 2 Cores per chip 4 2 8 Clock rate 2.5 GHz ~2.5 GHz? 4.7 GHz 1.4 GHz Power 120 W ~100 W? 94 W

Shared Memory Multiprocessor SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Synchronize shared variables using locks Memory access time UMA (uniform) vs. NUMA (nonuniform)

Shared Address Space Example Add 100,000 numbers in shared memory using 100 processors (P0–P99) sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i += 1) sum[Pn] += A[i]; /*sum the assigned areas*/ Gather the partial sums (reduction) half = 100; /*100 processors in multiprocessor*/ do synch(); /*wait for partial sum completion*/ if (half%2 != 0 && Pn == 0) sum[0] += sum[half–1]; half = half/2; if (Pn < half) sum[Pn] += sum[Pn+half]; while (half > 1);

Reduction

Message Passing Each processor has private physical address space Hardware sends/receives messages between processors

Sum Reduction (1) Sum 100,000 on 100 processors First distribute 100 numbers to each The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum += AN[i]; Reduction Half the processors send, other half receive and add The quarter send, quarter receive and add, …

Sum Reduction (2) Given send() and receive() operations limit = 100; half = 100;/* 100 processors */ do half = (half+1)/2; /* send vs. receive dividing line */ if (Pn >= half && Pn < limit) send(Pn - half, sum); if (Pn < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of senders */ while (half > 1); /* exit with final sum */ Send/receive also provide synchronization Assumes send/receive take similar time to addition (which is unrealistic)

Message Passing Multiprocessors

Pros and Cons of Message Passing Message sending and receiving is much slower than addition But message passing multiprocessors are much easier for hardware designers to design Don’t have to worry about cache coherency for example The advantage for programmers is that communication is explicit, so there are fewer “performance surprises” than with the implicit communication in cache-coherent SMPs. However, its harder to port a sequential program to a message passing multiprocessor since every communication must be identified in advance. With cache-coherent shared memory the hardware figures out what data needs to be communicated