CSL718 : Multiprocessors 13th April, 2006 Introduction

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction 9th January, 2006 CSL718 : Architecture of High Performance Systems.
Outline Classification ILP Architectures Data Parallel Architectures
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Lecture 3: Computer Architectures
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
CS 704 Advanced Computer Architecture
These slides are based on the book:
COMP 740: Computer Architecture and Implementation
Parallel Architecture
18-447: Computer Architecture Lecture 30B: Multiprocessors
Multiprocessor Systems
Introduction to parallel programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Computer Engineering 2nd Semester
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
MIMD Multiple instruction, multiple data
The University of Adelaide, School of Computer Science
Parallel and Multiprocessor Architectures – Shared Memory
Advanced Computer Architectures
Chapter 6 Multiprocessors and Thread-Level Parallelism
Chip-Multiprocessor.
Cache Coherence Protocols 15th April, 2006
Chapter 5 Multiprocessor and Thread-Level Parallelism
Chapter 17 Parallel Processing
Symmetric Multiprocessing (SMP)
Outline Interconnection networks Processor arrays Multiprocessors
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
Distributed Systems CS
/ Computer Architecture and Design
Lecture 25: Multiprocessors
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CSL718 : Multiprocessors 13th April, 2006 Introduction Anshul Kumar, CSE IITD

Parallel Architectures Flynn’s Classification [1966] Architecture Categories SISD SIMD MISD MIMD Anshul Kumar, CSE IITD

MIMD IS C P M IS DS IS C IS P DS Anshul Kumar, CSE IITD

Parallel Architectures Sima’s Classification Parallel architectures PAs Data-parallel architectures Function-parallel Anshul Kumar, CSE IITD

Function Parallel Architectures Instruction level PAs Thread level PAs Process level PAs ILPs Pipelined processors VLIWs Superscalar processors MIMDs Shared Memory MIMD Distributed Memory MIMD Built using general purpose processors Anshul Kumar, CSE IITD

Issues from user’s perspective Specification / Program design explicit parallelism or implicit parallelism + parallelizing compiler Partitioning / mapping to processors Scheduling / mapping to time instants static or dynamic Communication and Synchronization Anshul Kumar, CSE IITD

Parallelizing example for (i=0; i<n; i++) { m = m+3 a[i] = (a[m]+a[m+1]+a[m+2])/3 } Can all iterations be done in parallel? Dependence 1: m = m + 3 Dependence 2: a[1] = (a[3]+a[4]+a[5])/3 a[4] = (a[12]+a[13]+a[14])/3 Anshul Kumar, CSE IITD

Parallelizing example - contd. Eliminate dependence based on induction variable for (i=0; i<n; i++) { m = i*3 a[i] = (a[m]+a[m+1]+a[m+2])/3 } Anshul Kumar, CSE IITD

Parallelizing example - contd. Eliminate forward dependency using double buffer for (i=0; i<n; i++) { m = i*3 aa[i] = (a[m]+a[m+1]+a[m+2])/3 } barrier( ) a[i] = aa[i] Anshul Kumar, CSE IITD

Parallelizing example - contd. Parallelization using dynamic thread creation and scheduling schedule(0) for (i=0; i<n; i++) { wait_till_scheduled(i) m = i*3 a[i] = (a[m]+a[m+1]+a[m+2])/3 if (i0)schedule(3*i) schedule(3*i+1) schedule(3*i+2) } Anshul Kumar, CSE IITD

Grain size and performance Overhead limited load imbalance and parallelism limited Speed up Fine grain Opt grain size Coarse grain Anshul Kumar, CSE IITD

Speed up and efficiency Anshul Kumar, CSE IITD

Amdahl’s Law Sp s 1 .5 Sp=p Sp=1

Generalization Sp p actual Anshul Kumar, CSE IITD

Shared Memory Architecture Anshul Kumar, CSE IITD

Design Space of Shared Memory Architectures Extent of address space sharing Location of memory modules Uniformity of memory access Anshul Kumar, CSE IITD

Address Space P1 P2 P3 P4 Each processor sees an exclusive address space Each processor sees partly exclusive and partly shared address space Each processor sees same shared address space Anshul Kumar, CSE IITD

Location of Memory P M M Centralized P M P Mixed Distributed Interconnection Network Centralized P M Interconnection Network Mixed P M Interconnection Network Distributed Anshul Kumar, CSE IITD

Clustered Architecture M M M M M M M M P P P P P P P P Interconnection Network Interconnection Network M M M M M M Global Interconnection Network M M M Anshul Kumar, CSE IITD

Uniformity of Access UMA (Uniform Memory Access) Uniformity across memory address space Uniformity across processors NUMA (Non-Uniform Memory Access) CC-NUMA (Cache Coherent NUMA) COMA (Cache Only Memory Architecture) UMA : Symmetrical Shared Memory Multiprocessor (SMP) NUMA : Distributed Shared Memory Multiprocessor Anshul Kumar, CSE IITD

Location and Sharing SHARING full partial none UMA centralized mixed NUMA distributed Anshul Kumar, CSE IITD

Shared Memory with Caches Multiple copies of data may exist  Problem of cache coherence Cache coherence protocols What action is taken? Which processors/caches communicate? Status of each block? Anshul Kumar, CSE IITD

What action is taken? Invalidate other caches and/or memory send a signal/message immediately, copy information only when unavoidable similar to write back policy Update other caches and/or memory write simultaneously at all places (send modifications immediately) similar to write through policy Anshul Kumar, CSE IITD

Which procs/caches communicate? Snoopy protocol broadcast invalidate or update messages all processors snoop on the bus Directory based protocol maintain directory - list of copies communicate selectively directory - centralized (memory) or distributed (caches) Anshul Kumar, CSE IITD

Status of each cache block? valid/invalid private/shared clean/dirty Simplest protocol (3 states) Invalid, (shared) clean, private dirty Berkeley protocol (4 states) Invalid, (shared) clean, private dirty, shared dirty Illinois, Firefly protocols (4 states) Invalid, shared clean, private clean, private dirty Dragon protocols (5 states) Invalid, shared clean/dirty private clean/dirty Anshul Kumar, CSE IITD

Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty invalid clean shared? dirty CPU event BUS event Anshul Kumar, CSE IITD

Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty RD miss invalid clean shared? WR RD miss WR miss dirty CPU event BUS event Anshul Kumar, CSE IITD

Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty invalid clean shared? WR miss, INV RD miss WR miss, INV dirty CPU event BUS event Anshul Kumar, CSE IITD

Simplest invalidation protocol Use 3 states : Invalid, shared clean, private dirty RD miss invalid clean shared? WR miss, INV RD miss WR miss, INV WR RD miss WR miss dirty CPU event BUS event Anshul Kumar, CSE IITD