ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections 4.1-4.3 Reading for Monday:

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

CMSC 611: Advanced Computer Architecture
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Optimization Summary
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Lecture 18: Multiprocessors
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter 6 Multiprocessors and Thread-Level Parallelism 吳俊興 高雄大學資訊工程學系 December 2004 EEF011 Computer Architecture 計算機結構.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,
Parallel Processing Architectures Laxmi Narayan Bhuyan
Communication Models for Parallel Computer Architectures 4 Two distinct models have been proposed for how CPUs in a parallel computer system should communicate.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Graduate Computer Architecture I Lecture 10: Shared Memory Multiprocessors Young Cho.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
Outline Why this subject? What is High Performance Computing?
Multiprocessors— Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CMSC 611: Advanced Computer Architecture
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CS 147 – Parallel Processing
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Chapter 6 Multiprocessors and Thread-Level Parallelism
Lecture 1: Parallel Architecture Intro
CPE 631 Lecture 18: Multiprocessors
Chapter 5 Multiprocessor and Thread-Level Parallelism
Chapter 17 Parallel Processing
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
CPE 631 Session 20: Multiprocessors
Parallel Processing Architectures
Embedded Computer Architecture 5KK73 Going Multi-Core
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
Presentation transcript:

ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday: Sections 4.4 – 4.9

ENGS 116 Lecture 152 Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Almasi and Gottlieb, Highly Parallel Computing, 1989 Questions about parallel computers: –How large a collection? –How powerful are processing elements? –How do they cooperate and communicate? –How is data transmitted? –What type of interconnection? –What are HW and SW primitives for programmer? –Does it translate into performance?

ENGS 116 Lecture 153 Parallel Processors “Religion” The dream of computer architects since 1960: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models since “uniprocessors can’t keep going” –e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972, …, 1989,..., 2008 –Borders religious fervor: you must believe! –Fervor damped some when 1990s companies went out of business: Thinking Machines, Kendall Square,... Argument instead is the “pull” of opportunity of scalable performance, not the “push” of uniprocessor performance plateau Recent advancement: Sun Niagara “T1 Coolthreads” => 8 in-order simple execution cores. Switch thread on stalls...

ENGS 116 Lecture 154 Opportunities: Scientific Computing Nearly Unlimited Demand (Grand Challenge): AppPerf (GFLOPS)Memory (GB) 48 hour weather hour weather31 Pharmaceutical design10010 Global Change, Genome (Figure 1-2, page 25, of Culler, Singh, Gupta [CSG97]) Successes in some real industries: –Petroleum: reservoir modeling –Automotive: crash simulation, drag analysis, engine –Aeronautics: airflow analysis, engine, structural mechanics –Pharmaceuticals: molecular modeling –Entertainment: full length movies (“Finding Nemo”)

ENGS 116 Lecture 155 Flynn’s Taxonomy SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –??? SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 Simple programming model Low overhead Flexibility All custom integrated circuits MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 10000, Cray T3D, SGI Origin Flexible Use off-the-shelf microprocessors

ENGS 116 Lecture 156 Flynn’s classification of computer systems a) SISD Computer Stream Memory Module Processor Unit Data Control Unit Instruction Stream Instruction Stream Memory Module Memory Module Memory Module Memory Module Processor Unit Processor Unit Processor Unit Processor Unit Stream 1 Data Stream 2 Data Stream 3 Data Stream 4 Data InstructionStream Control Unit b) SIMD Computer InstructionStream Instruction Stream Instruction Stream 3 Instruction Stream 4 Memory Module Memory Module Memory Module Processor Unit Processor Unit Processor Unit Stream 2 Data Stream 3 Data Stream 4 Data Stream 2 Instruction Stream 3 Instruction Stream 4 Instruction Memory Module Processor Unit Stream 1 Data Stream 1 Instruction Control Unit Control Unit Control Unit Control Unit c) MIMD Computer Instruction Stream 1 Instruction Stream 2

ENGS 116 Lecture 157 Parallel Framework Programming Model: –Multiprogramming: lots of jobs, no communication –Shared address space: communicate via memory –Message passing: send and receive messages –Data parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: –Shared address space: e.g., load, store, atomic swap –Message passing: e.g., send, receive library calls –Debate over this topic (ease of programming, scaling)

ENGS 116 Lecture 158 Diagrams of Shared and Distributed Memory Architectures PPPP    Network MMM    Network P    MM PP M Distributed Memory system Shared Memory System

ENGS 116 Lecture 159 Shared Address/Memory Multiprocessor Model Communicate via Load and Store –Oldest and most popular model Based on timesharing: processes on multiple processors vs. sharing single processor Process: a virtual address space and ≥ 1 thread of control –Multiple processes can overlap (share), but ALL threads share a process address space Writes to shared address space by one thread are visible to reads of other threads –Usual model: share code, private stack, some shared heap, some private heap (shared page table!)

ENGS 116 Lecture 1510 Example: Small-Scale MP Designs Memory: centralized with uniform memory access (“UMA”) and bus interconnect, I/O Examples: Sun Enterprise, SGI Challenge, Intel SystemPro Processor One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory I/O System

ENGS 116 Lecture 1511 SMP Interconnect Processors connected to memory AND to I/O Bus based: all memory locations equal access time so SMP = “Symmetric MP” –Sharing limited BW as we add processors, I/O Crossbar: expensive to expand Multistage network: less expensive to expand than crossbar with more BW “Dance Hall” designs: all processors on the left, all memories on the right

ENGS 116 Lecture 1512 Large-Scale MP Designs Memory: distributed with nonuniform memory access (“NUMA”) and scalable interconnect (distributed memory) Interconnection Network Low Latency High Reliability Processor + cache Memory I/O Processor + cache Memory I/O Processor + cache Memory I/O Processor + cache Memory I/O 1 cycle 40 cycles Processor + cache Memory I/O Processor + cache Memory I/O Processor + cache Memory I/O Processor + cache Memory I/O 100 cycles

ENGS 116 Lecture 1513 Shared Address Model Summary Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word,... or cache blocks Uses virtual memory to map virtual to local or remote physical Memory hierarchy model applies: now communication moves data to local processor’s cache (as load moves data from memory to cache) –Latency, BW, scalability of communication?

ENGS 116 Lecture 1514 Message Passing Model Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations –Essentially NUMA but integrated at I/O devices vs. memory system Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data –Usually send includes process tag and receive has rule on tag: match one, match any –Synchronization: when send completes, when buffer free, when request accepted, receive waits for send Send + receive => memory-memory copy, where each supplies local address AND does pairwise synchronization!

ENGS 116 Lecture 1515 Logical View of a Message Passing System = processor = process = receiver’s buffer or queue = message Port 2 Port 1 Processor 2Processor 1 Communications Network

ENGS 116 Lecture 1516 Message Passing Model Send + receive => memory-memory copy, synchronization on OS even on 1 processor History of message passing: –Network topology important because could only send to immediate neighbor –Typically synchronous, blocking send & receive –Later DMA with non-blocking sends, DMA for receive into buffer until processor does receive, and then data is transferred to local memory –Later SW libraries to allow arbitrary communication Example: IBM SP-2, RS6000 workstations in racks –Network Interface Card has Intel 960 –8x8 Crossbar switch as communication building block –40 MB/sec per link

ENGS 116 Lecture 1517 Message Passing Model Special type of Message Passing: Pipe Pipe is a point-to-point connection Data is pushed in one way by first process, and Received, in-order, by the second process. Data can stay in pipe indefinitely Asynchronous, simplex communication Advantages: –Simple implementation –Works with Send-Receive calls –Good way of implementing producer consumers systems. –No latency, since communication is 1-way

ENGS 116 Lecture 1518 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: Model of choice for uniprocessors, small-scale MPs Ease of programming Lower latency Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via messages –Advantages: Less hardware, easier to design Focuses attention on costly non-local operations Can support either SW model on either HW base

ENGS 116 Lecture 1519 How Do We Develop a Parallel Program? Convert a sequential program to a parallel program –Partition the program into tasks and combine tasks into processes –Coordinate data accesses, communication, and synchronization –Map the processes to processors Create a parallel program from scratch

ENGS 116 Lecture 1520 Review: Parallel Framework Layers: –Programming Model: Multiprogramming: lots of jobs, no communication Shared address space: communicate via memory Message passing: send and receive messages Data parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) –Communication Abstraction: Shared address space: e.g., load, store, atomic swap Message passing: e.g., send, receive library calls Debate over this topic (ease of programming, scaling) Programming Model Communication Abstraction Interconnection SW/OS Interconnection HW

ENGS 116 Lecture 1521 Fundamental Issues Fundamental issues in parallel machines 1) Naming 2) Synchronization 3) Latency and Bandwidth

ENGS 116 Lecture 1522 Fundamental Issue #1: Naming Naming: –what data is shared –how it is addressed –what operations can access data –how processes refer to each other Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for message passing Choice of naming affects replication of data; via load in cache memory hierarchy or via SW replication and consistency

ENGS 116 Lecture 1523 Fundamental Issue #1: Naming Global physical address space: any processor can generate, address and access it in a single operation Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named uniformly for all processes of the parallel program Independent local physical addresses

ENGS 116 Lecture 1524 Fundamental Issue #2: Synchronization To cooperate, processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address system needs additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

ENGS 116 Lecture 1525 Fundamental Issue #3: Latency and Bandwidth Bandwidth –Need high bandwidth in communication –Cannot scale, but stay close –Match limits in network, memory, and processor –Overhead to communicate is a problem in many machines Latency –Affects performance, since processor may have to wait –Affects ease of programming, since requires more thought to overlap communication and computation Latency Hiding –How can a mechanism help hide latency? –Examples: overlap message send with computation, prefetch data, switch to other tasks

ENGS 116 Lecture 1526 Small-Scale — Shared Memory Caches serve to: –Increase bandwidth versus bus/memory –Reduce latency of access –Valuable for both private data and shared data What about cache coherence? Processor One or more levels of cache One or more levels of cache One or more levels of cache One or more levels of cache Main memory I/O System

ENGS 116 Lecture 1527 The Problem Cache Coherency CPU I/O B A Memory Cache A’ B’ CPU I/O Output A gives 100 Cache A’ B’ B A Memory CPU I/O Input 400 to B Cache 200 A’ B’ 100 B 400 A Memory (a) Cache and memory coherent: A’ = A & B’ = B (b) Cache and memory incoherent: A’ ≠ A (A stale) (c) Cache and memory incoherent: B’ ≠ B (B’ stale)

ENGS 116 Lecture 1528 What Does Coherency Mean? Informally: –“Any read must return the most recent write” –Too strict and too difficult to implement Better: –“Any write must eventually be seen by a read” –All writes are seen in proper order (“serialization”) Two rules to ensure this: –“If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” –Writes to a single location are serialized: seen in one order Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value)

ENGS 116 Lecture 1529 The cache-coherence problem for a single memory location (X), read and written by two processors (A and B). This example assumes a write-through cache.

ENGS 116 Lecture 1530 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): –Send all requests for data to all processors –Processors snoop to see if they have a copy and respond accordingly –Requires broadcast, since caching information is at processors –Works well with bus (natural broadcast medium) –Dominates for small-scale machines (most of the market) Directory-Based Schemes –Keep track of what is being shared in one centralized place –Distributed memory => distributed directory for scalability (avoids bottlenecks) –Send point-to-point requests to processors via network –Scales better than Snooping –Actually existed BEFORE Snooping-based schemes

ENGS 116 Lecture 1531 Basic Snoopy Protocols Write Invalidate Protocol: –Multiple readers, single writer –Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies –Read Miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast Protocol (typically write through): –Write to shared data: broadcast on bus, processors snoop, and update any copies –Read miss: memory is always up-to-date

ENGS 116 Lecture 1532 An example of an invalidation protocol working on a snooping bus for a single cache block (X) with write-back caches.

ENGS 116 Lecture 1533 An example of a write update or broadcast protocol working on a snooping bus for a single cache block (X) with write- back caches.

ENGS 116 Lecture 1534 Basic Snoopy Protocols Write Invalidate versus Broadcast: –Invalidate requires one transaction per write-run (writing to one location multiple times by the same CPU) –Invalidate uses spatial locality: one transaction per block –Broadcast has lower latency between write and read