CS 213: Parallel Processing Architectures

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
1 Introduction to MIMD Architectures Sima, Fountain and Kacsuk Chapter 15 CSE462.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
ENGS 116 Lecture 151 Multiprocessors and Thread-Level Parallelism Vincent Berk November 12 th, 2008 Reading for Friday: Sections Reading for Monday:
DAP Spr.‘98 ©UCB 1 Lecture 18: Review. DAP Spr.‘98 ©UCB 2 Cache Organization (1) How do you know if something is in the cache? (2) If it is in the cache,
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Parallel Architectures
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Multiprocessor Systems
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
Chapter 6 Multiprocessors and Thread-Level Parallelism
Lecture 1: Parallel Architecture Intro
CPE 631 Lecture 18: Multiprocessors
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 17 Parallel Processing
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
Introduction to Multiprocessors
CPE 631 Session 20: Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Distributed Systems CS
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Multiprocessor and Thread-Level Parallelism Chapter 4
Database System Architectures
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan http://www.cs.ucr.edu/~bhuyan Lecture3

Amdahl’s Law and Parallel Computers Amdahl’s Law (f: original fraction sequential) Speedup = 1 / [(f + (1-f)/n] = n/[1+(n-1)f], where n = No. of processors A portion f is sequential => limits parallel speedup Speedup <= 1/ f Ex. What fraction sequential to get 80X speedup from 100 processors? Assume either 1 processor or 100 fully used 80 = 1 / [(f + (1-f)/100] => f = 0.0025 Only 0.25% sequential! => Must be a highly parallel program

Popular Flynn Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) ???; multiple processors on a single data stream SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 Simple programming model Low overhead Flexibility All custom integrated circuits (Phrase reused by Intel marketing for media instructions ~ vector) MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Use off-the-shelf micros MIMD current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

Classification of Parallel Processors SIMD – EX: Illiac IV and Maspar

Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array 1 Control Processor (CP) broadcasts to many PEs. The CP reads an instruction from the control memory, decodes the instruction, and broadcasts control signals to all PEs. Condition flag per PE so that can skip Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit PEs + memory on a chip was the PE Data parallel programming languages lay out data to processor

Classification of Parallel Processors MIMD - True Multiprocessors 1. Message Passing Multiprocessor - Interprocessor communication through explicit message passing through “send” and “receive operations. EX: IBM SP2, Cray XD1, and Clusters 2. Shared Memory Multiprocessor – All processors share the same address space. Interprocessor communication through load/store operations to a shared memory. EX: SMP Servers, SGI Origin, HP V-Class, Cray T3E Their advantages and disadvantages?

Communication Models Shared Memory Message passing Processors communicate with shared address space Easy on small-scale machines Advantages: Model of choice for uniprocessors, small-scale MPs Ease of programming Lower latency Easier to use hardware controlled caching Message passing Processors have private memories, communicate via messages Less hardware, easier to design Good scalability Focuses attention on costly non-local operations Virtual Shared Memory (VSM)

Message Passing Model Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations Essentially NUMA but integrated at I/O devices vs. memory system Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Usually send includes process tag and receive has rule on tag: match 1, match any Synch: when send completes, when buffer free, when request accepted, receive wait for send Send+receive => memory-memory copy, where each supplies local address, AND does pair-wise synchronization!

Shared Address/Memory Multiprocessor Model Communicate via Load and Store Oldest and most popular model Based on timesharing: processes on multiple processors vs. sharing single processor process: a virtual address space and ~ 1 thread of control Multiple processes can overlap (share), but ALL threads share a process address space Writes to shared address space by one thread are visible to reads of other threads Usual model: share code, private stack, some shared heap, some private heap

Advantages shared-memory communication model Compatibility with SMP hardware Ease of programming when communication patterns are complex or vary dynamically during execution Ability to develop apps using familiar SMP model, attention only on performance critical accesses Lower communication overhead, better use of BW for small items, due to implicit communication and memory mapping to implement protection in hardware, rather than through I/O system HW-controlled caching to reduce remote comm. by caching of all data, both shared and private.

More Message passing Computers Cluster: Computers connected over high-bandwidth local area network (Ethernet or Myrinet) used as a parallel computer Network of Workstations (NOW): Homogeneous cluster – same type computers Grid: Computers connected over wide area network

Advantages message-passing communication model The hardware can be simpler Communication explicit => simpler to understand; in shared memory it can be hard to know when communicating and when not, and how costly it is Explicit communication focuses attention on costly aspect of parallel computation, sometimes leading to improved structure in multiprocessor program Synchronization is naturally associated with sending messages, reducing the possibility for errors introduced by incorrect synchronization Easier to use sender-initiated communication, which may have some advantages in performance

Another Classification for MIMD Computers Centralized Memory: Shared memory located at centralized location – may consist of several interleaved modules – same distance from any processor – Symmetric Multiprocessor (SMP) – Uniform Memory Access (UMA) Distributed Memory: Memory is distributed to each processor – improves scalability (a) Message passing architectures – No processor can directly access another processor’s memory (b) Hardware Distributed Shared Memory (DSM) Multiprocessor – Memory is distributed, but the address space is shared (c) Software DSM – A level of o/s built on top of message passing multiprocessor to give a shared memory view to the programmer.

Software DSM Advantages: Scalability, get more memory bandwidth, lower local memory latency Drawback: Longer remote communication latency, Software model more complex

Major Shared Memory Styles Centralized shared memory ("Uniform Memory Access" time or "Shared Memory Processor") Decentralized Shared memory (memory module with CPU) Advantages: Scalability, get more memory bandwidth, lower local memory latency Drawback: Longer remote communication latency, Software model more complex

Symmetric Multiprocessor (SMP) Memory: centralized with uniform access time (“uma”) and bus interconnect Examples: Sun Enterprise 5000 , SGI Challenge, Intel SystemPro

SMP Interconnect Processors to Memory AND to I/O Bus based: all memory locations equal access time so SMP = “Symmetric MP” Can have interleaved memories Performance limited by bus bandwidth Crossbar based: All memory access times are equal = SMP Provides higher bandwidth with interleaved memories Difficult to scale due to centralized control

Distributed Shared Memory – Non-Uniform Shared Memory Access (NUMA)

Cache-Coherent Non-Uniform Memory Access Machine (CC-NUMA) Memory distributed to each processor, but the address space is shared => Offers scalability of message passing but shared memory programming with low latency Non-uniform Memory Access (NUMA) time depending on the memory location Each processor has a local cache, the cache coherence is maintained by hardware (through Directory) => CC-NUMA

Scalable, High Perf. Interconnection Network At Core of Parallel Computer Arch. Requirements and trade-offs at many levels Elegant mathematical structure Deep relationships to algorithm structure Managing many traffic flows Electrical / Optical link properties Little consensus interactions across levels Performance metrics? Cost metrics? Workload? => Need holistic understanding

Performance Metrics: Latency and Bandwidth Need high bandwidth in communication Match limits in network, memory, and processor Challenge is link speed of network interface vs. bisection bandwidth of network Latency Affects performance, since processor may have to wait Affects ease of programming, since requires more thought to overlap communication and computation Overhead to communicate is a problem in many machines Latency Hiding How can a mechanism help hide latency? Increases programming system burden Examples: overlap message send with computation, prefetch data, switch to other tasks

Fundamental Issues 3 Issues to characterize parallel machines 1) Naming 2) Synchronization 3) Performance: Latency and Bandwidth

Fundamental Issue #1: Naming Naming: how to solve large problem fast what data is shared how it is addressed what operations can access data how processes refer to each other Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing Choice of naming affects replication of data; via load in cache memory hierarchy or via SW replication and consistency

Fundamental Issue #1: Naming Global physical address space: any processor can generate, address and access it in a single operation memory can be anywhere: virtual addr. translation handles it Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named <process number, address> uniformly for all processes of the parallel program

Fundamental Issue #2: Synchronization To cooperate, processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address => additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor

Summary: Parallel Framework Programming Model Communication Abstraction Interconnection SW/OS Interconnection HW Programming Model: Multiprogramming : lots of jobs, no communication Shared address space: communicate via memory Message passing: send and recieve messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: Shared address space: e.g., load, store, atomic swap Message passing: e.g., send, recieve library calls Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model