SE-292 High Performance Computing

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Multiple Processor Systems
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Parallel Computers Chapter 1
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
2. Multiprocessors Main Structures 2.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1  2004 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Parallel Computer Architectures
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
MultiIntro.1 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
©RG:E0243:L2- Parallel Architecture 1 E0-243: Computer Architecture L2 – Parallel Architecture.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
CMPE 421 Parallel Computer Architecture Multi Processing 1.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.
Outline Why this subject? What is High Performance Computing?
Super computers Parallel Processing
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Architecture
Multiprocessor Systems
CS5102 High Performance Computer Systems Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Multi-Processing in High Performance Computer Architecture:
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Multiprocessor Introduction and Characteristics of Multiprocessor
Multiprocessors - Flynn’s taxonomy (1966)
Distributed Systems CS
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

SE-292 High Performance Computing Intro. To Concurrent Programming & Parallel Architecture R. Govindarajan govind@serc L1

PARALLEL ARCHITECTURE Parallel Machine: a computer system with more than one processor Special parallel machines designed to make this interaction overhead less Questions: What about Multicores? What about a network of machines? Yes, but time involved in interaction (communication) might be high, as the system is designed assuming that the machines are more or less independent Lec22

Classification of Parallel Machines Flynn’s Classification In terms of number of Instruction streams and Data streams Instruction stream: path to instruction memory (PC) Data stream: path to data memory SISD: single instruction stream single data stream SIMD MIMD Lec22

PARALLEL PROGRAMMING Recall: Flynn SIMD, MIMD Speedup = Programming side: SPMD, MPMD Shared memory: threads Message passing: using messages Speedup = Lec24

How Much Speedup is Possible? Let s be the fraction of sequential execution time of a given program that can not be parallelized Assume that program is parallelized so that the remaining part (1 - s) is perfectly divided to run in parallel across n processors Speedup = Maximum speedup achievable is limited by the sequential fraction of the sequential program Lec24 Amdahl’s Law

Understanding Amdahl’s Law Compute : min ( |A[i,j]- B[i,i]| ) 1 n2 (a) Serial Time concurrency Time p 1 n2/p concurrency (c) Parallel n2/p n2 p 1 concurrency (b) Naïve Parallel Time

Classification 2: Shared Memory vs Message Passing Shared memory machine: The n processors share physical address space Communication can be done through this shared memory The alternative is sometimes referred to as a message passing machine or a distributed memory machine P M P M P M P M P M P M M P P P P Interconnect P P P P Main Memory Interconnect Lec22

Shared Memory Machines The shared memory could itself be distributed among the processor nodes Each processor might have some portion of the shared physical address space that is physically close to it and therefore accessible in less time Terms: Shared vs Private Terms: Local vs Remote Terms: Centralized vs Distributed Shared Terms: NUMA vs UMA architecture Non-Uniform Memory Access Lec22

Shared Memory Architecture Network Distributed Shared Memory Non-Uniform Memory Access (NUMA) M $ P ° ° ° M M M ° ° ° Network $ P $ P $ P ° ° ° Centralized Shared Memory Uniform Memory Access (UMA) © RG@SERC,IISc

MultiCore Structure Memory C0 C2 C4 C6 C1 C3 C5 C7 L1$ L2-Cache © RG@SERC,IISc

NUMA Architecture Memory Memory C0 C2 C4 C6 IMC QPI C1 C3 C5 C7 QPI L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory Memory © RG@SERC,IISc

Distributed Memory Architecture Cluster Network M $ P M $ P M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages © RG@SERC,IISc

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message from source to destination is decided Lec22

Indirect Interconnects Shared bus Multiple bus 2x2 crossbar Lec22 Crossbar switch Multistage Interconnection Network

Direct Interconnect Topologies Star Ring Linear 2D Mesh Hypercube (binary n-cube) n=2 n=3 Lec22 Torus

Shared Memory Architecture: Caches P1 P2 Read X Read X Write X=1 Read X Cache hit: Wrong data!! X: 1 X: 0 X: 0 Lec24 X: 0 X: 1

Cache Coherence Problem If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache coherence problem Shared variable modification, private cache Objective: processes shouldn’t read `stale’ data Solutions Hardware: cache coherence mechanisms Software: compiler assisted cache coherence Lec24

Example: Write Once Protocol Assumption: shared bus interconnect where all cache controllers monitor all bus activity Called snooping There is only one operation through bus at a time; cache controllers can be built to take corrective action and enforce coherence in caches Corrective action could involve updating or invalidating a cache block Lec24

Invalidation Based Cache Coherence P1 P2 Read X Read X Write X=1 Read X X: 1 X: 1 X: 0 X: 0 Invalidate Lec24 X: 0 X: 1

Snoopy Cache Coherence and Locks Holds lock; in critical section Release L; Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) Test&Set(L) L: 1 L: 0 L: 1 L: 1 L: 1 Lec24 L: 0 L: 1 L: 0

Space of Parallel Computing Programming Models What programmer uses in coding applns. Specifies synch. And communication. Programming Models: Shared address space Message passing Data parallel Parallel Architecture Shared Memory Centralized shared memory (UMA) Distributed Shared Memory (NUMA) Distributed Memory A.k.a. Message passing E.g., Clusters

Memory Consistency Model Order in which memory operations will appear to execute What value can a read return? Contract between appln. software and system. Affects ease-of-programming and performance

Implicit Memory Model Sequential consistency (SC) [Lamport] Result of an execution appears as if Operations from different processors executed in some sequential (interleaved) order Memory operations of each process in program order MEMORY P1 P3 P2 Pn

Distributed Memory Architecture Network Proc. Node M $ P Proc. Node M $ P Proc. Node M $ P ° ° ° Message Passing Architecture Memory is private to each node Processes communicate by messages © RG@SERC,IISc

NUMA Architecture PCI-E Memory IO-Hub Memory C0 C2 C4 C6 IMC QPI C1 C3 L2$ C0 C2 L1$ C4 C6 L3-Cache IMC QPI C1 C3 C5 C7 L1$ L1$ L1$ L1$ L2$ L2$ L2$ L2$ L3-Cache QPI IMC Memory IO-Hub Memory NIC PCI-E © RG@SERC,IISc

Using MultiCores to Build Cluster Node 0 Node 1 Memory Memory NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC © RG@SERC,IISc

Using MultiCores to Build Cluster Node 0 Node 1 Memory Memory Send A(1:N) NIC NIC N/W Switch Node 3 Node 2 Memory Memory NIC NIC © RG@SERC,IISc Multi-core Workshop