Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Distributed Systems CS
Introductions to Parallel Programming Using OpenMP
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
Parallel Computing Overview CS 524 – High-Performance Computing.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
Distributed Shared Memory Systems and Programming
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Parallel Computer Architecture and Interconnect 1b.1.
MIMD Distributed Memory Architectures message-passing multicomputers.
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Outline Why this subject? What is High Performance Computing?
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Embedded Computer Architecture 5SAI0 Multi-Processor Systems
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
Primitive Concepts of Distributed Systems Chapter 1.
The University of Adelaide, School of Computer Science
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Guoliang Chen Parallel Computing Guoliang Chen
Distributed Systems CS
High Performance Computing
Presentation transcript:

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory –Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory –SGI Origin 2000, Convex Exemplar

Case Study in Computational Science & Engineering - Lecture 2 2 Shared Memory Systems (SMP) P c P c P c P c Shared Memory Bus - Any processor can access any memory location at equal cost (Symmetric Multi-Processor) - Tasks “communicate” by writing/reading common locations - Easier to program - Cannot scale beyond around 30 PE's (bus bottleneck) - Most workstation vendors make SMP's today (SGI, Sun, HP Digital; Pentium) -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)

Case Study in Computational Science & Engineering - Lecture 2 3 Cache Coherence in SMP’s P c P c P c P c Shared Memory Bus - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM

Case Study in Computational Science & Engineering - Lecture 2 4 Distributed Memory Systems Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters P c M NIC P c M P c M P c M M: Memory c: Cache P: Processor NIC: Network Interface Card

Case Study in Computational Science & Engineering - Lecture 2 5 Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems P c P c P c P c Interconnection Network MMMM

Case Study in Computational Science & Engineering - Lecture 2 6 Parallel Programming Models Shared-Address Space Models –BSP (Bulk Synchronous Parallel model)BSP –HPF (High Performance Fortran)HPF –OpenMPOpenMP Message Passing –Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)]PVMMPI Designing and Building Parallel Programs Higher Level Programming Environments –PETSc: Portable Extensible Toolkit for Scientific computationPETSc –POOMA: Parallel Object-Oriented Methods and ApplicationsPOOMA

Case Study in Computational Science & Engineering - Lecture 2 7 OpenMP Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines

Case Study in Computational Science & Engineering - Lecture 2 8 High Performance Fortran Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good

Case Study in Computational Science & Engineering - Lecture 2 9 Message Passing Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance

Case Study in Computational Science & Engineering - Lecture 2 10 Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do a(20,20)b(20,20)

Case Study in Computational Science & Engineering - Lecture 2 11 Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) Global shared view of data

Case Study in Computational Science & Engineering - Lecture 2 12 Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) P0 P1 P2 P3 Global shared view of data

Case Study in Computational Science & Engineering - Lecture 2 13 Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20)b(20,20) Global shared view of data

Case Study in Computational Science & Engineering - Lecture 2 14 Message Passing: Local View a(20,20) b(20,20) Global shared view al(5,20) bl(5,20) Local partitioned view P0 P1 P2 P3 P0 P1 P2 P3 communication required bl(0:6,20) Local partitioned view with ghost cells al(5,20) ghost cells

Case Study in Computational Science & Engineering - Lecture 2 15 Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do ……... bl(0:6,20) Local partitioned view with ghost cells al(5,20) ghost cells are communicated by message-passing

Case Study in Computational Science & Engineering - Lecture 2 16 Comparison of Models Program Porting/Development Effort –OpenMP = HPF << MPI Portability across systems –HPF = MPI >> OpenMP (only shared-memory) Applicability –MPI = OpenMP >> HPF (only dense arrays) Performance –MPI > OpenMP >> HPF

Case Study in Computational Science & Engineering - Lecture 2 17 PETSc Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: –Provides global view of data arrays –System takes care of needed message-passing Portable across shared & distributed memory systems