01/24/2006CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Today’s topics Single processors and the Memory Hierarchy
Beowulf Supercomputer System Lee, Jung won CS843.
Types of Parallel Computers
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
01/25/2007CS267 Lecture 41 CS 267: Introduction to Parallel Machines and Programming Models Kathy Yelick
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Parallel Computing Overview CS 524 – High-Performance Computing.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
1 Introduction to Parallel Architectures and Programming Models.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
01/26/2005CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel
Chapter 17 Parallel Processing.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CS267 L3 Programming Models.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming.
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
2/4/200408/29/2002CS267 Lecure 51 CS 267: Introduction to Parallel Machines and Programming Models Katherine Yelick
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
01/28/2009CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
CS267 L3 Programming Models.1 CS 267: Applications of Parallel Computers Lecture 3: Introduction to Parallel Architectures and Programming Models David.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
01/28/2011 CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 James Demmel
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
01/28/2011CS267 Lecture 31 Parallel Machines and Programming Models Lecture 3 Based on James Demmel and Kathy Yelick
CPE 779 Parallel Computing - Spring Creating and Using Threads Based on Slides by Katherine Yelick
These slides are based on the book:
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
James Demmel CS 267: Introduction to Parallel Machines and Programming Models Lecture 3 James Demmel
Constructing a system with multiple computers or processors
Lecture 1: Parallel Architecture Intro
James Demmel and Kathy Yelick
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 4 Multiprocessors
COMP755 Advanced Operating Systems
Types of Parallel Computers
Presentation transcript:

01/24/2006CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel

01/24/2006CS267 Lecture 32 Outline Overview of parallel machines (~hardware) and programming models (~software) Shared memory Shared address space Message passing Data parallel Clusters of SMPs Grid Parallel machine may or may not be tightly coupled to programming model Historically, tight coupling Today, portability is important Trends in real machines

01/24/2006CS267 Lecture 33 A generic parallel architecture P P PP Interconnection Network MMMM ° Where is the memory physically located? Memory

01/24/2006CS267 Lecture 34 Parallel Programming Models Control How is parallelism created? What orderings exist between operations? How do different threads of control synchronize? Data What data is private vs. shared? How is logically shared data accessed or communicated? Operations What are the atomic (indivisible) operations? Cost How do we account for the cost of each of the above?

01/24/2006CS267 Lecture 35 Simple Example Consider a sum of an array function: Parallel Decomposition: Each evaluation and each partial sum is a task. Assign n/p numbers to each of p procs Each computes independent “private” results and partial sum. One (or all) collects the p partial sums and computes the global sum. Two Classes of Data: Logically Shared The original n numbers, the global sum. Logically Private The individual function evaluations. What about the individual partial sums?

01/24/2006CS267 Lecture 36 Programming Model 1: Shared Memory Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variables PnP1 P0 s s =... y =..s... Shared memory i: 2i: 5 Private memory i: 8

01/24/2006CS267 Lecture 37 Shared Memory Code for Computing a Sum Thread 1 for i = 0, n/2-1 s = s + f(A[i]) Thread 2 for i = n/2, n-1 s = s + f(A[i]) static int s = 0; Problem is a race condition on variable s in the program A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized) so they could happen simultaneously

01/24/2006CS267 Lecture 38 Shared Memory Code for Computing a Sum Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume s=27, f(A[i])=7 on Thread1 and =9 on Thread2 For this program to work, s should be 43 at the end but it may be 43, 34, or 36 The atomic operations are reads and writes Never see ½ of one number All computations happen in (private) registers

01/24/2006CS267 Lecture 39 Improved Code for Computing a Sum Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + f(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + f(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s -The race condition can be fixed by adding locks (only one thread can hold a lock at a time; others wait for it) static lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk);

01/24/2006CS267 Lecture 310 Machine Model 1a: Shared Memory P1 bus $ memory Processors all connected to a large shared memory. Typically called Symmetric Multiprocessors (SMPs) SGI, Sun, HP, Intel, IBM SMPs (nodes of Millennium, SP) Multicore chips (our common future) Difficulty scaling to large numbers of processors <= 32 processors typical Advantage: uniform memory access (UMA) Cost: much cheaper to access data in cache than main memory. P2 $ Pn $

01/24/2006CS267 Lecture 311 Problems Scaling Shared Memory Hardware Why not put more processors on (with larger memory?) The memory bus becomes a bottleneck Example from a Parallel Spectral Transform Shallow Water Model (PSTSWM) demonstrates the problem Experimental results (and slide) from Pat Worley at ORNL This is an important kernel in atmospheric models 99% of the floating point operations are multiplies or adds, which generally run well on all processors But it does sweeps through memory with little reuse of operands, so uses bus and shared memory frequently These experiments show serial performance, with one “copy” of the code running independently on varying numbers of procs The best case for shared memory: no sharing But the data doesn’t all fit in the registers/cache

01/24/2006CS267 Lecture 312 From Pat Worley, ORNL Example: Problem in Scaling Shared Memory Performance degradation is a “smooth” function of the number of processes. No shared data between them, so there should be perfect parallelism. (Code was run for a 18 vertical levels with a range of horizontal sizes.)

01/24/2006CS267 Lecture 313 Machine Model 1b: Distributed Shared Memory Memory is logically shared, but physically distributed Any processor can access any address in memory Cache lines (or pages) are passed around machine SGI Origin is canonical example (+ research machines) Scales to 512 (SGI Altix (Columbia) at NASA/Ames) Limitation is cache coherency protocols – how to keep cached copies of the same address consistent P1 network $ memory P2 $ Pn $ memory

01/24/2006CS267 Lecture 314 Programming Model 2: Message Passing Program consists of a collection of named processes. Usually fixed at program startup time Thread of control plus local address space -- NO shared data. Logically shared data is partitioned over local processes. Processes communicate by explicit send/receive pairs Coordination is implicit in every communication event. MPI (Message Passing Interface) is the most commonly used SW PnP1 P0 y =..s... s: 12 i: 2 Private memory s: 14 i: 3 s: 11 i: 1 send P1,s Network receive Pn,s

01/24/2006CS267 Lecture 315 Computing s = A[1]+A[2] on each processor ° First possible solution – what could go wrong? Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xloadl = A[2] receive xremote, proc1 send xlocal, proc1 s = xlocal + xremote ° Second possible solution Processor 1 xlocal = A[1] send xlocal, proc2 receive xremote, proc2 s = xlocal + xremote Processor 2 xlocal = A[2] send xlocal, proc1 receive xremote, proc1 s = xlocal + xremote ° If send/receive acts like the telephone system? The post office? ° What if there are more than 2 processors?

01/24/2006CS267 Lecture 316 MPI has become the de facto standard for parallel computing using message passing Pros and Cons of standards MPI created finally a standard for applications development in the HPC community  portability The MPI standard is a least common denominator building on mid-80s technology, so may discourage innovation Programming Model reflects hardware! “I am not sure how I will program a Petaflops computer, but I am sure that I will need MPI somewhere” – HDS 2001 MPI – the de facto standard

01/24/2006CS267 Lecture 317 Machine Model 2a: Distributed Memory Cray T3E, IBM SP2 PC Clusters (Berkeley NOW, Beowulf) IBM SP-3, Millennium, CITRIS are distributed memory machines, but the nodes are SMPs. Each processor has its own memory and cache but cannot directly access another processor’s memory. Each “node” has a Network Interface (NI) for all communication and synchronization. interconnect P0 memory NI... P1 memory NI Pn memory NI

01/24/2006CS267 Lecture 318 Tflop/s Clusters The following are examples of clusters configured out of separate networks and processor components 72% of Top 500 (Nov 2005), 2 of top 10 Dell cluster at Sandia (Thunderbird) is #4 on Top Intel 3.6GHz 64TFlops peak, 38TFlops Linpack Infiniband connection network Walt Disney Feature Animation (The Hive) is # Intel 3 GHz Gigabit Ethernet Saudi Oil Company is #107 Credit Suisse/First Boston is #108 For more details use “database/sublist generator” at

01/24/2006CS267 Lecture 319 Machine Model 2b: Internet/Grid Computing Running on 500,000 ~1000 CPU Years per Day 485,821 CPU Years so far Sophisticated Data & Signal Processing Analysis Distributes Datasets from Arecibo Radio Telescope Next Step- Allen Telescope Array

01/24/2006CS267 Lecture 320 Programming Model 2b: Global Address Space Program consists of a collection of named threads. Usually fixed at program startup time Local and shared data, as in shared memory model But, shared data is partitioned over local processes Cost models says remote data is expensive Examples: UPC, Titanium, Co-Array Fortran Global Address Space programming is an intermediate point between message passing and shared memory PnP1 P0 s[myThread] =... y =..s[i]... i: 2i: 5 Private memory Shared memory i: 8 s[0]: 27s[1]: 27 s[n]: 27

01/24/2006CS267 Lecture 321 Machine Model 2c: Global Address Space Cray T3D, T3E, X1, and HP Alphaserver cluster Clusters built with Quadrics, Myrinet, or Infiniband The network interface supports RDMA (Remote Direct Memory Access) NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations (put/get) Not just a load/store as on a shared memory machine Continue computing while waiting for memory op to finish Remote data is typically not cached locally interconnect P0 memory NI... P1 memory NI Pn memory NI Global address space may be supported in varying degrees

01/24/2006CS267 Lecture 322 Programming Model 3: Data Parallel Single thread of control consisting of parallel operations. Parallel operations applied to all (or a defined subset) of a data structure, usually an array Communication is implicit in parallel operators Elegant and easy to understand and reason about Coordination is implicit – statements executed synchronously Similar to Matlab language for array operations Drawbacks: Not all problems fit this model Difficult to map onto coarse-grained machines A: fA: f sum A = array of all data fA = f(A) s = sum(fA) s:

01/24/2006CS267 Lecture 323 Machine Model 3a: SIMD System A large number of (usually) small processors. A single “control processor” issues each instruction. Each processor executes the same instruction. Some processors may be turned off on some instructions. Originally machines were specialized to scientific computing, few made (CM2, Maspar) Programming model can be implemented in the compiler mapping n-fold parallelism to p processors, n >> p, but it’s hard (e.g., HPF) interconnect P1 memory NI... control processor P1 memory NIP1 memory NI P1 memory NI P1 memory NI

01/24/2006CS267 Lecture 324 Machine Model 3b: Vector Machines Vector architectures are based on a single processor Multiple functional units All performing the same operation Instructions may specific large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel Historically important Overtaken by MPPs in the 90s Re-emerging in recent years At a large scale in the Earth Simulator (NEC SX6) and Cray X1 At a small sale in SIMD media extensions to microprocessors SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc) Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

01/24/2006CS267 Lecture 325 Vector Processors Vector instructions operate on a vector of elements These are specified as operations on vector registers A supercomputer vector register holds ~32-64 elts The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 The hardware performs a full vector operation in #elements-per-vector-register / #pipes r1r2 r3 + + … vr2 … vr1 … vr3 (logically, performs # elts adds in parallel) … vr2 … vr1 (actually, performs #pipes adds in parallel)

01/24/2006CS267 Lecture Gflops (64 bit) S VV S VV S VV S VV 0.5 MB $ 0.5 MB $ 0.5 MB $ 0.5 MB $ 25.6 Gflops (32 bit) To local memory and network: 2 MB Ecache At frequency of 400/800 MHz 51 GB/s GB/s 25.6 GB/s GB/s custom blocks Cray X1 Node Figure source J. Levesque, Cray Cray X1 builds a larger “virtual vector”, called an MSP 4 SSPs (each a 2-pipe vector processor) make up an MSP Compiler will (try to) vectorize/parallelize across the MSP

01/24/2006CS267 Lecture 327 Cray X1: Parallel Vector Architecture Cray combines several technologies in the X Gflop/s Vector processors (MSP) Shared caches (unusual on earlier vector machines) 4 processor nodes sharing up to 64 GB of memory Single System Image to 4096 Processors Remote put/get between nodes (faster than MPI)

01/24/2006CS267 Lecture 328 Earth Simulator Architecture Parallel Vector Architecture High speed (vector) processors High memory bandwidth (vector architecture) Fast network (new crossbar switch) Rearranging commodity parts can’t match this performance

01/24/2006CS267 Lecture 329 Machine Model 4: Clusters of SMPs SMPs are the fastest commodity machine, so use them as a building block for a larger machine with a network Common names: CLUMP = Cluster of SMPs Hierarchical machines, constellations Many modern machines look like this: Millennium, IBM SPs, ASCI machines What is an appropriate programming model #4 ??? Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy). Shared memory within one SMP, but message passing outside of an SMP.

01/24/2006CS267 Lecture 330 Outline Overview of parallel machines and programming models Shared memory Shared address space Message passing Data parallel Clusters of SMPs Trends in real machines (

01/24/2006CS267 Lecture Listing of the 500 most powerful Computers in the World - Yardstick: R max from Linpack Ax=b, dense problem - Updated twice a year: ISC‘xy in Germany, June xy SC‘xy in USA, November xy - All data available from Size Rate TPP performance TOP500

01/24/2006CS267 Lecture 332 Extra Slides

01/24/2006CS267 Lecture 333 TOP500 list - Data shown ManufacturerManufacturer or vendor Computer Type indicated by manufacturer or vendor Installation SiteCustomer LocationLocation and country YearYear of installation/last major update Customer SegmentAcademic,Research,Industry,Vendor,Class. # ProcessorsNumber of processors R max Maxmimal LINPACK performance achieved R peak Theoretical peak performance N max Problemsize for achieving R max N 1/2 Problemsize for achieving half of R max N world Position within the TOP500 ranking

01/24/2006CS267 Lecture nd List: The TOP10 (2003)

01/24/2006CS267 Lecture 335 Continents Performance

01/24/2006CS267 Lecture 336 Continents Performance

01/24/2006CS267 Lecture 337 Customer Types

01/24/2006CS267 Lecture 338 Manufacturers

01/24/2006CS267 Lecture 339 Manufacturers Performance

01/24/2006CS267 Lecture 340 Processor Types

01/24/2006CS267 Lecture 341 Architectures

01/24/2006CS267 Lecture 342 NOW – Clusters

01/24/2006CS267 Lecture 343 Analysis of TOP500 Data Annual performance growth about a factor of 1.82 Two factors contribute almost equally to the annual total performance growth Processor number grows per year on the average by a factor of 1.30 and the Processor performance grows by 1.40 compared to 1.58 of Moore's Law Strohmaier, Dongarra, Meuer, and Simon, Parallel Computing 25, 1999, pp

01/24/2006CS267 Lecture 344 Summary Historically, each parallel machine was unique, along with its programming model and programming language. It was necessary to throw away software and start over with each new kind of machine. Now we distinguish the programming model from the underlying machine, so we can write portably correct codes that run on many machines. MPI now the most portable option, but can be tedious. Writing portably fast code requires tuning for the architecture. Algorithm design challenge is to make this process easy. Example: picking a blocksize, not rewriting whole algorithm.

01/24/2006CS267 Lecture 345 Reading Assignment Extra reading for today Cray X1 Clusters "Parallel Computer Architecture: A Hardware/Software Approach" by Culler, Singh, and Gupta, Chapter 1.Parallel Computer Architecture: A Hardware/Software Approach Next week: Current high performance architectures Shared memory (for Monday) Memory Consistency and Event Ordering in Scalable Shared- Memory Multiprocessors, Gharachorloo et al, Proceedings of the International symposium on Computer Architecture, 1990.Memory Consistency and Event Ordering in Scalable Shared- Memory Multiprocessors Or read about the Altix system on the web ( Blue Gene L (for Wednesday)

01/24/2006CS267 Lecture 346 PC Clusters: Contributions of Beowulf An experiment in parallel computing systems Established vision of low cost, high end computing Demonstrated effectiveness of PC clusters for some (not all) classes of applications Provided networking software Conveyed findings to broad community (great PR) Tutorials and book Design standard to rally community! Standards beget: books, trained people, software … virtuous cycle Adapted from Gordon Bell, presentation at Salishan 2000

01/24/2006CS267 Lecture 347 Open Source Software Model for HPC Linus's law, named after Linus Torvalds, the creator of Linux, states that "given enough eyeballs, all bugs are shallow". All source code is “open” Everyone is a tester Everything proceeds a lot faster when everyone works on one code (HPC: nothing gets done if resources are scattered) Software is or should be free (Stallman) Anyone can support and market the code for any price Zero cost software attracts users! Prevents community from losing HPC software (CM5, T3E)

01/24/2006CS267 Lecture 348 Cluster of SMP Approach A supercomputer is a stretched high-end server Parallel system is built by assembling nodes that are modest size, commercial, SMP servers – just put more of them together Image from LLNL