Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23, 2011 1.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Today’s topics Single processors and the Memory Hierarchy
Chapter 6 Computer Architecture
Types of Parallel Computers
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Reference: Message Passing Fundamentals.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Computer System Architectures Computer System Software
E0001 Computers in Engineering1 The System Unit & Memory.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1,
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
09/01/2009CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall September 1,
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Computer Organization & Assembly Language © by DR. M. Amer.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Computer Architecture 2 nd year (computer and Information Sc.)
08/31/2010CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 31,
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Lecture 3: Computer Architectures
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
08/30/2011CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 30,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Understanding Parallel Computers Parallel Processing EE 613.
08/28/2012CS4230 CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28,
09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Computer Organization and Architecture Lecture 1 : Introduction
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
The University of Adelaide, School of Computer Science
Lecture 1: Parallel Architecture Intro
Multiple Processor Systems
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Distributed Systems CS
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
Types of Parallel Computers
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23,

Overview Some types of parallel architectures – SIMD vs. MIMD – multi-cores from Intel and AMD – Sunfire SMP – BG/L supercomputer – Clusters – Later in the semester we’ll look at NVIDIA GPUs An abstract architecture for parallel algorithms Discussion Sources for this lecture: – Primary Source: Mary Hall, CS4961, University of Utah – Larry Snyder, – Textbook, Chapter 2 2

An Abstract Parallel Architecture How is parallelism managed? Where is the memory physically located? Is it connected directly to processors? What is the connectivity of the network? 3

Why Look at Architectures There is no canonical parallel computer – a diversity of parallel architectures – Hence, there is no canonical parallel programming language Architecture has an enormous impact on performance – And we wouldn’t write parallel code if we didn’t care about performance Many parallel architectures fail to succeed commercially – Can’t always tell what is going to be around in future years 4 Challenge is to write parallel code that abstracts away architectural features, focuses on their commonality, and is therefore easily ported from one platform to another.

Viewpoint (Jim Demmel*) Historically, each parallel machine was unique, along with its programming model and programming language. It was necessary to throw away software and start over with each new kind of machine. Now we distinguish the programming model from the underlying machine, so we can write portably correct codes that run on many machines. – MPI now the most portable option, but can be tedious. – Writing portable fast code requires tuning for the architecture. Parallel algorithm design challenge is to make this process easy. – Example: picking a blocksize, not rewriting whole algorithm. * James Demmel is a distinguished prof of Math & CS at UCB

Six Parallel Architectures (from text) Multi-cores – Intel Core2Duo – AMD Opteron Symmetric Multiprocessor (SMP) SunFire E25K Heterogeneous Architecture (ignore for now) IBM Cell B/E Clusters (ignore for now) – Commodity desktops (typically PCs) with high-speed interconnect Supercomputers – IBM BG/L 6 Shared Memory Distributed Memory

Multi-core 1: Intel Core2Duo Two 32-bit Pentiums on chip Private 32K L1 caches for instructions (L1-I) and data Shared 2M-4M L2 cache MESI cc-protocol provides cache coherency Shared bus control and memory Fast on-chip communication using shared memory 7

MESI Cache Coherence States: M=modified; E=exclusive; S=shared; I=invalid; Think of it as a finite state machine – Upon loading, a line is marked E, subsequent reads are OK; write marks M – Seeing another load, mark as S – A write to an S, sends I to all, marks as M – Another’s read to an M line, writes it back, marks it S Read/write to an I misses Related scheme: MOESI (O = owned; used by AMD) 8

AMD Dual-Core Opteron (in CHPC systems) Two 64-bit processors on a chip 64K private L1 data & instruction caches 1 MB private L2 cache MOESI cc-protocol Direct connect shared memory Fast on-chip communication between processors 9

Comparison Fundamental difference in memory hierarchy structure and cache coherency protocol More details about these protocols in textbook 10

Classical Shared-Memory, Symmetric Multiprocessor (SMP) All processors connected by a bus Cache coherence maintained by “snooping” on the bus Serializes communication – usually limiting nr of processors to ~20 AMD Dual Core has architecture well suited to construct an SMP. 11

SunFire E25K An example of SMP 4 UltraSparcs Dotted lines represent snooping 18 boards connected with crossbar switch 12

Crossbar Switch A crossbar is a network connecting each processor to every other processor Wires do not Intersect unless a circle connection is shown Crossbars grow as n 2 making them impractical for large n Here, only 4 boards are shown, but the Sun Fire D25K has 18 boards 13

Crossbar in SunFire E25K X-bar gives low latency for snoops allowing for shared memory 18 x 18 X-bar is basically the limit due to as n 2 rate of increase Raising the number of processors per node will, on average, increase congestion How could we make a larger machine? 14

Heterogeneous Chip Designs An alternative to replicating a standard processor several times is to augment a standard processor with one or more specialized attached processors The standard processor performs the general, hard-to- parallelize part of computation while attached processors perform the compute-intensive portion of the computation Examples: – Graphics Processing Units (GPU’s) – Field Programmable Gate Arrays (FPGAs) – Cell Processor, designed for video games

A Heterogeneous Chip: The Cell Processor Cell processor is a joint development by Sony, IBM, and Toshiba. Has 64- bit PowerPC core and eight specialized cores called synergistic processing elements (SPEs) – Each SPE supports 32 bit vector operations. – High speed Element Interconnect Bus (EIB) connecting SPEs – SPEs capable of supporting vector instructions, with one operation performed on several data values in parallel. Cell emphasizes performance and hardware simplicity over programming convenience. – SPEs do not have coherent memory – Programmers must carefully manage movement of data to and from the SPEs so that the vector units can be kept busy. – When successful, Cell processors can produce impressive throughputs.

2-17 Figure 2.6 Architecture of the Cell processor. The architecture is designed to move data: The high speed I/O controllers have a capacity of 76.8 GB/s; each of the two channels to RAM runs at 12.8 GB/s; the capacity of the EIB is theoretically capable of GB/s.

Clusters Parallel computers made from commodity parts Nodes connected by commodity interconnects, such as Gigabit Ethernet, Myrinet, Infiniband, etc. Instructions for assembling a typical system such as below are available on WWW.  8 nodes consisting of a 8-way Power4 processor with 32GB RAM, 2 disks  One control processor  Myrinet 16 port switch plus 8 PCI adapters on 4 boards  Open Source software  Can also be built using pre-packaged blade servers. More detail on how to build given in textbook.

Supercomputer: Blue Gene/L Node Below diagram gives logical organization of a BG/L node. BG/L has 65,536 dual core nodes Processors run at a moderate 770 MHz speed 22

BG/L Interconnect Separate networks for control and data Can then specialize network implementation for type of message Also reduces congestion 3D torus is for standard interprocessor transfer Collective network used for fast evaluation of reductions Entire BG/L expands to 64×32×32 3-d torus 23 collective network

Blue Gene/L Specs BG/L was the fastest computer in the world (#1 on the Top500 List) when the textbook was published A 64x32x32 torus = 65K 2-core processors Cut-through routing gives a worst-case latency of 6.4 µs Processor nodes are dual PPC-440 with “double hummer” FPUs Collective network performs global reduce for the “usual” functions 24

Summary of Architectures Two main classes Complete connection: CMPs, SMPs, X-bar – CMPs are Chip Multiprocessors – Preserve single memory image – Complete connection limits scaling to … – Available to everyone (multi-core) Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism (Grid) – Separate memory images – Can grow “arbitrarily” large – Available to everyone with LOTS of air conditioning Programming differences are significant 25

Parallel Architecture Model How to develop portable parallel algorithms for current and future parallel architectures, a moving target? Strategy: – Adopt an abstract parallel machine model for use in thinking about algorithms 1.Review how we compare algorithms on sequential architectures 2.Introduce the CTA model (Candidate Type Architecture) 3.Discuss how it relates to today’s set of machines 26

How did we do it for sequential architectures? Sequential Model: Random Access Machine – Control, ALU, (Unlimited) Memory, [Input, Output] – Fetch/execute cycle runs 1 inst. pointed at by PC – Memory references are “unit time” independent of location Gives RAM it’s name in preference to von Neumann “Unit time” is not literally true, but caches provide that illusion when effective – Executes “3-address” instructions Focus in developing sequential algorithms, at least in courses, is on reducing amount of computation (useful even if imprecise) – Treat memory time as negligible – Ignore overheads 27

Interesting Historical Parallel Architecture Model, PRAM Parallel Random Access Machine (PRAM) – Unlimited number of processors – Processors are standard RAM machines, executing synchronously – Memory reference is “unit time” – Outcome of collisions at memory specified EREW, CREW, CRCW … Model fails to capture true performance behavior for most current architecture. – Synchronous execution w/ unit cost memory reference does not scale (for multi-tasking type executions) – Therefore, parallel hardware typically implements non- uniform cost memory reference 28

Candidate Type Architecture (CTA Model) A model with P standard processors, d degree, λ latency Node == processor + memory + *NIC Key Property: Local memory ref is 1, global memory is λ 29 *NIC is the network interface chip. See pg 48 of text

Estimated Values for Lambda Captures inherent property that data locality is important. But different values of Lambda can lead to different algorithm strategies 30

Key Lesson from CTA Locality Rule: – Fast programs tend to maximize the number of local memory references and minimize the number of non- local memory references. Keep as much of the memory access as possible private (i.e., where λ=1). As much as possible, use non-private memory access (i.e., where λ>1) only for communications and not for data movement. Locality Rule in practice – It is usually more efficient to add a fair amount of redundant computation to avoid non-local accesses (e.g., random number generator example). This is the most important thing you will learn in this class! 31

Back to Abstract Parallel Architecture A key feature is how memory is connected processors! 32

Memory Reference Mechanisms Shared Memory – All processors have access to a global address space – Refer to remote data or local data in the same way, through normal loads and stores – Usually, caches must be kept coherent with global store Message Passing & Distributed Memory – Memory is partitioned and a partition is associated with an individual processor – Remote data access through explicit communication (sends and receives) – Two-sided (both a send and receive are needed) One-Sided Communication (a hybrid mechanism) – Supports a global shared address space but no coherence guarantees – Access to remote data through gets and puts 33

Brief Discussion Why is it good to have different parallel architectures? – Some may be better suited for specific application domains – Some may be better suited for a particular community – Cost – Explore new ideas And different programming models/languages? – Relate to architectural features – Application domains, user community, cost, exploring new ideas 34

Conceptual: CTA for Shared Memory Architectures? CTA is not capturing global memory in SMPs Forces a discipline – Application developer should think about locality even if remote data is referenced identically to local data! – Otherwise, performance will suffer – Anecdotally, codes written for distributed memory shown to run faster on shared memory architectures than shared memory programs – Similarly, GPU codes (which require a partitioning of memory) recently shown to run well on conventional multi-core 09/02/2010CS496135

Summary of Lecture Exploration of different kinds of parallel architectures Key features – How processors are connected? – How memory is connected to processors? – How parallelism is represented/managed? Models for parallel architectures 36