08/30/2011CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 30, 2011 1.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

The University of Adelaide, School of Computer Science

Today’s topics Single processors and the Memory Hierarchy

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23,

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Chapter Hardwired vs Microprogrammed Control Multithreading

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

CS 470/570:Introduction to Parallel and Distributed Computing.

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

Computer System Architectures Computer System Software

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Multi-core architectures. Single-core computer Single-core CPU chip.

09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1,

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 25, /25/2011 CS4961.

Multi-Core Architectures

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

L21: Final Preparation and Course Retrospective (also very brief introduction to Map Reduce) December 1, 2011.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

09/01/2009CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall September 1,

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Pipelining and Parallelism Mark Staveley

08/31/2010CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 31,

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Lecture 3: Computer Architectures

Parallel Processing Presented by: Wanki Ho CS147, Section 1.

08/28/2012CS4230 CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28,

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )

Parallel Computing Presented by Justin Reschke

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.

08/23/2012CS4230 CS4230 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 23,

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

COMP 740: Computer Architecture and Implementation

Distributed Processors

CS427 Multicore Architecture and Parallel Computing

Multi-Processing in High Performance Computer Architecture:

CS427 Multicore Architecture and Parallel Computing

Chapter 4 Multiprocessors

Graphics Processing Unit

Presentation transcript:

08/30/2011CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 30,

Administrative UPDATE Nikhil office hours: -Monday, 2-3 PM, MEB 3115 Desk #12 -Lab hours on Tuesday afternoons during programming assignments We’ll spend some class time going over the homework Next homework will be due next Thursday, Sept. 8 -I’ll post the assignment by Thursday’s class, maybe earlier 08/30/2011CS49612

Homework 1: Parallel Programming Basics Turn in electronically on the CADE machines using the handin program: “handin cs4961 hw1 ” Problem 1: (#1.3 in textbook): Try to write pseudo-code for the tree-structured global sum illustrated in Figure 1.1. Assume the number of cores is a power of two (1, 2, 4, 8, …). Hints: Use a variable divisor to determine whether a core should send its sum or receive and add. The divisor should start with the value 2 and be doubled after each iteration. Also use a variable core_difference to determine which core should be partnered with the current core. It should start with the value 1 and also be doubled after each iteration. For example, in the first iteration 0 % divisor = 0 and 1 % divisor = 1, so 0 receives and adds, while 1 sends. Also in the first iteration 0 + core_difference = 1 and 1 – core_difference = 0, so 0 and 1 are paired in the first iteration. 08/30/2011CS49613

Homework 1: Parallel Programming Basics Problem 2: I recently had to tabulate results from a written survey that had four categories of respondents: (I) students; (II) academic professionals; (III) industry professionals; and, (IV) other. The number of respondents in each category was very different; for example, there were far more students than other categories. The respondents selected to which category they belonged and then answered 32 questions with five possible responses: (i) strongly agree; (ii) agree; (iii) neutral; (iv) disagree; and, (v) strongly disagree. My family members and I tabulated the results “in parallel” (assume there were four of us). -(a) Identify how data parallelism can be used to tabulate the results of the survey. Keep in mind that each individual survey is on a separate sheet of paper that only one “processor” can examine at a time. Identify scenarios that might lead to load imbalance with a purely data parallel scheme. -(b) Identify how task parallelism and combined task and data parallelism can be used to tabulate the results of the survey to improve upon the load imbalance you have identified. 08/30/2011CS49614

Today’s Lecture Flynn’s Taxonomy Some types of parallel architectures -Shared memory -Distributed memory These platforms are things you will probably use -CADE Lab1 machines (Intel Nehalem i7) -Sun Ultrasparc T2 (water, next assignment) -Nvidia GTX260 GPUs in Lab1 machines And for fun, Jaguar, the fastest computer in the US Sources for this lecture: -Textbook -Jim Demmel, UC Berkeley -Notes on various architectures 08/30/20115CS4961

Reading this week: Chapter in textbook Chapter 2: Parallel Hardware and Parallel Software 2.1 Some background The von Neumann architecture Processes, multitasking, and threads 2.2 Modifications to the von Neumann Model The basics of caching Cache Mappings Caches and programs: an example Virtual memory Instruction-level parallelism Hardware multithreading 2.3 Parallel Hardware SIMD systems MIMD systems Interconnection networks Cache coherence Shared-memory versus distributed-memory 08/30/2011CS49616

An Abstract Parallel Architecture How is parallelism managed? Where is the memory physically located? Is it connected directly to processors? What is the connectivity of the network? 08/30/2011CS49617

Why are we looking at a bunch of architectures There is no canonical parallel computer – a diversity of parallel architectures -Hence, there is no canonical parallel programming language Architecture has an enormous impact on performance -And we wouldn’t write parallel code if we didn’t care about performance Many parallel architectures fail to succeed commercially -Can’t always tell what is going to be around in N years 08/30/2011CS49618 Challenge is to write parallel code that abstracts away architectural features, focuses on their commonality, and is therefore easily ported from one platform to another.

Retrospective (Jim Demmel) Historically, each parallel machine was unique, along with its programming model and programming language. It was necessary to throw away software and start over with each new kind of machine. Now we distinguish the programming model from the underlying machine, so we can write portably correct codes that run on many machines. Parallel algorithm design challenge is to make this process easy. -Still somewhat of an open research problem 08/30/2011CS49619

The von Neumann Architecture Figure 2.1 Conceptually, a von Neumann architecture executes one instruction at a time

von Neumann bottleneck CS496108/30/201111

08/30/2011CS4961 Locality and Parallelism Large memories are slow, fast memories are small Program should do most work on local data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects 12

Uniprocessor and Parallel Architectures 08/30/2011CS Achieve performance by addressing the von Neumann bottleneck Reduce memory latency Access data from “nearby” storage: registers, caches, scratchpad memory We’ll look at this in detail in a few weeks Hide or Tolerate memory latency Multithreading and, when necessary, context switches while memory is being serviced Prefetching, predication, speculation Uniprocessors that execute multiple instructions in parallel Pipelining Multiple issue SIMD multimedia extensions

How Does a Parallel Architecture Improve on this Further? 08/30/2011CS Computation and data partitioning focus a single processor on a subset of data that can fit in nearby storage Can achieve performance gains with simpler processors Even if individual processor performance is reduced, throughput can be increased Complements instruction-level parallelism techniques Multiple threads operate on distinct data Exploit ILP within a thread

Flynn’s Taxonomy Copyright © 2010, Elsevier Inc. All rights Reserved SISD Single instruction stream Single data stream (SIMD) Single instruction stream Multiple data stream MISD Multiple instruction stream Single data stream (MIMD) Multiple instruction stream Multiple data stream classic von Neumann not covered

More Concrete: Parallel Control Mechanisms 09/07/2010CS4961

Two main classes of parallel architecture organizations 08/30/2011CS Shared memory multiprocessor architectures A collection of autonomous processors connected to a memory system. Supports a global address space where each processor can access each memory location. Distributed memory architectures A collection of autonomous systems connected by an interconnect. Each system has its own distinct address space, and processors must explicitly communicate to share data. Clusters of PCs connected by commodity interconnect is the most common example.

Programming Shared Memory Architectures 08/30/2011CS A shared-memory program is a collection of threads of control. Threads are created at program start or possibly dynamically Each thread has private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate through locks and barriers implemented using shared variables.

Programming Distributed Memory Architectures 08/30/2011CS A distributed-memory program consists of named processes. Process is a thread of control plus local address space -- NO shared data. Logically shared data is partitioned over local processes. Processes communicate by explicit send/receive pairs Coordination is implicit in every communication event.

Shared Memory Architecture 1: Intel i7 860 Nehalem (CADE LAB1) 08/30/2011CS KB L2 Unified Cache 32KB L1 Instr Cache 32KB L1 Data Cache Shared 8MB L3 Cache Up to 16 GB Main Memory (DDR3 Interface) Proc 256KB L2 Unified Cache 32KB L1 Instr Cache 32KB L1 Data Cache Proc 256KB L2 Unified Cache 32KB L1 Instr Cache 32KB L1 Data Cache Proc 256KB L2 Unified Cache 32KB L1 Instr Cache 32KB L1 Data Cache Proc Bus (Interconnect)

More on Nehalem and Lab1 machines -- ILP 08/30/2011CS Target users are general-purpose Personal use Games High-end PCs in clusters Support for SSE 4.2 SIMD instruction set 8-way hyperthreading (executes two threads per core) multiscalar execution (4-way issue per thread) out-of-order execution usual branch prediction, etc.

Shared Memory Architecture 2: Sun Ultrasparc T2 Niagara (water) 08/30/2011CS Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Proc FPU Memory Controller 512KB L2 Cache 512KB L2 Cache Memory Controller 512KB L2 Cache 512KB L2 Cache Memory Controller 512KB L2 Cache 512KB L2 Cache Memory Controller 512KB L2 Cache 512KB L2 Cache Full Cross-Bar (Interconnect)

More on Niagara 08/30/2011CS Target applications are server-class, business operations Characterization: Floating point? Array-based computation? Support for VIS 2.0 SIMD instruction set 64-way multithreading (8-way per processor, 8 processors)

24 Multiprocessors, with 8 SIMD processors per multiprocessor SIMD Execution of warpsize threads (from single block) Multithreaded Execution across different instruction streams Complex and largely programmer-controlled memory hierarchy Shared Device memory Per-multiprocessor “Shared memory” Some other constrained memories (constant and texture memories/caches) No standard data cache Device Multiprocessor 23 Multiprocessor 2 Multiprocessor 0 Device memory Shared Memory Instruction Unit Processor 0 Registers … Processor 2 Registers Processor 7 Registers Constant Cache Texture Cache Shared Memory Architecture 3: GPUs Lab1 has Nvidia GTX 260 accelerators 08/30/201124CS4961

Jaguar (3 rd fastest computer in the world) 08/30/2011CS Peak performance of 2.33 Petaflops 224,256 AMD Opteron cores

Shared Memory Architecture 4: Each Socket is a 12-core AMD Opteron Istanbul 08/30/2011CS Shared 6MB L3 Cache Hyper Transport Link (Interconnect) 6-core “Processor” Hyper Transport 8 GB Main Memory (DDR3 Interface)

Shared Memory Architecture 3: Each Socket is a 12-core AMD Opteron Istanbul 08/30/2011CS Shared 6MB L3 Cache 8 GB Main Memory (DDR3 Interface) 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc Hyper Transport Link (Interconnect) 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc 512 KB L2 Unified Cache 64KB L1 Instr Cache 64KB L1 Data Cache Proc

Jaguar is a Cray XT5 (plus XT4) Interconnect is a 3-d mesh 08/30/2011CS dimensional toroidal mesh

Summary of Architectures Two main classes Complete connection: CMPs, SMPs, X-bar -Preserve single memory image -Complete connection limits scaling to small number of processors (say, 32 or 256 with heroic network) -Available to everyone (multi-core) Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism -Separate memory images -Can grow “arbitrarily” large -Available to everyone with LOTS of air conditioning Programming differences are significant 08/30/2011CS496129

Brief Discussion Why is it good to have different parallel architectures? -Some may be better suited for specific application domains -Some may be better suited for a particular community -Cost -Explore new ideas And different programming models/languages? -Relate to architectural features -Application domains, user community, cost, exploring new ideas 08/30/2011CS496130

08/30/2011CS4961 Summary of Lecture Exploration of different kinds of parallel architectures -And impact on programming models Key features -How processors are connected? -How memory is connected to processors? -How parallelism is represented/managed? Next Time -Memory systems and interconnect -Models of memory and communication latency 31