09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1, 2011 1.

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Multiple Processor Systems
The University of Adelaide, School of Computer Science
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23,
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Parallel Computer Architectures
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Course Wrap-Up Miodrag Bolic CEG4136. What was covered Interconnection network topologies and performance Shared-memory architectures Message passing.
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
09/01/2009CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall September 1,
08/31/2010CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 31,
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Cotter-cs431 Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved Chapter 8 Multiple Processor Systems.
Outline Why this subject? What is High Performance Computing?
1 Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Super computers Parallel Processing
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,
The University of Adelaide, School of Computer Science
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 23: Interconnection Networks
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Parallel and Multiprocessor Architectures – Shared Memory
Lecture 1: Parallel Architecture Intro
CS 6290 Many-core & Interconnect
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1,

Administrative Nikhil office hours: -Monday, 2-3PM -Lab hours on Tuesday afternoons during programming assignments First homework will be treated as extra credit -If you turned it in late, or up until tomorrow, turn in for half credit 09/01/2011CS49612

Homework 2: Due Before Class, Thursday, Sept. 8 ‘handin cs4961 hw2 ’ Problem 1: (Coherence) #2.15 in textbook (a)Suppose a shared-memory system uses snooping cache coherence and write-back caches. Also suppose that core 0 has the variable x in its cache, and it executes the assignment x=5. Finally, suppose that core 1 doesn’t have x in its cache, and after core 0’s update to x, core 1 tries to execute y=x. What value will be assigned to y? Why? (b) Suppose that the shared-memory system in the previous part uses a directory-based protocol. What value will be assigned to y? Why? (c)Can you suggest how any problems you found in the first two parts might be solved? 09/01/2011CS49613

Homework 2, cont. Problem 2: (Bisection width/bandwidth) (a) What is the bisection width and bisection bandwidth of a 3-d toroidal mesh. (b) A planar mesh is just like a toroidal mesh, except that it doesn’t have the wraparound links. What is the bisection width and bisection bandwidth of a square planar mesh. Problem 3 (in general, not specific to any algorithm): How is algorithm selection impacted by the value of λ? 09/01/2011CS49614

Homework 2, cont. Problem 4: (λ concept) #2.10 in textbook Suppose a program must execute instructions in order to solve a particular problem. Suppose further that a single processor system can solve the problem in 10 6 seconds (about 11.6 days). So, on average, the single processor system executes 10 6 or a million instructions per second. Now suppose that the program has been parallelized for execution on a distributed-memory system. Suppose also that if the parallel program uses p processors, each processor will execute /p instructions, and each processor must send 10 9 (p-1) messages. Finally, suppose that there is no additional overhead in executing the parallel program. That is, the program will complete after each processor has executed all of its instructions and sent all its messages, and there won’t be any delays due to things such as waiting for messages. (a)Suppose it takes seconds to send a message. How long will it take the program to run with 1000 processors, if each processor is as fast as the single processor on which the serial program was run? (b)Suppose it takes seconds to send a message. How long will it take the program to run with 1000 processors? 09/01/2011CS49615

Today’s Lecture Key architectural features affecting parallel performance -Memory systems -Memory access time -Cache coherence -Interconnect A few more parallel architectures -Sunfire SMP -BG/L supercomputer An abstract architecture for parallel algorithms Sources for this lecture: -Textbook -Larry Snyder, “ i/” 09/01/20116CS4961

An Abstract Parallel Architecture How is parallelism managed? Where is the memory physically located? Is it connected directly to processors? What is the connectivity of the network? 09/01/2011CS49617

Memory Systems There are three features of memory systems that affect parallel performance -Latency: Time between initiating a memory request and it being serviced by the memory (in micro-seconds). -Bandwidth: The rate at which the memory system can service memory requests (in GB/s). -Coherence: Since all processors can view all memory locations, coherence is the property that all processors see the memory image in the same state. 09/01/2011CS49618

Uniform Memory Access (UMA) multicore system Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.5 Time to access all the memory locations will be the same for all the cores.

Non-uniform Memory Access (NUMA) multicore system Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.6 A memory location a core is directly connected to can be accessed faster than a memory location that must be accessed through another chip.

Cache coherence Programmers have no control over caches and when they get updated. Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.17 A shared memory system with two cores and two caches

Cache coherence Copyright © 2010, Elsevier Inc. All rights Reserved x = 2; /* shared variable */ y0 privately owned by Core 0 y1 and z1 privately owned by Core 1 y0 eventually ends up = 2 y1 eventually ends up = 6 z1 = ???

Snooping Cache Coherence The cores share a bus. Any signal transmitted on the bus can be “seen” by all cores connected to the bus. When core 0 updates the copy of x stored in its cache it also broadcasts this information across the bus. If core 1 is “snooping” the bus, it will see that x has been updated and it can mark its copy of x as invalid. Copyright © 2010, Elsevier Inc. All rights Reserved

Directory Based Cache Coherence Uses a data structure called a directory that stores the status of each cache line. When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that variable’s cache line in their caches are invalidated. Copyright © 2010, Elsevier Inc. All rights Reserved

Definitions for homework Problem #1 Write through cache -A cache line is written to main memory when it is written to the cache. Write back cache -The data is marked as dirty in the cache, and written back to memory when it is replaced by a new cache line. 09/01/2011CS496115

False Sharing A cache line contains more than one machine word. When multiple processors access the same cache line, it may look like a potential race condition, even if they access different elements. Can cause coherence traffic. 09/01/2011CS496116

Shared memory interconnects Bus interconnect -A collection of parallel communication wires together with some hardware that controls access to the bus. -Communication wires are shared by the devices that are connected to it. -As the number of devices connected to the bus increases, contention for use of the bus increases, and performance decreases. Copyright © 2010, Elsevier Inc. All rights Reserved

Shared memory interconnects Switched interconnect -Uses switches to control the routing of data among the connected devices. -Crossbar – -Allows simultaneous communication among different devices. -Faster than buses. -But the cost of the switches and links is relatively high. -Crossbars grow as n 2 making them impractical for large n Copyright © 2010, Elsevier Inc. All rights Reserved

Figure 2.7 (a) A crossbar switch connecting 4 processors (P i ) and 4 memory modules (M j ) (b) Configuration of internal switches in a crossbar (c) Simultaneous memory accesses by the processors

SunFire E25K 4 UltraSparcs Dotted lines represent snooping 18 boards connected with crossbars -Basically the limit -Increasing processors per node will, on average, increase congestion 09/01/2011CS Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Distributed memory interconnects Two groups -Direct interconnect -Each switch is directly connected to a processor memory pair, and the switches are connected to each other. -Indirect interconnect -Switches may not be directly connected to a processor. The cost of an interconnect can be a big part of a machine, governed by -Number of links -Number of switches -Available bandwidth (highly optimized links are expensive) Copyright © 2010, Elsevier Inc. All rights Reserved

Direct interconnect Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.8 ringtoroidal mesh

Copyright © 2010, Elsevier Inc. All rights Reserved Message transmission time = l + n / b latency (seconds) bandwidth (bytes per second) length of message (bytes) Like memory systems, we also consider bandwidth and latency in message-passing

Definitions Bisection width -A measure of “number of simultaneous communications” or “connectivity”. -How many simultaneous communications can take place “across the divide” between the halves? Bisection bandwidth -A measure of network quality. -Instead of counting the number of links joining the halves, it sums the bandwidth of the links. Copyright © 2010, Elsevier Inc. All rights Reserved

Two bisections of a ring Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.9

A bisection of a toroidal mesh Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.10

Fully connected network Each switch is directly connected to every other switch. Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.11 bisection width = p 2 /4 impractical

Indirect interconnects Simple examples of indirect networks: -Crossbar -Omega network Often shown with unidirectional links and a collection of processors, each of which has an outgoing and an incoming link, and a switching network. Copyright © 2010, Elsevier Inc. All rights Reserved

A generic indirect network Copyright © 2010, Elsevier Inc. All rights Reserved Figure 2.13

Supercomputer: BG/L Node Actually has multiple interconnects 09/01/2011CS Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

BG/L Interconnect Separate networks for control and data Can then specialize network implementation for type of message Also reduces congestion 3-d torus 09/01/2011CS collective network

Blue Gene/L Specs BG/L was the fastest computer in the world (#1 on the Top500 List) when the textbook was published A 64x32x32 torus = 65K 2-core processors Cut-through routing gives a worst-case latency of 6.4 µs Processor nodes are dual PPC-440 with “double hummer” FPUs Collective network performs global reduce for the “usual” functions 09/01/2011CS496132

Summary of Architectures Two main classes Complete connection: CMPs, SMPs, X-bar -Preserve single memory image -Complete connection limits scaling to … -Available to everyone (multi-core) Sparse connection: Clusters, Supercomputers, Networked computers used for parallelism (Grid) -Separate memory images -Can grow “arbitrarily” large -Available to everyone with LOTS of air conditioning Programming differences are significant 09/01/2011CS496133

Parallel Architecture Model How to develop portable parallel algorithms for current and future parallel architectures, a moving target? Strategy: -Adopt an abstract parallel machine model for use in thinking about algorithms 1.Review how we compare algorithms on sequential architectures 2.Introduce the CTA model (Candidate Type Architecture) 3.Discuss how it relates to today’s set of machines 09/01/2011CS496134

How did we do it for sequential architectures? Sequential Model: Random Access Machine -Control, ALU, (Unlimited) Memory, [Input, Output] -Fetch/execute cycle runs 1 inst. pointed at by PC -Memory references are “unit time” independent of location -Gives RAM it’s name in preference to von Neumann -“Unit time” is not literally true, but caches provide that illusion when effective -Executes “3-address” instructions Focus in developing sequential algorithms, at least in courses, is on reducing amount of computation (useful even if imprecise) -Treat memory time as negligible -Ignore overheads 09/01/2011CS496135

Interesting Historical Parallel Architecture Model, PRAM Parallel Random Access Machine (PRAM) -Unlimited number of processors -Processors are standard RAM machines, executing synchronously -Memory reference is “unit time” -Outcome of collisions at memory specified -EREW, CREW, CRCW … Model fails to capture true performance behavior -Synchronous execution w/ unit cost memory reference does not scale -Therefore, parallel hardware typically implements non-uniform cost memory reference 09/01/2011CS496136

Candidate Type Architecture (CTA Model) A model with P standard processors, d degree,λ latency Node == processor + memory + NIC Key Property: Local memory ref is 1, global memory is λ 09/01/2011CS Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Estimated Values for Lambda Captures inherent property that data locality is important. But different values of Lambda can lead to different algorithm strategies 09/01/2011CS Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Key Lesson from CTA Locality Rule: -Fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references. This is the most important thing you will learn in this class! 09/01/2011CS496139

09/01/2011CS4961 Summary of Lecture Memory Systems -UMA vs. NUMA -Cache coherence Interconnects -Bisection bandwidth Models for parallel architectures -CTA model 40