1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper (1906-1992)

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

CS 201 Compiler Construction

1 Optimizing compilers Managing Cache Bercovici Sivan.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Parallel Processing Problems Cache Coherence False Sharing Synchronization.

MESI cache coherence protocol

SE-292 High Performance Computing

Multiple Processor Systems

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Parallel Computers Chapter 1

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Chapter 5, CLR Textbook Algorithms on Grids of Processors.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Section 4.3 – A Review of Determinants Section 4.4 – The Cross Product.

Interconnect Networks

Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Parallel Programming Sathish S. Vadhiyar Course Web Page:

Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

מערכים (arrays) 02 דצמבר דצמבר דצמבר 1502 דצמבר דצמבר דצמבר 1502 דצמבר דצמבר דצמבר 15 1 Department of Computer Science-BGU.

Parallel Programming Sathish S. Vadhiyar. 2 Motivations of Parallel Computing Parallel Machine: a computer system with more than one processor Motivations.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.

ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 13 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 9 – Locality School.

Outline Why this subject? What is High Performance Computing?

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Super computers Parallel Processing

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.

Parallel Processing Presented by: Wanki Ho CS147, Section 1.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Additional Material CEG 4131 Computer Architecture III

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 7.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

The University of Adelaide, School of Computer Science

מערכים (arrays) 02 אוקטובר אוקטובר אוקטובר 1602 אוקטובר אוקטובר אוקטובר 1602 אוקטובר אוקטובר אוקטובר 16 Department.

A few words on locality and arrays

Parallel Architecture

Higher Level Parallelism

Computer Engineering 2nd Semester

Lecture 18: Coherence and Synchronization

Exploiting Parallelism

Prof. Zhang Gang School of Computer Sci. & Tech.

Directory-based Protocol

Interconnect with Cache Coherency Manager

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Multiprocessors and Multi-computers

Presentation transcript:

1 Friday, September 22, 2006 If one ox could not do the job they did not try to grow a bigger ox, but used two oxen. -Grace Murray Hopper ( )

2 Today §Block matrix operations §Network topologies

3 Strided access §Stride Sequence of memory reads and writes to addresses, each of which is separated from the last by a constant interval called "the stride length“ §Unit stride

4 do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop. Little reuse between touches How many cache misses for A and B?

5 Blocking do i = 1, N do j = 1, N, S do jj = j, MIN(j+S, N) A[i] =A[i] + B[jj] enddo do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo

6 Blocking do j = 1, N, S do i = 1, N do jj = j, MIN(j+S, N) A[i] =A[i] + B[jj] enddo do i = 1, N do j = 1, N A[i] =A[i] + B[j] enddo S is the maximum number of elements of B that can remain in cache between two iterations of the i loop Block or strip mine How many cache misses for A and B?

7 Operation Count vs. Memory Operations §Example: Matrix multiplication §Previous example?

8 §Block matrix operations

9 Matrix multiplication int i,j,k; for (i=0;i<n;i++) { for(j=0;j<n;j++) { for (k=0;k<n;k++) { c[i][j]=c[i][j]+ a[i][k]*b[k][j]; } Remember to initialize c[i][j] to zero

10 Matrix multiplication with blocking int i,j,k,ii,jj,kk; for (ii=0;ii<n;ii+=S) { for (jj=0;jj<n;jj+=S) { for (kk=0;kk<n;kk+=S) { for(i=ii;i<min((ii+S),n);i++) { for(j=jj;j<min((jj+S),n);j++) { for(k=kk;k<min((kk+S),n);k++) { c[i][j]=c[i][j]+a[i][k]*b[k][j]; } Remember to initialize c[i][j] to zero

11 Exercise §Matrix Vector Multiplication

12 Cache coherence in multiprocessor systems §Suppose two processors on a shared bus have loaded the same variable. §If one processor changes value of that variable then:

13 Cache coherence in multiprocessor systems §Suppose two processors on a shared bus have loaded the same variable. §If one processor changes value of that variable then: l Invalidate other copies l Update other copies

14

15 Cache coherence in multiprocessor systems §What if a processor reads a data item only once initially? §Invalidate protocol is more commonly used.

16 False Sharing (multiprocessor) §Two processors are accessing different data items in the same cache block. §What happens if they both attempt to write to it?

17 False Sharing (multiprocessor) §Two processors are accessing different data items in the same cache block. §What happens if they both attempt to write to it? §Padding in data structures (tradeoff space vs. time)

18 Network Topologies §Bus based, crossbar and multistage networks §Earth simulator: crossbar §IBM SP-2 Multistage network

19 Network Topologies Large number of links in completely connected. Bottleneck in star topology.

20 Network Topologies 1-D torus Intel Paragon – 2-D Mesh BlueGene/L 3-D torus Cray TE3 3-D Cube

21 §2-D and 3-D meshes are common in parallel computers §Regularly structured computation maps naturally to 2-D mesh. §3-D network topologies: weather modeling, structure modeling