10/11/01CSE 260 - Models CSE 260 – Introduction to Parallel Computation Topic 6: Models of Parallel Computers October 11-18, 2001.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

1 Parallel Algorithms (chap. 30, 1 st edition) Parallel: perform more than one operation at a time. PRAM model: Parallel Random Access Model. p0p0 p1p1.
Parallel Algorithms.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
PRAM (Parallel Random Access Machine)
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
1 Lecture 3 PRAM Algorithms Parallel Computing Fall 2008.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Simulating a CRCW algorithm with an EREW algorithm Lecture 4 Efficient Parallel Algorithms COMP308.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
IT253: Computer Organization
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemor y Refs. FlopsFlops/
Data Structures and Algorithms in Parallel Computing Lecture 1.
Models of Hierarchical Memory. Two-Level Memory Hierarchy Problem starts out on disk Solution is to be written to disk Cost of an algorithm is the number.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
REGISTER TRANSFER LANGUAGE (RTL)
Distributed and Parallel Processing
Course Outline Introduction in algorithms and applications
Lecture 18: Coherence and Synchronization
Parallel computation models
PRAM Algorithms.
Data Structures and Algorithms in Parallel Computing
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 24: Memory, VM, Multiproc
Unit –VIII PRAM Algorithms.
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

10/11/01CSE Models CSE 260 – Introduction to Parallel Computation Topic 6: Models of Parallel Computers October 11-18, 2001

CSE Models2 Models of Computation What’s a model good for?? Provides a way to think about computers. Influences design of: Architectures Languages Algorithms Provides a way of estimating how well a program will perform. Cost in model should be roughly same as cost of executing program

CSE Models3 Outline RAM model of sequential computing PRAM Fat tree PMH BSP LogP

CSE Models4 The Random Access Machine Model RAM model of serial computers: –Memory is a sequence of words, each capable of containing an integer. –Each memory access takes one unit of time –Basic operations (add, multiply, compare) take one unit time. –Instructions are not modifiable –Read-only input tape, write-only output tape

CSE Models5 Has RAM influenced our thinking? Language design: No way to designate registers, cache, DRAM. Most convenient disk access is as streams. How do you express atomic read/modify/write? Machine & system design: It’s not very easy to modify code. Systems pretend instructions are executed in-order. Performance Analysis: Primary measures are operations/sec (MFlop/sec, MHz,...) What’s the difference between Quicksort and Heapsort??

CSE Models6 What about parallel computers RAM model is generally considered a very successful “bridging model” between programmer and hardware. “Since RAM is so successful, let’s generalize it for parallel computers...”

CSE Models7 PRAM [Parallel Random Access Machine] PRAM composed of: – P processors, each with its own unmodifiable program. –A single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. –a read-only input tape. –a write-only output tape. PRAM model is a synchronous, MIMD, shared address space parallel computer. (Introduced by Fortune and Wyllie, 1978)

CSE Models8 More PRAM taxonomy Different protocols can be used for reading and writing shared memory. –EREW - exclusive read, exclusive write A program isn’t allowed to have two processors access the same memory location at the same time. –CREW - concurrent read, exclusive write –CRCW - concurrent read, concurrent write Needs protocol for arbitrating write conflicts –CROW – concurrent read, owner write Each memory location has an official “owner” PRAM can emulate a message-passing machine by partitioning memory into private memories.

CSE Models9 Broadcasting on a PRAM “Broadcast” can be done on CREW PRAM in O(1) steps: –Broadcaster sends value to shared memory –Processors read from shared memory Requires lg(P) steps on EREW PRAM. M PPPPPPPP B

CSE Models10 Finding Max on a CRCW PRAM We can find the max of N distinct numbers x[1],..., x[N] in constant time using N 2 procs! Number the processors P rs with r, s  {1,..., N}. –Initialization: P 1s sets A[s] = 1. –Eliminate non-max’s: if x[r] < x[s], P rs sets A[r] = 0. Requires concurrent reads & writes. –Find winner: If A[r] = 1, P r1 sets max = x[r].

CSE Models11 Some questions 1.What if the x[i]’s aren’t necessarily distinct? 2.Can you sort N numbers in constant time? And only use only N k processors (for some k)? 3.How fast can you sort on CREW? 4.Does any of this have any practical significance ????

CSE Models12 PRAM is not a great success Many theoretical papers about fine-grained algorithmic techniques and distinctions between various modes. Results seem irrelevant. –Performance predictions are inaccurate. –Hasn’t lead to programming languages. –Hardware doesn’t have fine-grained synchronous steps.

CSE Models13 Fat Tree Model (Leiserson, 1985) Processors at leaves of tree Group of k 2 processors connected by k-width bus k 2 processors fit in (k lg 2k) 2 area Area-universal: can simulate t steps of any p-proc computer in t lg p steps

CSE Models14 Fat Tree Model inspired CM-5 Up to 1024 nodes in fat tree –20MB/sec/node within group-of-4 –10MB/sec/node within group-of-16 – 5 MB/sec/node among larger groups Node = 33MHz Sparc plus 4 33 MFlop/sec vector units Plus fast narrow “control network” for parallel prefix operations

CSE Models15 What happened to fat trees? CM-5 had many interesting features –Active message VSM software layer. –Randomized routing. –Fast control network. It was somewhat successful, but died anyway –Using the floating point unit well wasn’t easy. –Perhaps not sufficiently COTS-like to compete. Fat trees live on, but aren’t highlighted... –IBM SP and others have less bandwidth between cabinets than within a cabinet. –Seen more as a flaw than a feature.

CSE Models16 Another look at the RAM model RAM analysis says matrix multiply is O(N 3 ). for i = 1 to N for j = 1 to N for k = 1 to N C[i,j] += A[i,k]*B[k,j] Is it??

CSE Models17 Matrix Multiply on RS/6000 T = N 4.7 O(N 3 ) performance would have constant cycles/flop Performance looks much closer to O(N 5 ) Size 2000 took 5 days would take 1095 years

CSE Models18 Column major storage layout cachelines Blue row of matrix is stored in red cacheline

CSE Models19 Memory Accesses in Matrix Multiply for i = 1 to N for j = 1 to N for k = 1 to N C[i,j] += A[i,k]*B[k,j] When cache (or TLB or memory) can’t hold entire B matrix, there will be a miss on every line. When cache (or TLB or memory) can’t hold a row of A, there will be a miss on each access Sequential access through entire matrix Stride-N access to one row* * assumes data is in column-major order

CSE Models20 Matrix Multiply on RS/6000 Page miss every 512 iterations Page miss every iteration TLB miss every iteration Cache miss every 16 iterations

CSE Models21 Where are we? RAM model says naïve matrix multiply is O(N 3 ) Experiments show it’s O(N 5 )-ish Explanation involves cache, TLB, and main memory limits and block sizes Conclusion: memory features are important and should be included in model.

CSE Models22 Models of memory behavior Uniprocessor models looking at data access costs: Two-level models (main memory & cache): Floyd (’72), Hong & Kung (’81) Hierarchical Memory Model Accessing memory location i costs f(i) Aggarwal, Alpern, Chandra & Snir (’87) Block Transfer Model Moving block of length k at location i costs k+f(i) Aggarwal, Chandra & Snir (’87) Memory Hierarchy Model Multilevel memory, block moves, extends to parallelism Alpern & Carter (’90)

CSE Models23 Memory Hierarchy model A uniprocessor is Sequence of memory modules Highest level is large memory, low speed Processor (level 0) is tiny memory, high speed Connected by channels All channels can be active simultaneously Data are moved in fixed-sized blocks A block is a chunk of contiguous data Block size depends on level DISK DRAM cache regs

CSE Models24 Does MH model influence your thinking? Say your computer is a sequence of modules: You want to move data to the fast one at bottom. Moving contiguous chunks of data is faster. How do you accomplish this?? One possible answer: divide & conquer (Mini project – does the model suggest anything for your favorite algorithm?)

CSE Models25 Visualizing Matrix Multiplication C B A “stick” of computation is dot product of a row of A with column of B c ij =  a ik  b kj i j C = A B

CSE Models26 Visualizing Matrix Multiplication C B A “Cubelet” of computation is product of a submatrix of A with submatrix of B - Data involved is proportional to surface area. - Computation is proportional to volume.

CSE Models27 MH algorithm for C = AB Partition computation into “cubelets” Each cubelet requires sxs submatrix of A and B 3 s 2 data needed; allows s 3 multiply-adds Parent module gives child sequence of cubelets. Choose s to ensure all data fits into child’s memory Child sub-partitions cubelet into still smaller pieces. Known as “blocking” or “tiling” long before MH model invented (but rarely applied recursively).

CSE Models28 Theory of MH algorithm for C = AB “Uniform” Memory Hierarchy (UMH) model looks similar to actual computers. Block size, number of blocks per module, and transfer time per item grow by constant factor per level. Naïve matrix multiplication is O(N 5 ) on UMH. Similar to observed performance. Tiled algorithm is O(N 3 ) on UMH. Tiled algorithm gets about 90% “peak performance” on many computers. Moral: good MH algorithm  good in practice.

CSE Models29 Visualizing computers in MH model Height of module = lg(blocksize) Width = lg(number of blocks) Length of channel = lg(transfer time) DISK DRAM cache regs This computer is reasonably well-balanced This one isn’t Doesn’t satisfy “wide cache principle” (square submatrices don’t fit). Bandwidth too low

CSE Models30 Parallel Memory Hierarchy (PMH) model Alpern & Carter: “Since MH model is so great, let’s generalize it for parallel computers!” A computer is a tree of memory modules Largest memory is at root. Children have less memory, more compute power. Four parameters per module Block size, number of blocks, transfer time from parent, and number of children. Homogeneous  all modules at a level have same parameters (PMH ignores difference between shared and distributed address space computation.)

CSE Models31 Some Parallel Architectures The Grid registers Main memories Disks Caches network NOW DISKS Main memories Shared disk system Disks Caches network Main memory DISK Scalar cache Extended Storage vector regs Vector supercomputer

CSE Models32 PMH model of multi-tier computer Secondary Storage Internodal network Node L2 L1 P1 L2 L1 P2 Node L2 L1 P3 L2 L1 P4 Node L2 L1... L2 L1 Pn functional units DRAM SRAM Magnetic Storage registers

CSE Models33 Observations PMH can model heterogeneous systems as well as homogeneous ones. More expensive computers have more parallelism and higher bandwidth near leaves Computers getting more levels & more branching. Parallelizing code for PMH is very similar to tuning it for a memory hierarchy. –Break computation into independent blocks –Send blocks of work to children Needed for parallelization

CSE Models34 BSP (Bulk Synchronous Parallel) Model Valiant,”A Bridging Model for Parallel Computation”, CACM, Aug ‘90 CORRECTION!! I have been confusing BSP with the “Phase PRAM” model (Gibbons, SPAA ’89), which indeed is a shared-memory model with periodic barrier synchronizations. In BSP, each processor has local memory. “One-sided” communication style is advocated. –There are globally-known “symbolic addresses” (like VSM) Data may be inconsistent until next barrier synchronization Valiant suggests hashing implementation of puts and gets.

CSE Models35 BSP Programs BSP programs composed of supersteps. In each superstep, processors execute up to L computational steps using locally stored data, and also can send and receive messages Processors synchronize at end of superstep (at which time all messages have been received) Oxford BSP is a library of C routines for implementing BSP programs. It provides: –Direct Remote Memory Access (a VSM layer) –Bulk Synchronous Message Passing (sort of like non-blocking message passing in MPI) superstep synch superstep synch superstep synch

CSE Models36 Parameters of BSP Model P = number of processors. s = processor speed (steps/second). observed, not “peak”. L = time to do a barrier synchronization (steps/synch). g = cost of sending message (steps/word). measure g when all processors are communicating. h 0 = minimum # of messages per superstep. For h  h 0, cost of sending h messages is hg. h 0 is similar to block size in PMH model.

CSE Models37 BSP Notes Number of processors in model can be greater than number of processors of machine. Easier for computer to complete the remote memory operations Not all processors need to join barrier synch Time for superstep = 1/s  (max (operations performed by any processor) + g  max (messages sent or received by a processor, h 0 ) + L)

CSE Models38 Some representative BSP parameters Machine (all have P=8) MFlop/s s Flops/synch L Flops/word g words (32b) n 1/2 for h 0 Pentium II NOW switched Ethernet Cray T3E IBM SP Pentium NOW serial Ethernet , From oldwww.comlab.ox.ac.uk/oucl/groups/bsp/index.html (1998) NOTE: Benchmarks for determining s were not tuned.

CSE Models39 LogP Model Developed by Culler, Karp, Patterson, etc. –Famous guys at Berkeley Models communication costs in a multicomputer. Influenced by MPP architectures (circa 1993), notably the CM-5. –each node is a powerful processor with large memory –interconnection structure has limited bandwidth –interconnection structure has significant latency

CSE Models40 LogP parameters L: latency – time for message to go from P sender to P receiver o: overhead - time either processor is occupied sending or receiving message –Processor can’t do anything else for o cycles. g: gap - minimum time between messages –Processor can have at most  L/g  messages in transit at a time. –Gap includes overhead time (so overhead  gap) P: number of processors L, o, and g are measured in cycles

CSE Models41 Efficient Broadcasting in LogP P0 P1 P2 P3 P4 P5 P6 P7 time o g o g o g o o o oo L L L o L L o o g o o L o L Picture shows P=8, L=6, g=4, o=