Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

Slides:



Advertisements
Similar presentations
ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Introduction to MIMD architectures
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Hardware/Software Performance Tradeoffs (plus Msg Passing Finish) CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Computer Organization and Architecture
Computer Architecture II 1 Computer architecture II Introduction.
Parallel Processing Architectures Laxmi Narayan Bhuyan
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
Programming for Performance
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
EECC756 - Shaaban #1 lec # 5 Spring Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning.
ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Mapping Techniques for Load Balancing
Parallel Architectures
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Shared Memory Systems and Programming
Review of Memory Management, Virtual Memory CS448.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping.
DISTRIBUTED COMPUTING
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
A comparison of CC-SAS, MP and SHMEM on SGI Origin2000.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Primitive Concepts of Distributed Systems Chapter 1.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
Distributed Shared Memory
Lecture 22: Cache Hierarchies, Memory
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Fast Communication and User Level Parallelism
Parallel Programming: Performance Todd C
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Presentation transcript:

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002

CS 495 F’02– 2 – Implications for Programming Models Shared address space and explicit message passing SAS may provide coherent replication or may not Focus primarily on former case Assume distributed memory in all cases Recall any model can be supported on any architecture Assume both are supported efficiently Assume communication in SAS is only through loads and stores Assume communication in SAS is at cache block granularity

CS 495 F’02– 3 – Issues to Consider Functional issues Naming Replication and coherence Synchronization Organizational issues Granularity at which communication is performed Performance issues Endpoint overhead of communication –(latency and bandwidth depend on network so considered similar) Ease of performance modeling Cost Issues Hardware cost and design complexity

CS 495 F’02– 4 – Naming SAS: similar to uniprocessor; system does it all MP: each process can only directly name the data in its address space Need to specify from where to obtain or where to transfer non-local data Easy for regular applications (e.g. Ocean) Difficult for applications with irregular, time-varying data needs –Barnes-Hut: where the parts of the tree that I need? (change with time) –Raytrace: where are the parts of the scene that I need (unpredictable) Solution methods exist –Barnes-Hut: Extra phase determines needs and transfers data before computation phase –Raytrace: scene-oriented rather than ray-oriented approach –both: emulate application-specific shared address space using hashing

CS 495 F’02– 5 – Replication Who manages it (i.e. who makes local copies of data)? SAS: system, MP: program Where in local memory hierarchy is replication first done? SAS: cache (or memory too), MP: main memory At what granularity is data allocated in replication store? SAS: cache block, MP: program-determined How are replicated data kept coherent? SAS: system, MP: program How is replacement of replicated data managed? SAS: dynamically at fine spatial and temporal grain (every access) MP: at phase boundaries, or emulate cache in main memory in software Of course, SAS affords many more options too (discussed later)

CS 495 F’02– 6 – Amount of Replication Needed Mostly local data accessed => little replication Cache-coherent SAS: Cache holds active working set –replaces at fine temporal and spatial grain (so little fragmentation too) Small enough working sets => need little or no replication in memory Message Passing or SAS without hardware caching: Replicate all data needed in a phase in main memory –replication overhead can be very large (Barnes-Hut, Raytrace) –limits scalability of problem size with no. of processors Emulate cache in software to achieve fine-temporal-grain replacement –expensive to manage in software (hardware is better at this) –may have to be conservative in size of cache used –fine-grained message generated by misses expensive (in message passing) –programming cost for cache and coalescing messages

CS 495 F’02– 7 – Communication Overhead and Granularity Overhead directly related to hardware support provided Lower in SAS (order of magnitude or more) Major tasks: Address translation and protection –SAS uses MMU –MP requires software protection, usually involving OS in some way Buffer management –fixed-size small messages in SAS easy to do in hardware –flexible-sized message in MP usually need software involvement Type checking and matching –MP does it in software: lots of possible message types due to flexibility A lot of research in reducing these costs in MP, but still much larger Naming, replication and overhead favor SAS Many irregular MP applications now emulate SAS/cache in software

CS 495 F’02– 8 – Block Data Transfer Fine-grained communication not most efficient for long messages Latency and overhead as well as traffic (headers for each cache line) SAS: can use block data transfer Explicit in system we assume, but can be automated at page or object level in general (more later) Especially important to amortize overhead when it is high –latency can be hidden by other techniques too Message passing: Overheads are larger, so block transfer more important But very natural to use since message are explicit and flexible –Inherent in model

CS 495 F’02– 9 – Synchronization SAS: Separate from communication (data transfer) Programmer must orchestrate separately Message passing Mutual exclusion by fiat Event synchronization already in send-receive match in synchronous –need separate orchestration (using probes or flags) in asynchronous

CS 495 F’02– 10 – Hardware Cost and Design Complexity Higher in SAS, and especially cache-coherent SAS But both are more complex issues Cost –must be compared with cost of replication in memory –depends on market factors, sales volume and other non-technical issues Complexity –must be compared with complexity of writing high-performance programs –reduced by increasing experience

CS 495 F’02– 11 – Performance Model Three components: Modeling cost of primitive system events of different types Modeling occurrence of these events in workload Integrating the two in a model to predict performance Second and third are most challenging Second is the case where cache-coherent SAS is more difficult replication and communication implicit, so events of interest implicit –similar to problems introduced by caching in uniprocessors MP has good guideline: messages are expensive, send infrequently Difficult for irregular applications in either case (but more so in SAS) Block transfer, synchronization, cost/complexity, and performance modeling advantageous for MP

CS 495 F’02– 12 – Summary for Programming Models Given tradeoffs, architect must address: Hardware support for SAS (transparent naming) worthwhile? Hardware support for replication and coherence worthwhile? Should explicit communication support also be provided in SAS? Current trend: Tightly-coupled multiprocessors support for cache-coherent SAS in hw Other major platform is clusters of workstations or multiprocessors –currently don’t support SAS in hardware, mostly use message passing

CS 495 F’02– 13 – Summary Crucial to understand characteristics of parallel programs Implications for a host of architectural issues at all levels Architectural convergence has led to: Greater portability of programming models and software –Many performance issues similar across programming models too Clearer articulation of performance issues –Used to use PRAM model for algorithm design –Now models that incorporate communication cost (BSP, logP,….) –Emphasis in modeling shifted to end-points, where cost is greatest –But need techniques to model application behavior, not just machines Performance issues trade off with one another; iterative refinement Ready to understand using workloads to evaluate systems issues