09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2, 2010 1.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.
Efficient Parallel Algorithms COMP308
Introduction CSCI 444/544 Operating Systems Fall 2008.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Reference: Message Passing Fundamentals.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Chapter 5: Computer Systems Organization Invitation to Computer Science, Java Version, Third Edition.
Parallel Programming Chapter 2 Introduction to Parallel Architectures Johnnie Baker January 23,
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.
What is Concurrent Programming? Maram Bani Younes.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Computer System Architectures Computer System Software
1.1 1 Introduction Foundations of Computer Science  Cengage Learning.
1 COMPSCI 110 Operating Systems Who - Introductions How - Policies and Administrative Details Why - Objectives and Expectations What - Our Topic: Operating.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1,
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Hybrid MPI and OpenMP Parallel Programming
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 3, 2005 Session 7.
09/01/2009CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall September 1,
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
08/31/2010CS4961 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 31,
Data Structures and Algorithms in Parallel Computing Lecture 1.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
2016/1/5Part I1 Models of Parallel Processing. 2016/1/5Part I2 Parallel processors come in many different varieties. Thus, we often deal with abstract.
Outline Why this subject? What is High Performance Computing?
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Parallel Computing Presented by Justin Reschke
Concurrency and Performance Based on slides by Henri Casanova.
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
These slides are based on the book:
Advanced Computer Systems
Operating Systems (CS 340 D)
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.
EE 193: Parallel Computing
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Foundations of Computer Science
Lecture 1: Parallel Architecture Intro
What is Concurrent Programming?
COMP60621 Fundamentals of Parallel and Distributed Systems
Chapter 4: Threads & Concurrency
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

Homework 2, Due Friday, Sept. 10, 11:59 PM To submit your homework: -Submit a PDF file -Use the “handin” program on the CADE machines -Use the following command: “handin cs4961 hw2 Problem 1 (based on #1 in text on p. 59): Consider the Try2 algorithm for “count3s” from Figure 1.9 of p.19 of the text. Assume you have an input array of 1024 elements, 4 threads, and that the input data is evenly split among the four processors so that accesses to the input array are local and have unit cost. Assume there is an even distribution of appearances of 3 in the elements assigned to each thread which is a constant we call NTPT. What is a bound for the memory cost for a particular thread predicted by the CTA expressed in terms of λ and NTPT. 09/02/2010CS49612

Homework 2, cont Problem 2 (based on #2 in text on p. 59), cont.: Now provide a bound for the memory cost for a particular thread predicted by CTA for the Try4 algorithm of Fig. 114 on p. 23 (or Try3 assuming each element is placed on a separate cache line). Problem 3: For these examples, how is algorithm selection impacted by the value of NTPT? Problem 4 (in general, not specific to this problem): How is algorithm selection impacted by the value of λ? 09/02/2010CS49613

Brief Recap of Course So Far Technology trends that make parallel computing increasingly important History from scientific simulation and supercomputers Data dependences and reordering transformations -Fundamental Theory of Dependence Tuesday, we looked at a lot of different kinds of parallel architectures -Diverse! -Shared memory vs. distributed memory -Scalability through hierarchy 09/02/2010CS49614

Today’s Lecture How to write software for a moving hardware target? -Abstract away specific details -Want to write machine-independent code Candidate Type Architecture (CTA) Model -Captures inherent tradeoffs without details of hardware choices -Summary: Locality is Everything! Data parallel and task parallel constructs and how to express them Sources for this lecture: -Larry Snyder, -Grama et al., Introduction to Parallel Computing, /02/20105CS4961

Parallel Architecture Model How to develop portable parallel algorithms for current and future parallel architectures, a moving target? Strategy: -Adopt an abstract parallel machine model for use in thinking about algorithms 1.Review how we compare algorithms on sequential architectures 2.Introduce the CTA model (Candidate Type Architecture) 3.Discuss how it relates to today’s set of machines 08/31/2010CS49616

How did we do it for sequential architectures? Sequential Model: Random Access Machine -Control, ALU, (Unlimited) Memory, [Input, Output] -Fetch/execute cycle runs 1 inst. pointed at by PC -Memory references are “unit time” independent of location -Gives RAM it’s name in preference to von Neumann -“Unit time” is not literally true, but caches provide that illusion when effective -Executes “3-address” instructions Focus in developing sequential algorithms, at least in courses, is on reducing amount of computation (useful even if imprecise) -Treat memory time as negligible -Ignore overheads 08/31/2010CS49617

Interesting Historical Parallel Architecture Model, PRAM Parallel Random Access Machine (PRAM) -Unlimited number of processors -Processors are standard RAM machines, executing synchronously -Memory reference is “unit time” -Outcome of collisions at memory specified -EREW, CREW, CRCW … Model fails to capture true performance behavior -Synchronous execution w/ unit cost memory reference does not scale -Therefore, parallel hardware typically implements non-uniform cost memory reference 08/31/2010CS49618

Candidate Type Architecture (CTA Model) A model with P standard processors, d degree,λ latency Node == processor + memory + NIC Key Property: Local memory ref is 1, global memory is λ 08/31/2010CS49619

Estimated Values for Lambda Captures inherent property that data locality is important. But different values of Lambda can lead to different algorithm strategies 08/31/2010CS496110

Locality Rule Definition, p. 53: -Fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references. Locality Rule in practice -It is usually more efficient to add a fair amount of redundant computation to avoid non-local accesses (e.g., random number generator example). This is the most important thing you need to learn in this class! 09/02/2010CS496111

Memory Reference Mechanisms Shared Memory -All processors have access to a global address space -Refer to remote data or local data in the same way, through normal loads and stores -Usually, caches must be kept coherent with global store Message Passing & Distributed Memory -Memory is partitioned and a partition is associated with an individual processor -Remote data access through explicit communication (sends and receives) -Two-sided (both a send and receive are needed) One-Sided Communication (a hybrid mechanism) -Supports a global shared address space but no coherence guarantees -Access to remote data through gets and puts 08/31/2010CS496112

Brief Discussion Why is it good to have different parallel architectures? -Some may be better suited for specific application domains -Some may be better suited for a particular community -Cost -Explore new ideas And different programming models/languages? -Relate to architectural features -Application domains, user community, cost, exploring new ideas 08/31/2010CS496113

Conceptual: CTA for Shared Memory Architectures? CTA is not capturing global memory in SMPs Forces a discipline -Application developer should think about locality even if remote data is referenced identically to local data! -Otherwise, performance will suffer -Anecdotally, codes written for distributed memory shown to run faster on shared memory architectures than shared memory programs -Similarly, GPU codes (which require a partitioning of memory) recently shown to run well on conventional multi-core 09/02/2010CS496114

Definitions of Data and Task Parallelism Data parallel computation: -Perform the same operation to different items of data at the same time; the parallelism grows with the size of the data. Task parallel computation: -Perform distinct computations -- or tasks -- at the same time; with the number of tasks fixed, the parallelism is not scalable. Summary -Mostly we will study data parallelism in this class -Data parallelism facilitates very high speedups; and scaling to supercomputers. -Hybrid (mixing of the two) is increasingly common 09/02/2010CS496115

Parallel Formulation vs. Parallel Algorithm Parallel Formulation -Refers to a parallelization of a serial algorithm. Parallel Algorithm -May represent an entirely different algorithm than the one used serially. In this course, we primarily focus on “Parallel Formulations”. 09/02/2010CS496116

Steps to Parallel Formulation (refined from Lecture 2) Computation Decomposition/Partitioning: -Identify pieces of work that can be done concurrently -Assign tasks to multiple processors (processes used equivalently) Data Decomposition/Partitioning: -Decompose input, output and intermediate data across different processors Manage Access to shared data and synchronization: -coherent view, safe access for input or intermediate data UNDERLYING PRINCIPLES: Maximize concurrency and reduce overheads due to parallelization! Maximize potential speedup! 09/02/2010CS496117

Concept of Threads This week’s homework was made more difficult because we didn’t have a concrete way of expressing the parallelism features of our code! Text introduces Peril-L as a neutral language for describing parallel programming constructs -Abstracts away details of existing languages -Architecture independent -Data parallel -Based on C, for universality Next week we will instead learn OpenMP -Similar to Peril-L 09/02/2010CS496118

09/02/2010CS Common Notions of Task-Parallel Thread Creation (not in Peril-L)

Examples of Task and Data Parallelism Looking for all the appearances of “University of Utah” on the world-wide web A series of signal processing filters applied to an incoming signal Same signal processing filters applied to a large known signal Climate model from Lecture 1 09/02/2010CS496120

09/03/2009CS4961 Summary of Lecture CTA model How to develop a parallel code -Locality is everything! -Added in data decomposition Next Time: -Our first parallel programming language! -Data Parallelism in OpenMP 21