Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.

Slides:



Advertisements
Similar presentations
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Apollo 11 The Great Race By: Cameron Trulock. The Crew Neil Armstrong Michael Collins Edwin Aldrin, Jr.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Parallel System Performance CS 524 – High-Performance Computing.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Reference: Message Passing Fundamentals.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware
1 Tuesday, September 26, 2006 Wisdom consists of knowing when to avoid perfection. -Horowitz.
Designing Parallel Programs David Rodriguez-Velazquez CS-6260 Spring-2009 Dr. Elise de Doncker.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Lecturer: Simon Winberg Review of EEE4084F  Lecture content covered  Readings, seminars, chapters EEE4084F.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 14, 2005 Operating System.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
EEL Software development for real-time engineering systems.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
The Value of Parallelism 16 th Meeting Course Name: Business Intelligence Year: 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
Journey to the Moon Saturn V rocket powers Apollo 11's lift-off from Kennedy Space Centre Journey to the Moon.
Parallel Computing.
Apollo 11 By Ryan Nish and Mikenzie Hammond. Landing on the moon The first manned spacecraft landing on the Moon was at 3:17 p.m. EST on July 20, 1969,
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Computer Architecture SIMD Ola Flygt Växjö University
Data Structures and Algorithms in Parallel Computing Lecture 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Parallel Computing Presented by Justin Reschke
Today… Modularity, or Writing Functions. Winter 2016CISC101 - Prof. McLeod1.
Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
These slides are based on the book:
Design of Parallel Programs / Heterogeneous Computing Solutions
Topics Coarse-grained FPGAs. Reconfigurable systems.
Overview Parallel Processing Pipelining
Distributed Processors
Design of Parallel Programs / Heterogeneous Computing Solutions
TOPIC Week 2-Term 3-Lesson 1.
Parallel Programming By J. H. Wang May 2, 2017.
The University of Adelaide, School of Computer Science
Parallel Algorithm Design
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel Programming in C with MPI and OpenMP
Decomposition.
Design of Parallel Programs / Heterogeneous Computing Solutions
Threads Chapter 4.
COMP60621 Fundamentals of Parallel and Distributed Systems
Mattan Erez The University of Texas at Austin
COMP60611 Fundamentals of Parallel and Distributed Systems
Logical Architecture & UML Package Diagrams
Presentation transcript:

Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg

 Aside: Saturn5 – one giant leap for digital computer engineering  Step 3: decomposition and granularity  Class activity  Step 4: communications

A short – and I hope inspiring for everyone – trip back to the 1960 of rockets and trips to the moon, and in particular an early form of HPEC system for controlling the Saturn V

A short case study of a (yesteryear) high performance embedded computer Apollo Saturn V launch Think of the many subsystems & the challenges of getting 1960s tech to deliver The Apollo Saturn V Launch Vehicle Digital Computer (LVDC) You all surely know what the Saturn V is… but just incase you don’t… Launch of Apollo 4 first Saturn V as seen LIVE on CBS

Saturn V Launch Vehicle (i.e. the rocket) Apollo 11 Uncomfortable but eager flight crew: Neil Armstrong (Commander) Buzz Aldrin (Lunar Module Pilot) Michael Collins (Command Module Pilot) Neil Collins Buzz Safety abort rocket About 3,200 °C produced by F-1 rocket engines, all together giving ±160 million horsepower

… We all know that Apollo 11, and many thanks to Saturn V, resulted in: “That's one small step for [a] man, one giant leap for mankind” -- Niel Amstrong, July 21, 1969 But the successful launch of Saturn V has much to attribute to: The LVDC (Launch Vehicle Digital Computer) The LVDC was of course quite a fancy circuit back in those days. And it may have been ahead of it’s time in same ways too… which connects nicely to aims of this course… Here’s an interesting perspective on the LVDC, giving some concept of how an application (i.e. launching a rocket) can lead to significant breakthroughs in technology, and computer design in particular.

The Apollo Saturn V Launch Vehicle Digital Computer (LVDC) Circuit Board Other suggestions: Computer for Apollo (1960s documentary):

To go…

A forerunner of modern circuit design technology! (back in its day) Has some design characteristics in common to today’s reconfigurable computers! something like: Was (a bit) like an FPGA / A programmable logic devices (PLD) where you’d slot in a card that adjusted interconnects and configured ways processing modules would connect… had large parts of a reusable / reconfigurable system (customizable task specific system) but unlike being field programmable was more hardwired / a kind of rapid ASIC technology. “That’s just one small computer, but one giant leap for computer kind” (especially considering high performance embedded systems!)

Design of parallel Programs EEE4084F

Steps in designing parallel programs The hardware may come first or later The main steps: 1.Understand the problem 2.Partitioning (separation into main tasks) 3.Decomposition & Granularity 4.Communications 5.Identify data dependencies 6.Synchronization 7.Load balancing 8.Performance analysis and tuning

Step 3: Decomposition and Granularity EEE4084F

 Decomposition – how the problem can be divided up; looked at earlier:  Functional decomposition  Domain (or data) decomposition  Granularity  How big or small are the parts that the problem has been decomposed into?  How interrelated are the sub-tasks

 This ratio can help to decide is a problem is fine or course grained.  1 : 1 = Each intermediate result needs a communication operation  100 : 1 = 100 computations (or intermediate results) require only one communication operation  1 : 100 = Each computation needs 100 communication operations

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1) course grained …

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).  Embarrassingly Parallel:  So course that there’s no or very little interrelation between parts/sub-processes

 Fine grained:  Problem broken into (usually many) very small pieces  Problems where any one piece is highly interrelated to others (e.g., having to look at relations between neighboring gas molecules to determine how a cloud of gas molecules behaves)  Sometimes, attempts to parallelize fine-grained solutions increased the solution time.  For very fine-grained problems, computational performance is limited both by start-up time and the speed of the fastest single CPU in the cluster.

 Course grained:  Breaking the problems into larger pieces  Usually, low level of interrelations (e.g., can separate into parts whose elements are unrelated to other parts)  These solutions are generally easier to parallelize than fine- grained, and  Usually, parallelization of these problems provides significant benefits.  Ideally, the problem is found to be “embarrassingly parallel” (this can of course also be the case for fine grained solutions)

 Many image processing problems are suited to course grained solutions, e.g.: can perform calculations on individual pixels or small sets of pixels without requiring knowledge of any other pixel in the image.  Scientific problems tend to be between coarse and fine granularity. These solutions may require some amount of interaction between regions, therefore the individual processors doing the work need to collaborate and exchange results (i.e., need for synchronization and message passing).  E.g., any one element in the data set may depend on the values of its nearest neighbors. If data is decomposed into two parts that are each processed by a separate CPU, then the CPUs will need to exchange boundary information.

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply  FFTs  Decryption code breaking  (deterministic) Finite state machine validation / termination checking  Map navigation (e.g., shortest path)  Population modelling

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply - fine grain data, course funct.  FFTs - fine grained  Decryption code breaking - course grained  (deterministic) Finite state machine validation / termination checking - course grained  Map navigation (e.g., shortest path) - course  Population modelling - course grained

Step 4: Communications EEE4084F

 The communications needs between tasks depends on your solution:  Communications not needed for  Minimal or no shared data or results.  E.g., an image processing routine where each pixel is dimmed (e.g., 50% dim). Here, the image can easily be separated between many tasks that act entirely independently of one other.  Usually the case for embarrassingly parallel solutions4

 The communications needs between tasks depends on your solution:  Communications is needed for…  Parallel applications that need to share results or boundary information. E.g.,  E.g., modeling 2D heat diffusion over time – this could divide into multiple parts, but boundary results need to be shared. Changes to an elements in the middle of the partition only has an effect on the boundary after some time.

 Cost of communications  Latency vs. Bandwidth  Visibility of communications  Synchronous vs. asynchronous communications  Scope of communications  Efficiency of communications  Overhead and Complexity

 Communication between tasks has some kind of overheads, such as:  CPU cycles, memory and other resources that could be used for computation are instead used to package and transmit data.  Also needs synchronization between tasks, which may result in tasks spending time waiting instead of working.  Competing communication traffic could also saturate the network bandwidth, causing performance loss.

 Cloud computing  Step 4: Communications (cont)  Step 5: Identify data dependencies Post-lecture (voluntary) assignment: Refer back to slide 24 (factors related to communication) Use your favourite search engine to read up further on these factors, and think how the hardware design aspects of a computing platform can benefit or impact these issues.