Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Lecture 3: Lecturer: Simon Winberg Towards Prac1, Golden Measure, Temporal and Spatial Computing, Benchmarking Attribution-ShareAlike 4.0 International.
Topics Parallel Computing Shared Memory OpenMP 1.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Reference: Message Passing Fundamentals.
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
 Parallel Computer Architecture Taylor Hearn, Fabrice Bokanya, Beenish Zafar, Mathew Simon, Tong Chen.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Architectures
Lecture 5: Parallel Architecture
Lecture 6: Design of Parallel Programs Part I Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
Computer System Architectures Computer System Software
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Parallel Computing.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.
Outline Why this subject? What is High Performance Computing?
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 5: Lecturer: Simon Winberg Review of paper: Temporal Partitioning Algorithm for a Coarse-grained Reconfigurable Computing Architecture by Chongyong.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Overview Parallel Processing Pipelining
Introduction to parallel programming
Distributed Processors
Design of Parallel Programs / Heterogeneous Computing Solutions
Parallel Programming By J. H. Wang May 2, 2017.
EEE4084F Digital Systems NOT IN 2017 EXAM Lecture 25
CS 147 – Parallel Processing
Overview Parallel Processing Pipelining
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
EEE4084F Digital Systems NOT IN 2018 EXAM Lecture 24X
Parallel and Multiprocessor Architectures – Shared Memory
Chapter 17 Parallel Processing
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
COMP60621 Fundamentals of Parallel and Distributed Systems
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Syllabus: Lectures 1-5 Might ask something about prac1 or 2 (i.e. OCTAVE or pthreads) … 45 minutes, usual lecture venue

 Temporal & spatial computing  Extracting concurrency  Shared memory  Granularity Licensing details last slide

Temporal and Spatial Computation Temporal ComputationSpatial Computation The traditional paradigm Typical of Programmers Things done over time steps Suited to hardware Possibly more intuitive? Things related in a space A = input(“A= ? ”); B = input(“B =? ”); C = input(“B multiplier ?”); X = A + B * C Y = A – B * C A? B? C? + * X ! Y ! - Which do you think is easier to make sense of? Can provide a clearer indication of relative dependencies.

 Being able to comprehend and extract the parallelism, or properties of concurrency, from a process or algorithm is essential to accelerating computation  The Reconfigurable Computing (RC) Advantage:  The computer platform able to adapt according to the concurrency inherent in a particular application in order to accelerate computation for the specific application

 The choice of memory architecture is not necessarily dependent on the ‘Flynn classification’  For a SISM computer, this aspect is largely irrelevant (but consider a PC with GPU and DMA as not being in the SISM category)

 Generally, all processors have access to all memory in a global address space.  Processors operate independently, but they can share the same global memory.  Changes to global memory done by one processor are seen by the other processors.  Shared memory machines can be divided into two types, depending on when memory is accessed:  Uniform Memory Access (UMA) or  Non-uniform Memory Access (NUMA)

 Common today in form of Symmetric Multi- Processor (SMP) machines  Identical processors  Equal access and access times to memory  Cache coherent  Cache coherent =  When one processor writes a location in shared memory, all other processors are updated.  Cache coherency is implemented at the hardware level. MEMORY CPU

 Not all processors have the same access time to all the memories  Memory access across link is slower  If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA CPU MEMORY CPU MEMORY CPU Interconnect bus SMP 1SMP 2 This architecture has two SMPs connected via a bus. When a CPU on SMP1 needs to access memory connected to SMP2, there will be some form of lag which may be a few times slower than access to SMP1’s own memory.

 Advantages  Global address space gives a user-friendly programming approach (as discussed in shared memory programming model)  Sharing data between tasks is fast and uniform due to the proximity of memory to CPUs  Disadvantages:  Major drawback: lack of scalability between memory and CPUs.  Adding CPUs can increases traffic on shared memory- CPU path (for cache coherent systems also increases traffic associated with cache/memory management)

 Disadvantages  Programmer responsible for implementing/using synchronization constructs to make sure that correct access of global memory is done.  Becomes more difficult and expensive to design and construct shared memory machines with ever increasing numbers of processors.

 Similar to shared memory, but requires a communications network to share memory Local Memory CPU Communications network Local Memory CPU Local Memory CPU … Each processor has its own local memory (not directly accessible by the other processors’ memory addresses) Processors connected via a communication network – the communication network fabric varies; could simply be Ethernet. Cache coherency does not apply (when a CPU changes its local memory, the hardware does not notify the other processors – if needed the programmer needs to provide this functionality) Programmer responsible for implementing methods by which one processor can access memory of a different processor.

 Advantages:  Memory scalable with number of processors  Each processor can access own memory quickly without communication overheads or maintaining cache coherency (for UMA).  Cost benefits: use of commercial off-the- shelf (COTS) processors and networks

 Disadvantages:  Programmer takes on responsibility for data consistency, synchronization and communication between processors.  Existing (legacy) programs based on shared global memory may be difficult to port to this model.  May be more difficult to write applications for distributed memory systems than it is for shared memory systems.  Restricted by non-uniform memory access (NUMA) performance (meaning a memory access bottle neck that may be many times slower than shared memory systems)

 Simply a network of shared memory systems (possibly in one computer or a cluster of separated computers)  Use in many modern supercomputer designs today  Shared memory part is usually UMA (cache coherent)  Pros & Cons? – Best and Worst of two worlds.

 Granularity (characteristic of problem)  How big or small are the parts that the problem has been decomposed into?  How interrelated are the sub-tasks  Decomposition (development process)  How the problem can be divided up; relates closely to granularity of the problem  Functional decomposition  Domain (or data) decomposition

 This ratio can help to decide is a problem is fine or course grained.  1 : 1 = Each intermediate result needs a communication operation  100 : 1 = 100 computations (or intermediate results) require only one communication operation  1 : 100 = Each computation needs 100 communication operations

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1) course grained …

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).  Embarrassingly Parallel:  So course that there’s no or very little interrelation between parts/sub-processes

 Fine grained:  Problem broken into (usually many) very small pieces  Problems where any one piece is highly interrelated to others (e.g., having to look at relations between neighboring gas molecules to determine how a cloud of gas molecules behaves)  Sometimes, attempts to parallelize fine-grained solutions increased the solution time.  For very fine-grained problems, computational performance is limited both by start-up time and the speed of the fastest single CPU in the cluster.

 Course grained:  Breaking the problems into larger pieces  Usually, low level of interrelations (e.g., can separate into parts whose elements are unrelated to other parts)  These solutions are generally easier to parallelize than fine- grained, and  Usually, parallelization of these problems provides significant benefits.  Ideally, the problem is found to be “embarrassingly parallel” (this can of course also be the case for fine grained solutions)

 Many image processing problems are suited to course grained solutions, e.g.: can perform calculations on individual pixels or small sets of pixels without requiring knowledge of any other pixel in the image.  Scientific problems tend to be between coarse and fine granularity. These solutions may require some amount of interaction between regions, therefore the individual processors doing the work need to collaborate and exchange results (i.e., need for synchronization and message passing).  E.g., any one element in the data set may depend on the values of its nearest neighbors. If data is decomposed into two parts that are each processed by a separate CPU, then the CPUs will need to exchange boundary information.

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply  FFTs  Decryption code breaking  (deterministic) Finite state machine validation / termination checking  Map navigation (e.g., shortest path)  Population modelling

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply - fine grain data, course funct.  FFTs - fine grained  Decryption code breaking - course grained  (deterministic) Finite state machine validation / termination checking - course grained  Map navigation (e.g., shortest path) - course  Population modelling - course grained

 Review of readings for week #3

Image sources: Gold bar: Wikipedia (open commons) IBM Blade (CC by 2.0) ref: Takeaway, Clock, Factory and smoke – public domain CC0 ( Forrest of trees: Wikipedia (open commons) Moore’s Law graph, processor families per supercomputer over years – all these creative commons, commons.wikimedia.orgcommons.wikimedia.org Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).