Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Syllabus: Lectures 1-5 Might ask something about prac1 or 2 (i.e. OCTAVE or pthreads) … 45 minutes, usual lecture venue

 Temporal & spatial computing  Extracting concurrency  Shared memory  Granularity Licensing details last slide

Temporal and Spatial Computation Temporal ComputationSpatial Computation The traditional paradigm Typical of Programmers Things done over time steps Suited to hardware Possibly more intuitive? Things related in a space A = input(“A= ? ”); B = input(“B =? ”); C = input(“B multiplier ?”); X = A + B * C Y = A – B * C A? B? C? + * X ! Y ! - Which do you think is easier to make sense of? Can provide a clearer indication of relative dependencies.

 Being able to comprehend and extract the parallelism, or properties of concurrency, from a process or algorithm is essential to accelerating computation  The Reconfigurable Computing (RC) Advantage:  The computer platform able to adapt according to the concurrency inherent in a particular application in order to accelerate computation for the specific application

 The choice of memory architecture is not necessarily dependent on the ‘Flynn classification’  For a SISM computer, this aspect is largely irrelevant (but consider a PC with GPU and DMA as not being in the SISM category)

 Generally, all processors have access to all memory in a global address space.  Processors operate independently, but they can share the same global memory.  Changes to global memory done by one processor are seen by the other processors.  Shared memory machines can be divided into two types, depending on when memory is accessed:  Uniform Memory Access (UMA) or  Non-uniform Memory Access (NUMA)

 Common today in form of Symmetric Multi- Processor (SMP) machines  Identical processors  Equal access and access times to memory  Cache coherent  Cache coherent =  When one processor writes a location in shared memory, all other processors are updated.  Cache coherency is implemented at the hardware level. MEMORY CPU

 Not all processors have the same access time to all the memories  Memory access across link is slower  If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA CPU MEMORY CPU MEMORY CPU Interconnect bus SMP 1SMP 2 This architecture has two SMPs connected via a bus. When a CPU on SMP1 needs to access memory connected to SMP2, there will be some form of lag which may be a few times slower than access to SMP1’s own memory.

 Advantages  Global address space gives a user-friendly programming approach (as discussed in shared memory programming model)  Sharing data between tasks is fast and uniform due to the proximity of memory to CPUs  Disadvantages:  Major drawback: lack of scalability between memory and CPUs.  Adding CPUs can increases traffic on shared memory- CPU path (for cache coherent systems also increases traffic associated with cache/memory management)

 Disadvantages  Programmer responsible for implementing/using synchronization constructs to make sure that correct access of global memory is done.  Becomes more difficult and expensive to design and construct shared memory machines with ever increasing numbers of processors.

 Similar to shared memory, but requires a communications network to share memory Local Memory CPU Communications network Local Memory CPU Local Memory CPU … Each processor has its own local memory (not directly accessible by the other processors’ memory addresses) Processors connected via a communication network – the communication network fabric varies; could simply be Ethernet. Cache coherency does not apply (when a CPU changes its local memory, the hardware does not notify the other processors – if needed the programmer needs to provide this functionality) Programmer responsible for implementing methods by which one processor can access memory of a different processor.

 Advantages:  Memory scalable with number of processors  Each processor can access own memory quickly without communication overheads or maintaining cache coherency (for UMA).  Cost benefits: use of commercial off-the- shelf (COTS) processors and networks

 Disadvantages:  Programmer takes on responsibility for data consistency, synchronization and communication between processors.  Existing (legacy) programs based on shared global memory may be difficult to port to this model.  May be more difficult to write applications for distributed memory systems than it is for shared memory systems.  Restricted by non-uniform memory access (NUMA) performance (meaning a memory access bottle neck that may be many times slower than shared memory systems)

 Simply a network of shared memory systems (possibly in one computer or a cluster of separated computers)  Use in many modern supercomputer designs today  Shared memory part is usually UMA (cache coherent)  Pros & Cons? – Best and Worst of two worlds.

 Granularity (characteristic of problem)  How big or small are the parts that the problem has been decomposed into?  How interrelated are the sub-tasks  Decomposition (development process)  How the problem can be divided up; relates closely to granularity of the problem  Functional decomposition  Domain (or data) decomposition

 This ratio can help to decide is a problem is fine or course grained.  1 : 1 = Each intermediate result needs a communication operation  100 : 1 = 100 computations (or intermediate results) require only one communication operation  1 : 100 = Each computation needs 100 communication operations

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1) course grained …

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).

 Fine Grained:  One part / sub-process requires a great deal of communication with other parts to complete its work relative to the amount of computing it does (the ratio computation : communication is low, approaching 1:1)  Course Grained:  A coarse-grained parallel task is largely independent of other tasks. But still requires some communication to complete its part. The computation : communication ratio is high (say around 100:1).  Embarrassingly Parallel:  So course that there’s no or very little interrelation between parts/sub-processes

 Fine grained:  Problem broken into (usually many) very small pieces  Problems where any one piece is highly interrelated to others (e.g., having to look at relations between neighboring gas molecules to determine how a cloud of gas molecules behaves)  Sometimes, attempts to parallelize fine-grained solutions increased the solution time.  For very fine-grained problems, computational performance is limited both by start-up time and the speed of the fastest single CPU in the cluster.

 Course grained:  Breaking the problems into larger pieces  Usually, low level of interrelations (e.g., can separate into parts whose elements are unrelated to other parts)  These solutions are generally easier to parallelize than fine- grained, and  Usually, parallelization of these problems provides significant benefits.  Ideally, the problem is found to be “embarrassingly parallel” (this can of course also be the case for fine grained solutions)

 Many image processing problems are suited to course grained solutions, e.g.: can perform calculations on individual pixels or small sets of pixels without requiring knowledge of any other pixel in the image.  Scientific problems tend to be between coarse and fine granularity. These solutions may require some amount of interaction between regions, therefore the individual processors doing the work need to collaborate and exchange results (i.e., need for synchronization and message passing).  E.g., any one element in the data set may depend on the values of its nearest neighbors. If data is decomposed into two parts that are each processed by a separate CPU, then the CPUs will need to exchange boundary information.

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply  FFTs  Decryption code breaking  (deterministic) Finite state machine validation / termination checking  Map navigation (e.g., shortest path)  Population modelling

 Which of the following are more fine- grained, and which are course-grained?  Matrix multiply - fine grain data, course funct.  FFTs - fine grained  Decryption code breaking - course grained  (deterministic) Finite state machine validation / termination checking - course grained  Map navigation (e.g., shortest path) - course  Population modelling - course grained

 Review of readings for week #3

Image sources: Gold bar: Wikipedia (open commons) IBM Blade (CC by 2.0) ref: http://www.flickr.com/photos/hongiiv/407481199/http://www.flickr.com/photos/hongiiv/407481199/ Takeaway, Clock, Factory and smoke – public domain CC0 (http://pixabay.com/)http://pixabay.com/ Forrest of trees: Wikipedia (open commons) Moore’s Law graph, processor families per supercomputer over years – all these creative commons, commons.wikimedia.orgcommons.wikimedia.org Disclaimers and copyright/licensing details I have tried to follow the correct practices concerning copyright and licensing of material, particularly image sources that have been used in this presentation. I have put much effort into trying to make this material open access so that it can be of benefit to others in their teaching and learning practice. Any mistakes or omissions with regards to these issues I will correct when notified. To the best of my understanding the material in these slides can be shared according to the Creative Commons “Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)” license, and that is why I selected that license to apply to this presentation (it’s not because I particulate want my slides referenced but more to acknowledge the sources and generosity of others who have provided free material such as the images I have used).

Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Similar presentations

Presentation on theme: "Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Similar presentations

Presentation on theme: "Lecture 4: Lecturer: Simon Winberg Temporal and Spatial Computing, Shared Memory, Granularity Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)"— Presentation transcript:

Similar presentations

About project

Feedback