Presentation is loading. Please wait.

Presentation is loading. Please wait.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Similar presentations


Presentation on theme: "Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory."— Presentation transcript:

1 Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory processors – Multi-core processors Parallel programming – Shared memory programming (OpenMP) – Distributed memory programming (MPI) – Thread-based programming Parallel algorithms – Sorting – Matrix-vector multiplication – Graph algorithms Applications – Computational science and engineering – High-performance computing

2 Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory processors – Multi-core processors Parallel programming – Shared memory programming (OpenMP) – Distributed memory programming (MPI) – Thread-based programming Parallel algorithms – Sorting – Matrix-vector multiplication – Graph algorithms Applications – Computational science and engineering – High-performance computing This lecture hopes to touch on each of the highlighted aspects via one running example

3 Intel Nehalem: an example of a multi-core shared-memory architecture A few of the features… 2 sockets 4 cores per socket 2 hyper-threads per core  16 threads in total Proc speed: 2.5 GHz 24 GB total memory Cache: – L1 (32 KB, on-core) – L2 (2x6 MB, on-core) – L3 (8 MB, shared) Memory: virtually globally shared (from programmer’s point of view) Multithreading: simultaneous (Multiple instructions from ready threads executed in a given cycle) Block diagram

4 Overview of Programming Models Programming models provide support for expressing concurrency and synchronization Process based models assume that all data associated with a process is private, by default, unless otherwise specified Lightweight processes and threads assume that all memory is global – A thread is a single stream of control in the flow of a program Directive based programming models extend the threaded model by facilitating creation and synchronization of threads

5 Advantages of Multithreading (threaded programming) Threads provide software portability Multithreading enables latency hiding Multithreading takes scheduling and load balancing burdens away from programmers Multithreaded programs are significantly easier to write than message-passing programs – Becoming increasingly widespread in use due to the abundance of multicore platforms

6 Shared Address Space Programming APIs The POSIX Thread (Pthread) API – Has emerged as the standard threads API – Low-level primitives (relatively difficult to work with) OpenMP – Is a directive-based API for programming shared address space platforms (has become a standard) – Used with Fortran, C, C++ – Directives provide support for concurrency, synchronization, and data handling without having to explicitly manipulate threads

7 Parallelizing Graph Algorithms Challenges: – Runtime dominated by memory latency than processor speed – Little work is done while visiting a vertex or an edge Little computation to hide memory access cost – Access patterns determined only at runtime Prefetching techniques inapplicable – There is poor data locality Difficult to obtain good memory system performance For these reasons, parallel performance – on distributed memory machines is often poor – on shared memory machines is often better We consider here graph coloring as an example of a graph algorithm to parallelize on shared memory machines

8 Graph coloring Graph coloring is an assignment of colors (positive integers) to the vertices of a graph such that adjacent vertices get different colors The objective is to find a coloring with the least number of colors Examples of applications: – Concurrency discovery in parallel computing (illustrated in the figure to the right) – Sparse derivative computation – Frequency assignment – Register allocation, etc

9 A greedy algorithm for coloring Graph coloring is NP-hard to solve optimally (and even to approximate) The following Greedy algorithm gives very good solution in practice Complexity of GREEDY: O(|E|) (thanks to the way the array forbiddenColors is used) color is a vertex-indexed array that stores the color of each vertex forbiddenColors is a color-indexed array used to mark impermissible colors to a vertex

10 Parallelizing Greedy Coloring Desired goal: parallelize GREEDY such that – Parallel runtime is roughly O(|E|/p) when p processors (threads) are used – Number of colors used is nearly the same as in the serial case Difficult to achieve since GREEDY is inherently sequential Challenge: come up with a way to create concurrency in a nontrivial way

11 A potentially “generic” parallelization technique “Standard” Partitioning – Break up the given problem into p independent subproblems of almost equal sizes – Solve the p subproblems concurrently Main work lies in the decomposition step which is often no easier than solving the original problem “Relaxed” Partitioning – Break up the problem into p, not necessarily entirely independent, subproblems of almost equal sizes – Solve the p subproblems concurrently – Detect inconsistencies in the solutions concurrently – Resolve any inconsistencies Can be used potentially successfully if the resolution in the fourth step involves only local adjustments

12 “Relaxed Partitioning” applied towards parallelizing Greedy coloring Speculation and Iteration: – Color as many vertices as possible concurrently, tentatively tolerating potential conflicts, detect and resolve conflicts afterwards (iteratively)

13 Parallel Coloring on Shared Memory Platforms (using Speculation and Iteration) Lines 4 and 9 can be parallelized using the OpenMP directive #pragma omp parallel for

14 Sample experimental results of Algorithm 2 (Iterative) on Nehalem: I Graph (RMAT-G): 16.7M vertices; 133.1M edges; Degree (avg=16, Max= 1,278, variance=416)

15 Sample experimental results of Algorithm 2 (Iterative) on Nehalem: II Graph (RMAT-B): 16.7M vertices; 133.7M edges; Degree (avg=16, Max= 38,143, variance=8,086)

16 Recap Saw an example of a multi-core architecture – Intel Nehalem Took a bird’s eye view of parallel programming models – OpenMP (highlighted) Saw an example of design of a graph algorithm – Graph coloring Saw some experimental results


Download ppt "Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory."

Similar presentations


Ads by Google