Parallel Computing Platforms

Parallel Computing Platforms
Motivation: High Performance Computing Dichotomy of Parallel Computing Platforms Communication Model of Parallel Platforms Physical Organization of Parallel Platforms Communication Costs in Parallel Machines Sahalu Junaidu ICS 573: High Performance Computing

High Performance Computing
Computing power required to solve, effectively, computationally intensive and/or data intensive problems in science, engineering and other emerging disciplines Provided using Parallel computers and Parallel programming techniques Is this compute power not available otherwise? Sahalu Junaidu ICS 573: High Performance Computing

Elements of a Parallel Computer
Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel Operating System Programming Constructs to Express/Orchestrate Concurrency Application Software Parallel Algorithms Goal: Utilize the Hardware, System, & Application Software to either Achieve Speedup: Tp = Ts/p Solve problems requiring a large amount of memory. Sahalu Junaidu ICS 573: High Performance Computing

Dichotomy of Parallel Computing Platforms
Logical Organization The user’s view of the machine as it is being presented via its system software Physical Organization The actual hardware architecture Physical Architecture is to a large extent independent of the Logical Architecture Sahalu Junaidu ICS 573: High Performance Computing

ICS 573: High Performance Computing
Logical Organization An explicitly parallel program must specify concurrency and interaction between concurrent tasks That is, there are two critical components of parallel computing, logically: Control structure: How to express parallel tasks Communication model: mechanism for specifying interaction Parallelism can be expressed at various levels of granularity - from instruction level to processes. Sahalu Junaidu ICS 573: High Performance Computing

Control Structure of Parallel Platforms
Processing units in parallel computers either operate under the centralized control of a single control unit or work independently. If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD). If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD). Sahalu Junaidu ICS 573: High Performance Computing

SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b). Sahalu Junaidu ICS 573: High Performance Computing

SIMD Processors Instruction Stream Processor A B C Data Input stream A stream B stream C Data Output Sahalu Junaidu ICS 573: High Performance Computing

MIMD Processors Processor A B C Data Input stream A stream B stream C Data Output Instruction Stream A Stream B Stream C Sahalu Junaidu ICS 573: High Performance Computing

SIMD & MIMD Processors (cont’d)
SIMD relies on the regular structure of computations (such as those in image processing). Require less hardware than MIMD computers (single control unit). Require less memory Are specialized: not suited to all applications. In contrast to SIMD processors, MIMD processors can execute different programs on different processors. Single program multiple data streams (SPMD) executes the same program on different processors. SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. Sahalu Junaidu ICS 573: High Performance Computing

Logical Organization:Communication Model
There are two primary forms of data exchange between parallel tasks: Accessing a shared data space and Exchanging messages. Platforms that provide a shared data space are called shared-address-space machines or multiprocessors. Platforms that support messaging are also called message passing platforms or multicomputers. Sahalu Junaidu ICS 573: High Performance Computing

Shared-Address-Space Platforms
Part (or all) of the memory is accessible to all processors. Processors interact by modifying data objects stored in this shared-address-space. If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine. Sahalu Junaidu ICS 573: High Performance Computing

NUMA and UMA Shared-Address-Space Platforms
Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only. Sahalu Junaidu ICS 573: High Performance Computing

NUMA and UMA Shared-Address-Space Platforms
The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines require locality from underlying algorithms for performance. Programming these platforms is easier since reads and writes are implicitly visible to other processors. However, read-write to shared data must be coordinated Caches in such machines require coordinated access to multiple copies. This leads to the cache coherence problem. Sahalu Junaidu ICS 573: High Performance Computing

Shared-Address-Space vs. Shared Memory Machines
Shared-address-space is as a programming abstraction Shared memory is as a physical machine attribute It is possible to provide a shared address space using a physically distributed memory Distributed shared memory machines Shared-address-space machines commonly programmed using Pthreads and OpenMP Sahalu Junaidu ICS 573: High Performance Computing

Message-Passing Platforms
These platforms comprise of a set of processors and their own (exclusive) memory Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers. These platforms are programmed using (variants of) send and receive primitives. Libraries such as MPI and PVM provide such primitives. Sahalu Junaidu ICS 573: High Performance Computing

Physical Organization: Interconnection Networks (ICNs)
Provide processor-to-processor and processor-to-memory connections Networks are classified as: Static Dynamic Consist of a number of point-to-point links direct network Historically used to link processors-to-processors distributed-memory The network consists of switching elements that the various processors attach to indirect network Historically used to link processors-to-memory shared-memory systems Sahalu Junaidu ICS 573: High Performance Computing

Static and Dynamic Interconnection Networks
Classification of interconnection networks: (a) a static network; and (b) a dynamic network. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies Interconnection Networks Static Dynamic Bus-based Switch-based 1-D 2-D HC Single Multiple SS MS Crossbar Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Static ICNs
Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bi-directional, between processors. Completely connected networks (CCNs): Number of links: O(N2), delay complexity: O(1). Limited connected network (LCNs) Linear arrays Ring (Loop) networks Two-dimensional arrays Tree networks Cube network Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Dynamic ICNs
A variety of network topologies have been proposed and implemented: Bus-based Crossbar Multistage etc These topologies tradeoff performance for cost. Commercial machines often implement hybrids of multiple topologies for reasons of packaging, cost, and available components. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Buses
Shared medium. Ideal for information broadcast Distance between any two nodes is a constant Bandwidth of the shared bus is a major bottleneck. Local memories can improve performance Scalable in terms of cost, unscalable in terms of performance. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Crossbars
Uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner. The cost of a crossbar of p processors grows as . Scalable in terms of performance, unscalable in terms of cost A completely non-blocking crossbar network connecting p processors to b memory banks. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Multistage Networks
Strike a compromise between the cost and performance scalability of the Bus and Crossbar networks. The schematic of a typical multistage interconnection network. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Multistage Omega Network
One of the most commonly used multistage interconnects is the Omega network. This network consists of log p stages, where p is the number of inputs/outputs. At each stage, input i is connected to output j if: Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Multistage Omega Network
Each stage of the Omega network implements a perfect shuffle as follows: A perfect shuffle interconnection for eight inputs and outputs. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Completely Connect and Star Networks
Completely connected network is the static counterpart of a Crossbar network Performance scales very well, the hardware complexity is not realizable for large values of p. Star network is the static counterpart of a Bus network Central processor is the bottleneck (a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Linear Arrays, Meshes, and k-d Meshes
In a linear array, each node has two neighbors, one to its left and one to its right. If the nodes at either end are connected, we refer to it as a 1-D torus or a ring. A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west. A further generalization to d dimensions has nodes with 2d neighbors. A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Linear Arrays and Meshes
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Properties of Hypercubes
The distance between any two nodes is at most log p. Each node has log p neighbors. The distance between two nodes is given by the number of bit positions at which the two nodes differ. Sahalu Junaidu ICS 573: High Performance Computing

Network Topologies: Tree-Based Networks
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network. Sahalu Junaidu ICS 573: High Performance Computing

Evaluation Metrics for ICNs
The following evaluation metrics are the criteria used to characterize the cost and performance of static ICNs Diameter The maximum distance between any two nodes Smaller the better. Connectivity The minimum number of arcs that must be removed to break it into two disconnected networks Larger the better Bisection width The minimum number of arcs that must be removed to partition the network into two equal halves. Cost The number of links in the network Smaller the better Sahalu Junaidu ICS 573: High Performance Computing

Evaluating Static Interconnection Networks
Diameter BisectionWidth Arc Connectivity Cost (No. of links) Completely-connected Star Complete binary tree Linear array 2-D mesh, no wraparound 2-D wraparound mesh Hypercube Wraparound k-ary d-cube Sahalu Junaidu ICS 573: High Performance Computing

Evaluating Dynamic Interconnection Networks
Diameter Bisection Width Arc Connectivity Cost (No. of links) Crossbar Omega Network Dynamic Tree Sahalu Junaidu ICS 573: High Performance Computing

Communication Costs in Parallel Machines
Along with idling and contention, communication is a major overhead in parallel programs. Communication cost dependents on many features including: Network topology Data handling Routing etc Sahalu Junaidu ICS 573: High Performance Computing

Message Passing Costs in Parallel Computers
The communication cost of a data-transfer operation depends on: Start-up time: ts add headers/trailer, error-correction, execute the routing algorithm, establish the connection between source & destination Per-hop time: th time to travel between two directly connected nodes. node latency Per-word transfer time: tw 1/channel-width Sahalu Junaidu ICS 573: High Performance Computing

Store-and-Forward Routing
A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop. The total communication cost for a message of size m words to traverse l communication links is In most platforms, th is small and the above expression can be approximated by Sahalu Junaidu ICS 573: High Performance Computing

Routing Techniques Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that the message is in transit. The startup time associated with this message transfer is assumed to be zero. Sahalu Junaidu ICS 573: High Performance Computing

Packet Routing Store-and-forward makes poor use of communication resources. Packet routing breaks messages into packets and pipelines them through the network. Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information. The total communication time for packet routing is approximated by: The factor tw accounts for overheads in packet headers. Sahalu Junaidu ICS 573: High Performance Computing

Cut-Through Routing Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits. Since flits are typically small, the header information must be minimized. This is done by forcing all flits to take the same path, in sequence. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed. Sahalu Junaidu ICS 573: High Performance Computing

Cut-Through Routing The total communication time for cut-through routing is approximated by: This is identical to packet routing, however, tw is typically much smaller. Sahalu Junaidu ICS 573: High Performance Computing

Simplified Cost Model for Communicating Messages
The cost of communicating a message between two nodes l hops away using cut-through routing is given by In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large. Furthermore, it is often not possible to control routing and placement of tasks. For these reasons, we can approximate the cost of message transfer by Sahalu Junaidu ICS 573: High Performance Computing

Notes on the Simplified Cost Model
The given cost model allows the design of algorithms in an architecture-independent manner However, the following assumptions are made: Communication between any pair of nodes takes equal time Underlying network is uncongested Underlying network is completely connected Cut-through routing is used Sahalu Junaidu ICS 573: High Performance Computing

Parallel Computing Platforms

Similar presentations

Presentation on theme: "Parallel Computing Platforms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing Platforms

Similar presentations

Presentation on theme: "Parallel Computing Platforms"— Presentation transcript:

Similar presentations

About project

Feedback