Distributed and Parallel Processing George Wells
Terminology (cont.)
Flynn’s Taxonomy Single Instruction stream, Single Data stream (SISD) = serial computer Single Instruction stream, Multiple Data stream (SIMD) = processor arrays/vector processors/GPU Multiple Instruction stream, Single Data stream (MISD) Multiple Instruction stream, Multiple Data stream (MIMD) = multiprocessors
Terminology Middleware Connectivity software Functional set of APIs “the software layer that lies between the operating system and the applications on each side” Connectivity software Functional set of APIs
Terminology Data Access DBMS house and manage data access Allow disparate data sources to be viewed in a consistent way Database middleware – data passing
Terminology MOM - Message Oriented Middleware resides between applications and network infrastructure refers to process of distributing data and control through exchange of messages includes message passing and queueing models asynchronous and synchronous communications
Granularity The term grain is used to indicate the amount of computation performed between synchronisations: Coarse grain Fine grain
Communication : Computation Ratio Important performance characteristic when communication is explicit (e.g. message passing) Related to grain size
Hardware Models The RAM (Random Access Machine) model provides a useful abstraction We can reason about performance of algorithms, etc. Can we create a similar model for parallel systems?
PRAM Parallel Random Access Machine Multiple processing units connected to a shared memory unit Instructions executed in lock-step Simplifies synchronisation Multiple simultaneous accesses to one memory location Differing approaches: disallowed; must all write same value; one (randomly selected) succeeds; etc.
Problem PRAM does not adequately model memory behaviour Assumes all memory accesses take unit time Overhead of enforcing consistency grows with number of processors
CTA Candidate Type Architecture Distinguishes between local and non-local memory accesses Multiple processors connected by some form of “network”
Interconnection network CTA PC P0 P1 P2 Pm . . . Interconnection network Processor Memory NIC Network connections (1 <= n <= 6)
CTA Data references can be Local (unit cost) Non-local (λ, non-local memory latency – multiple of local cost ) Models for non-local access Shared memory High hardware cost, poor scalability 1-sided communication One processor “gets” and “puts” non-local data; requires synchronisation Message passing Explicit “send” and “receive” required
Processor Topologies Criteria to measure effectiveness in implementing parallel algorithms Diameter of network = largest distance between 2 nodes Bisection width = minimum no. of edges to be removed to split network in two No of edges per node Maximum edge length
Processor Topologies Ideal Organisation: Low diameter - lower bound on complexity for algs that require comms between arbitrary nodes. High bisection width - in algs with large amounts of data movement, size of data divided by bisection width puts lower bound on complexity. No of edges constant independent of network size - scalability Max edge length constant - scalability
Processor Topologies Mesh Pyramid Shuffle-Exchange Butterfly Hypercube / Cube-connected Cube-connected Cycles Others: Binary Tree; Hypertree; de Bruijn Network; minimum path
Simple 2-D Mesh
Wrap-around Mesh
Toroidal Wrap-around Mesh
Pyramid Attempt to combine advantages of mesh networks and tree networks A pyramid of size p is a 4-ary tree of height log4 p
Shuffle-Exchange Network Solid arrows = shuffle connections. Dashed arrows = exchange connections. Shuffle-exchange network: used for Discrete Fourier Transforms and sorting bitonic sequences Necklace of i = nodes which a data item (starting at position i) traverses in response to shuffles.
Butterfly
Hypercube
Cube Connected Cycles