Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints.

Similar presentations


Presentation on theme: "Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints."— Presentation transcript:

1 Parallel Computing Erik Robbins

2 Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints to further improvement  Physical barriers  Heat and electromagnetic interference limit chip transistor density  Processor speeds constrained by speed of light  Economic barriers  Cost will eventually increase beyond price anybody will be willing to pay

3 Parallelism  Improvement of processor performance by distributing the computational load among several processors.  The processing elements can be diverse  Single computer with multiple processors  Several networked computers

4 Drawbacks to Parallelism  Adds cost  Imperfect speed-up.  Given n processors, perfect speed-up would imply a n-fold increase in power.  A small portion of a program which cannot be parallelized will limit overall speed-up.  “The bearing of a child takes nine months, no matter how many women are assigned.”

5 Amdahl’s Law  This relationship is given by the equation:  S = 1 / (1 – P)  S is the speed-up of the program (as a factor of its original sequential runtime)  P is the fraction that is parallelizable  Web Applet –  http://www.cs.iastate.edu/~prabhu/Tutorial/CACHE/amdahl.html

6 Amdahl’s Law

7 History of Parallel Computing – Examples  1954 – IBM 704  Gene Amdahl was a principle architect  uses fully automatic floating point arithmetic commands.  1962 – Burroughs Corporation D825  Four-processor computer  1967 – Amdahl and Daniel Slotnick publish debate about parallel computing feasibility  Amdahl’s Law coined  1969 – Honeywell Multics system  Capable of running up to eight processors in parallel  1970s – Cray supercomputers (SIMD architecture)  1984 – Synapse N+1  First bus-connected multi-processor with snooping caches

8 History of Parallel Computing – Overview of Evolution  1950’s - Interest in parallel computing began.  1960’s & 70’s - Advancements surfaced in the form of supercomputers.  Mid-1980’s – Massively parallel processors (MPPs) came to dominate top end of computing.  Late-1980’s – Clusters (type of parallel computer built from large numbers of computers connected by network) competed with & eventually displaced MPPs.  Today – Parallel computing has become mainstream based on multi-core processors in home computers. Scaling of Moore’s Law predicts a transition from a few cores to many.

9 Multiprocessor Architectures  Instruction Level Parallelism (ILP)  Superscalar and VLIW  SIMD Architectures (single instruction streams, multiple data streams)  Vector Processors  MIMD Architectures (multiple instruction, multiple data)  Interconnection Networks  Shared Memory Multiprocessors  Distributed Computing  Alternative Parallel Processing Approaches  Dataflow Computing  Neural Networks (SIMD)  Systolic Arrays (SIMD)  Quantum Computing

10 Superscalar  A design methodology that allows multiple instructions to be executed simultaneously in each clock cycle.  Analogous to adding another lane to a highway. The “additional lanes” are called execution units.  Instruction Fetch Unit  Critical component.  Retrieves multiple instructions simultaneously from memory. Passes instructions to…  Decoding Unit  Determines whether the instructions have any type of dependency

11 VLIW  Superscalar processors rely on both hardware and the compiler.  VLIW processors rely entirely on the compiler.  They pack independent instructions into one long instruction which tells the execution units what to do.  Compiler cannot have an overall picture of the run- time code.  Is compelled to be conservative in its scheduling.  VLIW compiler also arbitrates all dependencies.

12 Vector Processors  Referred to as supercomputers. (Cray series most famous)  Based on vector arithmetic.  A vector is a fixed-length, one-dimensional array of values, or an ordered series of scalar quantities.  Operations include addition, subtraction, and multiplication.  Each instruction specifies a set of operations to be carried over an entire vector.  Vector registers – specialized registers that can hold several vector elements at one time.  Vector instructions are efficient for two reasons.  Machine fetches fewer instructions.  Processor knows it will have continuous source of data – can pre-fetch pairs of values.

13 MIMD Architectures  Communication is essential for synchronized processing and data sharing.  Manner of passing messages determines overall design.  Two aspects:  Shared Memory – one large memory accessed identically by all processors.  Interconnected Network – Each processor has own memory, but processors are allowed to access each other’s memories via the network.

14 Interconnection Networks  Categorized according to topology, routing strategy, and switching technique.  Networks can be either static or dynamic, and either blocking or non-blocking.  Dynamic – Allow the path between two entities (two processors or a processor & memory) to change between communications. Static is opposite.  Blocking – Does not allow new connections in the presence of other simultaneous connections.

15 Network Topologies  The way in which the components are interconnected.  A major determining factor in the overhead of message passing.  Efficiency is limited by:  Bandwidth – information carrying capacity of the network  Message latency – time required for first bit of a message to reach its destination  Transport latency – time a message spends in the network  Overhead – message processing activities in the sender and receiver

16 Static Topologies  Completely Connected – All components are connected to all other components.  Expensive to build & difficult to manage.  Star – Has a central hub through which all messages must pass.  Excellent connectivity, but hub can be a bottleneck.  Linear Array or Ring – Each entity can communicate directly with its two neighbors.  Other communications have to go through multiple entities.  Mesh – Links each entity to four or six neighbors.  Tree – Arrange entities in tree structures.  Potential for bottlenecks in the roots.  Hypercube – Multidimensional extensions of mesh networks in which each dimension has two processors.

17 Static Topologies

18 Dynamic Topology  Dynamic networks use either a bus or a switch to alter routes through a network.  Bus-based networks are simplest and most efficient when number of entities are moderate.  Bottleneck can result as number of entities grow large.  Parallel buses can alleviate bottlenecks, but at considerable cost.

19 Switches  Crossbar Switches  Are either open or closed.  A crossbar network is a non-blocking network.  If only one switch at each crosspoint, n entities require n^2 switches. In reality, many switches may be required at each crosspoint.  Practical only in high-speed multiprocessor vector computers.

20 Switches  2x2 Switches  Capable of routing its inputs to different destinations.  Two inputs and two outputs.  Four states  Through (inputs feed directly to outputs)  Cross (upper in directed to lower out & vice versa)  Upper broadcast (upper input broadcast to both outputs)  Lower broadcast (lower input directed to both outputs)  Through and Cross states are the ones relevant to interconnection networks.

21 2x2 Switches

22 Shared Memory Multiprocessors  Tightly coupled systems that use the same memory.  Global Shared Memory – single memory shared by multiple processors.  Distributed Shared Memory – each processor has local memory, but is shared with other processors.  Global Shared Memory with separate cache at processors.

23 UMA Shared Memory  Uniform Memory Access  All memory accesses take the same amount of time.  One pool of shared memory and all processors have equal access.  Scalability of UMA machines is limited. As the number of processors increases…  Switched networks quickly become very expensive.  Bus-based systems saturate when the bandwidth becomes insufficient.  Multistage networks run into wiring constraints and significant latency.

24 NUMA Shared Memory  Nonuniform Memory Access  Provides each processor its own piece of memory.  Processors see this memory as a contiguous addressable entity.  Nearby memory takes less time to read than memory that is further away. Memory access time is thus inconsistent.  Prone to cache coherence problems.  Each processor maintains a private cache.  Modified data needs to be updated in all caches.  Special hardware units known as snoopy cache controllers.  Write-through with update – updates stale values in other caches.  Write-through with invalidation – removes stale values from other caches.

25 Distributed Computing  Means different things to different people.  In a sense, all multiprocessor systems are distributed systems.  Usually used referring to a very loosely based multicomputer system.  Depend on a network for communication among processors.

26 Grid Computing  An example of distributed computing.  Uses resources of many computers connected by a network (i.e. Internet) to solve computational problems that are too large for any single super-computer.  Global Computing  Specialized form of grid computing. Uses computing power of volunteers whose computers work on a problem while the system is idle.  SETI@Home Screen Saver  Six year run accumulated two million years of CPU time and 50 TB of data.

27 Questions?


Download ppt "Parallel Computing Erik Robbins. Limits on single-processor performance  Over time, computers have become better and faster, but there are constraints."

Similar presentations


Ads by Google