Jie Liu, Ph.D. Professor Department of Computer Science

Introduction to Parallel Processing with Multi-core Part III – Architecture
Jie Liu, Ph.D. Professor Department of Computer Science Western Oregon University USA

Part I outline Three models of parallel computation
Processor organizations Processor arrays Multiprocessors Multi-computers Flynn’s taxonomy Affordable parallel computers Algorithms with processor organizations

Processor Organization
In a parallel computer, processors need to “cooperate.” To do so, a processor must be able to “reach” other processors. The method of connecting processors is a parallel computer is called processor Organization. In a processor organization chart, vertices represent processors and edges represent communication paths.

Processor Organization Criteria
Diameter: the largest distance between two nodes. The lower, the better because it affects the communicate costs. Bisection width: the minimum number of edges that must be removed to divide the network into to halves (within one). The higher, the better because it affect the number of concurrent communication channels. Number of edges per node: we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability. Maximum edge length: again, we consider this to be the best if this is constant, i.e., independent of number of processors, because it affects the scalability.

Mesh Networks 1. A mesh always has a dimension q, which could be 1, 2, 3, or even higher. Each interior nodes can communicate with 2q other processors

Mesh Networks (2) For a mesh with nodes ( as shown)
Diameter: q(k-1) (too large to have NC algorithm) Bisection width: (Reasonable) Maximum number of edge per note: 2q (constant -- good) Maximum edge length: Constant (good) Many parallel computers used this architecture because it is simple and scalable Intel Paragon XP/S used this architecture

Hypertree Networks Degree k = 4 And depth d = 2

Hypertree Networks For a hypertree of degree k and depth d, generally, we only consider the cases where k = 4 Number of nodes: Diameter: 2d (good for design NC class algorithms) Bisection width: (Reasonable) Maximum number of edge per note: 6 (kind of constant) Maximum edge length: changes depend on d Only one parallel computers Thinking Machines’ CM-5 (Connection Machine) used this architecture The designed maximum number of processors was 64K The processors were vector processors that were capable of performing 32 pairs of arithmetic operations per clock cycle.

Butterfly Network A butterfly network has nodes. The one on the right has k= 3. In practice, rank 0 and rank k are combined, so each node has four connections. Each rank contains n= nodes. If n(i, j) is the jth node on the ith rank, then it connects to two nodes on rank i-1: n(i-1, j) and n(i-1, m), where m is the integer formed by inverting the ith most significant bit in the k-bit binary number of j. For example, n(2,3) is connected to n(1,3) and n(1,1) because 3 is 011, inverting the second most significant bit makes it 001, which is 1.

Butterfly Network (2) For a butterfly network with nodes
Diameter: 2k -1 (good for design NC class algorithms) Bisection width: (very good) Maximum number of edge per note: 4 (constant) Maximum edge length: changes depend on k The network is also called an  network. A few computers used this connection network including BBN’s TC2000

Routing on Butterfly Network
To route a message from rank 0 to a node on rank k, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right. The chart shows routing from n(0, 2) to n(3, 5) To route a message from rank k to a node on rank 0, each switch node picks off the lead bit from the message. If the bit is zero, it goes to the left, otherwise, it keeps to the right.

Hypercube Is a butterfly in which each column of switch nodes is collapsed into a single node. A binary n-cube has processors and equal number of switch nodes. The chart on the right shows a hypercube of degree 4. Two switch nodes are connected if their binary labels differ in exactly one bit position.

Hypercube Measures For a hypercube with nodes
Diameter: k (good for design NC class algorithms) Bisection width: n/2 (very good) Maximum number of edge per note: k Maximum edge length: depend on nodes Routing, just find the difference, one bit at a time, either from left to right, or from right to left. For example, for 0100 to 1011 we can go 0100  1100  1000 1010  1011, or 0100  0101  0111 0011  1011 A company named nCUBE Corporation makes machine of this structure up to k = 13 (theoretically). The company was later bought by ORACLE

Shuffle-Exchange Network
It has nodes numbered 0, 1, … … n-1. It has two kinds of connections: shuffle and exchange. Exchange connections link two nodes whose number differ in their least significant bit. Shuffle connections link node i with Below is a Shuffle-Exchange network with

Shuffle-Exchange Network
For a Shuffle-Exchange Network with nodes Diameter: 2k -1 (good for design NC class algorithms) Bisection width:  n/k (very good) Maximum number of edge per note: 2 Maximum edge length: depend on nodes Routing is not easy. It is hard to build a real Shuffle-Exchange Network because there are lines crossing each other. This architecture is studied for its theoretical significance

Summary

Processor Array

Processor Array (2) Parallel computers employ processor array technology can perform many arithmetic operation per clock cycle achieved by with pipelined vector processors, such as Cray-1, or processor array, such as Thinking Machines’ CM-200. This type of parallel did not really survive because $$$$$ because of the special CPUs Hard to utilize all processors Cannot handle if-then-else types of statements well because all the processors must carry out the same instruction Partitioning is very difficult It really need to deal with very large amount of data, which make I/O impossible

Multiprocessors Parallel computers with multiple CPUs and shared memory space. + can use commodity CPU  reasonable costs + support multiple user + different CPUs can execute different instructions UMA– uniform memory access, also called symmetric multiprocessor (SMP) – all the processors access any memory address with the same cost Sequent Symmetry can have up to 32 Intel X386processors All the CPUs share the same bus The problem with SMP/UMA is that the number of processors is limited NUMA – nonuniform memory access – processors can access it own memory, though accessible by others, much cheaper. Processors are connected through some connection network, such as butterfly Kendall Square Research support over 1000 processors The connection network costs too much, around 40% of the overall costs

UMA VS. NUMA

Cache Coherence Problem

Multicomputers Parallel computers with multiple CPUs and NO shared memory. Processors interact through message passing. + all of multiprocessor and possible to have a large number of CPUs - message passing is hard to implement and take a lot of time to carry out The first generation of message passing is store-and-forward where a processor receives the complete message then forward to the next processor iPSC and nCUBE/10 The second generation of message passing is circuit-switched where a path is first established with a cost, then subsequent messages use the path without the start up cost iPSC/2 and nCUBE 2 The cost of message passing ( startup time – must occur even if you send an empty message per byte cost cost of one floating point operation, for comparison reason

Multicomputers--nCUBE
An nCUBE parallel computer has three parts, the frontend, the backend, and I/Os. The frontend is a fully functioning computer The backend nodes, each is a computer of it own, has an simple OS that support message passing Note, the capability of the frontend stays the same regardless the number of processors at the backend The largest nCUBE can have 8K processors

Multicomputers—CM-5 Each node consist of a SPARC CPU, up to 32 MB of RAM, and four pipeline lined vector processing, each with 32 MFlop It can have up to 16K nodes With a theoretical peak speed of 2 teraflops (in 1991)

Flynn’s Taxonomy SISD – single core PC SIMD processor array or CM-200
MISD – systolic array MIMD – multiple core PC, nCUBE, Symmetry, CM-5, Paragon XP/S Single Data Multiple Data SISD SIMD Single-Instruction MISD MIND Multiple-Instruction

Inexpensive “Parallel Computers”
Beowulf PCs connected by a switch NOW Work stations on an intranet Multi-core PCs with few multicore CPUS # of node Cost Perfor- mance Easy to program Dedicated Beowulf Few to 100 OK Yes NOW 100s none No Multi-core Two to few low Low

Summation on Hypercube

Summation on Hypercube
for j (log p) -1 down to 0 do { for all where 0 <= i < p - 1 if ( i < ) // variable tmp on receives //value of sum from tmp <= [i ] sum sum = sum + tmp }

Summation on Hypercube (2) What could the code looks like?

Summation on 2-D mesh –SIMD Code
The mesh has lxl processors, 1 based for i  l -1 down to 1 do // push from right to left { for all where 1 <= j < l do // only l processor is working { // variable tmp on receives value of sum from tmp <= [j, i+1] sum sum = sum + tmp } for i  l -1 down to 1 do // push from bottom up { for all // really only two processors are working { // variable tmp on receives value of sum from tmp <= [i+1,1] sum

Summation on 2-D mesh

Summation on Shuffle-exchange
Shuffle-exchange SIMD code for j 0 to (log p) -1 { for all where 0 <= i < p - 1 Shuffle(sum) <= sum Exchange(tmp) <= sum sum = sum + tmp }

Jie Liu, Ph.D. Professor Department of Computer Science

Similar presentations

Presentation on theme: "Jie Liu, Ph.D. Professor Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jie Liu, Ph.D. Professor Department of Computer Science

Similar presentations

Presentation on theme: "Jie Liu, Ph.D. Professor Department of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback