Chapter 2 Program and Network Properties

Chapter 2 Program and Network Properties

Conditions of parallelism
Data and resources dependences: The ability to execute several program segments in parallel requires each segment to be independent of the other segments. Dependence graphs are used to describe certain relationships. The nodes of the graph corresponds to the program statement (instruction) and the directed edges with different labels show the relations among the statements. The analysis of dependence graph shows where opportunity exists for parallelization.

Data dependence Flow dependence: a statement S2 is flow dependent on statement S1 if an execution path exists from s1 to s2 and if at least one output of s1 feeds in as input to s2. Antidependence: Statement s2 is antidependent on statement s1 if s2 follows s1 in program order and if the output of s2 overlaps the input to s1. Output dependence: two statements are output dependent if they produce (write) the same output variable. I/O dependence: read and write are I/O statements. I/O dependence occur not because the same variable is involved but because the same file is referenced by both I/o statement.

Example S1 load r1,a S2 add r2,r1 S3 move r1,r3 S4 store b, r1
S1 read (4), A(i) read array a from tape unit 4 S2 rewind (4) S3 write(4),B(i) write array b into tape unit 4 S4 rewind (4)

Control Dependence: This refers to the situation where the order of execution of statements cannot be determined before run time. Eg. If condition. Control dependence often prohibits parallelism. For ( i= 1; I<= n ; i++) { if (x[i - 1] == 0) x[i] =0 else x[i] = 1; }

Resource dependence: Deals with the conflicts in using shared resources.
When conflicting resource is ALU- alu dependence. If memory location- storage dependence. For example, floating point units or registers are shared, and this is known as ALU dependency. When memory is being shared, then it is called Storage dependency.

Bernstein’s Conditions
In 1966, Bernstein revealed a set of conditions based on which two processes can execute in parallel. Input set-all input variables needed to execute the process. Output set- all output variables generated after the execution.

To show the operation of Bernstein’s conditions, consider the following instructions of sequential program: I1 : x = (a + b) / (a * b) I2 : y = (b + c) * d I3 : z = x2 + (a * e) Now, the read set and write set of I1, I2 and I3 are as follows: R1 = {a,b} W1 = {x} R2 = {b,c,d} W2 = {y} R3 = {x,a,e} W3 = {z}

Now let us find out whether I1 and I2 are parallel or not
R1∩W2=φ R2∩W1=φ W1∩W2=φ That means I1 and I2 are independent of each other. Similarly for I1 || I3, R1∩W3=φ R3∩W1≠φ W1∩W3=φ

Hence I1 and I3 are not independent of each other.
For I2 || I3, R2∩W3=φ R3∩W2=φ W3∩W2=φ Hence, I2 and I3 are independent of each other. Thus, I1 and I2, I2 and I3 are parallelizable but I1 and I3 are not.

2 processes p1 and p2 having I1 and I2 as input set and O1 and O2 as output set.
Now these 2 processes can execute in parallel if they satisfy the following conditions. I1 ∩ O2 (anti dependent) = null (anti-independent) I2 ∩ O1(flow dependency) = null(flow independent) O1 ∩ O2(o/p dependent) = null(output independent )

The input set is also called the read set or the domain of process.
The output set has been called as the write set or the range of the process. P1 C= D * E P2 M= G + C P3 A= B + C P4 C= L + M P5 F = G / E

Only 5 pairs can execute in parallel.
P1-P5, P2-P3, P2-P5, P3-P5, P4-P5. Parallelism relation is commutative- p1-p2 implies p2-p1. But it is not transitive. P1-p2, p2-p3. does not necessarily guarantee p1-p3.

Hardware and software parallelism
For implementation of parallelism, we need special hardware and software support. Mismatch problem between them

Hardware parallelism This refers to the type of parallelism defined by the machine architecture and hardware multiplicity. Hardware parallelism is often a function of cost and performance tradeoffs. It displays the resource utilization patterns of simultaneously executable operations.

One way to characterize the parallelism in a processor is by the number of instruction issues per machine cycle. If a processor issues k instructions per machine cycle, then it is called a k-issue processor. A conventional processor takes one or more machine cycles to issue a single instructions. These types of processors are called as one-issue machine. For eg. Intel i960C is a three-issue processor. One arithmetic, one memory access and one branch instruction issued per cycle.

Software parallelism This type of parallelism is defined by the control and data dependence of programs.

Mismatch Example. Software parallelism: there are eight instructions. 4 load and 4 arithmetic instructions. Therefore software parallelism 8/3= 2.67 instructions per cycle. Hardware parallelism- using 2-issue processor, can execute only one load (memory access) and one arithmetic operation simultaneously. Hardware parallelism- 8/7= 1.14 instructions per cycle. Now using dual-processor- 8/6.

Of the many types of software parallelism, two most important are control parallelism and data parallelism. Control parallelism allows two or more operations to be performed in parallel. Eg. Pipeline operations. Data parallelism in which almost the same operation is performed over many data elements by many processors in parallel.

To solve mismatch problem
To solve the problem of hardware and software mismatch, one approach is to develop compilation support. The other is through hardware redesign for more efficient exploitation by an intelligent compiler. One must design the compiler and hardware jointly. Interaction between the can lead to a better solution to the mismatch problem. Hardware and software design tradeoffs also exists in terms of cost, complexity, expandability, compatibility and performance.

Program partitioning and scheduling
Grain size or granularity is a measure of the amount of computation involved in a software process. The simplest measure is to count the number of instructions in a grain (program segment). Grain size determines the basic program segment chosen for parallel processing. Grain size are commonly described as fine, medium or course depending on the processing level involved.

Latency: is a time measure of the communication overhead incurred between the machine subsystems.
For eg. Memory latency: is the time required by the processor to access the memory. Synchronization latency is the time required for two processors to synchronize with each other.

Levels of parallelism Instruction Level: a typical grain contains less than 20 instructions called fine grain. Its easy to detect the parallelism here. Loop level: here the grain size is less than 500 instructions. Procedures Level: this corresponds to medium grain. Contains less than 2000 instructions. Detection of parallelism is more difficult at this level as compared o fine level.

Sub-program level: the number of instructions range to thousands
Sub-program level: the number of instructions range to thousands. Form coarse grain. Job level: this corresponds to the parallel execution of essentially independent jobs (programs).the grain size can be as high as tens of thousands of instructions in a single program.

Kruatrachue algorithm
Each node is represented by pair(n,s) =Node name and grain size. Edge label(v,d)=output variable v and communication delay.

Node duplication Algorithm

Program Flow mechanisms
Conventional computers are based on a control flow mechanisms by which the order of program execution is explicitly stated in the user programs. Data flow computers are based on a data driven mechanism which allows the execution of any instruction to be driven by data availability. Reduction computers are based on demand driven mechanism which initiates an operation based on a demand for its results by other computations.

Control flow computers
Conventional von Neumann computers use a program counter to sequence the execution of instructions in a program. Control flow computers use shared memory to hold program instructions and data. Data flow computers, the instructions are executed once the operand is available. Data directly goes to the instruction. Computational results ()data token) are passed directly between instructions.

The data generated by an instruction will be duplicated into may copies and forwarded directly to all needy instructions. Data tokens once consumed by an instruction, will no longer be available for reuse. It does not require shared memory or program counter. It just requires special mechanisms to detect data availability., to match data tokens with needy instructions.

A data flow architecture
Arvind and its associates at MIT have developed a tagged-token architecture for building data flow computers. The global architecture consists of n processing elements interconnected by an n*n routing network. Within each PE, the machine provides a token matching which dispatches only those instructions whose input data are already available.

Instructions are stored in program memory.
Each datum is tagged with the address of the instruction to which it belongs. Tagged tokens enter the PE through local path. It is the machine’s job to match up data with the same tag to needy instructions. Each instruction represents a synchronization operation.

Another synchronization mechanism called I-structure is also provided within each PE. The I-structure is a tagged memory unit for overlapped usage of a data by both the producer and consumer processes. Each word of I-structure uses a 2 bit tag indicating whether the word is empty, is full or has a pending read requests.

Demand driven mechanisms
The computation is triggered by the demand for an operation’s result. Eg. A= ((b+1)*c-(d/e)). The data driven computation chooses a bottom-up approach, starting from the innermost operation. Such computations are also called as eager evaluation because the operations are carried out immediately after all the operands become available.

A demand driven goes with top-down approach.
In this when a is demanded then operation should will start. They are also called as lazy evaluation, because the operations are executed only when their results are required by another instruction.

System interconnect network
Static and dynamic networks are used for interconnecting computer subsystems or for constructing multiprocessors/multicomputers. Static: formed of point to point direct connections which will not change during program execution. Dynamic: are implemented with switched channels which are dynamically configured to match the communication demand.

Network properties and Routing
Node degree: the no. of edges incident on a node. In-degree. Out-degree. Total is node degree. The total reflects the number of i/o ports required per node. Thus the cost of the node. Diameter D: shortest path between any 2 nodes. The path length is measured by the number of links traversed. Network Size: the total number of nodes.

Bisection width b: When a given network is cut into two equal halves, the minimum number of edges along the cut. In communication network, each edge corresponds to a channel with w bit wires. Thus the wire bisection width is B=bw. Thus B reflects the wiring density of a network.

Data routing functions
A data routing network is used for inter-PE data exchange. Routing network can be static- hypercube routing network used in TMC/CM-2 or dynamic- multistage network used in IBM GF11. Commonly seen data routing functions among PE’s include shifting, rotation, permutation(one to one), broadcast, multicast, personalized communication, shuffle etc. these routing functions can be implemented on ring, mesh, hypercube etc.

Permutations: For n objects there are n
Permutations: For n objects there are n! permutations by which the n objects can be reordered. The set of all permutations form permutation group. Eg. Permutation ∏ = (a,b,c)(d,e). In circular fashion- a-b, b-c, c-a, d-e, e-d. A.b,c has a period of 3. e,d has period of 2. Therefore total is 2*3=6. Crossbar switch, multistage network.

Perfect Shuffle and Exchange
This is obtained by shifting 1 bit to the left and wrapping around the most significant to the least significant position. Hypercube routing function: 3 routing functions are defined in this

Network performance parameters
Functionality: this refers to how the network supports data routing, interrupt handling, synchronization etc. Network latency: this refers to the worst case time delay for a unit message to be transferred through the network. Bandwidth: refers to the maxi. Data transfer rate in terms of Mbytes/s through the network. Hardware Complexity: refers to the implementation costs such as those of wires, switches etc. Scalability: refers to the ability of a network to be expandable with increasing machine resources.

Static Connection Network
Static networks use direct links which are fixed once built. This type of network is more suitable for computers where the communication pattern is fixed or predictable.

Linear Array 1-D network. N nodes are Connected by N-1 links.
Internal nodes have degree 2. end nodes have degree 1. The diameter is N-1. The bisection width is b = 1.

Ring and Chordal ring Can be unidirectional or bidirectional.
N nodes are Connected by N links. nodes have degree 2. The diameter is N/2 for bidirectional an N for unidirectional. The bisection width is b = 2. By increasing the node degree from 2 to 3 or 4, we get chordal ring. In general, more links added, higher the node degree and shorter the network diameter.

Barrel Shifter: N=16 nodes. Network Size= N=2^n
Barrel Shifter: N=16 nodes. Network Size= N=2^n. node degree 2n-1, Diameter D=n/2. Node I is connected to node j if j-i=2^r for some r=0,1,2,3…n-1.

Tree A k-level completely balanced binary tree should have N=2^k-1 nodes. Eg. 5-level tree has 31 nodes. The maximum node degree is 3. Diameter is 2(k-1). Star:2 level tree with high node degree of d=N-1. Constant diameter of 2.

Fat tree The channel width of a tree increases as we ascend from leaves to the root. The conventional tree faced the problem of bottleneck towards the root, since the traffic towards the root becomes heavier. This idea has been applied in the connection machine CM-5.

Mesh and torus A 3*3 mesh network is shown. This has been implemented in Illiac IV, MPP< DAP, CM-2. A k dimensional mesh with N=n^k. Interior node degree of 2K. Network diameter is k(n-1). The node degree at the boundary and corner nodes are 3 or 2. Illiac and torus are variations of mesh.

Cube connected cycles

Dynamic connection Networks
Buses

Switch modules A*B module has a inputs and b outputs . A and B are chosen in 2^.

Multistage interconnection Networks

Omega Switch

Chapter 2 Program and Network Properties

Similar presentations

Presentation on theme: "Chapter 2 Program and Network Properties"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2 Program and Network Properties

Similar presentations

Presentation on theme: "Chapter 2 Program and Network Properties"— Presentation transcript:

Similar presentations

About project

Feedback