Introduction 因為需要 large amount of computation. Why parallel?

Example 10 15 /10 7 ≒ 10 8 秒 10 8 /10 5 ≒ 10 3 天≒ 3 年用在外科手術時，每秒約需 10 15 個計算 3D 立體影像一般的電腦每秒約可算 10 7 個計算

其他需要大量計算的領域 Aircraft Testing New Drug Development Oil Exploration Modeling Fusion Reactors Economic Planning Cryptanalysis Managing Large Databases Astronomy Biomedical Analysis …

電腦的速度光速 : 3×10 8 m/s Components 太近 : 會互相干擾每 3 、 4 年增快一倍 Unfortunately It is evident that this trend will soon come to an end. A simple law of physics

Parallelism 人 (CPU) 海戰術目前許多電腦中都不只一顆 CPU

Models of Computation Flynn's classification: SISD(Single Instruction, Single Data) MISD(Multiple Instruction, Single Data) SIMD(Single Instruction, Multiple Data) MIMD(Multiple Instruction, Multiple Data)

SISD Computers Sequential(serial) Computation

Example summation of n numbers: Sum = 0 For I = 1 to n Sum = Sum + I Next need n-1 additions O(n) time

MISD Computers

Example test whether a positive integer is prime or not. n is prime: no divisors except 1 & n. n is composite: otherwise Each processor tests a number between 1 and n. Each processor needs O(1) operation. O(1) time

SIMD Computers

PRAM (Parallel Random Access Machine) Share-Memory(SM) SIMD computers

How does a processor i pass a number to another processor j? O(1) time

WriteRead 1. EREW (Exclusive Read, Exclusive Write) 2. CREW (Concurrent Read, Exclusive Write) 3. ERCW (Exclusive Read, Concurrent Write) 4. CRCW (Concurrent Read, Concurrent Write)

How to resolve conflicts in CRCW? (b) the values to be written in all processors are equal, otherwise access is denied to all processor. (c) sum the values to be written. (a) the smallest-numbered processor is allowed to write, and access is denied to all other processor.

Simulating multiple accesses on an EREW computer: (i)N multiple accesses (read) : broadcasting procedure: 1. P1→P2 2. P1,P2, → P3,P4 3. P1,P2,P3,P4 → P5,P6,P7,P8 : : O(log N) steps

1 23456 7 8 5 D P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P8P8 A 5 55 5 55 55 5555 5555

(ii) m out of N multiple access(read), m ≦ N Each memory location is associated with a binary tree.

[1][2][4][7] [1][4] [7][1][7] d P1P1 P2P2 P4P4 P7P7 [1][2][4][7][1][2][4]

[1] [4][2] [1][7] d P1P1 P2P2 P4P4 P7P7 d[1] d [7]d [4][7] d d dd dd

Dividing a shared memory into blocks

Interconnection networks SIMD computers The memory is distributed among the N processors. Every pair of processors can communicate with each other if they are connected by a line(link). Instantaneous communication between several pairs of processors is allowed.

(1) fully connected networks There are links.

(2) linear array(1-dimensional array) 1-D array P i-1 Pi Pi P i+1, 2 ≦ i ≦ N – 1

(3) two-dimensional array (2-D mesh-connected)

P(j,k) has links connecting to P(j+1,k), P(j-1,k), P(j,k+1) and P(j,k-1).

(4) tree connection (tree machines)

The sons of P i : P 2i and P 2i+1 The parent of P i : P i/2

(5) shuffle-exchange connection shuffle links: P i  P j j=2i if 0  i  j=2i-(N-1) if exchange links: P i  P i+1 if i is even  i  N-1

(6) cube connection (n-cube,hypercube) 3-cube:

4-cube:

An n-cube connected network consists of 2 n node. (N=2 n ) binary representation of node i: i n-1 i n-2... i j... i 1 i 0 connects to i n-1 i n-2...... i 1 i 0 Each processor has n links.

adding 8 numbers:

This concept can be implemented on a tree machine. This concept can also be implemented on other models. time: O(log n), n:# of inputs m sets, each of n numbers: pipelining: log n+(m-1) steps.

MIMD Computers

MIMD computers sharing a common memory: multiprocessors, tightly coupled machines MIMD computers with an interconnection network: multicomputers, loosely coupled machines, distributed system

Analyzing Algorithms ● running time ● speedup ● cost ● efficiency 以工程為例 : ● 到完工所花的時間 ● 和一個人所花的時間比較 ● 人年 ( 或人月 ) ● 一人之人年 /n 人之人年

Running Time 從 the first processor begins computing 到 the last processor ends computing 的時間

Running Time Sum = 0 For I = 1 to n Sum = Sum + I Next 1次1次 N+1 次 N次N次共 3N+3 次 = O(N)

g(n)=O(f(n)): at most f(n), g(n) = 3n + 3 f(n) = n 3n+3 ≦ 5 n 當 n ≧ 2 時 c n0n0

Upper Bound g(n)=O(f(n)): at most f(n),

g(n) cf(n) n0n0

g(n) = 3n + 3 f(n) = n 3n+3 ≧ 2 n 當 n ≧ 1 時 c n0n0 g(n)= Ω(f(n)): at least f(n),

Lower Bound g(n)= Ω(f(n)): at least f(n),

g(n) cf(n) n0n0

adding 8 numbers: time: O(log n)

The time complexity of heapsort is O(n log n). The lower bound of sorting is Ω(n log n). (why?) Thus, heapsort is optimal. The upper bound of sorting is O(n log n).

For matrix multiplication, the lower bound is Ω(n 2 ) and the upper bound is O(n 2.5 ). (There exists such an algorithm.)

speedup = 即 : 一個 CPU 所花的時間 / 平行所花的時間

Adding n number on a tree machine Time: O(log n) Processor: n-1 speedup=O(n/log n)

Bitonic sorting speedup=O(nlogn/log 2 n) =O(n/log n) Time: O(log 2 n) Processor: O(n)

cost = parallel time * # of processor = t(n) * p(n)

s(n): time for the fastest sequential algo. If s(n)=t(n)*p(n), the algorithm is cost optimal (achieves optimal speedup).

adding n numbers processor: O(n/log n)}  cost optimal time: O(log n) Each processor first adds logn numbers sequentially.

efficiency = e.g. bitonic sorting efficiency:

Introduction 因為需要 large amount of computation. Why parallel?

Similar presentations

Presentation on theme: "Introduction 因為需要 large amount of computation. Why parallel?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction 因為需要 large amount of computation. Why parallel?

Similar presentations

Presentation on theme: "Introduction 因為需要 large amount of computation. Why parallel?"— Presentation transcript:

Similar presentations

About project

Feedback