Introduction 因為需要 large amount of computation. Why parallel?
Example /10 7 ≒ 10 8 秒 10 8 /10 5 ≒ 10 3 天≒ 3 年 用在外科手術時 , 每秒約需 個計算 3D 立體影像 一般的電腦每秒約可算 10 7 個計算
其他需要大量計算的領域 Aircraft Testing New Drug Development Oil Exploration Modeling Fusion Reactors Economic Planning Cryptanalysis Managing Large Databases Astronomy Biomedical Analysis …
電腦的速度 光速 : 3×10 8 m/s Components 太近 : 會互相干擾 每 3 、 4 年增快一倍 Unfortunately It is evident that this trend will soon come to an end. A simple law of physics
Parallelism 人 (CPU) 海戰術 目前許多電腦中都不只一顆 CPU
Models of Computation Flynn's classification: SISD(Single Instruction, Single Data) MISD(Multiple Instruction, Single Data) SIMD(Single Instruction, Multiple Data) MIMD(Multiple Instruction, Multiple Data)
SISD Computers Sequential(serial) Computation
Example summation of n numbers: Sum = 0 For I = 1 to n Sum = Sum + I Next need n-1 additions O(n) time
MISD Computers
Example test whether a positive integer is prime or not. n is prime: no divisors except 1 & n. n is composite: otherwise Each processor tests a number between 1 and n. Each processor needs O(1) operation. O(1) time
SIMD Computers
PRAM (Parallel Random Access Machine) Share-Memory(SM) SIMD computers
How does a processor i pass a number to another processor j? O(1) time
WriteRead 1. EREW (Exclusive Read, Exclusive Write) 2. CREW (Concurrent Read, Exclusive Write) 3. ERCW (Exclusive Read, Concurrent Write) 4. CRCW (Concurrent Read, Concurrent Write)
How to resolve conflicts in CRCW? (b) the values to be written in all processors are equal, otherwise access is denied to all processor. (c) sum the values to be written. (a) the smallest-numbered processor is allowed to write, and access is denied to all other processor.
Simulating multiple accesses on an EREW computer: (i)N multiple accesses (read) : broadcasting procedure: 1. P1→P2 2. P1,P2, → P3,P4 3. P1,P2,P3,P4 → P5,P6,P7,P8 : : O(log N) steps
D P1P1 P2P2 P3P3 P4P4 P5P5 P6P6 P7P7 P8P8 A
(ii) m out of N multiple access(read), m ≦ N Each memory location is associated with a binary tree.
[1][2][4][7] [1][4] [7][1][7] d P1P1 P2P2 P4P4 P7P7 [1][2][4][7][1][2][4]
[1] [4][2] [1][7] d P1P1 P2P2 P4P4 P7P7 d[1] d [7]d [4][7] d d dd dd
Dividing a shared memory into blocks
Interconnection networks SIMD computers The memory is distributed among the N processors. Every pair of processors can communicate with each other if they are connected by a line(link). Instantaneous communication between several pairs of processors is allowed.
(1) fully connected networks There are links.
(2) linear array(1-dimensional array) 1-D array P i-1 Pi Pi P i+1, 2 ≦ i ≦ N – 1
(3) two-dimensional array (2-D mesh-connected)
P(j,k) has links connecting to P(j+1,k), P(j-1,k), P(j,k+1) and P(j,k-1).
(4) tree connection (tree machines)
The sons of P i : P 2i and P 2i+1 The parent of P i : P i/2
(5) shuffle-exchange connection shuffle links: P i P j j=2i if 0 i j=2i-(N-1) if exchange links: P i P i+1 if i is even i N-1
(6) cube connection (n-cube,hypercube) 3-cube:
4-cube:
An n-cube connected network consists of 2 n node. (N=2 n ) binary representation of node i: i n-1 i n-2... i j... i 1 i 0 connects to i n-1 i n i 1 i 0 Each processor has n links.
adding 8 numbers:
This concept can be implemented on a tree machine. This concept can also be implemented on other models. time: O(log n), n:# of inputs m sets, each of n numbers: pipelining: log n+(m-1) steps.
MIMD Computers
MIMD computers sharing a common memory: multiprocessors, tightly coupled machines MIMD computers with an interconnection network: multicomputers, loosely coupled machines, distributed system
Analyzing Algorithms ● running time ● speedup ● cost ● efficiency 以工程為例 : ● 到完工所花的時間 ● 和一個人所花的時間比較 ● 人年 ( 或人月 ) ● 一人之人年 /n 人之人年
Running Time 從 the first processor begins computing 到 the last processor ends computing 的時間
Running Time Sum = 0 For I = 1 to n Sum = Sum + I Next 1次1次 N+1 次 N次N次 共 3N+3 次 = O(N)
g(n)=O(f(n)): at most f(n), g(n) = 3n + 3 f(n) = n 3n+3 ≦ 5 n 當 n ≧ 2 時 c n0n0
Upper Bound g(n)=O(f(n)): at most f(n),
g(n) cf(n) n0n0
g(n) = 3n + 3 f(n) = n 3n+3 ≧ 2 n 當 n ≧ 1 時 c n0n0 g(n)= Ω(f(n)): at least f(n),
Lower Bound g(n)= Ω(f(n)): at least f(n),
g(n) cf(n) n0n0
adding 8 numbers: time: O(log n)
The time complexity of heapsort is O(n log n). The lower bound of sorting is Ω(n log n). (why?) Thus, heapsort is optimal. The upper bound of sorting is O(n log n).
For matrix multiplication, the lower bound is Ω(n 2 ) and the upper bound is O(n 2.5 ). (There exists such an algorithm.)
speedup = 即 : 一個 CPU 所花的時間 / 平行所花的時間
Adding n number on a tree machine Time: O(log n) Processor: n-1 speedup=O(n/log n)
Bitonic sorting speedup=O(nlogn/log 2 n) =O(n/log n) Time: O(log 2 n) Processor: O(n)
cost = parallel time * # of processor = t(n) * p(n)
s(n): time for the fastest sequential algo. If s(n)=t(n)*p(n), the algorithm is cost optimal (achieves optimal speedup).
adding n numbers processor: O(n/log n)} cost optimal time: O(log n) Each processor first adds logn numbers sequentially.
efficiency = e.g. bitonic sorting efficiency: