Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIT 668: System Architecture

Similar presentations


Presentation on theme: "CIT 668: System Architecture"— Presentation transcript:

1 CIT 668: System Architecture
Parallel Computing

2 Topics What is Parallel Computing? Why use Parallel Computing?
Types of Parallelism Amdahl’s Law Flynn’s Taxonomy of Parallel Computers Parallel Memory Architectures Parallel Programming Models Images from LLNL Parallel Computing Tutorial, Wikipedia, or Majd F. Sakr’s parallel computation lectures unless otherwise noted.

3 Serial and Parallel Computation

4 History of Parallel Computation

5 Parallel Computation Breaking a problem into pieces and using multiple computer resources to solve each piece of the problem and reassemble the solution pieces into the final answer. Parallelism is limited by data dependencies

6 Data Dependencies

7 Data Dependencies

8 Parallel Terminology Task: A logically discrete section of computational work. A task is typically a thread or process at the OS level. Communications: Parallel tasks typically need to exchange data through a shared memory busy or over a network. Synchronization: Coordination of parallel tasks in real time. Often implemented by establishing a synchronization point within an application where task cannot proceed further until other tasks reach that point. Scalability: Ability of parallel system to demonstrate aproportionate increase in speed with addition of more processors. Embarrassingly Parallel: Solving many similar but independent tasks in parallel with little or no need for coordination between tasks.

9 Parallel Granularity Fine grain Coarse grain
Relatively small amounts of computation done between communication events. Low computation to communication ratio. Facilitates load balancing. Relatively large amounts of computation done between communication events. High computation to communication ratio. Difficult to load balance.

10 Why Parallel Computing?

11 The Real World is Parallel

12 Modeling Science & Engineering Problems

13 Reasons for Parallel Computation
Limits to serial computing CPU speeds increased slowly >2003 Solve problems faster Reduce time by using more resources Solve larger problems Scientific problems Web-scale applications

14 Types of Parallelism Bit-level Parallelism
Instruction-level Parallelism Data-level Parallelism Task-level Parallelism Types of Parallelism

15 The Processor The Brain: a functional unit that interprets and carries out instructions (mathematical operations) Also called a CPU (actually includes CPU + ALU) Consists of hundreds of millions of transistors.

16 Processor Components: Control
Control Unit Processor’s supervisor Fetch/execute cycle Fetch Decode Execute Store Program Counter (PC): stores address of instruction to be fetched. Instruction Register (IR): has instruction most recently fetched.

17 Processor Components: Datapath
Register File: General purpose storage locations inside processor that hold addresses or values. Arithmetic Logic Unit: Set of functional units that perform arithmetic and logic operations. A 32-bit processor has registers that are typically 32 bits in size.

18 Bit-level Parallelism
Increase processor word size to operate on more bits at once. Task: add two 64-bit numbers 32-bit CPU: must complete two 32-bit operations plus handle carry operations 64-bit CPU: adds two 64-bit ints in single instruction

19 Evolution of Processor Word Size

20 Instruction Level Parallelism (ILP)
Running independent instructions on separate execution units simultaneously. Serial Execution: If each instruction takes one cycle, it takes 3 clock cycles to run program. x = a + b y = c + d z = x + y Parallel Execution: First two programs are independent, so can be executed simultaneously. Third instruction depends on first two, so must be executed afterwards. Two clock cycles to run program.

21 Instruction Pipelining
Improve ILP by splitting processing of a single machine language instruction into a series of independent steps. CPU can issue instructions at processing rate of slowest step, increasing clock speed. IF = Instruction Fetch ID = Instruction Decode EX = Execute MEM = Memory access WB = Register write back Image from

22 Sequential Laundry Sequential laundry = 8 hours for 4 loads 30 Time
6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e B C D A

23 Pipelined Laundry Pipelined laundry = 3.5 hours for 4 loads! 12 2 AM
6 PM 7 8 9 10 11 1 Time 30 T a s k O r d e B C D A

24 Pipelining Lessons 6 PM 7 8 9 B C D A 30 T a s k O r d e
Time B C D A 30 T a s k O r d e Pipelining doesn’t decrease latency of single task; it increases throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number of stages Time to fill pipeline and time to drain it reduces speedup: 2.3X v. 4X in this example Speedup limited by slowest pipeline stage.

25 Processor Pipeline Execution
IF ID ALU MEM WR Instr 1 IF ID ALU MEM WR Instr 2 Instr 2 IF ID ALU MEM WR Instr 3 IF ID ALU MEM WR IF ID ALU MEM WR Instr 4 IF ID ALU MEM WR Instr 5 IF ID ALU MEM WR Instr 6 IF ID ALU MEM WR Instr 7 IF ID ALU MEM WR Instr 8

26 Hazards Problems that prevent pipeline execution. Data hazards
Must wait for previous instruction to compute data that current instruction needs for input. Structural hazards A processor component (execution unit or memory unit) is needed by two simultaneous instructions. Control hazards Conditional branch statements offer two alternatives for the next instruction to fetch.

27 Working Around Hazards
Instruction re-ordering Re-order instructions to extract more ILP Data dependencies limit re-ordering Branch prediction Predict which branch is likely to be taken then execute those instructions Loops usually continue in loop; only exit once Loop unrolling Human or compiler replicates body of loop to expose more parallelism

28 Superscalar Pipelined Execution Superscalar Execution

29 AMD Athlon Architecture

30 Superscalar Pipeline

31 Pipeline Depth and Issue Width
CPU Year Clock Speed Pipeline Stages Issue Width Cores Power 80486 1989 25 MHz 5 1 5W Pentium 1993 66 MHz 2 10W Pentium Pro 1997 150 MHz 10 3 29W Pentium 4 2001 1500 MHz 22 75W Pentium 4 Prescott 2004 3600 MHz 31 103W Core 2 Conroe 2006 2930 MHz 14 4 Core 2 Yorkfield 2008 16 95W Core i7 Gulftown 2010 3460 MHz 6 130W

32 Pipeline Depth and Issue Width

33 Data Parallelism Distribute data across different computing nodes, so same operation performed on different parts of same data structure AKA loop-level parallelism If each loop iteration depends on results from previous iteration, loop cannot be parallelized

34 Task Parallelism Different operations on different data sets
Each processor performs different task Each processor communicates with other processes to get inputs and put results

35 Multiple CPUs

36 Multicore Multicore CPU chips contain multiple complete processors
Individual L1 and shared L2 caches OS and applications see each core as an independent processor Each core can run a separate task A single application must be divided into multiple tasks to improve performance

37 Multicore Organization Alternatives
Figure 18.8 from Computer Organization and Architecture, 8th edition

38 Core 2 Duo and Core i7 Architectures
Figures from Computer Organization and Architecture, 8th edition

39 High Core CPUs Use On-Chip Network
Low count cores use ~1000 wires per core to connect to L3 cache and each other. To scale, Core i7 uses a ring bus to communicate with cores and L3 cache slices. Core i7 2600k Ring Bus Architecture

40 Multicore vs. Multiprocessor
Inter-processor communication Faster Slower Memory bandwidth per CPU Lower Higher Power consumption per CPU Cost per CPU

41 Simultaneous Multi-Threading
CPU presents virtual cores to OS CPU duplicates PC, IR, and thread state registers But keeps same number of execution units OS feeds two threads at once to CPU Improves ILP by having multiple instruction streams that are unlikely to have cross-stream dependencies

42 CPU Parallelism Architectures Compared
Parallelism types combined to increase parallel capacity. x86 cores have been superscalar since Pentium in 1993. Server x86 CPUs use simultaneous multithreading since Pentium 4 Xeon in 2000. Figure 18.1 from Computer Organization and Architecture, 8th edition

43 Flynn’s Taxonomy

44 Flynn’s Taxonomy Single Instruction Multiple Instruction Single Data
SISD: Pentium III MISD: None today Multiple Data SIMD: SSE instruction set MIMD: Xeon e5345 (Clovertown)

45 Single Instruction Single Data
Serial computation Single instruction = only one instruction stream acted on by CPU during one clock cycle Single data = only one data stream used as input during any clock cycle

46 Single Instruction Multiple Data
Parallel computation Single instruction = All processing units given same instruction at any clock cycle. Multiple data = Each processing unit can operate on a different data element Applications: graphics Radeon R770 GPU

47 Multiple Instruction Single Data
Parallel computation Single data = A single data stream is fed into multiple processing units Multiple instruction = Each processing unit operates on data independently via its own instruction stream

48 Multiple Instruction Multiple Data
Parallel computation Multiple instruction = each processing unit may execute a different instruction stream Multiple data = each processing unit may work with a different data stream Examples: multicore, grid computing

49 Taxonomy of Parallel Architectures
Figure 17.1 from Computer Organization and Architecture, 8th edition

50 Amdahl’s Law

51 Scaling Parallel Computing
Just add more processors? Hardware limitations Memory-CPU bus bandwidth on local machine Network bandwidth and latency Software limitations Most algorithms have limits to scalability Supporting libraries may have their own limits

52 Amdahl’s Law Speedup due to enhancement E is
𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 𝐸 𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑤𝑖𝑡ℎ 𝐸 Suppose E accelerates a piece P (P<1) of task by a factor S (S>1) and remainder unaffected Exec time with E = Exec time w/o E × [ 1 - P + P/S ] 𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝑃 +𝑃/𝑆

53 Amdahl’s Law: Example Consider an application whose work is divided into the following four components: Work Load Memory Access Computation Disk Access Network Access Time 10% 70% 10% 10% What is the expected percent improvement if: Memory access speed is doubled? Computation speed is doubled? 5% 35%

54 Amdahl’s Law for Parallelization
𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝑃 +𝑃/𝑆 Let P be the parallelizable portion of code As the number of processors increases, the time to do the parallel portion of the program, P/S tends towards zero, reducing the equation to: 𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝑃 If P=0, then speedup=1 (no improvement) If P=1, then speedup grows without limit. If P=0.5, then maximum speed is 2.

55 Amdahl’s Law: Parallel Example
Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 4% 10% 80% 6% Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Parallelizing which part would best improve performance? What is the best performance speedup that could be reached by parallelizing all three parallelizable functions? f3 10X

56 Amdahl’s Law: Time Example
Consider an application whose work is divided into the following four functions: Work Load f1 f2 f3 f4 Time 2ms 5ms 40ms 3ms Assume f1, f3, and f4 can be parallelized, but f2 must be computed serially. Assume that running the whole program takes 50ms. What is the best running time that can be achieved by parallelizing f1, f3, and f4? Why can’t parallelizing the program decrease the total running time below that time? 5ms 5ms is the time required for serial part. Even if parallel part takes 0ms, f2 still takes 5ms to run.

57 Amdahl’s Law

58 Parallel Memory Architectures
Uniform Memory Architecture Non-Uniform Memory Architecture Distributed Memory Hybrid Distributed-Shared Memory Parallel Memory Architectures

59 Uniform Memory Architecture (UMA)
Global shared address space for all memory Symmetric Multi-Processing (SMP) Does not scale to much greater than 8 CPUs due to memory contention

60 Non-Uniform Memory Architecture (NUMA)
Global shared address space where Processors have fast access to nearby memory Memory access across links is slower Better scaling than UMA but still limited by memory contention

61 Distributed Memory Each CPU has own local address space; changes by each CPU are not visible to other CPUs CPUs must use network to exchange data Highly scalable: CPUs have fast local RAM Data communication is programmer’s responsibility

62 Hybrid Distributed-Shared Memory
Shared memory components are SMP nodes Distributed memory is network of SMP nodes Used by most large supercomputers + clouds

63 Hybrid Distributed-Shared in Cloud

64 Parallel Programming Models
Shared Memory Model Threads Model Data Parallel Model Message Passing Model Parallel Programming Models

65 Parallel Programming Models
An abstraction above the hardware level for programmers to use. Any programming model can be used with any parallel architecture. Model performance may depend on architecture. Model choice depends on problem being solved and programmer preference.

66 Shared Memory Tasks share common global address space, which they read and write to asynchronously. Software mechanisms such as locks and semaphores used to control access to shared memory.

67 Shared Memory Model Advantages Disadvantages
Program development simplified since process owns data stored in memory. Referencing data in shared memory is similar to traditional serial programming. Difficult to understand and manage data locality. Keeping data local to the CPU working on it is faster, but bus traffic results when other CPUs try to access that data.

68 Threads Divide single program into multiple concurrent execution paths called threads. Threads are distributed across CPUs and can be executed on multiple CPUs simultaneously.

69 Threads Each thread has local data structures specific to this thread.
However, all threads share common process global memory space. Threads are associated with shared memory architectures. Threads can be scheduled by OS or by middleware such as the language VM (green threads). Green threads start and synchronize faster But most implementations cannot use multiple CPUs

70 Data Parallel Data set is divided into chunks and operations are performed on each chunk concurrently Tasks carried out on same part of data structure perform same operations on each instance of data (ex: multiply each array element by X)

71 Message Passing Tasks use only their own local memory.
Tasks can be on multiple machines. Data exchanged between tasks by sending and receiving messages. Data transfer requires cooperation between tasks.

72 Parallel Code Example

73 Example: Array Processing
if MASTER initialize array send each WORKER info on chunk it owns chunk of initial array end recv results from workers elsif SLAVE recv info on my chunk recv array data do j=1stcol..last col do i = 1,n a(i,j) = f(i,j) Array is divided into chunks; each CPU owns a chunk and executes the portion of loop corresponding to it.

74 Example: Heat Equation
The Heat Equation describes change in temperature in a region over time, given initial temperature distribution and boundary conditions. Divide region into chunks and iterate, allowing heat from nearby chunks to change temperature in chunk for next iteration.

75 Example: Heat Equation
To compute Ux,y Serial program:

76 Example: Parallel Heat Equation Solver
Divide array into chunks. Data dependencies: Interior elements are independent of other tasks. Border elements are dependent on other tasks, so tasks must communicate. Master sends initial state to worker tasks and collects results from workers. Workers compute heat equation, communicating state of border elements with adjacent worker tasks.

77 Parallel Code

78 Key Points Granularity refers to ratio of comp to comm.
Levels of Parallelism Bit-level parallelism Data-level parallelism (loop-level) Instruction-level parallelism Pipelining Superscalar Task-level parallelism Multi-CPU Multi-core Hyperthreading Flynn’s Taxonomy: SIMD, MIMD

79 Key Points Amdahl’s Law Parallel Memory Architectures
No matter how many processors, speed-up is limited by sequential portion of program Parallel Memory Architectures Shared memory for SMP: UMA, NUMA Distributed memory for clusters or large-scale MP Most clouds use hybrid shared-distributed arch Parallel Programming Models Shared Memory Threads Data Parallel Message Passing


Download ppt "CIT 668: System Architecture"

Similar presentations


Ads by Google