Download presentation
Presentation is loading. Please wait.
Published byBaldwin Sims Modified over 9 years ago
1
PARALLEL PROCESSING COMPARATIVE STUDY 1
2
CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker has a limit Inadequate for long works 2
3
CONTEXT How to finish a calculation in short time???? Solution To use quicker calculator (processor).[1960-2000] Inconvenient: The speed of processor has reach a limit Inadequate for long calculations 3
4
CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 4
5
CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 5
6
CONTEXT How to finish a work in short time???? Solution 1. To use quicker worker. (Inadequate for long works) 2. To use more than one worker concurrently 6
7
CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 7
8
CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations ) 8
9
CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently 9
10
CONTEXT How to finish a Calculation in short time???? Solution 1. To use quicker processor (Inadequate for long calculations) 2. To use more than one processor concurrently Parallelism 10
11
CONTEXT Definition The parallelism is the concurrent use of more than one processing unit (CPUs, Cores of processor, GPUs, or combinations of them) in order to carry out calculations more quickly 11
12
PROJECT GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 12
13
THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer 13
14
THE GOAL Parallel Computer Several parallel computers in the hardware market Differ in their architecture Several Classifications Based on the Instruction and Data Streams (Flynn classification) Based on the Memory Charring Degree …. 14
15
THE GOAL Flynn Classification A. Single Instruction and Single Data stream 15
16
THE GOAL Flynn Classification B. Single Instruction and Multiple Data 16
17
THE GOAL Flynn Classification C. Multiple Instruction and Single Data stream 17
18
THE GOAL Flynn Classification D. Multiple Instruction and Multiple Data stream 18
19
THE GOAL Memory Sharing Degree Classification A. Shared Memory B. Distributed memory 19
20
THE GOAL Memory Sharing Degree Classification C. Hybrid Distributed-Shared Memory 20
21
THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 21
22
THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processor cooperates) 22
23
THE GOAL Parallelism needs 1. Parallel Computer (more than one processors) 2. Accommodate Calculation to Parallel Computer Dividing the calculation and data between the processors Defining the execution scenario (how the processors cooperate) 23
24
THE GOAL The accommodation of calculation to parallel computer Is called parallel processing Depend closely on the architecture 24
25
THE GOAL Goal : A comparative study between 1. Shared Memory Parallel Processing approach 2. Distributed Memory Parallel Processing approach 25
26
PLAN 1. Distributed Memory Parallel Processing approach 2. Shared Memory Parallel Processing approach 3. Case study problems 4. Comparison results and discussion 5. Conclusion 26
27
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH 27
28
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-Memory Computers (DMC) = Distributed Memory System (DMS) = Massively Parallel Processor (MPP) 28
29
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Distributed-memory computers architecture 29
30
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of nodes Nodes can be : identical processors Pure DMC different types of processor Hybrid DMC different type of nodes with different Architectures Heterogeneous DMC 30
31
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Architecture of Interconnection Network No shared memory space between nodes Network is the only way of node-communications Network performance influence directly the performance of parallel program on DMC Network performance depends on : 1. Topology 2. Physical connectors (as wires…) 3. Routing Technique The DMC evolutions closely depends on the Networking evolutions 31
32
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH The Used DMC in our Comparative Study Heterogeneous DMC Modest cluster of workstations Three nodes: Sony Laptop: i3 processor HP Laptop: i3 processor HP Laptop core 2 due processor Communication Network: 100 MByte-Ethernet 32
33
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Designer main tasks: 1. Global Calculation decomposition and tasks assignment 2. Data decomposition 3. Communications scheme Definition 4. Synchronization Study 33
34
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Parallel Software Development for DMC Important considerations for efficiency: 1. Minimize Communication 2. Avoid barrier synchronization 34
35
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH Implementation environments Several implementation environments PVM (Parallel Virtual Machine) MPI (Message Passing Interface) 35
36
DISTRIBUTED MEMORY PARALLEL PROCESSING APPROACH MPI Application Anatomy All the node execute the same code All the nodes does not do the same work It’s possible using SPMD application form SPMD :.... The processes are organized in one controller and workers Contradiction
37
SHARED MEMORY PARALLEL PROCESSING APPROACH Several SMPC in the Markets Multi-core PC: Intel i3 i5 i7,AMD Which SMPC we use ? - GPU originally for image processing - GPU NOW : Domestic Super-Computer Characteristics: Chipset and fastest Shared Memory Parallel computer Hard Parallel Design 37
38
SHARED MEMORY PARALLEL PROCESSING APPROACH The GPU Architecture The implementation environment 38
39
SHARED MEMORY PARALLEL PROCESSING APPROACH GPU Architecture As the classical processing unit, the Graphics Processing Unit is composed from two main components: A- Calculation Units B- Storage Unit 39
40
SHARED MEMORY PARALLEL PROCESSING APPROACH 40
41
SHARED MEMORY PARALLEL PROCESSING APPROACH 41 SHARED MEMORY PARALLEL PROCESSING APPROACH
42
SHARED MEMORY PARALLEL PROCESSING The GPU Architecture The implementation environment 1. CUDA : for GPU S manufactured by NVIDIA 2. OpenCL: independent of the GPU architecture 42
43
SHARED MEMORY PARALLEL PROCESSING CUDA Program Anatomy 43
44
SHARED MEMORY PARALLEL PROCESSING Q: How to execute code fragments to be parallelized in the GPU? R: By Calling a kernel Q: What’s Kernel ? R: A kernel is a function callable from the host and executed on the device simultaneously by many threads in parallel 44
45
KERNEL LAUNCH 45 SHARED MEMORY PARALLEL PROCESSING
46
KERNEL LAUNCH 46 SHARED MEMORY PARALLEL PROCESSING
47
KERNEL LAUNCH 47 SHARED MEMORY PARALLEL PROCESSING
48
Design recommendations utilize the shared memory to reduce the amount of time to access the global memory. reduce the amount of idle threads ( control divergence) to fully utilize the GPU resource. 48
49
CASE STUDY PROBLEM 49
50
CASE STUDY PROBLEM 50
51
COMPARSION Comparisons Creteria Analysis and conclusion 51
52
COMPARISON Criteria 1 : Time-Cost factor = ∗ : Parallel Execution Time (in Milliseconds) : The Hardware Cost (in Saudi Arabia Riyals) The Hardware costs() GPU : 5000 SAR Cluster of workstation : 9630 SAR. 52
53
COMPARISON 53
54
COMPARISON Conclusion: GPU is better if we need to perform a lot of number of small amount of iterations calculation. However if our need is to perform a calculation with big amount of iterations, the cluster of workstations is the best choice. 54
55
COMPARISON Criteria 2 : required Memory Matrix multiplication problem Graphics Processing Unit The Global-Memory-based-method requirement: ℎ =6 ∗∗∗ The Shared-Memory-based-method requirement: ℎ =8 ∗∗∗ Cluster of workstations The used cluster contains three nodes ℎ =19/3 ∗∗∗ 55
56
COMPARISON Criteria 2 : required Memory Pi approximation problem Graphics Processing Unit The size of these arrays depends on the number of used thread The required memory = ∗ ∗ Cluster of workstations Small amount of memory used on each node almost 15 ∗ 56
57
COMPARISON Criteria 2 : required Memory Conclusion: We cannot judge which parallel approach is the better for the required memory criteria. This criteria depends on the intrinsic characteristics of the on-hand problem. 57
58
COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity The Gap between the Theoretical Complexity and E ff ective Complexity- calculated by: =((/) − 1)×100 : Experimental Parallel Time : Theoretical Parallel Time = / : Sequential Time. : Number of processing unit. 58
59
CLUSTER OF WORKSTATIONS 59 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity
60
GRAPHICS PROCESSING UNIT 60 COMPARISON Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity
61
COMPARISON Conclusion In the GPU, the resulting execution time of parallel program can give less time than the theoretical expected time. That is impossible to achieve when using a Cluster of workstation because of the communication overhead. To minimize the Gap, or take it constant, in the cluster of workstations, the designer has to maintain constant, as possible, number and sizes of communicated messages when increasing the problem size. 61 Criteria 3 : The Gap between the Theoretical Complexity and E ff ective Complexity
62
COMPARISON 62
63
CRITERIA 4 : EFFICIENCY 63 COMPARISON
64
Criteria 4 : Efficiency Conclusion: The efficiency (speedup) is much better in the GPU than in the cluster of workstations. 64
65
IMPORTANT NOTIFICATION 65 COMPARISON
66
IMPORTANT NOTIFICATION
67
COMPARISON Criteria 5 : Hardness of development Cuda MPI 67
68
COMPARISON Criteria 6 : necessary hardware and software materials GPU (Nvidia gt 525m ) Cluster of workstation( 3 pc, switch, internet modem and wires) 68
69
69
70
CONCLUSION 70
71
Parallel Processing Comparative Study Shared Memory Parallel Processing ApproachDistributed Memory Parallel Processing Approach Graphics Processing Unit (GPU)Cluster Of work-station GPU and Cluster are the main two components of the Fastest Word Computers (As Shahin) To compare we use : Two different problems (Matrix-Multiplication and Pi Approximation) Six Measure’s Criteria More Adequate for Data-Level Parallelism FormMore Adequate for Task –Level Parallelism Form Big number of small calculationA Big calculation Memory requirement ̴ Problem Characteristics Better than the expected Run TimeImpossible Null or Negative GAP Complicate Design and programmingLess complicated Implementation environment very practical Complicated
72
72
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.