Download presentation
Presentation is loading. Please wait.
1
מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1
2
Introduction to Parallel Processing Course Number 361-1-3621 אתר הקורס : http://www.ee.bgu.ac.il/~tel-zur/2003/pp.html
3
Course Objectives: The goal of this course is to provide in-depth introduction to modern parallel processing. The course will cover theoretical and practical aspects of parallel processing.
4
מבנה הקורס מבוא טכניקות מיקבול יישומים מקביליים בחישובים מדעים והנדסיים פרקטיקה נושאים נוספים ( העשרה, מגמות עתידיות...)
5
תכנית ההרצאה הראשונה מבוא ל " מבוא לחישוב מקבילי ". מושגי יסוד תאור קצר של המערך המקבילי עליו יתבצע התרגול בשבוע הבא ובמהלך הקורס.
6
מתחילים …
7
מהו { חישוב, עיבוד } מקבילי ? Parallel Computing Parallel Processing Cluster Computing Beowulf Clusters HPC – High Performance Computing
8
Oxford Dictionary of Science: A technique that allows more than one process – stream of activity – to be running at any given moment in a computer system, hence processes can be executed in parallel. This means that two or more processors are active among a group of processes at any instant.
9
האם מחשב מקבילי זהה למונח מחשב - על ?
10
A Supercomputer An extremely high power computer that has a large amount of main memory and very fast processors… Often the processors run in parallel.
11
A Supercomputer A definition from: http://www.cray.com/supercomputing
12
What is a Supercomputer? A supercomputer is defined simply as the most powerful class of computers at any point in time. Supercomputers are used to solve large and complex problems which would be insurmountable by smaller, less powerful computers. Since the pioneering Cray-1 ® system arrived in 1976, supercomputers have contributed enormously to the advancement of knowledge and the quality of human life. Problems of major economic, scientific and strategic importance typically are addressed by supercomputers years before becoming tractable on less-capable systems.
13
Why Study Parallel Architecture? Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design (H/W – S/W Integration) Is a fascinating topic Is increasingly central in information processing, science and engineering
14
The Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible.Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.
15
Large Memory Requirements Use parallel computing for executing larger problems which require more memory than exists on a single computer.
16
Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers.Obviously, an execution time of 10 years is always unreasonable. Examples: Modeling large DNA structures,global weather forecasting, modeling motion of astronomical bodies.
17
Scientific Computing Demand
18
Cluster Computing – An Example
19
Cluster Computing – Cont’ Linux NetworX 11.2 Tflops Linux cluster 4.6 TB of aggregate memory 138.2 TB of aggregate local disk space 1152 total nodes plus separate hot spare cluster and development cluster 2,304 Intel 2.4 GHz Xeon processors http://www.llnl.gov/linux/mcr/
20
תרגיל נניח שבגלקסיה יש 10^11 כוכבים. הערך את הזמן שיידרש לחישוב 100 איטרציות על בסיס חישוב של O(N^2) במחשב בעל כח - חישוב של 1GFLOPS?
21
פתרון עבור 10^11 כוכבים תהינה 10^22 אינטראקטיות. סה " כ פעולות כולל 100 איטרציות : 10^24 לכן זמן החישוב יהיה :
22
פתרון - המשך חישוב על - פי N log(N): מסקנה: שיפור באלגוריתם חשוב בד“כ הרבה יותר מהוספת מעבדים!
23
Technology Trends
24
Clock Frequency Growth Rate
25
מיקבול הוא טוב אבל יש לו מחיר ! לא כל בעיה ניתנת למיקבול מיקבול תוכנה אינו דבר קל זמינות החומרה זמן הפיתוח מול אלטרנטיבות אחרות ( טכנולוגיה עתידית ) עלותו וכח - אדם
26
Parallel Architecture Considerations Resource Allocation: – how large a collection? – how powerful are the elements? – how much memory? Data access, Communication and Synchronization – how do the elements cooperate and communicate? – how are data transmitted between processors? – what are the abstractions and primitives for cooperation? Performance and Scalability – how does it all translate into performance? – how does it scale?
27
Conventional Computer
28
Shared Memory System
29
Message-Passing Multi-computer
30
הגישה יש לחלק את הבעיה לקטעים הניתנים להרצה במקביל כל קטע מהבעיה הוא תהליך אשר יורץ על מעבד אחד לשם העברת הנתונים / התוצאות בין המעבדים יש צורך בשליחת הודעות – Message Passing בין המעבדים ( קיימות גם שיטות אחרות ואנו נדון בהן בהמשך הקורס )
31
Distributed Shared Memory
32
Flynn (1966) Taxonomy SISD - a single instruction stream-single data stream computer. SIMD - a single instruction stream-multiple data stream computer. MIMD - a multiple instruction stream- multiple data stream computer.
33
Multiple Program Multiple Data (MPMD)
34
Single Program Multiple Data (SPMD) A Single source program Each processor will execute its personal copy of this program Independently and not in synchronism
35
Message-Passing Multi-computers
36
לרשת התקשורת תפקיד משמעותי במערך מחשבים מקבילי ! בשקפים הבאים נסקור פרמטרים מאפיינים של רשת התקשורת
37
Network Criteria – 1/6 Bandwidth Network Latency Communication Latency (H/W+S/W) Message Latency (see next slide)
38
Network Criteria – 2/6 Latency 1/slope=Bandwidth Message Size Time to Send Message Not latency l Bandwidth is the inverse of the slope of the line time = latency + (1/rate) size_of_message Latency is sometimes described as “ time to send a message of zero bytes ”. This is true only for the simple model. The number quoted is sometimes misleading.
39
Network Criteria – 3/6 Bisection Width - # links to be cut in order to divide the network into two equal parts 2
40
Network Criteria – 4/6 Diameter – The max. distance between any two nodes P/2
41
Network Criteria – 5/6 Connectivity – Multiplicity of paths between any two nodes 2
42
Network Criteria – 6/6 Cost – Number of links P
43
תרגיל : חשב את תכונות רשת בעלת P מעבדים שהיא Fully Connected כבציור :
44
פתרון Diameter = 1 Bisection=p^2/4 Connectivity=p-1 Cost=p(p-1)/2
45
פתרון עבור ה -Bisection - המשך Number of links: p(p-1)/2 Internal links in each half: (p/2)(p/2-1)/2 Internal links in both halves: (p/2)(p/2-1) Number of links being cut: p(p-1)/2 – (p/2)(p/2-1) = p^2/4
46
2D Mesh
47
Example: Intel Paragon
48
A Binary Tree – 1/2
49
A Binary – Tree 2/2 Fat tree: Thinking Machine CM5, 1993
50
3D Hypercube Network
51
4D Hypercube Network
52
Embedding – 1/2
53
Embedding – 2/2
54
Deadlock
55
Ethernet
56
Ethernet Frame Format
57
Point-to-Point Communication
58
Performance Computation/Communication ratio Speedup Factor Overhead Efficiency Cost Scalability Gustafson’s Law
59
Computation/Communication Ratio
60
Speedup Factor The maximum speedup is n (linear speedup)
61
Speedup and Comp/Comm Ratio Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <
62
Overhead Things that limit the speedup: –Serial parts of the computation –Some processors compute while others are idle –Communication time for sending messages –Extra computation in the parallel version not appearing in the serial version
63
Amdahl’s Law (1967)
64
Amdahl’s Law - continue With only 5% of the computation being serial, the maximum speedup is 20
65
Speedup
66
Efficiency E is the fraction of time that the processors are being used. If E=100% then S(n)=n.
67
Cost Cost-optimal algorithm is when the cost is proportional to the single processor cost ( i.e. execution time)
68
Scalability An imprecise term Reflects H/W and S/W scalability How to get increased performance when the H/W increased? What H/W is needed when problem size (e.g. # cells) is increased? Problem dependent!
69
Gustafson’s Law (1988) – 1/3 Gives an argument against the pessimistic Amdahl’s Law conclusion. Rather than assume that the problem size is fixed, we should assume that the parallel execution time is fixed. Define a Scaled Speedup for the case of increaseing the number of processors as well as the problem size
70
Gustafson’s Law – 2/3
71
Gustafson’s Law – 3/3 An Example: Assume we have n=20 and a serial fraction of s=0.05 S(scaled)=0.05+0.95*20=19.05, while the Speedup according to Amdahl’s Law is: S=20/(0.05(20-1)+1)=10.26
72
תרגיל מערך מחשבים מכיל 10 מעבדים, לכ " א כוח חישוב של 200MFLOPS. מהם ביצועי המערך ביחידות של MFLOPS אילו 10% מהקוד היה טורי ו - 90% מהקוד היה מקבילי ?
73
פתרון אילו כל הקוד היה מקבילי, כוח החישוב היה : 10*200 = 2000MFLOPs במקרה שלנו : 10 % מהקוד יבצע מחשב בודד ויתרת 90 % מהקוד יבצעו 10 מחשבים, לכן :
74
Domain Decomposition מיפוי הבעיה לפתרון על טופולוגית המערך המקבילי חלוקת הבעיה ליחידות חישוב נפרדות באופן אופטימלי : Load Balance Granularity
75
Load Balance – 1/2 All processors must be kept busy! The parallel cluster may not be homogenous (CPUs, memory, users/jobs, network…)
76
Load Balance 2/2 Static versus Dynamic techniques Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads
77
Task granularity: amount of work associated with a task General rule: –Coarse-grained => often less load balance –Fine-grained => more overhead; often more comm., contention Determining Task Granularity
78
Algorithms: Adding 8 Numbers
79
Summary – Terms Defined – 1 Flynn Taxonomy Message Passing Shared Memory Bandwidth Latency Bisection Width Diameter Connectivity Cost Meshes, Trees, Hypercubes… Deadlock
80
Summary – Terms Defined - 2 Embedding Process Amdahl’s Law Speedup Factor Efficiency Cost Scalability Gustafson’s Law Load Balance
81
Next Week Class… השיעור הבא יתקיים במעבדת המחשבים, קומה ג ' בבניין הנדסת חשמל ומחשבים, כיתה 330 לא לשכוח לפתוח חשבון על המערך המקבילי ועל מחשבי כיתת הלימוד (Email)!!! תלמיד שלא יפתח חשבון במחשב לא יוכל לבצע התרגול !!!
82
Task #2 http://www.lam-mpi.org/tutorials/ Download and print the file: “MPI quick reference sheet “MPI quick reference sheet Linux Tutorial: “http://www.ctssn.com/”, learn at least lessons 1,2 and 3.http://www.ctssn.com/
83
Cluster Computing COTS – Commodities of The Shelf Free O/S, e.g. Linux LOBOS – Lots Of Boxes On the Shelf PCs connected by a fast network
84
The Dwarves 1/5 12 (+2) PCs of several types Red Hat Linux 6.0-6.2 Fast Ethernet – 100Mbps Myrinet Network 1.28+1.28Gbps, SAN
85
The Dwarves – 2/5 There are 12 computers with Linux operating system. dwarf[1-12] or dwarf[1-12]m dwarf1[m], dwarf3[m]-dwarf7[m] - Pentium II 300 MHz, dwarf9[m]-dwarf12[m] - Pentium III 450 MHz (dual CPU), dwarf2[m], dwarf8[m] - Pentium III 733 MHz (dual CPU).
86
The Dwarves – 3/5 6 PII at 300MHz processors 8 PIII at 450MHz processors 4 PIII at 733MHz processors Total: 18 processors, ~8GFlops
87
The Dwarves 4/5 Dwarf1..dwarf12 – nodes names for the Fast Ethernet link Dwarf1m.. Dwarf12m – nodes names for the Myrinet network
88
The Dwarves 5/5 GNU FORTRAN / C Compilers PVM / MPI
89
Cluster Computing - 1
90
Cluster Computing - 2
91
Cluster Computing - 3
92
Cluster Computing - 4
93
Linux http://www.ee.bgu.ac.il/~tel-zur/linux.html
94
Linux In Google: Linux: 38,600,000 Microsoft: 21,500,000 Bible: 7,590,000
95
BGU Cray “Negev”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.