מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1.

מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1

Introduction to Parallel Processing Course Number 361-1-3621 אתר הקורס : http://www.ee.bgu.ac.il/~tel-zur/2003/pp.html

Course Objectives: The goal of this course is to provide in-depth introduction to modern parallel processing. The course will cover theoretical and practical aspects of parallel processing.

מבנה הקורס מבוא טכניקות מיקבול יישומים מקביליים בחישובים מדעים והנדסיים פרקטיקה נושאים נוספים ( העשרה, מגמות עתידיות...)

תכנית ההרצאה הראשונה מבוא ל " מבוא לחישוב מקבילי ". מושגי יסוד תאור קצר של המערך המקבילי עליו יתבצע התרגול בשבוע הבא ובמהלך הקורס.

מתחילים …

מהו { חישוב, עיבוד } מקבילי ? Parallel Computing Parallel Processing Cluster Computing Beowulf Clusters HPC – High Performance Computing

Oxford Dictionary of Science: A technique that allows more than one process – stream of activity – to be running at any given moment in a computer system, hence processes can be executed in parallel. This means that two or more processors are active among a group of processes at any instant.

האם מחשב מקבילי זהה למונח מחשב - על ?

A Supercomputer An extremely high power computer that has a large amount of main memory and very fast processors… Often the processors run in parallel.

A Supercomputer A definition from: http://www.cray.com/supercomputing

What is a Supercomputer? A supercomputer is defined simply as the most powerful class of computers at any point in time. Supercomputers are used to solve large and complex problems which would be insurmountable by smaller, less powerful computers. Since the pioneering Cray-1 ® system arrived in 1976, supercomputers have contributed enormously to the advancement of knowledge and the quality of human life. Problems of major economic, scientific and strategic importance typically are addressed by supercomputers years before becoming tractable on less-capable systems.

Why Study Parallel Architecture? Parallelism: Provides alternative to faster clock for performance Applies at all levels of system design (H/W – S/W Integration) Is a fascinating topic Is increasingly central in information processing, science and engineering

The Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible.Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.

Large Memory Requirements Use parallel computing for executing larger problems which require more memory than exists on a single computer.

Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with today’s computers.Obviously, an execution time of 10 years is always unreasonable. Examples: Modeling large DNA structures,global weather forecasting, modeling motion of astronomical bodies.

Scientific Computing Demand

Cluster Computing – An Example

Cluster Computing – Cont’ Linux NetworX 11.2 Tflops Linux cluster 4.6 TB of aggregate memory 138.2 TB of aggregate local disk space 1152 total nodes plus separate hot spare cluster and development cluster 2,304 Intel 2.4 GHz Xeon processors http://www.llnl.gov/linux/mcr/

תרגיל נניח שבגלקסיה יש 10^11 כוכבים. הערך את הזמן שיידרש לחישוב 100 איטרציות על בסיס חישוב של O(N^2) במחשב בעל כח - חישוב של 1GFLOPS?

פתרון עבור 10^11 כוכבים תהינה 10^22 אינטראקטיות. סה " כ פעולות כולל 100 איטרציות : 10^24 לכן זמן החישוב יהיה :

פתרון - המשך חישוב על - פי N log(N): מסקנה: שיפור באלגוריתם חשוב בד“כ הרבה יותר מהוספת מעבדים!

Technology Trends

Clock Frequency Growth Rate

מיקבול הוא טוב אבל יש לו מחיר ! לא כל בעיה ניתנת למיקבול מיקבול תוכנה אינו דבר קל זמינות החומרה זמן הפיתוח מול אלטרנטיבות אחרות ( טכנולוגיה עתידית ) עלותו וכח - אדם

Parallel Architecture Considerations Resource Allocation: – how large a collection? – how powerful are the elements? – how much memory? Data access, Communication and Synchronization – how do the elements cooperate and communicate? – how are data transmitted between processors? – what are the abstractions and primitives for cooperation? Performance and Scalability – how does it all translate into performance? – how does it scale?

Conventional Computer

Shared Memory System

Message-Passing Multi-computer

הגישה יש לחלק את הבעיה לקטעים הניתנים להרצה במקביל כל קטע מהבעיה הוא תהליך אשר יורץ על מעבד אחד לשם העברת הנתונים / התוצאות בין המעבדים יש צורך בשליחת הודעות – Message Passing בין המעבדים ( קיימות גם שיטות אחרות ואנו נדון בהן בהמשך הקורס )

Distributed Shared Memory

Flynn (1966) Taxonomy SISD - a single instruction stream-single data stream computer. SIMD - a single instruction stream-multiple data stream computer. MIMD - a multiple instruction stream- multiple data stream computer.

Multiple Program Multiple Data (MPMD)

Single Program Multiple Data (SPMD) A Single source program Each processor will execute its personal copy of this program Independently and not in synchronism

Message-Passing Multi-computers

לרשת התקשורת תפקיד משמעותי במערך מחשבים מקבילי ! בשקפים הבאים נסקור פרמטרים מאפיינים של רשת התקשורת

Network Criteria – 1/6 Bandwidth Network Latency Communication Latency (H/W+S/W) Message Latency (see next slide)

Network Criteria – 2/6 Latency 1/slope=Bandwidth Message Size Time to Send Message Not latency l Bandwidth is the inverse of the slope of the line time = latency + (1/rate) size_of_message Latency is sometimes described as “ time to send a message of zero bytes ”. This is true only for the simple model. The number quoted is sometimes misleading.

Network Criteria – 3/6 Bisection Width - # links to be cut in order to divide the network into two equal parts 2

Network Criteria – 4/6 Diameter – The max. distance between any two nodes P/2

Network Criteria – 5/6 Connectivity – Multiplicity of paths between any two nodes 2

Network Criteria – 6/6 Cost – Number of links P

תרגיל : חשב את תכונות רשת בעלת P מעבדים שהיא Fully Connected כבציור :

פתרון Diameter = 1 Bisection=p^2/4 Connectivity=p-1 Cost=p(p-1)/2

פתרון עבור ה -Bisection - המשך Number of links: p(p-1)/2 Internal links in each half: (p/2)(p/2-1)/2 Internal links in both halves: (p/2)(p/2-1) Number of links being cut: p(p-1)/2 – (p/2)(p/2-1) = p^2/4

2D Mesh

Example: Intel Paragon

A Binary Tree – 1/2

A Binary – Tree 2/2 Fat tree: Thinking Machine CM5, 1993

3D Hypercube Network

4D Hypercube Network

Embedding – 1/2

Embedding – 2/2

Deadlock

Ethernet

Ethernet Frame Format

Point-to-Point Communication

Performance Computation/Communication ratio Speedup Factor Overhead Efficiency Cost Scalability Gustafson’s Law

Computation/Communication Ratio

Speedup Factor The maximum speedup is n (linear speedup)

Speedup and Comp/Comm Ratio Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <

Overhead Things that limit the speedup: –Serial parts of the computation –Some processors compute while others are idle –Communication time for sending messages –Extra computation in the parallel version not appearing in the serial version

Amdahl’s Law (1967)

Amdahl’s Law - continue With only 5% of the computation being serial, the maximum speedup is 20

Speedup

Efficiency E is the fraction of time that the processors are being used. If E=100% then S(n)=n.

Cost Cost-optimal algorithm is when the cost is proportional to the single processor cost ( i.e. execution time)

Scalability An imprecise term Reflects H/W and S/W scalability How to get increased performance when the H/W increased? What H/W is needed when problem size (e.g. # cells) is increased? Problem dependent!

Gustafson’s Law (1988) – 1/3 Gives an argument against the pessimistic Amdahl’s Law conclusion. Rather than assume that the problem size is fixed, we should assume that the parallel execution time is fixed. Define a Scaled Speedup for the case of increaseing the number of processors as well as the problem size

Gustafson’s Law – 2/3

Gustafson’s Law – 3/3 An Example: Assume we have n=20 and a serial fraction of s=0.05 S(scaled)=0.05+0.95*20=19.05, while the Speedup according to Amdahl’s Law is: S=20/(0.05(20-1)+1)=10.26

תרגיל מערך מחשבים מכיל 10 מעבדים, לכ " א כוח חישוב של 200MFLOPS. מהם ביצועי המערך ביחידות של MFLOPS אילו 10% מהקוד היה טורי ו - 90% מהקוד היה מקבילי ?

פתרון אילו כל הקוד היה מקבילי, כוח החישוב היה : 10*200 = 2000MFLOPs במקרה שלנו : 10 % מהקוד יבצע מחשב בודד ויתרת 90 % מהקוד יבצעו 10 מחשבים, לכן :

Domain Decomposition מיפוי הבעיה לפתרון על טופולוגית המערך המקבילי חלוקת הבעיה ליחידות חישוב נפרדות באופן אופטימלי : Load Balance Granularity

Load Balance – 1/2 All processors must be kept busy! The parallel cluster may not be homogenous (CPUs, memory, users/jobs, network…)

Load Balance 2/2 Static versus Dynamic techniques Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads

Task granularity: amount of work associated with a task General rule: –Coarse-grained => often less load balance –Fine-grained => more overhead; often more comm., contention Determining Task Granularity

Algorithms: Adding 8 Numbers

Summary – Terms Defined – 1 Flynn Taxonomy Message Passing Shared Memory Bandwidth Latency Bisection Width Diameter Connectivity Cost Meshes, Trees, Hypercubes… Deadlock

Summary – Terms Defined - 2 Embedding Process Amdahl’s Law Speedup Factor Efficiency Cost Scalability Gustafson’s Law Load Balance

Next Week Class… השיעור הבא יתקיים במעבדת המחשבים, קומה ג ' בבניין הנדסת חשמל ומחשבים, כיתה 330 לא לשכוח לפתוח חשבון על המערך המקבילי ועל מחשבי כיתת הלימוד (Email)!!! תלמיד שלא יפתח חשבון במחשב לא יוכל לבצע התרגול !!!

Task #2 http://www.lam-mpi.org/tutorials/ Download and print the file: “MPI quick reference sheet “MPI quick reference sheet Linux Tutorial: “http://www.ctssn.com/”, learn at least lessons 1,2 and 3.http://www.ctssn.com/

Cluster Computing COTS – Commodities of The Shelf Free O/S, e.g. Linux LOBOS – Lots Of Boxes On the Shelf PCs connected by a fast network

The Dwarves 1/5 12 (+2) PCs of several types Red Hat Linux 6.0-6.2 Fast Ethernet – 100Mbps Myrinet Network 1.28+1.28Gbps, SAN

The Dwarves – 2/5 There are 12 computers with Linux operating system. dwarf[1-12] or dwarf[1-12]m dwarf1[m], dwarf3[m]-dwarf7[m] - Pentium II 300 MHz, dwarf9[m]-dwarf12[m] - Pentium III 450 MHz (dual CPU), dwarf2[m], dwarf8[m] - Pentium III 733 MHz (dual CPU).

The Dwarves – 3/5 6 PII at 300MHz processors 8 PIII at 450MHz processors 4 PIII at 733MHz processors Total: 18 processors, ~8GFlops

The Dwarves 4/5 Dwarf1..dwarf12 – nodes names for the Fast Ethernet link Dwarf1m.. Dwarf12m – nodes names for the Myrinet network

The Dwarves 5/5 GNU FORTRAN / C Compilers PVM / MPI

Cluster Computing - 1

Linux http://www.ee.bgu.ac.il/~tel-zur/linux.html

Linux In Google: Linux: 38,600,000 Microsoft: 21,500,000 Bible: 7,590,000

BGU Cray “Negev”

מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1.

Similar presentations

Presentation on theme: "מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1.

Similar presentations

Presentation on theme: "מבוא לעיבוד מקבילי דר ' גיא תל - צור שקפי הרצאה מס ' 1."— Presentation transcript:

Similar presentations

About project

Feedback