CS61C L28 Parallel Computing (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #28: Parallel Computing.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Distributed Systems CS
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Parallel System Performance CS 524 – High-Performance Computing.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Introduction CS 524 – High-Performance Computing.
CS267 L1 IntroDemmel Sp 1999 CS267 / E233 Applications of Parallel Computers Lecture 1: Introduction 1/18/99 James Demmel
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Parallel Computing Overview CS 524 – High-Performance Computing.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
09/22/2008CS49601 CS4960: Parallel Programming Guest Lecture: Parallel Programming for Scientific Computing Mary Hall September 22, 2008.
Parallel System Performance CS 524 – High-Performance Computing.
Parallel Algorithms - Introduction Advanced Algorithms & Data Structures Lecture Theme 11 Prof. Dr. Th. Ottmann Summer Semester 2006.
CS61C L28 Parallel Computing (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #28: Parallel Computing.
Lecture 1: Introduction to High Performance Computing.
CS61C L42 Parallel Computing (1) Carle, Spring 2005 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #42.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Parallel Programming Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.
CSE 260 Parallel Computation Allan Snavely, Henri Casanova
What is Concurrent Programming? Maram Bani Younes.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Projected Performance Development. TOP 500 Performance Projection.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 2 : Introduction to Multicore Computing
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
Message Passing Computing 1 iCSC2015,Helvi Hartmann, FIAS Message Passing Computing Lecture 1 High Performance Computing Helvi Hartmann FIAS Inverted CERN.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
CSCI-455/522 Introduction to High Performance Computing Lecture 1.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Department of Computer Science University of the West Indies Part II.
CS591x -Cluster Computing and Parallel Programming
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Advanced Computer Networks Lecture 1 - Parallelization 1.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Concurrency and Performance Based on slides by Henri Casanova.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
1 A simple parallel algorithm Adding n numbers in parallel.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
CMSC 611: Advanced Computer Architecture
Programming Models for SimMillennium
The University of Adelaide, School of Computer Science
EE 193: Parallel Computing
Distributed Systems CS
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Chapter 4 Multiprocessors
Vrije Universiteit Amsterdam
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

CS61C L28 Parallel Computing (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #28: Parallel Computing Andy Carle

CS61C L28 Parallel Computing (2) A Carle, Summer 2005 © UCB Scientific Computing °Traditional Science 1) Produce theories and designs on “paper” 2) Perform experiments or build systems Has become difficult, expensive, slow, and dangerous for fields on the leading edge °Computational Science Use ultra-high performance computers to simulate the system we’re interested in °Acknowledgement Many of the concepts and some of the content of this lecture were drawn from Prof. Jim Demmel’s CS 267 lecture slides which can be found at

CS61C L28 Parallel Computing (3) A Carle, Summer 2005 © UCB Example Applications °Science Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences °Engineering Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation °Business Financial and economic modeling Transaction processing, web services and search engines °Defense Nuclear weapons -- test by simulations Cryptography

CS61C L28 Parallel Computing (4) A Carle, Summer 2005 © UCB Performance Requirements °Terminology Flop – Floating point operation Flops/second – standard metric for expressing the computing power of a system °Global Climate Modeling Divide the world into a grid (e.g. 10 km spacing) Solve fluid dynamics equations to determine what the air has done at that point every minute -Requires about 100 Flops per grid point per minute This is an extremely simplified view of how the atmosphere works, to be maximally effective you need to simulate many additional systems on a much finer grid

CS61C L28 Parallel Computing (5) A Carle, Summer 2005 © UCB Performance Requirements (2) °Computational Requirements To keep up with real time (i.e. simulate one minute per wall clock minute): 8 Gflops/sec Weather Prediction (7 days in 24 hours): 56 Gflops/sec Climate Prediction (50 years in 30 days): 4.8 Tflops/sec Climate Prediction Experimentation (50 years in 12 hours): 288 Tflops/sec °Perspective Pentium 4 1.4GHz, 1GB RAM, 4x100MHz FSB -~320 Mflops/sec, effective -Climate Prediction would take ~1233 years Reference:

CS61C L28 Parallel Computing (6) A Carle, Summer 2005 © UCB What Can We Do? °Wait Moore’s law tells us things are getting better; why not stall for the moment? °Parallel Computing!

CS61C L28 Parallel Computing (7) A Carle, Summer 2005 © UCB Prohibitive Costs °Rock’s Law The cost of building a semiconductor chip fabrication plant that is capable of producing chips in line with Moore’s law doubles every four years

CS61C L28 Parallel Computing (8) A Carle, Summer 2005 © UCB How fast can a serial computer be? °Consider a 1 Tflop/sec sequential machine: Data must travel some distance, r, to get from memory to CPU To get 1 data element per cycle, this means times per second at the speed of light, c = 3x10 8 m/s. Thus r < c/10 12 = 0.3 mm -So all of the data we want to process must be stored within 0.3 mm of the CPU °Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each word occupies about 3 square Angstroms, the size of a very small atom Maybe someday, but it most certainly isn’t going to involve transistors as we know them

CS61C L28 Parallel Computing (9) A Carle, Summer 2005 © UCB What is Parallel Computing? °Dividing a task among multiple processors to arrive at a unified (meaningful) solution For today, we will focus on systems with many processors executing identical code °How is this different from Multiprogramming (which we’ve touched on some in this course)? °How is this different from Distributed Computing?

CS61C L28 Parallel Computing (10) A Carle, Summer 2005 © UCB Recent History °Parallel Computing as a field exploded in popularity in the mid-1990s °This resulted in an “arms race” between universities, research labs, and governments to have the fastest supercomputer in the world Source: top500.org

CS61C L28 Parallel Computing (11) A Carle, Summer 2005 © UCB Current Champions BlueGene/L – IBM/DOE Rochester, United States Processors, Tflops/sec 0.7 GHz PowerPC 440 Columbia – NASA/Ames Mountain View, United States Processors, Tflops/sec 1.5 GHz SGI Altix Earth Simulator – Earth Simulator Ctr. Yokohama, Japan 5120 Processors, Tflops/sec SX6 Vector Data Source: top500.org

CS61C L28 Parallel Computing (12) A Carle, Summer 2005 © UCB Administrivia °Proj 4 Due Friday °HW8 (Optional) Due Friday °Final Exam on Friday Yeah, sure, you can have 3 one-sided cheat sheets -But I really don’t think they’ll help you all that much °Course Survey in lab today

CS61C L28 Parallel Computing (13) A Carle, Summer 2005 © UCB Parallel Programming °Processes and Synchronization °Processor Layout °Other Challenges Locality Finding parallelism Parallel Overhead Load Balance

CS61C L28 Parallel Computing (14) A Carle, Summer 2005 © UCB Processes °We need a mechanism to intelligently split the execution of a program °Fork: int main(…){ int pid = fork(); if (pid == 0) printf(“I am the child.”); if (pid != 0) printf(“I am the parent.”); return 0; } °What will this print?

CS61C L28 Parallel Computing (15) A Carle, Summer 2005 © UCB Processes (2) °We don’t know! Two potential orderings: I am the child.I am the parent. I am the parent.I am the child. This situation is a simple race condition. This type of problem can get far more complicated… °Modern parallel compilers and runtime environments hide the details of actually calling fork() and moving the processes to individual processors, but the complexity of synchronization remains

CS61C L28 Parallel Computing (16) A Carle, Summer 2005 © UCB Synchronization °How do processors communicate with each other? °How do processors know when to communicate with each other? °How do processors know which other processor has the information they need? °When you are done computing, which processor, or processors, have the answer?

CS61C L28 Parallel Computing (17) A Carle, Summer 2005 © UCB Synchronization (2) °Some of the logistical complexity of these operations is reduced by standard communication frameworks Message Passing Interface (MPI) °Sorting out the issue of who holds what data can be made easier with the use of explicitly parallel languages Unified Parallel C (UPC) Titanium (Parallel Java Variant) °Even with these tools, much of the skill and challenge of parallel programming is in resolving these problems

CS61C L28 Parallel Computing (18) A Carle, Summer 2005 © UCB Processor Layout Generalized View P P PP Interconnection Network MMMM Memory M = Memory local to one processor Memory = Memory local to all other processors

CS61C L28 Parallel Computing (19) A Carle, Summer 2005 © UCB Processor Layout (2) P1 bus $ memory P2 $ Pn $ interconnect P0 memory NI... P1 memory NI Pn memory NI Shared Memory Distributed Memory

CS61C L28 Parallel Computing (20) A Carle, Summer 2005 © UCB Processor Layout (3) °Clusters of SMPs n of the N total processors share one memory Simple shared memory communication within one cluster of n processors Explicit network-type calls to communicate from one group of n to another °Understanding the processor layout that your application will be running on is crucial!

CS61C L28 Parallel Computing (21) A Carle, Summer 2005 © UCB Parallel Locality °We now have to expand our view of the memory hierarchy to include remote machines °Remote memory behaves like a very fast network Bandwidth vs. Latency becomes important Regs Memory Remote Memory Local and Remote Disk Instr. Operands Blocks Large Data Blocks Cache Blocks

CS61C L28 Parallel Computing (22) A Carle, Summer 2005 © UCB Amdahl’s Law °Applications can almost never be completely parallelized °Let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable, and P = number of processors Speedup(P) = Time(1)/Time(P) <= 1/(s + (1-s)/P) <= 1/s °Even if the parallel portion of your application speeds up perfectly, your performance may be limited by the sequential portion

CS61C L28 Parallel Computing (23) A Carle, Summer 2005 © UCB Parallel Overhead °Given enough parallel work, these are the biggest barriers to getting desired speedup °Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation °Each of these can be in the range of milliseconds (many millions of flops) on some systems °Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

CS61C L28 Parallel Computing (24) A Carle, Summer 2005 © UCB Load Balance °Load imbalance is the time that some processors in the system are idle due to insufficient parallelism (during that phase) unequal size tasks °Examples of the latter adapting to “interesting parts of a domain” tree-structured computations fundamentally unstructured problems °Algorithms need to carefully balance load

CS61C L28 Parallel Computing (25) A Carle, Summer 2005 © UCB Summary °Parallel Computing is a multi-billion dollar industry driven by interesting and useful scientific computing applications °It is extremely unlikely that sequential computing will ever again catch up with the processing power of parallel systems °Programming parallel systems can be extremely challenging, but is built upon many of the concepts you’ve learned this semester in 61c