1 CPE 779 Parallel Computing Lecture 1: Introduction

1 CPE 779 Parallel Computing http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr10/index.html Lecture 1: Introduction http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr10/index.html Walid Abu-Sufah University of Jordan CPE 779 Parallel Computing - Spring 2010

01/19/2010CS267 - Lecture 1 Acknowledgment: Collaboration This course is being offered in collaboration with The IMPACT research group at the University of Illinois http://impact.crhc.illinois.edu/ http://impact.crhc.illinois.edu/ The Universal Parallel Computing Research Center (UPCRC) at the University of Illinois http://www.upcrc.illinois.edu/ http://www.upcrc.illinois.edu/ The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/ http://cstrc.cyi.ac.cy/ 2

01/19/2010CS267 - Lecture 1 Acknowledgment: Slides Some of the slides used in this course are based on slides by Kathy Yelick, University of California at Berkeley http://www.cs.berkeley.edu/~yelick/cs194f07 http://www.cs.berkeley.edu/~yelick/cs194f07 Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) http://www.cs.berkeley.edu/~demmel/cs267_Spr10/ http://www.cs.berkeley.edu/~demmel/cs267_Spr10/ Wen-mei Hwu and Sanjay Patel of the University of Illinois and David Kurk, Nvidia Corporation http://courses.ece.illinois.edu/ece498/al/ http://courses.ece.illinois.edu/ece498/al/ 3

01/19/2010CS267 - Lecture 1 Course Motivation In the last few years: Conventional sequential processors can not get faster -Previously clock speed doubled every 18 months All computers will be parallel >>> All programs will have to become parallel programs -Especially programs that need to run faster. 4

01/19/2010CS267 - Lecture 1 Course Motivation (continued) There will be a huge change in the entire computing industry Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them. Multi/ many core chips have started a revolution in the software industry 5

01/19/2010CS267 - Lecture 1 Course Motivation (continued) Large research activities to address this issue are underway Computer companies: Intel, Microsoft, Nvidia, IBM,..etc -Parallel programming is a concern for the entire computing industry. Universities -Berkeley's ParLab (2008: $20 million grant) -The Universal Parallel Computing Research Center of the University of Illinois (2008: $20 million grant) 6

01/19/2010CS267 - Lecture 1 Course Goals The purpose of this course is to teach students the necessary skills for developing applications that can take advantage of on-chip parallelism. Part 1 (~4 weeks): focus on the techniques that are most appropriate for multicore architectures and the use of parallelism to improve program performance. Topics include -performance analysis and tuning -data techniques -shared data structures -load balancing. and task parallelism -synchronization 7

01/19/2010CS267 - Lecture 1 Course Goals (continued - I) Part 2 (~ 12 weeks): Provide students with knowledge and hands-on experience in developing applications software for massively parallel processors (100s or 1000s of cores) -Use NVIDIA GPUs and the Cuda programming language. -To Effectively program these processors students will acquire in-depth knowledge about -Data parallel programming principles -Parallelism models -Communication models -Resource limitations of these processors. 8

01/19/2010CS267 - Lecture 1 9 Outline of rest of lecture Why powerful computers must use parallel processors Examples of Computational Science and Engineering (CSE) problems which require powerful computers Why writing (fast) parallel programs is hard Principles of parallel computing performance Structure of the course Commercial problems too Including your laptops and handhelds all But things are improving

CPE 779 Parallel Computing - Spring 201010 What is Parallel Computing? Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor Examples of parallel machines: A cluster computer that contains multiple PCs combined together with a high speed network A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip Concurrent execution comes from the desire for performance * Technically, SMP stands for “Symmetric Multi-Processor”

11 Units of Measure High Performance Computing (HPC) units are: Flop: floating point operation Flops/s: floating point operations per second Bytes: size of data (a double precision floating point number is 8) Typical sizes are millions, billions, trillions… Mega: Mflop/s = 10 06 flop/sec; Mbyte = 2 20 = 1048576 ~ 10 6 bytes Giga: Gflop/s = 10 09 flop/sec; Gbyte = 2 30 ~ 10 9 bytes Tera: Tflop/s = 10 12 flop/sec; Tbyte = 2 40 ~ 10 12 bytes Peta: Pflop/s = 10 15 flop/sec; Pbyte = 2 50 ~ 10 15 bytes Exa: Eflop/s = 10 18 flop/sec; Ebyte = 2 60 ~ 10 18 bytes Zetta: Zflop/s = 10 21 flop/sec; Zbyte = 2 70 ~ 10 21 bytes Yotta: Yflop/s = 10 24 flop/sec; Ybyte = 2 80 ~ 10 24 bytes Current fastest (public) machine ~ 2.3 Pflop/s Up-to-date list at www.top500.org

CPE 779 Parallel Computing - Spring 201012 High Performance Computing, HPC Parallel computers have been used for decades Mostly used in computational science, engineering, business, and defense Problems too large to solve on one processor; use 100s or 1000s Examples of challenging computations in science Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences

CPE 779 Parallel Computing - Spring 201013 High Performance Computing, HPC(continued) Examples of challenging computations in engineering Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation Examples of challenging computations in business Financial and economic modeling Transaction processing, web services and search engines Examples of challenging computations in defence Nuclear weapons -- test by simulations Cryptography

14 Economic Impact of HPC Airlines: Logistics optimization systems on parallel computers. Savings: approx. $100 million per airline per year. Automotive design: Major automotive companies use large systems (500+ CPUs) for: CAD-CAM, crash testing, structural integrity and aerodynamics. One company has 500+ CPU parallel system. Savings: approx. $1 billion per company per year. Semiconductor industry: Semiconductor firms use large systems (500+ CPUs) for device electronics simulation and logic validation Savings: approx. $1 billion per company per year. Securities industry (note: old data …) Savings: approx. $15 billion per year for U.S. home mortgages. CPE 779 Parallel Computing - Spring 2010

15 Why powerful computers are parallel all (2007)

CPE 779 Parallel Computing - Spring 201016 What is New in Parallel Computing Now? In the 80s/90s many companies “bet” on parallel computing and failed Computers got faster too quickly for there to be a large market What is new now? The entire computing industry has bet on parallelism Let’s see why… There is a desperate need for parallel programmers There is a desperate need for parallel programmers

CPE 779 Parallel Computing - Spring 201017 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

CPE 779 Parallel Computing - Spring 201018 Microprocessor Transistors and Clock Rate Growth in transistors per chipIncrease in clock rate In 2002: Why bother with parallel programming? Just wait a year or two…

CPE 779 Parallel Computing - Spring 201019 Limit #1: Power density 4004 8008 8080 8085 8086 286 386 486 Pentium® P6 1 10 100 1000 10000 19701980199020002010 Year Power Density (W/cm 2 ) Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface Source: Patrick Gelsinger, Intel  Scaling clock speed (business as usual) will not work Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‘07

CPE 779 Parallel Computing - Spring 201020 Limit #2: Hidden Parallelism Tapped Out VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here ½ due to transistor density ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP)

CPE 779 Parallel Computing - Spring 201021 Limit #2: Hidden Parallelism Tapped Out Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops Unfortunately, these sources have been used up

22 More Limits: How fast can a serial computer be? Consider the 1 Tflop/s, 1 Tbytes sequential machine (single processor excecuting 10 12 flop/sec with10 12 bytes memory ): Data must travel some distance, r, to get from memory to processor. r = speed X time Max speed at which data can travel= speed of light= 3x10 8 m/s To get 1 data element per cycle, time= 1/10 12 seconds Thus r < (3x10 8 )/10 12 = 0.3 mm. Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each bit occupies about 1 square Angstrom, or the size of a small atom. No choice but parallelism r = 0.3 mm

CPE 779 Parallel Computing - Spring 201023 Revolution is Happening Now Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

24 Parallelism in 2010? All major processor vendors are producing multicore chips Every machine will soon be a parallel machine To keep doubling performance, parallelism must double Which commercial applications can use this parallelism? Do they have to be rewritten from scratch? Will all programmers have to be parallel programmers? New software model needed Try to hide complexity from most programmers – eventually In the meantime, need to understand it Computer industry betting on this big change, but does not have all the answers Berkeley ParLab and University of Illinois UPCRC established to work on this

Moore’s Law reinterpreted Number of cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with millions of concurrent threads Need to deal with inter-chip parallelism as well as intra-chip parallelism

CPE 779 Parallel Computing - Spring 201026 Outline Why powerful computers must be parallel processors Why writing (fast) parallel programs is hard Principles of parallel computing performance Including your laptop all

CPE 779 Parallel Computing - Spring 201027 Why writing (fast) parallel programs is hard

CPE 779 Parallel Computing - Spring 201028 Principles of Parallel Computing Finding enough parallelism (Amdahl’s Law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things make parallel programming harder than sequential programming.

29 Finding Enough Parallelism: Amdahl’s Law T 1 = execution time using 1 processor (serial execution time) T p = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S Maximum speedup (i.e. when P= ∞ ) Ψ max = 1/S; example S=.05, Ψ max = 20 Currently the fastest machine has ~224,000 processors Even if the parallel part speeds up perfectly performance is limited by the sequential part CPE 779 Parallel Computing - Spring 2010

30 Speedup Barriers: (a) Overhead of Parallelism Given enough parallel work, overhead is a big barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Each of these can be in the range of milliseconds on some systems (=millions of flops) Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Speedup Barriers: (b) Working on Non Local Data Large memories are slow, fast memories are small Parallel processors, collectively, have large, fast cache the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects 31CPE 779 Parallel Computing - Spring 2010

32 Speedup Barriers: (c) Load Imbalance Load imbalance occurs when some processors in the system are idle due to insufficient parallelism (during that phase) unequal size tasks Algorithm needs to balance load

01/19/2010CS267 - Lecture 1 33 Course Mechanics Web page: http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr10/index.h tml Grading: -Five programming assignments -Final projects (proposals due Wednesday April 21) -Could be parallelizing an application -Developing an application using Cuda/OpenCL -Performance models driven tuning of a parallel application on multicore and or GPU -Midterm, Wednesday April 7 ; 5:30-6:45 -Final, Thursday, May 27, 5:30-7:30

01/19/2010CS267 - Lecture 1 34 Rough List of Topics (For Details see Syllabus) Basics of computer architecture, memory hierarchies, performance Parallel Programming Models and Machines -Shared Memory and Multithreading -Data parallelism, GPUs Parallel languages and libraries -OpenMP -CUDA General techniques -Load balancing, performance modeling and tools Applications

01/19/2010CS267 - Lecture 1 35 Reading Materials: Textbooks Required David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann (February 5, 2010). (Most chapters are available on line in draft status; visit http://courses.ece.illinois.edu/ece498/al/Syllabus.html ) http://courses.ece.illinois.edu/ece498/al/Syllabus.html Calvin Lin and Larry Snyder, "Principles of Parallel Programming", Addison-Wesley, 2009

01/19/2010CS267 - Lecture 1 36 Reading Materials: References Ian Foster, Designing and Building Parallel Programs, Addison- Wesley ( available online @ http://www.mcs.anl.gov/~itf/dbpp/ )http://www.mcs.anl.gov/~itf/dbpp/ Randima Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics, Publisher: Addison-Wesley Professional (2004), (available online on NVIDIA Developer Site; see: http://http.developer.nvidia.com/GPUGems/gpugems_part01.html ) http://http.developer.nvidia.com/GPUGems/gpugems_part01.html Grama, A., Gupta, A., Karypis, G., and Kumar, V. "Introduction to Parallel Computing", Second Edition, Addison Wesley, 2003.

01/19/2010CS267 - Lecture 1 37 Reading Materials: Tutorials See course website for tutorials and other online resources

1 CPE 779 Parallel Computing Lecture 1: Introduction

Similar presentations

Presentation on theme: "1 CPE 779 Parallel Computing Lecture 1: Introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CPE 779 Parallel Computing Lecture 1: Introduction

Similar presentations

Presentation on theme: "1 CPE 779 Parallel Computing Lecture 1: Introduction"— Presentation transcript:

Similar presentations

About project

Feedback