1 CPE 779 Parallel Computing Lecture 1: Introduction

Slides:



Advertisements
Similar presentations
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Computational models of the physical world Cortical bone Trabecular bone.
Computer Abstractions and Technology
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
CS61C L28 Parallel Computing (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #28: Parallel Computing.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
09/22/2008CS49601 CS4960: Parallel Programming Guest Lecture: Parallel Programming for Scientific Computing Mary Hall September 22, 2008.
1 Lecture 1 Introduction to Parallel Computing Parallel Computing Fall 2008.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
CIS 314 : Computer Organization Lecture 1 – Introduction.
COMP25212 SYSTEM ARCHITECTURE Antoniu Pop Jan/Feb 2015COMP25212 Lecture 1.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Lecture 1: Introduction to High Performance Computing.
CSE 260 Parallel Computation Allan Snavely, Henri Casanova
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
08/21/2012CS4230 CS4230 Parallel Programming Lecture 1: Introduction Mary Hall August 21,
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Walid Abu-Sufah University of Jordan
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Lecture 2 : Introduction to Multicore Computing
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Introduction CSE 410, Spring 2008 Computer Systems
EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 6.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
CS267 L1 Intro CS267 Applications of Parallel Computers Lecture 1: Introduction David H. Bailey Based on previous notes by Prof. Jim Demmel and Prof. David.
Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.
Multicore Computing Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Multiprocessors.
Pipelining and Parallelism Mark Staveley
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
CS 52500, Parallel Computing Spring 2011 Alex Pothen Lectures: Tues, Thurs, 3:00—4:15 PM, BRNG 2275 Office Hours: Wed 3:00—4:00 PM; Thurs 4:30—5:30 PM;
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Why Parallel/Distributed Computing Sushil K. Prasad
Introduction CSE 410, Spring 2005 Computer Systems
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Web: Parallel Computing Rabie A. Ramadan , PhD Web:
CS427 Multicore Architecture and Parallel Computing
Chapter1 Fundamental of Computer Design
CSE 410, Spring 2006 Computer Systems
The University of Adelaide, School of Computer Science
COSC 3406: Computer Organization
Lecture 1: Parallel Architecture Intro
Parallel Processing Sharing the load.
T Computer Architecture, Autumn 2005
CS/EE 6810: Computer Architecture
Course Description: Parallel Computer Architecture
Chapter 1 Introduction.
Computer Evolution and Performance
Chapter 4 Multiprocessors
Presentation transcript:

1 CPE 779 Parallel Computing Lecture 1: Introduction Walid Abu-Sufah University of Jordan CPE 779 Parallel Computing - Spring 2010

01/19/2010CS267 - Lecture 1 Acknowledgment: Collaboration This course is being offered in collaboration with The IMPACT research group at the University of Illinois The Universal Parallel Computing Research Center (UPCRC) at the University of Illinois The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute 2

01/19/2010CS267 - Lecture 1 Acknowledgment: Slides Some of the slides used in this course are based on slides by Kathy Yelick, University of California at Berkeley Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) Wen-mei Hwu and Sanjay Patel of the University of Illinois and David Kurk, Nvidia Corporation 3

01/19/2010CS267 - Lecture 1 Course Motivation In the last few years: Conventional sequential processors can not get faster -Previously clock speed doubled every 18 months All computers will be parallel >>> All programs will have to become parallel programs -Especially programs that need to run faster. 4

01/19/2010CS267 - Lecture 1 Course Motivation (continued) There will be a huge change in the entire computing industry Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them. Multi/ many core chips have started a revolution in the software industry 5

01/19/2010CS267 - Lecture 1 Course Motivation (continued) Large research activities to address this issue are underway Computer companies: Intel, Microsoft, Nvidia, IBM,..etc -Parallel programming is a concern for the entire computing industry. Universities -Berkeley's ParLab (2008: $20 million grant) -The Universal Parallel Computing Research Center of the University of Illinois (2008: $20 million grant) 6

01/19/2010CS267 - Lecture 1 Course Goals The purpose of this course is to teach students the necessary skills for developing applications that can take advantage of on-chip parallelism. Part 1 (~4 weeks): focus on the techniques that are most appropriate for multicore architectures and the use of parallelism to improve program performance. Topics include -performance analysis and tuning -data techniques -shared data structures -load balancing. and task parallelism -synchronization 7

01/19/2010CS267 - Lecture 1 Course Goals (continued - I) Part 2 (~ 12 weeks): Provide students with knowledge and hands-on experience in developing applications software for massively parallel processors (100s or 1000s of cores) -Use NVIDIA GPUs and the Cuda programming language. -To Effectively program these processors students will acquire in-depth knowledge about -Data parallel programming principles -Parallelism models -Communication models -Resource limitations of these processors. 8

01/19/2010CS267 - Lecture 1 9 Outline of rest of lecture Why powerful computers must use parallel processors Examples of Computational Science and Engineering (CSE) problems which require powerful computers Why writing (fast) parallel programs is hard Principles of parallel computing performance Structure of the course Commercial problems too Including your laptops and handhelds all But things are improving

CPE 779 Parallel Computing - Spring What is Parallel Computing? Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor Examples of parallel machines: A cluster computer that contains multiple PCs combined together with a high speed network A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip Concurrent execution comes from the desire for performance * Technically, SMP stands for “Symmetric Multi-Processor”

11 Units of Measure High Performance Computing (HPC) units are: Flop: floating point operation Flops/s: floating point operations per second Bytes: size of data (a double precision floating point number is 8) Typical sizes are millions, billions, trillions… Mega: Mflop/s = flop/sec; Mbyte = 2 20 = ~ 10 6 bytes Giga: Gflop/s = flop/sec; Gbyte = 2 30 ~ 10 9 bytes Tera: Tflop/s = flop/sec; Tbyte = 2 40 ~ bytes Peta: Pflop/s = flop/sec; Pbyte = 2 50 ~ bytes Exa: Eflop/s = flop/sec; Ebyte = 2 60 ~ bytes Zetta: Zflop/s = flop/sec; Zbyte = 2 70 ~ bytes Yotta: Yflop/s = flop/sec; Ybyte = 2 80 ~ bytes Current fastest (public) machine ~ 2.3 Pflop/s Up-to-date list at

CPE 779 Parallel Computing - Spring High Performance Computing, HPC Parallel computers have been used for decades Mostly used in computational science, engineering, business, and defense Problems too large to solve on one processor; use 100s or 1000s Examples of challenging computations in science Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences

CPE 779 Parallel Computing - Spring High Performance Computing, HPC(continued) Examples of challenging computations in engineering Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation Examples of challenging computations in business Financial and economic modeling Transaction processing, web services and search engines Examples of challenging computations in defence Nuclear weapons -- test by simulations Cryptography

14 Economic Impact of HPC Airlines: Logistics optimization systems on parallel computers. Savings: approx. $100 million per airline per year. Automotive design: Major automotive companies use large systems (500+ CPUs) for: CAD-CAM, crash testing, structural integrity and aerodynamics. One company has 500+ CPU parallel system. Savings: approx. $1 billion per company per year. Semiconductor industry: Semiconductor firms use large systems (500+ CPUs) for device electronics simulation and logic validation Savings: approx. $1 billion per company per year. Securities industry (note: old data …) Savings: approx. $15 billion per year for U.S. home mortgages. CPE 779 Parallel Computing - Spring 2010

15 Why powerful computers are parallel all (2007)

CPE 779 Parallel Computing - Spring What is New in Parallel Computing Now? In the 80s/90s many companies “bet” on parallel computing and failed Computers got faster too quickly for there to be a large market What is new now? The entire computing industry has bet on parallelism Let’s see why… There is a desperate need for parallel programmers There is a desperate need for parallel programmers

CPE 779 Parallel Computing - Spring Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra

CPE 779 Parallel Computing - Spring Microprocessor Transistors and Clock Rate Growth in transistors per chipIncrease in clock rate In 2002: Why bother with parallel programming? Just wait a year or two…

CPE 779 Parallel Computing - Spring Limit #1: Power density Pentium® P Year Power Density (W/cm 2 ) Hot Plate Nuclear Reactor Rocket Nozzle Sun’s Surface Source: Patrick Gelsinger, Intel  Scaling clock speed (business as usual) will not work Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‘07

CPE 779 Parallel Computing - Spring Limit #2: Hidden Parallelism Tapped Out VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here ½ due to transistor density ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP)

CPE 779 Parallel Computing - Spring Limit #2: Hidden Parallelism Tapped Out Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer multiple instruction issue dynamic scheduling: hardware discovers parallelism between instructions speculative execution: look past predicted branches non-blocking caches: multiple outstanding memory ops Unfortunately, these sources have been used up

22 More Limits: How fast can a serial computer be? Consider the 1 Tflop/s, 1 Tbytes sequential machine (single processor excecuting flop/sec with10 12 bytes memory ): Data must travel some distance, r, to get from memory to processor. r = speed X time Max speed at which data can travel= speed of light= 3x10 8 m/s To get 1 data element per cycle, time= 1/10 12 seconds Thus r < (3x10 8 )/10 12 = 0.3 mm. Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area: Each bit occupies about 1 square Angstrom, or the size of a small atom. No choice but parallelism r = 0.3 mm

CPE 779 Parallel Computing - Spring Revolution is Happening Now Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead There is little or no hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

24 Parallelism in 2010? All major processor vendors are producing multicore chips Every machine will soon be a parallel machine To keep doubling performance, parallelism must double Which commercial applications can use this parallelism? Do they have to be rewritten from scratch? Will all programmers have to be parallel programmers? New software model needed Try to hide complexity from most programmers – eventually In the meantime, need to understand it Computer industry betting on this big change, but does not have all the answers Berkeley ParLab and University of Illinois UPCRC established to work on this

Moore’s Law reinterpreted Number of cores per chip will double every two years Clock speed will not increase (possibly decrease) Need to deal with systems with millions of concurrent threads Need to deal with inter-chip parallelism as well as intra-chip parallelism

CPE 779 Parallel Computing - Spring Outline Why powerful computers must be parallel processors Why writing (fast) parallel programs is hard Principles of parallel computing performance Including your laptop all

CPE 779 Parallel Computing - Spring Why writing (fast) parallel programs is hard

CPE 779 Parallel Computing - Spring Principles of Parallel Computing Finding enough parallelism (Amdahl’s Law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things make parallel programming harder than sequential programming.

29 Finding Enough Parallelism: Amdahl’s Law T 1 = execution time using 1 processor (serial execution time) T p = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S Maximum speedup (i.e. when P= ∞ ) Ψ max = 1/S; example S=.05, Ψ max = 20 Currently the fastest machine has ~224,000 processors Even if the parallel part speeds up perfectly performance is limited by the sequential part CPE 779 Parallel Computing - Spring 2010

30 Speedup Barriers: (a) Overhead of Parallelism Given enough parallel work, overhead is a big barrier to getting desired speedup Parallelism overheads include: cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation Each of these can be in the range of milliseconds on some systems (=millions of flops) Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work

Speedup Barriers: (b) Working on Non Local Data Large memories are slow, fast memories are small Parallel processors, collectively, have large, fast cache the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data Proc Cache L2 Cache L3 Cache Memory Conventional Storage Hierarchy Proc Cache L2 Cache L3 Cache Memory Proc Cache L2 Cache L3 Cache Memory potential interconnects 31CPE 779 Parallel Computing - Spring 2010

32 Speedup Barriers: (c) Load Imbalance Load imbalance occurs when some processors in the system are idle due to insufficient parallelism (during that phase) unequal size tasks Algorithm needs to balance load

01/19/2010CS267 - Lecture 1 33 Course Mechanics Web page: tml Grading: -Five programming assignments -Final projects (proposals due Wednesday April 21) -Could be parallelizing an application -Developing an application using Cuda/OpenCL -Performance models driven tuning of a parallel application on multicore and or GPU -Midterm, Wednesday April 7 ; 5:30-6:45 -Final, Thursday, May 27, 5:30-7:30

01/19/2010CS267 - Lecture 1 34 Rough List of Topics (For Details see Syllabus) Basics of computer architecture, memory hierarchies, performance Parallel Programming Models and Machines -Shared Memory and Multithreading -Data parallelism, GPUs Parallel languages and libraries -OpenMP -CUDA General techniques -Load balancing, performance modeling and tools Applications

01/19/2010CS267 - Lecture 1 35 Reading Materials: Textbooks Required David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann (February 5, 2010). (Most chapters are available on line in draft status; visit ) Calvin Lin and Larry Snyder, "Principles of Parallel Programming", Addison-Wesley, 2009

01/19/2010CS267 - Lecture 1 36 Reading Materials: References Ian Foster, Designing and Building Parallel Programs, Addison- Wesley ( available ) Randima Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics, Publisher: Addison-Wesley Professional (2004), (available online on NVIDIA Developer Site; see: ) Grama, A., Gupta, A., Karypis, G., and Kumar, V. "Introduction to Parallel Computing", Second Edition, Addison Wesley, 2003.

01/19/2010CS267 - Lecture 1 37 Reading Materials: Tutorials See course website for tutorials and other online resources