Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts

Winter 2014Parallel Processing, Fundamental ConceptsSlide 2 About This Presentation This presentation is intended to support the use of the textbook Introduction to Parallel Processing: Algorithms and Architectures (Plenum Press, 1999, ISBN 0-306-45970-1). It was prepared by the author in connection with teaching the graduate-level course ECE 254B: Advanced Computer Architecture: Parallel Processing, at the University of California, Santa Barbara. Instructors can use these slides in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami EditionReleasedRevised First Spring 2005Spring 2006Fall 2008Fall 2010 Winter 2013Winter 2014

Parallel Processing, Fundamental ConceptsSlide 3 I Fundamental Concepts Topics in This Part Chapter 1 Introduction to Parallelism Chapter 2 A Taste of Parallel Algorithms Chapter 3 Parallel Algorithm Complexity Chapter 4 Models of Parallel Processing

Winter 2014Parallel Processing, Fundamental ConceptsSlide 4 1 Introduction to Parallelism Topics in This Chapter 1.1 Why Parallel Processing? 1.2 A Motivating Example 1.3 Parallel Processing Ups and Downs 1.4 Types of Parallelism: A Taxonomy 1.5 Roadblocks to Parallel Processing 1.6 Effectiveness of Parallel Processing

Winter 2014Parallel Processing, Fundamental ConceptsSlide 5 Some Resources Our textbook; followed closely in lectures Parhami, B., Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, 1999 Recommended book; complementary software topics Herlihy, M. and M. Shavit, The Art of Multiprocessor Programming, Morgan Kaufmann, revised 1st ed., 2012 Free on-line book (Creative Commons License) Matloff, N., Programming on Parallel Machines: GPU, Multicore, Clusters and More, 341 pp., PDF file http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf Useful free on-line course, sponsored by NVIDIA “Introduction to Parallel Programming,” CPU/GPU-CUDA https://www.udacity.com/course/cs344 1 2 3 4 Complete Unified Device Architecture

Winter 2014Parallel Processing, Fundamental ConceptsSlide 6 1.1 Why Parallel Processing?  The request for higher-performance digital computers seems unending.  In the past two decades, the performance of microprocessors has enjoyed an exponential growth.  The growth of microprocessor speed/performance by a factor of 2 every 18 months is known as Moore’s law

Winter 2014Parallel Processing, Fundamental ConceptsSlide 7 1.1 Why Parallel Processing? 1)Increase in complexity(related both to higher device density and to larger size) of VLSI chips, projected to rise to around 10 M transistor per chip for microprocessors. 2)Introduction of, and improvements in, architectural features such as on-chip cache memories, large instruction buffers, multiple instruction issue per cycle, multithreading This growth is the result of a combination of two factors:

Winter 2014Parallel Processing, Fundamental ConceptsSlide 8 1.1 Why Parallel Processing? Moore’s law was originally formulated in 1965 in terms of the doubling of chip complexity every year (later revised to every 18 months) [Scha97]. Moore’s law seems to hold regardless of how one measures processor performance:  counting the number of executed instructions per second (IPS)  counting the number of floating-point operations per second (FLOPS)  using sophisticated benchmark suites that attempt to measure the processor's performance on real applications

Winter 2014Parallel Processing, Fundamental ConceptsSlide 9 1.1 Why Parallel Processing?

Winter 2014Parallel Processing, Fundamental ConceptsSlide 10 1.1 Why Parallel Processing?  Even though it is expected that Moore's law will continue to hold for the near future, there is a limit that will eventually be reached. That some previous predictions about when the limit will be reached have proven wrong does not alter the fact that a limit, dictated by physical laws, does exist.  The most easily understood physical limit is that imposed by the finite speed of signal propagation along a wire.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 11 Why High-Performance Computing? Higher speed (solve problems faster) Important when there are “hard” or “soft” deadlines; e.g., 24-hour weather forecast Higher throughput (solve more problems) Important when we have many similar tasks to perform; e.g., transaction processing Higher computational power (solve larger problems) e.g., weather forecast for a week rather than 24 hours, or with a finer mesh for greater accuracy 1 2 3

Winter 2014Parallel Processing, Fundamental ConceptsSlide 12 1.2 A Motivating Example Fig. 1.3 The sieve of Eratosthenes yielding a list of 10 primes for n = 30. Marked elements have been distinguished by erasure from the list. Init. Pass 1 Pass 2 Pass 3 2  m 2 2 2 3 3  m 3 3 4 5 5 5  m 5 6 7 7 7 7  m 8 9 10 11111111 12 13131313 1415 16 17171717 18 19191919 2021 22 23232323 24 252525 2627 28 29292929 30 Any composite number has a prime factor that is no greater than its square root.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 13 Single-Processor Implementation of the Sieve Fig. 1.4 Schematic representation of single-processor solution for the sieve of Eratosthenes. Bit-vector

Winter 2014Parallel Processing, Fundamental ConceptsSlide 14 Control-Parallel Implementation of the Sieve Fig. 1.5 Schematic representation of a control-parallel solution for the sieve of Eratosthenes.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 15 Running Time of the Sequential/Parallel Sieve Fig. 1.6 Control-parallel realization of the sieve of Eratosthenes with n = 1000 and 1  p  3.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 16 Data-Parallel Implementation of the Sieve Fig. 1.7 Data-parallel realization of the sieve of Eratosthenes. Assume at most  n processors, so that all prime factors dealt with are in P 1 (which broadcasts them)  n < n / p

Winter 2014Parallel Processing, Fundamental ConceptsSlide 17 One Reason for Sublinear Speedup: Communication Overhead Fig. 1.8 Trade-off between communication time and computation time in the data-parallel realization of the sieve of Eratosthenes.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 18 Another Reason for Sublinear Speedup: Input/Output Overhead Fig. 1.9 Effect of a constant I/O time on the data-parallel realization of the sieve of Eratosthenes.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 19 1.3 Parallel Processing Ups and Downs  Parallel processing, in the literal sense of the term, is used in virtually every modern computer. (For example, overlapping I/O with computation is a form of parallel processing, as is the overlap between instruction preparation and execution in a pipelined processor.)  Other forms of parallelism or concurrency that are widely used include the use of multiple functional units (e.g., separate integer and floating-point ALUs or two floating-point multipliers in one ALU) and multitasking (which allows overlap between computation and memory load necessitated by a page fault).

Winter 2014Parallel Processing, Fundamental ConceptsSlide 20 1.3 Parallel Processing Ups and Downs However, in this book: The term parallel processing is used in a restricted sense of having multiple (usually identical) processors for the main computation and not for the I/O or other peripheral activities.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 21 1.3 Parallel Processing Ups and Downs  The history of parallel processing has had its ups and downs (read company formations and bankruptcies!) with what appears to be a 20-year cycle. Serious interest in parallel processing started in the 1960s.  Commercial interest in parallel processing resurfaced in the 1980s. Driven primarily by contracts from the defense establishment and other federal agencies in the United States, numerous companies were formed to develop parallel systems.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 22 1.3 Parallel Processing Ups and Downs However, three factors led to another recess:  Government funding in the United States and other countries dried up, in part related to the end of the cold war between the NATO allies and the Soviet bloc.  Commercial users in banking and other data-intensive industries were either saturated or disappointed by application difficulties.  Microprocessors developed so fast in terms of performance/cost ratio that custom designed parallel machines always lagged in cost-effectiveness.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 23 1.4 Types of Parallelism: A Taxonomy Fig. 1.11 The Flynn-Johnson classification of computer systems.

Winter 2014Parallel Processing, Fundamental ConceptsSlide 24  Grosch’s law: Economy of scale applies, or power = cost 2  Minsky’s conjecture: Speedup tends to be proportional to log p  Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)  Tyranny of vector supercomputers: Familiar programming model  Software inertia: Billions of dollars investment in software  Amdahl’s law: Unparallelizable code severely limits the speedup 1.5 Roadblocks to Parallel Processing No longer valid; in fact we can buy more MFLOPS computing power on micros rather than supers Has roots in analysis of memory bank conflicts; can be overcome Faster ICs make parallel machines faster too; what about x1000? Not all computations involve vectors; parallel vector machines New programs; even uniprocessors benefit from parallelism spec

Winter 2014Parallel Processing, Fundamental ConceptsSlide 25 Amdahl’s Law Fig. 1.12 Limit on speed-up according to Amdahl’s law. s =  min(p, 1/f) 1 f + (1 – f)/p f = fraction unaffected p = speedup of the rest

Winter 2014Parallel Processing, Fundamental ConceptsSlide 26 1.6 Effectiveness of Parallel Processing p Number of processors W(p)Work performed by p processors T(p)Execution time with p processors T(1) = W(1); T(p)  W(p) S(p)Speedup = T(1) / T(p) E(p)Efficiency = T(1) / [p T(p)] R(p)Redundancy = W(p) / W(1) U(p)Utilization = W(p) / [p T(p)] Q(p)Quality = T 3 (1) / [p T 2 (p) W(p)] Fig. 1.13 Task graph exhibiting limited inherent parallelism. W(1) = 13 T(1) = 13 T(  ) = 8

Winter 2014Parallel Processing, Fundamental ConceptsSlide 27 Reduction or Fan-in Computation Fig. 1.14 Computation graph for finding the sum of 16 numbers. Example: Adding 16 numbers, 8 processors, unit-time additions Zero-time communication E(8) = 15 / (8  4) = 47% S(8) = 15 / 4 = 3.75 R(8) = 15 / 15 = 1 Q(8) = 1.76 Unit-time communication E(8) = 15 / (8  7) = 27% S(8) = 15 / 7 = 2.14 R(8) = 22 / 15 = 1.47 Q(8) = 0.39

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts.

Similar presentations

Presentation on theme: "Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts.

Similar presentations

Presentation on theme: "Winter 2014Parallel Processing, Fundamental ConceptsSlide 1 Part I Fundamental Concepts."— Presentation transcript:

Similar presentations

About project

Feedback