Extending Amdahl’s Law in the Multicore Era Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of.

Slides:



Advertisements
Similar presentations
“Amdahl's Law in the Multicore Era” Mark Hill and Mike Marty University of Wisconsin IEEE Computer, July 2008 Presented by Dan Sorin.
Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law & YODA Blog & Design Review.
Potential for parallel computers/parallel programming
University of Wisconsin-Madison © 2008 Multifacet Project Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin—Madison.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel Processing & Distributed Systems Thoai Nam Chapter 2.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Recap.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 1 in the Multicore Era Reevaluating Amdahl’s Law in the Multicore Era Xian-He.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Dan Tang, Yungang Bao, Yunji Chen, Weiwu Hu, Mingyu Chen
Computer System Architectures Computer System Software
University of Wisconsin-Madison © 2008 Multifacet Project Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R. Marty Univ. of Wisconsin—Madison.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Performance Evaluation of Parallel Processing. Why Performance?
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.
EECE 571R (Spring 2009) Massively parallel/distributed platforms Matei Ripeanu matei at ece.ubc.ca.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law (+- 25 min)
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Importance of Single-core in Multicore.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
Parallel Programming with MPI and OpenMP
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.
CISC 879 : Advanced Parallel Programming Vaibhav Naidu Dept. of Computer & Information Sciences University of Delaware Dark Silicon and End of Multicore.
Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
1 Advanced Embedded Systems Lecture 7 Advances in Embedded Systems CPUs.
PRAM and Parallel Computing
Potential for parallel computers/parallel programming
Parallel Computing and Parallel Computers
4- Performance Analysis of Parallel Programs
Parallel Processing - introduction
Parallel Computers.
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Parallel Computing and Parallel Computers
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
The University of Adelaide, School of Computer Science
Potential for parallel computers/parallel programming
Multicore and GPU Programming
Potential for parallel computers/parallel programming
Presentation transcript:

Extending Amdahl’s Law in the Multicore Era Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences {baoyg, tgm,

A Brief Intro Of ICT, CAS ICT has built the Fastest HPC in China – Dawning 5000, which is 233.5TFlops and rank 10 th in Top500. ICT has developed the Loongson CPU

Outline I. Background and Related Works II. Model of Multicore Scalability III. Symmetrical Multicore Chips IV. Asymmetrical Multicore Chips V. Dynamic Multicore Chips VI. Conclusion and Future Work

We are in the Multi-Core Era Mainstream market has already been dominated by multicore Intel: 2-core Core Duo, 4-core i7 AMD: 2-core Athlon, 4-core Opteron IBM: 2-core POWER6, 9-core Cell Sun: 8-core T1/T2 ……

Many-Core is coming Some processor vendors have announced or released their manycore processors Tilera: 64-core Intel: 80-core GPGPU: 100x-core ……

Revisiting Amdahl’s Law in the Multi/Many-Core Era Assume that a fraction f of a program’s execution time was infinitely parallelizable with no scheduling overhead, while the remaining fraction, 1 − f, was totally sequential. Using p processors to accelerate the parallel fraction. Fixed-size speedup, the amount of work to be executed is independent of the number of processors

Implications of Amdahl’s Law Despite its simplicity, Amdahl’s law applies broadly and gives important insights such as: (i) Attack the common case: When f is small, optimization will have little effect. (ii) The aspects you ignore also limit speedup: Even if p approaches infinity, speedup is bounded by 1/(1−f).

Mark Hill et al.’s Insights Hill and Marty apply Amdahl’s law to multicore hardware by constructing a cost model for the number and performance of cores in one chip.  Obtaining optimal multicore performance requires further research both in extracting more parallelism and in making sequential cores faster. Woo and Lee have extended Hill’s work by taking power and energy into account.

Motivation of Our Work The revised Amdahl’s Law model provides a better understanding of multicore scalability. However, there is little work on theoretical analysis. This paper presents our investigations on theoretical analysis of multicore scalability and attempts to find the optimal results under different conditions.

Model of Multicore Scalability We adopt the same cost model on multicore hardware proposed by Hill and Marty, which includes two assumptions: First, assume that a multicore chip of given size and technology generation can contain at most n base core equivalents (BCE) Second, assume that the individual core with more resources (r BCEs) can achieve better sequential performance. –1 < perf(r) < r The architecture of multicore chips can be classified into three types: –Symmetric –Asymmetric –Dynamic

Model-Symmetrical A symmetric multicore chip requires that all its cores have the same cost. Example: given 16 BCEs. –r = 8  2 cores * 8 BCEs/core –r = 4  4 cores * 4 BCEs/core Given the resource budget of n BCEs, we have n/r cores, each with r BCEs. Performance of each core is perf(r). Then we get

Model-Asymmetrical In an asymmetric multicore chip, several cores are more powerful than the others. Example: given 16 BCEs –1 four-BCE core and 12 base cores. –1 six-BCE core and 10 base cores. Given the resource budget of n BCEs, we have 1+n−r cores with one larger core (with r BCEs) and n−r base cores (with 1 BCE each). Then we get

Model-Dynamic A dynamic multicore chip can dynamically combine up to r cores into one core in order to boost sequential performance. –In sequential mode, it can execute with performance of perf(r) when the dynamic techniques use r BCEs. –In parallel mode, it can obtain performance of n using all base cores in parallel. Then, we get

Symmetrical Multicore Chips –Fixed n and r, speedup is an increasing function of f –Fixed f and r, speedup is an increasing function of n  Increasing both the parallel fraction (f) and the number of base core (n) can improve the speedup of symmetric multicore chip. For fixed f and n, we have the following theorem:

Symmetrical Multicore Chips For any fixed f and c, –if f < c, the maximum speedup is achieved at r = n. –if f > c and n is not big, the maximum speedup is achieved at r = 1. –if f > c and n is big enough, to obtain optimal multicore performance, the resources of BCEs should be dedicated to one core intended to offer reasonable individual core’s performance.

Symmetrical Multicore Chips If n is big enough, then will the maximum speedup always be achieved between extremes for any perf(x) < x? Counterexample: –(i) perf(x)=kx, for any 0<k<1; –(ii) perf(x)=x c, for any f<c<1.

Asymmetrical Multicore Chips Similarly, increasing both the parallel fraction (f) and the number of BCEs (n) can improve the speedup of asymmetric multicore chip. For fixed f and n, we have the following theorem:

Asymmetrical Multicore Chips If f >c and n is not big, maximum speedup is achieved at r = 1. If f <c and n is not big, maximum speedup is achieved at r = n. For any fixed f and c, if n is big enough, the maximum speedup is achieved at 1<r 0 <n.

Asymmetrical Multicore Chips Note that the optimal r 0 in Theorem 2 can not be solved analytically. r 0 is linear with n, and if n is big enough, r 0 will approach n to any extent.

Asymmetrical Multicore Chips If n is big enough, will the maximum speedup always be achieved between extremes for any perf(x)<x? Counterexample: –perf(x)=kx, for any f<k<1. For saturated functions, Like p(x)=x c, p(x)=kx c +mx c’ +…, where c, c’<1.

Asymmetrical Multicore Chips Based on the simplistic assumptions of Amdahl’s law, it makes most sense to devote extra resources to increase only one core’s capability. In fact we have the following theorem: Although the architecture of asymmetric multicore chip using one large core and many base cores is assumed originally for simplicity, it is indeed the optimal architecture in the sense of speedup.

Dynamic Multicore Chips We should increase both f and n to enhance the speedup of dynamic multicore chip. For fixed f and n, –if perf(r) is an increasing function, speedup is also an increasing function –  the maximum speedup is always achieved at r = n.  Dynamic multicore chips can offer potential speedups that are greater and never worse than symmetric or asymmetric multicore chips with identical perf(r) functions. So researchers should continue to investigate methods that approximate a dynamic multicore chip.

Potentials of Maximum Speedups Recall that in the Amdahl’s law, even if the number of processors approaches infinity, the speedup is bound by1/(1−f). The increasing of n can improve the speedup continuously. Under the assumption of perf(r) = r c, when n approaches infinity, the speedup can also approach infinity even if the performance index c is small.

Implications and Results A theoretical analysis of multicore scalability is investigated, and quantitative conditions are given to determine how to obtain optimal multicore performance. The theorems and corollary provide computer architects with a better understanding of multicore design types, enabling them to make more informed tradeoffs. However, our precise quantitative results are suspect because the real world is much more complex. The model considered here ignores many important structures. This theoretical analysis attempts to provide insights on future work.

Future Work In applications, the parallel fraction f can not be infinitely parallelizable. The parallel degree can be less than some constant d or even be random in some circumstances. Introducing practical structures, such as memory hierarchy, shared caches, etc. More cores might allow more parallelism for larger problem size. Fixed-time speedup, like the Gustafson’s law, should be considered. …

Acknowledgements We would like to thank Professor Mark Hill for his valuable comments and suggestions. We also appreciate the help of Dr. Mark Squillant and the arrangement of the MAMA organizator on this video presentation.

Thanks Welcome Questions and Comments