8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 1 in the Multicore Era Reevaluating Amdahl’s Law in the Multicore Era Xian-He.

Slides:

Advertisements

Similar presentations

Parallelism Lecture notes from MKP and S. Yalamanchili.

Advertisements

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Last Lecture The Future of Parallel Programming and Getting to Exascale 1.

Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.

Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)

New HPC technologies Arunas Birmontas, BGM BalticGrid II Kick-off meeting, Vilnius May 13, 2008.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Lecture 1: Introduction to High Performance Computing.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Computer Performance Computer Engineering Department.

Performance Evaluation of Parallel Processing. Why Performance?

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Extending Amdahl’s Law in the Multicore Era Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of.

Last Time Performance Analysis It’s all relative

Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Rensselaer Why not change the world? Rensselaer Why not change the world? 1.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Lecturer: Simon Winberg Lecture 18 Amdahl’s Law (+- 25 min)

- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Argonne Leadership Computing Facility ALCF at Argonne  Opened in 2006  Operated by the Department of Energy’s Office of Science  Located at Argonne.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

CS591x -Cluster Computing and Parallel Programming

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

 The need for parallelization  Challenges towards effective parallelization  A multilevel parallelization framework for BEM: A compute intensive application.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Tackling I/O Issues 1 David Race 16 March 2010.

SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

NFV Compute Acceleration APIs and Evaluation

Lynn Choi School of Electrical Engineering

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Computer Architecture: Parallel Processing Basics

Parallel Processing - introduction

Super Computing By RIsaj t r S3 ece, roll 50.

Parallel Computers Today

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 1 in the Multicore Era Reevaluating Amdahl’s Law in the Multicore Era Xian-He Sun Illinois Institute of Technology Chicago, Illinois

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 2 TOP 10 Machines (6/2004) RankSiteComputer#procTF/sCountry 1Earth simulator center Earth simulator/NEC Japan 2LLNLThunder/Intel Itanium 2 Tiger 41.4GHz Quadrics USA 3LANLASCI Q/AlphaServer SC GHz USA 4IBM-RochesterBlueGene/LDD1 Prototype USA 5NCASTungsten/PowerEdge 1750, P4 Xeon 3.06Ghz USA 6ECMWFeServer P Series690 IBM UK 7RIKENRIKEN Super Combined Cluster/Fujitsu Japan 8IBM-Thomas WatsonBlueGene/LDD2 Prototype USA 9PNNLIntegrity rx2600 Itanium 21.5 GHz USA 10Shanghai Supercomputer Center Dawning 4000A Opteron 2.2GHz China

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 3 Cloud: Integrated Resource Provide virtual computing environments on demand

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 4 Maximum System 256 racks 3.5 PF/s 512 TB 850 MHz 8 MB EDRAM 4 cores 1 chip, 20 DRAMs 13.6 GF/s 2.0 GB DDR Supports 4-way SMP 32 Node Cards 1024 chips, 4096 procs 14 TF/s 2 TB 72 Racks 1 PF/s 144 TB Cabled 8x8x16 Rack Petaflops System Compute Card Chip 435 GF/s 64 GB (32 chips 4x4x2) 32 compute, 0-2 IO cards Node Board Front End Node / Service Node System p Servers Linux SLES10 HPC SW: Compilers GPFS ESSL Loadleveler Scalable Computing: the Way to High-performance Source: ANL ALCF IBM BG/P

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 5 Multicore Adds in Another Dimension IBM Cell: 8 slave cores + 1 master core, 2005 Sun T2: 8 cores, 2007 Intel Dunnington: 6 cores, 2008 AMD Phenom: 4 cores, 2007

AMD Opteron “Istanbul”: 6 Cores, 2009 Sun UltraSPARC Rock: 16 cores, 2009 Intel Dunnington: 6 cores, 2009 IBM Power-7: 8 cores, 2010 No in the Mood to Scale Up, yet

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 7 Why not Scale up the Number of Cores? Perception/technology?

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 8 Whereas Technology is Available Kilocore: 256-core prototype By Rapport Inc. GRAPE-DR chip: 512-core, By Japan GRAPE-DR testboard Tesla C1060: 240 cores, by NVDIA Quadro FX 5800: 240 cores, By NVDIA. NVIDIA Fermi: 512 CUDA cores

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 9 It All Starts with Amdahl ’ s Law Gene M. Amdahl, “Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities”, 1967 Amdahl’s law (Amdahl’s speedup model) f is the parallel portion Implications

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 10 Amdahl’s Law for Multicore (Hill&Marty) Hill & Marty, “Amdahl’s Law in the Multicore Era”, IEEE Computer, July 2008 Study the limitation of multicore architecture based on Amdahl’s law for parallel processing and hardware concern  n BCEs (Base Core Equivalents)  A powerful perf(r) core built with r BCEs is best choice from design concern SymmetricAsymmetric

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 11 Amdahl’s Law for Multicore (Hill&Marty) Speedup of symmetric architecture Speedup of asymmetric architecture

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 12 History Repeats Itself (back to 1988) ? Cray X-MP Fastest computer Cray Y-MP IBM 7030 Stretch IBM 7950 Harvest All have up to 8 processors, citing Amdahl’s law,

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 13 Terms of Scalable Computing (today) LANL Roadrunner: 18,360 processors, 130,464 cores 2009 World’s fastest supercomputer TACC Ranger: 15,744 processors, 2008 ANL Intrepid: 20,480 processors, 2008 The scale size is far beyond implication of Amdahl’s law

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 14 Scalable Computing Tacit assumption in Amdahl’s law  The problem size is fixed  The speedup emphasizes time reduction Gustafson’s Law, 1988  Fixed-time speedup model Sun and Ni’s law, 1990  Memory-bounded speedup model 1-ff f*n Work: (1-f)+nf Work: 1

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 15 Revisit Amdahl ’ s Law for Multicore Hill and Marty’s findings

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 16 Fixed-time Model for Multicore Emphasis on work finished in a fixed time Problem size is scaled from w to w' w': Work finished within the fixed time, when the number of cores scales from r to mr The scaled fixed-time speedup =>

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 17 Fixed-time Speedup for Multicore Scales linearly

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 18 Memory-bounded Model for Multicore Problem size is scaled from w to w* w*: Work executed under memory limitation (each core has its own L1 cache) w * = g(m)w, where g(m) is the increased workload as the memory capacity increases m times (g(m) = 0.38m 3/2, for matrix- multiplication 2N 3 v.s. 3N 2 ) The scaled memory-bounded speedup

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 19 Memory-bounded Speedup for Multicore Scales linearly and better than fixed-time for matrix-multiplication 2N3 v.s. 3N2

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 20 Perspective: a comparison Scalable computing shows an optimistic view

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 21 Result and Implications Result : The scalable computing concept and the two scaled speedup models are applicable to multicore architecture Implication 1: Amdahl’s law (Hill&Marty) presents a limited and pessimistic view Implication 2: Multicore is scalable in term of the number of cores Implication 3: The memory-bounded model reveals the relation between scalability and memory capacity and requirement Question: Is data access the actual performance constraint of multicore architecture ?

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 22 Processor-memory performance gap Processor performance increases rapidly  Uni-processor: ~52% until 2004, ~25% since then  New trend: multi-core/manycore architecture Intel TeraFlops chip, 2007  Aggregate processor performance much higher Memory: ~9% per year Processor-memory speed gap keeps increasing Source: Intel Source: OCZ 25% 52% 20% 9% 60% 9%

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 23 Multicore Scalability Analysis Architecture  N cores  Data contention to L2  Increase cores does not improve data access speed 23

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 24 Dense Solver k 3 comp, k 2 memory k k Two phases: Computing phase and communication phase Application: Iterative Solvers Synchronization/Communication Time

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 25 Data Access as the Scalability Constraint Phased computing model (embarrassing parallel, meta-tasks) Assume a task has two parts, w = w p + w c  Data processing work, w p  Data communication (access) work, w c Fixed-size speedup with data-access processing consideration

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 26 Scaled Speedup under Memory-wall Assuming data access time is fixed, Fixed-time model constraint Fixed-time speedup Memory-bounded speedup  With memory-bounded speedup is bigger than fixed- time speedup  g(m) equals one, memory-bounded is the as fixed-size, g(m) equals m, then memory-bound is the same as fixed-time =>

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 27 Mitigating Memory-wall Effect Result: Multicore is scalable, but under the assumption  Data access time is fixed and does not increase with the amount of work and the number of cores Implication: Data access is the bottleneck needs attention L2 L1 DF Memory Wall Data Prefetching  Software prefetching technique Adaptive, compete for computing power, and costly  Hardware prefetching technique Fixed, simple, and less powerful Our Solutions  Data Access History Cache (DAHC)  Server-based Push Prefetching

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 28 Hybrid Adaptive Prefetching Architecture Hybrid Adaptive Prefetching Hints Data Access Histories Prefetch generator Prefetch queue Access Scheduler Core L1 $ Demand requests Programmer Pre-compiler … Pre-execution Sequential Stride Markov … Prediction Memory Core L1 $ Core L1 $ Core L1 $ Disk

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 29 Data Access History Cache: a hardware solution for memory CompComp L1 data L1 cache L2 cache tagdataSTST MKMK MTMT SQSQ ST Counter MK Counter MT Counter SQ Counter Timer Prefetcher DAHC Prefetch Counter

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 30 Push-IO: A Software Solution for I/O

Dynamic Application-specific I/O Optimization Architecture Illinois Institute of Technology & Argonne National Laboratory 31

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 32 Conclusion Cloud computing and multicore/manycore architecture lead to the future of computing Multicore architecture is scalable Scaling up the number of cores can continually improve performance, if the data access delay is fixed Data access is the killing factor of performance Mitigating memory-wall: Data prefetching  Data Access History Cache (DAHC)  Server-based Push Prefetching Mitigating memory-wall: Application-specific data access system

8/15/2015 Scalable Computing Software Lab, Illinois Institute of Technology 33 Thank you! Thank you! To visit