Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Using the Iteration Space Visualizer in Loop Parallelization Yijun YU

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Lecturer: Simon Winberg Lecture 18 Amdahl’s Law & YODA Blog & Design Review.

ParaMeter: A profiling tool for amorphous data-parallel programs Donald Nguyen University of Texas at Austin.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Isoparametric Elements

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

The of Parallelism in Algorithms Keshav Pingali The University of Texas at Austin Joint work with D.Nguyen, M.Kulkarni, M.Burtscher, A.Hassaan, R.Kaleem,

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Quantitative.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.

CS240A: Conjugate Gradients and the Model Problem.

Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.

Advanced Topics in Algorithms and Data Structures 1 Two parallel list ranking algorithms An O (log n ) time and O ( n log n ) work list ranking algorithm.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi-sets, etc. can be.

RAM and Parallel RAM (PRAM). Why models? What is a machine model? – A abstraction describes the operation of a machine. – Allowing to associate a value.

Graph Algorithms. Overview Graphs are very general data structures – data structures such as dense and sparse matrices, sets, multi- sets, etc. can be.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.

Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.

Performance Evaluation of Parallel Processing. Why Performance?

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extending Amdahl’s Law in the Multicore Era Erlin Yao, Yungang Bao, Guangming Tan and Mingyu Chen Institute of Computing Technology, Chinese Academy of.

Amdahl’s Law in the Multicore Era Mark D.Hill & Michael R.Marty 2008 ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun Ham 2012.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

Algorithms and Algorithm Analysis The “fun” stuff.

Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,

1 Keshav Pingali University of Texas, Austin Introduction to parallelism in irregular algorithms.

1 Keshav Pingali University of Texas, Austin Operator Formulation of Irregular Algorithms.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Lecturer: Simon Winberg Lecture 18 Amdahl’s Law (+- 25 min)

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

Scaling Area Under a Curve. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.

From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.

Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,

ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.

Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.

Kriging for Estimation of Mineral Resources GISELA/EPIKH School Exequiel Sepúlveda Department of Mining Engineering, University of Chile, Chile ALGES Laboratory,

Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.

Auburn University

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

4- Performance Analysis of Parallel Programs

Decimation Of Triangle Meshes

Parallel Processing - introduction

CS 290N / 219: Sparse Matrix Algorithms

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

EE 193: Parallel Computing

COMP60621 Designing for Parallelism

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Amir Kamil and Katherine Yelick

Potential for parallel computers/parallel programming

Presentation transcript:

Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf Benchmarking guidelines Regular vs. irregular parallel applications

Last time: Amdahl’s law Under what assumptions? Speedup = F 1 F N F Code is infinitely paralelizable No parallelization overheads No synchronization

Assuming multiple BCEs. Q: How to design a multicore for maximum speedup Assumed Perf(R) = square root of R Two problems –symmetric / asymmetric multicore chips –Area allocations (symmetric) Sixteen 1-BCE cores (symmetric) Four 4-BCE cores (symmetric) One 16-BCE core

For Asymmetric Multicore Chips Serial Fraction 1-F same, so time = (1 – F) / Perf(R) Parallel Fraction F –One core at rate Perf(R) –N-R cores at rate 1 –Parallel time = F / (Perf(R) + N - R) Therefore, w.r.t. one base core: Asymmetric Speedup = F Perf(R) F Perf(R) + N - R

[for 256 BCEs] (256 cores)(253 cores)(193 cores)(1 core) (241 cores)

Amdahl assumptions Code is infinitely paralelizable No parallelization overheads No synchronization –Add synchronization. Randomly entered (?!) f seq + f par = 1f seq + f par,ncs + f par,cs = 1

Average time in critical sections Paper also derives an estimate for max time in critical sections

f seq f par,cs P cs P ctn f par,cs (1-P cs P ctn )/N f par,ncs / N

Speedup for an asymmetric processor as a function of the big core size (b) and small core size (s) for different contention rates, assuming 256 BCEs. Fraction spent in sequential code 1%.

Design space exploration across symmetric, asymmetric and ACS multicore processors Varying the fraction of the time spent in critical sections and their contention rates. Fraction spent in sequential code equals 1% ACS = Accelerated critical section

agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf 12 ways to fool the masses Regular vs. irregular parallel applications

If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

David H. Bailey, “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”, Supercomputing Review, August 1991, 1. Quote only 32-bit performance results, not 64-bit results. 2. Present performance figures for an inner kernel, and then represent these figures as the performance of the entire application. 3. Quietly employ assembly code and other low-level language constructs. 4. Scale up the problem size with the number of processors, but omit any mention of this fact. 5. Quote performance results projected to a full system. 6. Compare your results against scalar, unoptimized code on Crays. 7. When direct run time comparisons are required, compare with an old code on an obsolete system. 8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation, not on the best sequential implementation. 9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per dollar. 10. Mutilate the algorithm used in the parallel implementation to match the architecture. 11. Measure parallel run times on a dedicated system, but measure conventional run times in a busy environment. 12 If all else fails, show pretty pictures and animated videos, and don't talk about performance.

Rodamap Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf 12 ways to fool the masses Regular vs. irregular parallel applications

16 Definitions Regular applications –key data structures are vectors dense matrices –simple access patterns (eg) array indices are affine functions of for-loop indices –examples: MMM, Cholesky & LU factorizations, stencil codes, FFT,… Irregular applications –key data structures are lists, priority queues trees, DAGs, graphs usually implemented using pointers or references –complex access patterns –examples: see next slide

17 Regular application example: Stencil computation (e.g.,) Finite-difference method for solving pde’s –discrete representation of domain: grid Values at interior points are updated using values at neighbors –values at boundary points are fixed Data structure: –dense arrays Parallelism: –values at next time step can be computed simultaneously –parallelism is not dependent on runtime values Compiler can find the parallelism –spatial loops are DO-ALL loops //Jacobi iteration with 5-point stencil //initialize array A for time = 1, nsteps for in [2,n-1]x[2,n-1] temp(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1)) for in [2,n-1]x[2,n-1]: A(i,j) = temp(i,j) Jacobi iteration, 5-point stencil Atemp tntn t n+1

18 Delaunay Mesh Refinement Iterative refinement to remove badly shaped triangles: while there are bad triangles do { Pick a bad triangle; Find its cavity; Retriangulate cavity; // may create new bad triangles } Don’t-care non-determinism: –final mesh depends on order in which bad triangles are processed –applications do not care which mesh is produced Data structure: –graph in which nodes represent triangles and edges represent triangle adjacencies Parallelism: –bad triangles with cavities that do not overlap can be processed in parallel –parallelism is dependent on runtime values compilers cannot find this parallelism