Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Parallel Processing with OpenMP
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
Computer Organization and Architecture 18 th March, 2008.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Reference: Message Passing Fundamentals.
Parallel Programming Henri Bal Vrije Universiteit Faculty of Sciences Amsterdam.
Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
Parallel Computing Overview CS 524 – High-Performance Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Introduction Computational Challenges Serial Solutions Distributed Memory Solution Shared Memory Solution Parallel Analysis Conclusion Introduction: 
Task Farming on HPCx David Henty HPCx Applications Support
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Parallel Processing LAB NO 1.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Parallel Computing Basics of Parallel Computers Shared Memory SMP / NUMA Architectures Message Passing Clusters.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Agenda Project discussion Modeling Critical Sections in Amdahl's Law and its Implications for Multicore Design, S. Eyerman, L. Eeckhout, ISCA'10 [pdf]pdf.
Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
GPU Architecture and Programming
Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallelization of 2D Lid-Driven Cavity Flow
Parallel Programming with MPI and OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
TransAT Tutorial Separation of Oil, Gas & Water July 2015 ASCOMP
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Scaling Conway’s Game of Life. Why do parallelism? Speedup – solve a problem faster. Accuracy – solve a problem better. Scaling – solve a bigger problem.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Measuring Performance Based on slides by Henri Casanova.
Applied Operating System Concepts
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Parallel Algorithm Design
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Operating System Concepts
CSE8380 Parallel and Distributed Processing Presentation
COMP60621 Designing for Parallelism
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Mattan Erez The University of Texas at Austin
Operating System Concepts
Presentation transcript:

Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Outline Why Clusters? Parallelization example - Game of Life performance metrics Ways to Fool the Masses summary Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Why Clusters? Scientific computing has traditionionally been performed on fast, specialized machines Buzzword - Commodity Computing –clustering cheap, off-the-shelf processors –can achieve good performance at a low cost if the applications scale well Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Clusters (2) 102 clusters in current Top 500 list Resonable parallel efficiency is the key generally use message passing, even if there are shared-memory CPU’s in each box Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Compilers Linux Fortran compilers (F90/95) –available from many vendors, e.g., Absoft, Compaq, Intel, Lahey, NAG, Portland Group, Salford –g77 is free, but is restricted to Fortran 77, relatively slow Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Compilers (2) Intel offers free unsupported Fortran compiler for non-commercial purposes –full F95 –OpenMP compilers/f60l/noncom.htm Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Compilers (3)

Compilers (4) Linux C/C++ compilers –gcc/g++ seems to be the standard, usually described as a good compiler –also available from vendors, e.g., Compaq, Intel, Portland Group Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Parallelization of Scientific Codes Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Domain Decomposition Typically perform operations on arrays –e.g., setting up and solving system of equations domain decomposition –arrays are broken into chunks, and each chunk is handled by a separate processor –processors operate simultaneously on their own chunks of the array Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Other Methods Parallelzation also possible without domain decomposition –less common –e.g., process one set of inputs while reading another set of inputs from a file Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Embarrassingly Parallel if operations are completely independent of one another, this is called embarrassingly parallel –e.g., initializing an array –some Monte Carlo simulations –not usually the case Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Game of Life Early simple cellular automata –created by John Conway 2-D grid of cells –each has one of 2 states (“alive” or “dead”) –cells are initialized with some distribution of alive and dead states Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Game of Life (2) at each time step states are modified based on states of adjacent cells (including diagonals) Rules of the game: –3 alive neighbors - alive –2 alive neighbors - no change –other - dead Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Game of Life (3) Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Game of Life (4) Parallelize on 2 processors –assign block of columns to each processor Problem - What happens at split?

Game of Life (5) Solution - Overlap cells Each time step, pass overlap data processor to processor

Message Passing Largest bottleneck to good parallel efficiency is usually message passing –much slower than number crunching set up your algorithm to minimize message passing minimize surface-to-volume ratio of subdomains Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Domain Decomp. For this domain: To run on 2 processors, decompose like this: Not like this: Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

How to Pass Msgs. MPI is the recommended method –PVM may also be used MPICH –most common –free download others also avalable, e.g., LAM Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

How to Pass Msgs. some MPI tutorials –Boston University –NCSA Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Performance Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Code Timing How well has code been parallelized? CPU time vs. wallclock time –both are seen in literature –I prefer wallclock only for dedicated processors CPU time doesn’t account for load imbalance unix time command Fortran system_clock subroutine MPI_Wtime Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Parallel Speedup quantify how well we have parallelized our code S n = parallel speedup n = number of processors T 1 = time on 1 processor T n = time on n processors

Parallel Speedup (2)

Parallel Efficiency  n = parallel efficiency T 1 = time on 1 processor T n = time on n processors n = number of processors Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Parallel Efficiency (2)

Parallel Efficiency (3) What is a “reasonable” level of parallel efficiency? Depends on –how much CPU time you have available –when the paper is due can think of (1-  as “wasted” CPU time my personal rule of thumb ~60% Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Parallel Efficiency (4) Superlinear speedup –parallel efficiency > 1.0 –sometimes quoted in the literature –generally attributed to cache issues subdomains fit entirely in cache, entire domain does not this is very problem dependent be suspicious! Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Amdahl’s Law Always some operations which are performed serially want a large fraction of code to execute in parallel Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Amdahl’s Law (2) Let fraction of code that executes serially be denoted s Let fraction of code that executes in parallel be denoted p Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Amdahl’s Law (3) Noting that p = (1-s) The parallel speedup is Amdahl’s Law Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Amdahl’s Law (4) The parallel efficiency is Alternate version of Amdahl’s Law Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Amdahl’s Law (5)

Amdahl’s Law (6) Should we despair? –No! –bigger machines solve bigger problems smaller value of s if you want to run on a large number of processors, try to minimize s Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Ways to Fool the Masses full title: “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers” Created by David Bailey of NASA Ames in 1991 following is selection of “ways,” some paraphrased Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Ways to Fool (2) Scale problem size with number of processors Project results linearly –2 proc, 1 hr proc., 1 sec. Present performance of kernel, represent as performance of application Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Ways to Fool (3) Compare with old code on obsolete system Quote MFLOPS based on parallel implementation, not best serial implementation –increase no. operations rather than decreasing time Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Ways to Fool (4) Quote parallel speedup making sure single-processor version is slow Mutilate the algorithm used in the parallel implementation to match the architecture –explicit vs. implicit PDE solvers Measure parallel times on dedicated system, serial times in busy environment Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Ways to Fool (5) If all else fails, show pretty pictures and animated videos, and don’t talk about performance. Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002

Summary Clusters are viable platforms for relatively low-cost scientific computing parallel considerations similar to other platforms MPI is a free, effective message passing API careful with performance timings Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002