Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Slides:

Advertisements

Similar presentations

Parallel Processing with OpenMP

Advertisements

Introduction to Openmp & openACC

INSTRUCTION SET ARCHITECTURES

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Performance of Cache Memory

MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,

Reference: Message Passing Fundamentals.

Introduction CS 524 – High-Performance Computing.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Chapter 3.2 : Virtual Memory

E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Introduction to Scientific Computing on Linux Clusters Doug Sondak Linux Clusters and Tiled Display Walls July 30 – August 1, 2002.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

A. Frank - P. Weisberg Operating Systems Simple/Basic Paging.

Cache memory Direct Cache Memory Associate Cache Memory Set Associative Cache Memory.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Systems I Locality and Caching

CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

CMPE 421 Parallel Computer Architecture

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Operating Systems and Networks AE4B33OSS Introduction.

1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

Performance Optimization Getting your programs to run faster CS 691.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Performance Optimization Getting your programs to run faster.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

CS 471 Final Project 2d Advection/Wave Equation Using Fourier Methods December 10, 2003 Jose L. Rodriguez

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Single Node Optimization Computational Astrophysics.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

Parallel Computing Presented by Justin Reschke

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Non Contiguous Memory Allocation

ECE232: Hardware Organization and Design

Memory COMPUTER ARCHITECTURE

How will execution time grow with SIZE?

Architecture Background

Cache memory Direct Cache Memory Associate Cache Memory

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Hybrid Programming with OpenMP and MPI

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Memory System Performance Chapter 3

Presentation transcript:

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization

Outline Introduction Software Parallelization Hardware

Introduction What is Scientific Computing? – Need for speed – Need for memory Simulations tend to grow until they overwhelm available resources – If I can simulate 1000 neurons, wouldn’t it be cool if I could do 2000? 10000? ? Example – flow over an airplane – It has been estimated that if a teraflop machine were available, would take about 200,000 years to solve (resolving all scales). If Homo Erectus had a teraflop machine, we could be getting the result right about now.

Introduction (cont’d) Optimization – Profile serial (1-processor) code Tells where most time is consumed – Is there any “low fruit”? Faster algorithm Optimized library Wasted operations Parallelization – Break problem up into chunks – Solve chunks simultaneously on different processors

Software

Compiler The compiler is your friend (usually) Optimizers are quite refined – Always try highest level Usually –O3 Sometimes –fast, -O5, … Loads of flags, many for optimization Good news – many compilers will automatically parallelize for shared- memory systems Bad news – this usually doesn’t work well

Software Libraries – Solver is often a major consumer of CPU time – Numerical Recipes is a good book, but many algorithms are not optimal – Lapack is a good resource – Libraries are often available that have been optimized for the local architecture Disadvantage – not portable

Parallelization

Divide and conquer! – divide operations among many processors – perform operations simultaneously – if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? not so easy, of course

Parallelization (cont’d) problem – some calculations depend upon previous calculations – can’t be performed simultaneously – sometimes tied to the physics of the problem, e.g., time evolution of a system want to maximize amount of parallel code – occasionally easy – usually requires some work

Parallelization (3) method used for parallelization may depend on hardware distributed memory – each processor has own address space – if one processor needs data from another processor, must be explicitly passed shared memory – common address space – no message passing required

Parallelization (4) proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mem 2 mem 3 distributed memory proc 0 proc 1 proc 2 proc 3 mem shared memory proc 0 proc 1 proc 2 proc 3 mem 0 mem 1 mixed memory

Parallelization (5) MPI – for both distributed and shared memory – portable – freely downloadable OpenMP – shared memory only – must be supported by compiler (most do) – usually easier than MPI – can be implemented incrementally

MPI Computational domain is typically decomposed into regions – One region assigned to each processor Separate copy of program runs on each processor

MPI Discretized domain to solve flow over airfoil System of coupled PDE’s solved at each point

MPI Decomposed domain for 4 processors

MPI Since points depend on adjacent points, must transfer information after each iteration This is done with explicit calls in the source code

MPI Diminishing returns – Sending messages can get expensive – Want to maximize ratio of computation to communication Parallel speedup, parallel efficiency T = time n = number of processors

Speedup

Parallel Efficiency

OpenMP Usually loop-level parallelization An OpenMP directive is placed in the source code before the loop – Assigns subset of loop indices to each processor – No message passing since each processor can “see” the whole domain for(i=0; i<N; i++){ do lots of stuff }

OpenMP Can’t guarantee order of operations for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; ia[i] (serial)a[i] (parallel) Proc. 0 Proc. 1 Parallelize this loop on 2 processors Example of how to do it wrong!

Hardware

A faster processor is obviously good, but: – Memory access speed is often a big driver Cache – a critical element of memory system Processors have internal parallelism such as pipelines and multiply-add instructions

Cache Cache is a small chunk of fast memory between the main memory and the registers secondary cache registers primary cache main memory

Cache (cont’d) Variables are moved from main memory to cache in lines – L1 cache line sizes on our machines Opteron (blade cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Cache (cont’d) Why not just make the main memory out of the same stuff as cache? – Expensive – Runs hot – This was actually done in Cray computers Liquid cooling system

Cache (cont’d) Cache hit – Required variable is in cache Cache miss – Required variable not in cache – If cache is full, something else must be thrown out (sent back to main memory) to make room – Want to minimize number of cache misses

Cache example … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] Main memory “mini” cache holds 2 lines, 4 words each for(i==0; i<10; i++) x[i] = i a b …

Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] We will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits a b …

Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[0] x[1] x[2] x[3] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits x[4] x[5] x[6] x[7] a b …

Cache example (cont’d) … x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] for(i==0; i<10; i++) x[i] = i x[8] x[9] a b need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line x[4] x[5] x[6] x[7] a b …

Cache (cont’d) Contiguous access is important In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …

Cache (cont’d) In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …

Cache (cont’d) Rule: Always order your loops appropriately for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo C Fortran

SCF Machines

p-series Shared memory IBM Power4 processors 32 KB L1 cache per processor 1.41 MB L2 cache per pair of processors 128 MB L3 cache per 8 processors

p-series machineProc. spd# procsmemory kite1.3 GHz3232 GB pogo1.3 GHz3232 GB frisbee1.3 GHz3232 GB domino1.3 GHz1616 GB twister1.1 GHz816 GB scrabble1.1 GHz816 GB marbles1.1 GHz816 GB crayon1.1 GHz816 GB litebrite1.1 GHz816 GB hotwheels1.1 GHz816 GB

Blue Gene Distributed memory 2048 processors – processor nodes IBM PowerPC 440 processors – 700 MHz 512 MB memory per node (per 2 processors) 32 KB L1 cache per node 2 MB L2 cache per node 4 MB L3 cache per node

BladeCenter Hybrid memory 56 processors – 14 4-processor nodes AMD Opteron processors – 2.6 GHz 8 GB memory per node (per 4 processors) – Each node has shared memory 64 KB L1 cache per 2 processors 1 MB L2 cache per 2 processors

Linux Cluster Hybrid memory 104 processors – 52 2-processor nodes Intel Pentium III processors – 1.3 GHz 1 GB memory per node (per 2 processors) – Each node has shared memory 16 KB L1 cache per 2 processors 512 KB L2 cache per 2 processors

For More Information SCV web site Today’s presentations are available at under the title “Introduction to Scientific Computing and Visualization”

Next Time G & T code Time it – Look at effect of compiler flags profile it – Where is time consumed? Modify it to improve serial performance