April 23, 2002 Parallel Computing: Overview John Urbanic

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

SE-292 High Performance Computing

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Reference: Message Passing Fundamentals.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.

1 Reasons for parallelization Can we make GA faster? One of the most promising choices is to use parallel implementations. The reasons for parallelization.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.

Computer System Architectures Computer System Software

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Basics and Architectures

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Performance Evaluation of Parallel Processing. Why Performance?

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi

(More) Interfacing concepts. Introduction Overview of I/O operations Programmed I/O – Standard I/O – Memory Mapped I/O Device synchronization Readings:

Department of Computer Science University of the West Indies.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Parallel Processing Steve Terpe CS 147. Overview What is Parallel Processing What is Parallel Processing Parallel Processing in Nature Parallel Processing.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Data Structures and Algorithms in Parallel Computing Lecture 1.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Outline Why this subject? What is High Performance Computing?

Super computers Parallel Processing

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Parallel Computing Presented by Justin Reschke

Concurrency and Performance Based on slides by Henri Casanova.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Background Computer System Architectures Computer System Software.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Distributed Processors

Course Outline Introduction in algorithms and applications

The University of Adelaide, School of Computer Science

Parallel and Multiprocessor Architectures

CSE8380 Parallel and Distributed Processing Presentation

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Presentation transcript:

April 23, 2002 Parallel Computing: Overview John Urbanic

April 23, 2002 Introduction to Parallel Computing Why we need parallel computing How such machines are built How we actually use these machines

April 23, 2002 New Applications

April 23, 2002 Clock Speeds

April 23, 2002 Clock Speeds When the PSC went from a 2.7 GFlop Y-MP to a 16 GFlop C90, the clock only got 50% faster. The rest of the speed increase was due to increased use of parallel techniques: More processors (8  16) Longer vector pipes (64  128) Parallel functional units (2)

April 23, 2002 Clock Speeds So, we want as many processors working together as possible. How do we do this? There are two distinct elements: Hardware vendor does this Software you, at least today

April 23, 2002 Amdahl’s Law How many processors can we really use? Let’s say we have a legacy code such that is it only feasible to convert half of the heavily used routines to parallel:

April 23, 2002 Amdahl’s Law If we run this on a parallel machine with five processors: Our code now takes about 60s. We have sped it up by about 40%. Let’s say we use a thousand processors: We have now sped our code by about a factor of two.

April 23, 2002 Amdahl’s Law This seems pretty depressing, and it does point out one limitation of converting old codes one subroutine at a time. However, most new codes, and almost all parallel algorithms, can be written almost entirely in parallel (usually, the “start up” or initial input I/O code is the exception), resulting in significant practical speed ups. This can be quantified by how well a code scales which is often measured as efficiency.

April 23, 2002 Shared Memory Easiest to program. There are no real data distribution or communication issues. Why doesn’t everyone use this scheme? Limited numbers of processors (tens) – Only so many processors can share the same bus before conflicts dominate. Limited memory size – Memory shares bus as well. Accessing one part of memory will interfere with access to other parts.

April 23, 2002 Distributed Memory Number of processors only limited by physical size (tens of meters). Memory only limited by the number of processors time the maximum memory per processor (very large). However, physical packaging usually dictates no local disk per node and hence no virtual memory. Since local and remote data have much different access times, data distribution is very important. We must minimize communication.

April 23, 2002 Common Distributed Memory Machines CM-2 CM-5 T3E Workstation Cluster SP3 TCS

April 23, 2002 Common Distributed Memory Machines While the CM-2 is SIMD (one instruction unit for multiple processors), all the new machines are MIMD (multiple instructions for multiple processors) and based on commodity processors. SP-2POWER2 CM-5SPARC T3EAlpha WorkstationsYour Pick TCSAlpha Therefore, the single most defining characteristic of any of these machines is probably the network.

April 23, 2002 Latency and Bandwidth Even with the "perfect" network we have here, performance is determined by two more quantities that, together with the topologies we'll look at, pretty much define the network: latency and bandwidth. Latency can nicely be defined as the time required to send a message with 0 bytes of data. This number often reflects either the overhead of packing your data into packets, or the delays in making intervening hops across the network between two nodes that aren't next to each other. Bandwidth is the rate at which very large packets of information can be sent. If there was no latency, this is the rate at which all data would be transferred. It often reflects the physical capability of the wires and electronics connecting nodes.

April 23, 2002 Token-Ring/Ethernet with Workstations

April 23, 2002 Complete Connectivity

April 23, 2002 Super Cluster / SP2

April 23, 2002 CM-2

April 23, 2002 Binary Tree

April 23, 2002 CM-5 Fat Tree

April 23, 2002 INTEL Paragon (2-D Mesh)

April 23, D Torus T3E has Global Addressing hardware, and this helps to simulate shared memory. Torus means that “ends” are connected. This means A is really connected to B and the cube has no real boundary.

April 23, 2002 TCS Fat Tree

April 23, 2002 Data Parallel Only one executable. Do computation on arrays of data using array operators. Do communications using array shift or rearrangement operators. Good for problems with static load balancing that are array-oriented SIMD machines. Variants: FORTRAN 90 CM FORTRAN HPF C* CRAFT Strengths: 1.Scales transparently to different size machines 2.Easy debugging, as there I sonly one copy of coed executing in highly synchronized fashion Weaknesses: 1.Much wasted synchronization 2.Difficult to balance load

April 23, 2002 Data Parallel – Cont’d Data Movement in FORTRAN 90

April 23, 2002 Data Parallel – Cont’d Data Movement in FORTRAN 90

April 23, 2002 Data Parallel – Cont’d When to use Data Parallel –Very array-oriented programs FEA Fluid Dynamics Neural Nets Weather Modeling –Very synchronized operations Image processing Math analysis

April 23, 2002 Work Sharing Splits up tasks (as opposed to arrays in date parallel) such as loops amongst separate processors. Do computation on loops that are automatically distributed. Do communication as a side effect of data loop distribution. Not important on shared memory machines. If you have used CRAYs before, this of this as “advanced multitasking.” Good for shared memory implementations. Strengths: 1.Directive based, so it can be added to existing serial codes Weaknesses: 1.Limited flexibility 2.Efficiency dependent upon structure of existing serial code 3.May be very poor with distributed memory. Variants: CRAFT Multitasking

April 23, 2002 Work Sharing – Cont’d When to use Work Sharing Very large / complex / old existing codes: Gaussian 90 Already multitasked codes: Charmm Portability (Directive Based) (Not Recommended)

April 23, 2002 Load Balancing An important consideration which can be controlled by communication is load balancing: Consider the case where a dataset is distributed evenly over 4 sites. Each site will run a piece of code which uses the data as input and attempts to find a convergence. It is possible that the data contained at sites 0, 2, and 3 may converge much faster than the data at site 1. If this is the case, the three sites which finished first will remain idle while site 1 finishes. When attempting to balance the amount of work being done at each site, one must take into account the speed of the processing site, the communication "expense" of starting and coordinating separate pieces of work, and the amount of work required by various pieces of data. There are two forms of load balancing: static and dynamic.

April 23, 2002 Load Balancing – Cont’d Static Load Balancing In static load balancing, the programmer must make a decision and assign a fixed amount of work to each processing site a priori. Static load balancing can be used in either the Master-Slave (Host-Node) programming model or the "Hostless" programming model.

April 23, 2002 Load Balancing – Cont’d Static Load Balancing yields good performance when: homogeneous cluster each processing site has an equal amount of work Poor performance when: heterogeneous cluster where some processors are much faster (unless this is taken into account in the program design) work distribution is uneven

April 23, 2002 Load Balancing – Cont’d Dynamic Load Balancing Dynamic load balancing can be further divided into the categories: task-oriented when one processing site finishes its task, it is assigned another task (this is the most commonly used form). data-oriented when one processing site finishes its task before other sites, the site with the most work gives the idle site some of its data to process (this is much more complicated because it requires an extensive amount of bookkeeping). Dynamic load balancing can be used only in the Master-Slave programming model. ideal for: codes where tasks are large enough to keep each processing site busy codes where work is uneven heterogeneous clusters