CS591x -Cluster Computing and Parallel Programming

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

The University of Adelaide, School of Computer Science

Parallel computer architecture classification

Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:

Today’s topics Single processors and the Memory Hierarchy

Parallel Computers Chapter 1

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Parallel Processing Architectures Laxmi Narayan Bhuyan

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Lecture 1: Introduction to High Performance Computing.

Parallel Computing Techniques. 1. Introduction 2. Parallel Machines 3. Clusters 4. Computational Grids 5. unGrid 6. Questions & Answers.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

2015/10/14Part-I1 Introduction to Parallel Processing.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Message Passing Computing 1 iCSC2015,Helvi Hartmann, FIAS Message Passing Computing Lecture 1 High Performance Computing Helvi Hartmann FIAS Inverted CERN.

- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.

Chapter 9: Alternative Architectures In this course, we have concentrated on single processor systems But there are many other breeds of architectures:

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Parallel Computing.

Pipelining and Parallelism Mark Staveley

Data Structures and Algorithms in Parallel Computing Lecture 1.

Parallel Processing I’ve gotta spend at least 10 hours studying for the IT 344 final! I’m going to study with 9 friends… we’ll be done in an hour.

Outline Why this subject? What is High Performance Computing?

Computer Architecture And Organization UNIT-II Flynn’s Classification Of Computer Architectures.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 3: Computer Architectures

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

CS Design of Algorithms Parallel Computer Architecture and Software Models.

CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.

1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.

LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Classification of parallel computers Limitations of parallel processing.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Processor Level Parallelism 1

CS203 – Advanced Computer Architecture

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Introduction to Parallel Processing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

Parallel computer architecture classification

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Parallel Processing - introduction

Super Computing By RIsaj t r S3 ece, roll 50.

Flynn’s Classification Of Computer Architectures

Morgan Kaufmann Publishers

Chapter 17 Parallel Processing

Symmetric Multiprocessing (SMP)

Parallel Processing Architectures

AN INTRODUCTION ON PARALLEL PROCESSING

Chapter 4 Multiprocessors

CSE 102 Introduction to Computer Engineering

Presentation transcript:

CS591x -Cluster Computing and Parallel Programming Parallel Computer Architecture and Software Models

It all about performance Greater performance is the reason for parallel computing Many types of scientific and engineering programs are too large and too complex for traditional uniprocessors Such large problems are common is – Ocean modeling, weather modeling, astrophysics, solid state physics, power systems….

FLOPS – a measure of performance FLOPS – Floating Point Operations per Second … a measure of how much computation can be done in a certain amount of time MegaFLOPS – MFLOPS - 106 FLOPS GigaFLOPS – GFLOPS – 109 FLOPS TeraFLOPS – TFLOPS – 1012 FLOPS PetaFLOPS – PFLOPS – 1015 FLOPS

How fast … Cray 1 - ~150 MFLOPS Pentium 4 – 3-6 GFLOPS IBM’s BlueGene - +70 TFLOPS PSC’s Big Ben – 10 TFLOPS Humans --- it depends as calculators – 0.001 MFLOPS as information processors – 10PFLOPS

FLOPS vs. MIPS FLOPS only concerned with floating pointing calculations other performance issues memory latency cache performance I/O capacity …

See… www.Top500.org biannual performance reports and … rankings of the fastest computers in the world

Performance Speedup(n processors) = time(1 processor)/time(n processors) ** Culler, Singh and Gupta, Parallel Computing Architecture, A Hardware/Software Approach

Consider… from: www.lib.utexas.edu/maps/indian_ocean.html

… a model of the Indian Ocean - 73,000,000 square kilometer One data point per 100 meters 7,300,000,000 surface points Need to model the ocean at depth – say every 10 meters up to 200 meters 20 depth data points Every 10 minutes for 4 hours – 24 time steps

So – 73 x 106 (points on the surface) x 102 (points per sq. km) x 20 points per sq km of depth) x 24 (time steps) 3,504,000,000,000 data points in the model grid Suppose 100 instruction per grid point 350,400,000,000,000 instructions in model

Then - Imagine that you have a computer that can run 1 billion (109)instructions per second 3.504 x 1014 / 109 = 35040 seconds or 9.7 hours

But – On a 10 teraflops computer – 3.504 x 1014 / 1013 = 35.0 seconds

Gaining performance Pipelining More instructions –faster More instructions in execution at the same time in a single processor Not usually an attractive strategy these days – why?

Instruction Level Parallelism (ILP) based on the fact that many instructions do not depend on instructions that are before them… Processor has extra hardware to execute several instructions at the same time …multiple adders…

Pipelining and ILP not the solution to our problem – why? near incremental improvements in performance been done already we need orders of magnitude improvements in performance

Gaining Performance Vector Processors Scientific and Engineering computations are often vector and matrix operations graphic transformations – i.e. shift object x to the right Redundant arithmetic hardware and vector registers to operate on an entire vector in one step (SIMD)

Gaining Performance Vector Processors Declining popularity for a while – Hardware expensive Popularity returning – Applications – science, engineering, cryptography, media/graphics Earth Simulator

Parallel Computer Architecture Shared Memory Architectures Distributed Memory

Shared Memory Systems Multiple processors connected to/share the same pool of memory SMP Every processor has, potentially, access to and control of every memory location

Shared Memory Computers Processor Processor Processor Memory Processor Processor Processor

Shared Memory Computers Processor Processor Processor

Shared Memory Computer Switch Processor Processor Processor

Share Memory Computers SGI Origin2000 – at NCSA Balder 256 250mhz R10000 processors 128 Gbyte Memory

Shared Memory Computers Rachel at PSC 64 1.15 Ghz EV7 processors 256 Gbytes of shared memory

Distributed Memory Systems Multiple processors each with their own memory Interconnected to share/exchange data, processing Modern architectural approach to supercomputers Supercomputers and Clusters similar

Clusters – distributed memory Processor Processor Processor Interconnect Processor Processor Processor Memory Memory Memory

Cluster Distributed Memory with SMP Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Interconnect Proc1 Proc2 Proc1 Proc2 Proc1 Proc2 Memory Memory Memory

Distributed Memory Supercomputer BlueGene/L DOE/IBM 0.7 Ghz PowerPC 440 32768 Processors 70 Teraflops

Distributed Memory Supercomputer Thunder at LLNL Number 5 20 Teraflops 1.4 Ghz Itanium processors 4096 processors

Grid Computing Systems What is a Grid Means different things to different people Distributed Processors Around campus Around the state Around the world

Grid Computing Systems Widely distributed Loosely connected (i.e. Internet) No central management

Grid Computing Systems Connected Clusters/other dedicated scientific computers I2/Abilene

Grid Computer Systems Harvested Idle Cycles Internet Control/Scheduler

Grid Computing Systems Dedicated Grids TeraGrid Sabre NASA Information Power Grid Cycle Harvesting Grids Condor *GlobalGridForum (Parabon) Seti@home

Let’s revisit speedup… we can achieve speedup (theoretically) by using more processors,… but, of factors may limit speedup… Interprocessor communications Interprocess synchronization Load balance

Amdahl’s Law According to Amdahl’s Law… Speedup = 1/(S + (1-S)/N) where S is the purely sequential part of the program N is the number of processors

Amdahl’s Law What does it mean – Amdahl’s law says – Part of a program can is parallelizable Part of the program must be sequential (S) Amdahl’s law says – Speedup is constrained by the portion of the program that must remain sequential relative to the part that is parallelized. Note: If S is very small – “embarrassingly parallel problem”

Software models for parallel computing Shared Memory Distributed Memory Data Parallel

Flynn’s Taxonomy Single Instruction/Single Data - SISD Multiple Instruction/Single Data - MISD Single Instruction/Multiple Data - SIMD Multiple Instruction/Multiple Data - MIMD Single Program/Multiple Data - SPMD

Next Cluster Computer Architecture Linux