3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multiple Processor Systems
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
History of Distributed Systems Joseph Cordina
CS 284a, 7 October 97Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 7 October 1997.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Chapter 17 Parallel Processing.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Ramya Kandasamy CS 147 Section 3
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
What is Parallel and Distributed computing?
Guoliang Chen Parallel Computing Guoliang Chen
Parallel and Multiprocessor Architectures – Shared Memory
Chapter 17 Parallel Processing
Introduction to Multiprocessors
Distributed Systems CS
High Performance Computing
Chapter 4 Multiprocessors
Presentation transcript:

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2

Parallel Computing ● Parallel computing: the use of multiple computers or processors working together on a common task ● Parallel computer: a computer that contains multiple processors: ➔ each processor works on its section of the problem ➔ processors are allowed to exchange information with other processors

Parallel vs. Serial Computers Two big advantages of parallel computers: 1. total performance 2. total memory ● Parallel computers enable us to solve problems that: ➔ benefit from, or require, fast solution ➔ require large amounts of memory ➔ example that requires both: weather forecasting

Parallel vs. Serial Computers Some benefits of parallel computing include: more data points ➔ bigger domains ➔ better spatial resolution ➔ more particles ● more time steps ➔ longer runs ➔ better temporal resolution ● faster execution ➔ faster time to solution ➔ more solutions in same time ➔ lager simulations in real time

Serial Processor Performance Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached

Types of Parallel Processor The simplest and most useful way to classify modern parallel computers is by their memory model: ➔ shared memory ➔ distributed memory

Shared vs Distributed memory Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)

Shared Memory Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000) Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)

Distributed Memory Processor-memory nodes are connected by some type of interconnect network ➔ Massively Parallel Processor (MPP): tightly integrated, single system image. ➔ Cluster: individual computers connected by s/w

Processor, Memory & Network Both shared and distributed memory systems have: ➔ processors: now generally commodity RISC processors ➔ memory: now generally commodity DRAM ➔ network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

Processor-Related Terms Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz). Instruction: an action executed by a processor, such as a mathematical operation or a memory operation. Register: a small, extremely fast location for storing data or instructions in the processor.

Processor-Related Terms Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc. Pipeline : technique enabling multiple instructions to be overlapped in execution. Superscalar: multiple instructions are possible per clock period. Flops: floating point operations per second.

Processor-Related Terms Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly. Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)

Memory-Related Terms SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable. DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper). Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.

Interconnect-Related Terms Latency: Networks: How long does it take to start sending a "message"? Measured in microseconds. Processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec

Interconnect-Related Terms ● Topology: the manner in which the nodes are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. Instead, processors are arranged in some variation of a grid, torus, or hypercube.

Putting the Pieces Together ● Shared memory architectures: ➔ Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000 ➔ Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000 ● Distributed memory architectures: ➔ Massively Parallel Processor (MPP): tightly integrated system, single system image. Ex: CRAY T3E, IBM SP ➔ Clusters: commodity nodes connected by interconnect. Example: Beowulf clusters.

Symmetric Multiprocessors (SMPs) ● SMPs connect processors to global shared memory using one of: ➔ Bus ➔ crossbar ● Provides simple programming model, but has problems: ➔ buses can become saturated ➔ crossbar size must increase with # processors ● Problem grows with number of processors, limiting maximum size of SMPs

Shared Memory Programming Programming models are easier since message passing is not necessary Techniques: ➔ autoparallelization via compiler options ➔ oop-level parallelism via compiler directives ➔ OpenMP ➔ pthreads

Massively Parallel Processors ● Each processor has it’s own memory: ➔ memory is not shared globally ➔ adds another layer to memory hierarchy (remote memory) ● Processor/memory nodes are connected by interconnect network ➔ many possible topologies ➔ processors must pass data via messages ➔ communication overhead must be minimized

Types of Interconnections ● Fully connected ➔ not feasible ● Array and torus ➔ Intel Paragon (2D array), CRAY T3E (3D torus) ● Crossbar ➔ IBM SP (8 nodes) ● Hypercube ➔ SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree) ● Combinations of some of the above ➔ IBM SP (crossbar & fully connected for 80 nodes) ➔ IBM SP (fat tree for > 80 nodes)

Distributed Memory Programming ● Message passing is most efficient ➔ MPI ➔ MPI-2 ➔ Active/one-sided messages Vendor: SHMEM (T3E), LAPI (SP Coming in MPI-2 ● Shared memory models can be implemented in software, but are not as efficient.

Distributed Shared Memory ● More generally called cc-NUMA (cache coherent NUMA) ● Consists of m SMPs with n processors in a global address space: ➔ Each processor has some local memory (SMP) ➔ All processors can access all memory: extra “directory” hardware on each SMP tracks values stored in all SMPs ➔ Hardware guarantees cache coherency ➔ Access to memory on other SMPs slower (NUMA)

Distributed Shared Memory ● Easier to build because of slower access to remote memory (no expensive bus/crossbar) ● Similar cache problems ● Code writers should be aware of data distribution ● Load balance: Minimize access of “far” memory