Parallel Computer Architecture

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
EECC756 - Shaaban #1 lec # 1 Spring Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.
Parallel Processing Architectures Laxmi Narayan Bhuyan
EECC756 - Shaaban #1 lec # 1 Spring Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Computer architecture II
Parallel Architectures
Computer System Architectures Computer System Software
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Computer Organization and Architecture Lecture 1 : Introduction
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Overview Parallel Processing Pipelining
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Processors
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Programming Models Dr. Xiao Qin Auburn.
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
Chapter 1.
CS 147 – Parallel Processing
Morgan Kaufmann Publishers
Multi-Processing in High Performance Computer Architecture:
CMSC 611: Advanced Computer Architecture
What is Parallel and Distributed computing?
Lecture 1: Parallel Architecture Intro
Chapter 17 Parallel Processing
Convergence of Parallel Architectures
Multiprocessors - Flynn’s taxonomy (1966)
CS 213: Parallel Processing Architectures
Parallel Processing Architectures
CS 258 Parallel Computer Architecture
Lecture 24: Memory, VM, Multiproc
Constructing a system with multiple computers or processors
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
What is Computer Architecture?
What is Computer Architecture?
What is Computer Architecture?
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Types of Parallel Computers
CSE378 Introduction to Machine Organization
Presentation transcript:

Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems fast Broad issues involved: Resource Allocation: Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization How the elements cooperate and communicate. How data is transmitted between processors. Abstractions and primitives for cooperation. Performance and Scalability Performance enhancement of parallelism: Speedup. Scalabilty of performance to larger systems/problems. #1 lec # 1 Spring 2000 3-7-2000

The Need And Feasibility of Parallel Computing Application demands: More computing cycles: Scientific computing: CFD, Biology, Chemistry, Physics, ... General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… Mainstream multithreaded programs, are similar to parallel programs Technology Trends Number of transistors on chip growing rapidly Clock rates expected to go up but only slowly Architecture Trends Instruction-level parallelism is valuable but limited Coarser-level parallelism, as in MPs, the most viable approach Economics: Today’s microprocessors have multiprocessor support eliminating the need for designing expensive custom PEs Lower parallel system cost. Multiprocessor systems to offer a cost-effective replacement of uniprocessor systems in mainstream computing. #2 lec # 1 Spring 2000 3-7-2000

Scientific Computing Demand #3 lec # 1 Spring 2000 3-7-2000

Scientific Supercomputing Trends Proving ground and driver for innovative architecture and advanced techniques: Market is much smaller relative to commercial segment Dominated by vector machines starting in 70s Meanwhile, microprocessors have made huge gains in floating-point performance High clock rates. Pipelined floating point units. Instruction-level parallelism. Effective use of caches. Large-scale multiprocessors replace vector supercomputers Well under way already #4 lec # 1 Spring 2000 3-7-2000

Raw Uniprocessor Performance: LINPACK #5 lec # 1 Spring 2000 3-7-2000

Raw Parallel Performance: LINPACK #6 lec # 1 Spring 2000 3-7-2000

General Technology Trends Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years 20 40 60 80 100 120 140 160 180 1987 1988 1989 1990 1991 1992 Integer FP Sun 4 260 MIPS M/120 IBM RS6000 540 M2000 HP 9000 750 DEC alpha #7 lec # 1 Spring 2000 3-7-2000

Clock Frequency Growth Rate Currently increasing 30% per year #8 lec # 1 Spring 2000 3-7-2000

Transistor Count Growth Rate 100 million transistors on chip by early 2000’s A.D. Transistor count grows much faster than clock rate - Currently 40% per year #9 lec # 1 Spring 2000 3-7-2000

System Attributes to Performance Performance benchmarking is program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures: Cycles per instruction (CPI) Total CPU time = T = C x t = C / f = Ic x CPI x t = Ic x (p + m x k) x t Ic = Instruction count t = CPU cycle time p = Instruction decode cycles m = Memory cycles k = Ratio between memory/processor cycles C = Total program clock cycles f = clock rate MIPS Rate = Ic / (T x 106) = f / (CPI x 106) = f x Ic /(C x 106) Throughput Rate: Wp = f /(Ic x CPI) = (MIPS) x 106 /Ic Performance factors: (Ic, p, m, k, t) are influenced by: instruction-set architecture, compiler design, CPU implementation and control, cache and memory hierarchy. #10 lec # 1 Spring 2000 3-7-2000

CPU Performance Trends The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance. #11 lec # 1 Spring 2000 3-7-2000

Parallelism in Microprocessor VLSI Generations #12 lec # 1 Spring 2000 3-7-2000

The Goal of Parallel Computing Goal of applications in using parallel machines: Speedup Speedup (p processors) = For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Performance (p processors) Performance (1 processor) Time (1 processor) Time (p processors) #13 lec # 1 Spring 2000 3-7-2000

Elements of Modern Computers Hardware Architecture Operating System Applications Software Computing Problems Algorithms and Data Structures Mapping Programming High-level Languages Binding (Compile, Load) Performance Evaluation #14 lec # 1 Spring 2000 3-7-2000

Elements of Modern Computers Computing Problems: Numerical Computing: Science and technology numerical problems demand intensive integer and floating point computations. Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. Algorithms and Data Structures Special algorithms and data structures are needed to specify the computations and communication present in computing problems. Most numerical algorithms are deterministic using regular data structures. Symbolic processing may use heuristics or non-deterministic searches. Parallel algorithm development requires interdisciplinary interaction. #15 lec # 1 Spring 2000 3-7-2000

Elements of Modern Computers Hardware Resources Processors, memory, and peripheral devices form the hardware core of a computer system. Processor instruction set, processor connectivity, memory organization, influence the system architecture. Operating Systems Manages the allocation of resources to running processes. Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. Parallelism exploitation at: algorithm design, program writing, compilation, and run time. #16 lec # 1 Spring 2000 3-7-2000

Elements of Modern Computers System Software Support Needed for the development of efficient programs in high-level languages (HLLs.) Assemblers, loaders. Portable parallel programming languages User interfaces and tools. Compiler Support Preprocessor compiler: Sequential compiler and low-level library of the target parallel computer. Precompiler: Some program flow analysis, dependence checking, limited optimizations for parallelism detection. Parallelizing compiler: Can automatically detect parallelism in source code and transform sequential code into parallel constructs. #17 lec # 1 Spring 2000 3-7-2000

Approaches to Parallel Programming Source code written in sequential languages C, C++ FORTRAN, LISP .. Programmer Parallelizing compiler Parallel object code Execution by runtime system Source code written in concurrent dialects of C, C++ FORTRAN, LISP .. Programmer Concurrency preserving compiler Concurrent object code Execution by runtime system (a) Implicit Parallelism (b) Explicit Parallelism #18 lec # 1 Spring 2000 3-7-2000

Evolution of Computer Architecture Scalar Sequential Lookahead I/E Overlap Functional Parallelism Multiple Func. Units Pipeline Implicit Vector Explicit MIMD SIMD Multiprocessor Multicomputer Register-to -Register Memory-to -Memory Processor Array Associative I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams MIMD: Multiple Instruction streams over Multiple Data streams Massively Parallel Processors (MPPs)

Parallel Architectures History Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth. Application Software System Software SIMD Message Passing Shared Memory Dataflow Systolic Arrays Architecture #20 lec # 1 Spring 2000 3-7-2000

Programming Models Programming methodology used in coding applications Specifies communication and synchronization Examples: Multiprogramming: No communication or synchronization at program level Shared memory address space: Message passing: Explicit point to point communication Data parallel: More regimented, global actions on data Implemented with shared address space or message passing #21 lec # 1 Spring 2000 3-7-2000

Flynn’s 1972 Classification of Computer Architecture Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines. Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Shared memory multiprocessors. Multicomputers: Unshared distributed memory, message-passing used instead. #22 lec # 1 Spring 2000 3-7-2000

Flynn’s Classification of Computer Architecture Fig. 1.3 page 12 in Advanced Computer Architecture: Parallelism, Scalability, Programmability, Hwang, 1993. #23 lec # 1 Spring 2000 3-7-2000

Current Trends In Parallel Architectures The extension of “computer architecture” to support communication and cooperation: OLD: Instruction Set Architecture NEW: Communication Architecture Defines: Critical abstractions, boundaries, and primitives (interfaces) Organizational structures that implement interfaces (hardware or software) Compilers, libraries and OS are important bridges today #24 lec # 1 Spring 2000 3-7-2000

Modern Parallel Architecture Layered Framework CAD Multipr ogramming Shar ed addr ess Message passing Data parallel Database Scientific modeling Parallel applications Pr ogramming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication har dwar e Physical communication medium Har e/softwar e boundary #25 lec # 1 Spring 2000 3-7-2000

Shared Address Space Parallel Architectures Any processor can directly reference any memory location Communication occurs implicitly as result of loads and stores Convenient: Location transparency Similar programming model to time-sharing in uniprocessors Except processes run on different processors Good throughput on multiprogrammed workloads Naturally provided on a wide range of platforms Wide range of scale: few to hundreds of processors Popularly known as shared memory machines or model Ambiguous: Memory may be physically distributed among processors #26 lec # 1 Spring 2000 3-7-2000

Shared Address Space (SAS) Model Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared S t o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses Writes to shared address visible to other threads (in other processes too) Natural extension of the uniprocessor model: Conventional memory operations used for communication Special atomic operations needed for synchronization OS uses shared memory to coordinate processes #27 lec # 1 Spring 2000 3-7-2000

Models of Shared-Memory Multiprocessors The Uniform Memory Access (UMA) Model: The physical memory is shared by all processors. All processors have equal access to all memory addresses. Distributed memory or Nonuniform Memory Access (NUMA) Model: Shared memory is physically distributed locally among processors. The Cache-Only Memory Architecture (COMA) Model: A special case of a NUMA machine where all distributed main memory is converted to caches. No memory hierarchy at each processor. #28 lec # 1 Spring 2000 3-7-2000

Models of Shared-Memory Multiprocessors Uniform Memory Access (UMA) Model Interconnect: Bus, Crossbar, Multistage network P: Processor M: Memory C: Cache D: Cache directory ° ° ° Network D P C M ° ° ° Network P $ Distributed memory or Nonuniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA) #29 lec # 1 Spring 2000 3-7-2000

Uniform Memory Access Example: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth #30 lec # 1 Spring 2000 3-7-2000

Uniform Memory Access Example: SUN Enterprise 16 cards of either type: processors + memory, or I/O All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus #31 lec # 1 Spring 2000 3-7-2000

Distributed Shared-Memory Multiprocessor System Example: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates communication requests for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this) #32 lec # 1 Spring 2000 3-7-2000

Message-Passing Multicomputers Comprised of multiple autonomous computers (nodes). Each node consists of a processor, local memory, attached storage and I/O peripherals. Programming model more removed from basic hardware operations Local memory is only accessible by local processors. A message passing network provides point-to-point static connections among the nodes. Inter-node communication is carried out by message passing through the static connection network Process communication achieved using a message-passing programming environment. #33 lec # 1 Spring 2000 3-7-2000

Message-Passing Abstraction Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Many overheads: copying, buffer management, protection #34 lec # 1 Spring 2000 3-7-2000

Message-Passing Example: IBM SP-2 Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus) #35 lec # 1 Spring 2000 3-7-2000

Message-Passing Example: Intel Paragon #36 lec # 1 Spring 2000 3-7-2000

Message-Passing Programming Tools Message-passing programming libraries include: Message Passing Interface (MPI): Provides a standard for writing concurrent message-passing programs. MPI implementations include parallel libraries used by existing programming languages. Parallel Virtual Machine (PVM): Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. PVM support software executes on each machine in a user-configurable pool, and provides a computational environment of concurrent applications. User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines. #37 lec # 1 Spring 2000 3-7-2000

Data Parallel Systems SIMD in Flynn taxonomy Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor is associated with each data element Architectural model Array of many simple, cheap processors each with little memory Processors don’t sequence through instructions Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization Some recent machines: Thinking Machines CM-1, CM-2 (and CM-5) Maspar MP-1 and MP-2, #38 lec # 1 Spring 2000 3-7-2000

Dataflow Architectures Represent computation as a graph of essential dependences Logical processor at each node, activated by availability of operands Message (tokens) carrying tag of next instruction sent to next processor Tag compared with others in matching store; match fires execution #39 lec # 1 Spring 2000 3-7-2000

Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate data flow for high throughput with less memory access Different from pipelining Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory Different from SIMD: each PE may do something different Initial motivation: VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern #40 lec # 1 Spring 2000 3-7-2000