Convergence of Parallel Architectures

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Multiple Processor Systems
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.
Computer architecture II
0 Introduction to parallel processing Chapter 1 from Culler & Singh Winter 2007.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Intro to Parallel Architecture Todd C. Mowry CS 740 September 19, 2007.
INEL6067 Technology ---> Limitations & Opportunities Wires -Area -Propagation speed Clock Power VLSI -I/O pin limitations -Chip area -Chip crossing delay.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~18 parallel architecture lectures (based on text)  ~10 (recent) paper presentations 
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.
Diversity and Convergence of Parallel Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
740: Computer Architecture Programming Models and Architectures Prof. Onur Mutlu Carnegie Mellon University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,
COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Fundamental Design Issues
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Fundamental Design Issues Dr. Xiao Qin Auburn.
These slides are based on the book:
Convergence of Parallel Architectures
Overview Parallel Processing Pipelining
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Lecture 21 Synchronization
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Parallel Programming Models Dr. Xiao Qin Auburn.
Course Outline Introduction in algorithms and applications
Conventional Computer Architecture Abstraction
CMSC 611: Advanced Computer Architecture
Shared Memory Multiprocessors
Lecture 1: Parallel Architecture Intro
Chapter 17 Parallel Processing
Multiprocessor Architectures and Parallel Programs
Multiprocessors - Flynn’s taxonomy (1966)
Multiple Processor Systems
CS 213: Parallel Processing Architectures
Computer Science Division
Introduction to Multiprocessors
Parallel Processing Architectures
Lecture 24: Memory, VM, Multiproc
Parallel Computer Architecture
Distributed Systems CS
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Database System Architectures
Prof. Onur Mutlu Carnegie Mellon University 9/12/2012
Presentation transcript:

Convergence of Parallel Architectures CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley CS258 S99

Recap of Lecture 1 Parallel Comp. Architecture driven by familiar technological and economic forces application/platform cycle, but focused on the most demanding applications hardware/software learning curve More attractive than ever because ‘best’ building block - the microprocessor - is also the fastest BB. History of microprocessor architecture is parallelism translates area and denisty into performance The Future is higher levels of parallelism Parallel Architecture concepts apply at many levels Communication also on exponential curve => Quantitative Engineering approach New Applications More Performance Speedup 12/5/2018 CS258 S99 L2

History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth. Mid 80s rennaisance Application Software System Software Systolic Arrays SIMD Architecture Message Passing Dataflow Shared Memory 12/5/2018 CS258 S99 L2

Plan for Today Look at major programming models where did they come from? The 80s architectural rennaisance! What do they provide? How have they converged? Extract general structure and fundamental issues Reexamine traditional camps from new perspective (next week) Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory 12/5/2018 CS258 S99 L2

Administrivia Mix of HW, Exam, Project load HW 1 due date moved out to Fri 1/29 added 1.18 Hands-on session with parallel machines in week 3 12/5/2018 CS258 S99 L2

Programming Model Conceptualization of the machine that programmer uses in coding applications How parts cooperate and coordinate their activities Specifies communication and synchronization operations Multiprogramming no communication or synch. at program level Shared address space like bulletin board Message passing like letters or phone calls, explicit point to point Data parallel: more regimented, global actions on data Implemented with shared address space or message passing 12/5/2018 CS258 S99 L2

Shared Memory => Shared Addr. Space Bottom-up engineering factors Programming concepts Why its attactive. 12/5/2018 CS258 S99 L2

Adding Processing Capacity Memory capacity increased by adding modules I/O by controllers and devices Add processors for processing! For higher-throughput multiprogramming, or parallel programs 12/5/2018 CS258 S99 L2

Historical Development “Mainframe” approach Motivated by multiprogramming Extends crossbar used for Mem and I/O Processor cost-limited => crossbar Bandwidth scales with p High incremental cost use multistage instead “Minicomputer” approach Almost all microprocessor systems have bus Motivated by multiprogramming, TP Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck caching is key: coherence problem Low incremental cost 12/5/2018 CS258 S99 L2

Shared Physical Memory Any processor can directly reference any memory location Any I/O controller - any memory Operating system can run on any processor, or all. OS uses shared memory to coordinate Communication occurs implicitly as result of loads and stores What about application processes? 12/5/2018 CS258 S99 L2

Shared Virtual Address Space Process = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. User-kernel or multiple processes Multiple threads of control on one address space. Popular approach to structuring OS’s Now standard application capability (ex: POSIX threads) Writes to shared address visible to other threads Natural extension of uniprocessors model conventional memory operations for communication special atomic operations for synchronization also load/stores 12/5/2018 CS258 S99 L2

Structured Shared Address Space o r e P 1 2 n L a d p i v Virtual address spaces for a collection of processes communicating via shared addresses Machine physical address space Shared portion of address space Private portion Common physical addresses Add hoc parallelism used in system code Most parallel applications have structured SAS Same program on each processor shared variable X means the same thing to each thread 12/5/2018 CS258 S99 L2

Engineering: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth 12/5/2018 CS258 S99 L2

Engineering: SUN Enterprise Proc + mem card - I/O card 16 cards of either type All memory accessed over bus, so symmetric Higher bandwidth, higher latency bus 12/5/2018 CS258 S99 L2

Scaling Up “Dance hall” Distributed memory ° ° ° Network Network $ ° ° ° $ M M ° ° ° $ $ $ M $ P P P P P P “Dance hall” Distributed memory Problem is interconnect: cost (crossbar) or bandwidth (bus) Dance-hall: bandwidth still scalable, but lower cost than crossbar latencies to memory uniform, but uniformly large Distributed memory or non-uniform memory access (NUMA) Construct shared address space out of simple message transactions across a general-purpose network (e.g. read-request, read-response) Caching shared (particularly nonlocal) data? 12/5/2018 CS258 S99 L2

Engineering: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates request message for non-local references No hardware mechanism for coherence SGI Origin etc. provide this 12/5/2018 CS258 S99 L2

Systolic Arrays SIMD Generic Architecture Message Passing Dataflow ° ° ° Network P $ Systolic Arrays SIMD Generic Architecture Message Passing Dataflow Shared Memory 12/5/2018 CS258 S99 L2

Message Passing Architectures Complete computer as building block, including I/O Communication via explicit I/O operations Programming model direct access only to private address space (local memory), communication via explicit messages (send/receive) High-level block diagram Communication integration? Mem, I/O, LAN, Cluster Easier to build and scale than SAS Programming model more removed from basic hardware operations Library or OS intervention M ° ° ° Network P $ 12/5/2018 CS258 S99 L2

Message-Passing Abstraction Pr ocess P Q Addr ess Y X Send X, Q, t Receive , t Match Local pr addr ess space Send specifies buffer to be transmitted and receiving process Recv specifies sending process and application storage to receive into Memory to memory copy, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/recv match achieves pairwise synch event Other variants too Many overheads: copying, buffer management, protection 12/5/2018 CS258 S99 L2

Evolution of Message-Passing Machines Early machines: FIFO on each link HW close to prog. Model; synchronous ops topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95) 12/5/2018 CS258 S99 L2

Diminishing Role of Topology Shift to general links DMA, enabling non-blocking ops Buffered by system at destination until recv Store&forward routing Diminishing role of topology Any-to-any pipelined routing node-network interface dominates communication time Simplifies programming Allows richer design space grids vs hypercubes Intel iPSC/1 -> iPSC/2 -> iPSC/860 H x (T0 + n/B) vs T0 + HD + n/B 12/5/2018 CS258 S99 L2

Example Intel Paragon 12/5/2018 CS258 S99 L2

Building on the mainstream: IBM SP-2 Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus) 12/5/2018 CS258 S99 L2

Berkeley NOW 100 Sun Ultra2 workstations Inteligent network interface proc + mem Myrinet Network 160 MB/s per link 300 ns per hop 12/5/2018 CS258 S99 L2

Toward Architectural Convergence Evolution and role of software have blurred boundary Send/recv supported on SAS machines via buffers Can construct global address space on MP (GA -> P | LA) Page-based (or finer-grained) shared virtual memory Hardware organization converging too Tighter NI integration even for MP (low-latency, high-bandwidth) Hardware SAS passes messages Even clusters of workstations/SMPs are parallel systems Emergence of fast system area networks (SAN) Programming models distinct, but organizations converging Nodes connected by general network and communication assists Implementations also converging, at least in high-end machines 12/5/2018 CS258 S99 L2