ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Architectural Support for Operating Systems. Announcements Most office hours are finalized Assignments up every Wednesday, due next week CS 415 section.
OPERATING SYSTEM OVERVIEW
EECS 570: Fall rev3 1 Chapter 7 (excl. 7.9): Scalable Multiprocessors.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Parallel Processing Architectures Laxmi Narayan Bhuyan
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Vir. Mem II CSE 471 Aut 011 Synonyms v.p. x, process A v.p. y, process B v.p # index Map to same physical page Map to synonyms in the cache To avoid synonyms,
1 OS & Computer Architecture Modern OS Functionality (brief review) Architecture Basics Hardware Support for OS Features.
I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p
Input / Output CS 537 – Introduction to Operating Systems.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
Synchronization and Communication in the T3E Multiprocessor.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Department of Computer Science University of the West Indies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
I/O management is a major component of operating system design and operation Important aspect of computer operation I/O devices vary greatly Various methods.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Interactions with Microarchitectures and I/O Copyright 2004 Daniel.
CMSC 611: Advanced Computer Architecture
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CS 286 Computer Organization and Architecture
CS703 - Advanced Operating Systems
CMSC 611: Advanced Computer Architecture
Operating Systems Chapter 5: Input/Output Management
Parallel Processing Architectures
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 4 Multiprocessors
CSE 471 Autumn 1998 Virtual memory
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment. Phone:
Synonyms v.p. x, process A v.p # index Map to same physical page
Chapter 13: I/O Systems.
Presentation transcript:

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline System Organization Taxonomy Distributed Memory Systems (w/o cache coherence) Clusters and Networks of Workstations Presentations of Research Papers

3 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 A Taxonomy of System Organizations Flynn’s Taxonomy of Organizations –SISD = Single Instruction, Single Data = uniprocessor –SIMD = Single Instruction, Multiple Data –MISD = Multiple Instruction, Single Data = doesn’t exist –MIMD = Multiple Instruction, Multiple Data This course addresses SIMD and MIMD machines

4 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 SIMD Single Instruction, Multiple Data E.g., every processor executes an Add on diff data Matches certain programming models –E.g., High performance Fortran’s “Forall” For data parallel algorithms, SIMD is very effective –Else, it doesn’t help much (Amdahl’s Law!) Note: SPMD is programming model, not organization SIMD hardware implementations –Vector machines –Instruction set extensions, like Intel’s MMX –Digital Signal Processors (DSPs)

5 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 MIMD Multiple Instruction, Multiple Data Completely general multiprocessor system model –Lots of ways to design MIMD machines Every processor executes own program w/own data MIMD machines can handle many program models –Shared memory –Message passing –SPMD (using shared memory or message passing)

6 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline System Organization Taxonomy Distributed Memory Systems (w/o cache coherence) Clusters and Networks of Workstations Presentations of Research Papers

7 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Distributed Memory Machines Multiple nodes connected by network: –Node = processor, caches, memory, communication assist Design issue #1: Communication assist –Where is it? (I/O bus, memory bus, processor registers) –What does it know? »Does it just move bytes, or does it perform some functions? –Is it programmable? –Does it run user code? Design issue #2: Network transactions –Input & output buffering –Action on remote node These two issues are not independent!

8 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Ex 1: Massively Parallel Processor (MPP) Network interface typically close to processor –Connected to memory bus: »Locked to specific processor architecture/bus protocol –Connected to registers/cache: »Only in research machines Time-to-market is long –Processor already available or must work closely with processor designers Maximize performance (cost is no object!) I/O Bus Memory Bus Processor Cache Main Memory Disk Controller Disk Network Interface Network I/O Bridge

9 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Ex 2: Network of Workstations Network interface on I/O bus Standards (e.g., PCI)  longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP I/O Bus Processor Cache Main Memory Disk Controller Disk Graphics Controller Network Interface Graphics Network interrupts Core Chip Set

10 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Spectrum of Communication Assist Designs 1)None: Physical bit stream –Physical DMA using OSnCUBE, iPSC,... 2)User-level communication –User-level portCM-5, *T –User-level handlerJ-Machine, Monsoon,... 3)Communication co-processor –Processing, translationParagon, Meiko CS-2, Myrinet –Reflective memoryMemory Channel, SHRIMP 4)Global physical address –Proc + Memory controllerRP3, BBN, T3D, T3E 5)Cache-to-cache (later) –Cache controllermost current MPs (Sun, HP) Increasing HW Support, Specialization, Intrusiveness, Performance (???)

11 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Network Transaction Interpretation 1)Simplest MP: communication assist doesn’t interpret much if anything –DMA from/to buffer, then interrupt or set flag on completion 2)User-level messaging: get the OS out of the way –Assist does protection checks to allow direct user access to network (e.g., via memory mapped I/O) 3)Dedicated communication co-processor: get the CPU out of the way (if possible) –Basic protection plus address translation: user-level bulk DMA 4)Global physical address space (NUMA): everything in hardware –Complexity increases, but performance does too (if done right) 5)Cache coherence: even more so –Stay tuned

12 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (1) No Comm Assist: Physical DMA (C&S 7.3) Physical addresses: OS must initiate transfers –System call per message on both ends: ouch! Sending OS copies data to OS buffer w/ header/trailer Receiver copies packet into OS buffer, then interprets –User message then copied (or mapped) into user space

13 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2) User Level Communication (C&S 7.4) Problem with raw DMA: too much OS activity Solution: let user to user communication skip OS –Can’t skip OS if communication is user-system or system-user Two flavors: –User-level ports  map network into user space –User-level handlers  message reception triggers user handler

14 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2a) User Level Network Ports (C&S 7.4.1) Map network hardware into user’s address space –Talk directly to network via loads & stores User-to-user communication without OS intervention – Low latency Communication assist must protect users & system DMA hard … CPU involvement (copying) becomes bottleneck

15 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2a) User Level Network Ports Appears to user as logical message queues plus status register Sender and receiver must coordinate –What happens if receiver doesn’t pop messages?

16 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Thinking Machines CM-5 (C&S 7.4.2) Input and output FIFO for each network Two networks –Request & Response Save/restore network buffers on context switch More later on CM5!

17 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2b) User Level Handlers (C&S 7.4.3) Like user-level ports, but tighter coupling between port and processor Ports are mapped to processor registers, instead of to special memory regions Hardware support to vector to address specified in message: incoming message directs execution Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

18 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: J-Machine (MIT) Each node a small message- driven processor HW support to queue msgs and dispatch to msg handler task

19 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (Brief) Example: Active Messages Messages are “active” – they contain instructions for what the receiving processor should do Optional reading assignment on Active Messages

20 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (3) Dedicated Communication Co-Processor Use a programmable message processor (MP) –MP runs software that does communication assist work MP and CPU communicate via shared memory Network ° ° ° dest Mem P NI UserSystem Mem P NI UserSystem MP

21 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (3) Dedicated Communication Co-Processor User processor stores cmd / msg / data into shared output queue –Must still check for output queue full (or have it grow dynamically) Message processors make transaction happen –Checking, translation, scheduling, transport, interpretation Avoid system call overhead Multiple bus crossings likely bottleneck Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

22 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Intel Paragon Network ° ° ° Mem P M P NI i860xp 50 MHz 16 KB $ 4-way 32B Block MESI sDMA rDMA MB/s $$ MB/s Duplex I/O Nodes rte MP handler Var data EOP I/O Nodes Service Devices 2048 B

23 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Dedicated MP w/specialized NI: Meiko CS-2 Integrate message processor into network interface –Active messages-like capability –Dedicated threads for DMA, reply handling, simple remote memory access –Supports user-level virtual DMA »Own page table »Can take a page fault, signal OS, restart Meanwhile, nack other node Problem: processor is slow, time-slices threads –Fundamental issue with building your own CPU

24 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (4) Shared Physical Address Space Implement shared address space model in hardware w/o caching –Actual caching must be done by copying from remote memory to local –Programming paradigm looks more like message passing than Pthreads »Nevertheless, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right »Result: great platform for MPI & compiled data-parallel codes Implementation: –“Pseudo-memory” acts as memory controller for remote mem, converts accesses to network transactions (requests) –“Pseudo-CPU” on remote node receives requests, performs on local memory, sends reply –Split-transaction or retry-capable bus required (or dual-ported mem)

25 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Cray T3D Up to 2,048 Alpha microprocessors –No off-chip L2 to avoid inherent latency In addition to remote memory ops, includes: –Prefetch buffer (hide remote latency) –DMA engine (requires OS trap) –Synchronization operations (swap, fetch&inc, global AND/OR) –Global synch operations (barrier, eureka) –Message queue (requires OS trap on receiver) Big problem: physical address space –21064 supports only 32 bits –2K-node machine limited to 2M per node –External “DTB annex” provides segment-like registers for extended addressing, but management is expensive & ugly

26 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Cray T3E Similar to T3D, uses Alpha instead of (on-chip L2) –Still has physical address space problems E-registers for remote communication and synchronization –512 user, 128 system; 64 bits each –Replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue –Address specifies source or destination E-register and command –Data contains pointer to block of 4 E-regs and index for centrifuge Centrifuge –Supports data distributions used in data-parallel languages (HPF) –4 E-regs for global memory operation: mask, base, two arguments

27 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Cray T3E (continued) Atomic Memory operations –E-registers & centrifuge used –Fetch&Increment, Fetch&Add, Compare&Swap, Masked_Swap Messaging –Arbitrary number of queues (user or system) –64-byte messages –Create message queue by storing message control word to memory location Message Send –Construct data in aligned block of 8 E-regs –Send like put, but destination must be message control word –Processor is responsible for queue space (buffer management) Barrier and Eureka synchronization DISCUSSION OF PAPER

28 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (5) Cache Coherent Shared Memory Hardware management of caching shared memory –Less burden on programmer to manage caches –But tough to do on very big systems (e.g., Cray T3E) Cache coherent systems are focus of this course

29 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Summary of Distributed Memory Machines Convergence of architectures –Everything “looks basically the same” –Processor, cache, memory, communication assist Communication Assist –Where is it? (I/O bus, memory bus, processor registers) –What does it know? »Does it just move bytes, or does it perform some functions? –Is it programmable? –Does it run user code? Network transaction –Input & output buffering –Action on remote node

30 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline System Organization Taxonomy Distributed Memory Systems (w/o cache coherence) Clusters and Networks of Workstations Presentations of Research Papers

31 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Clusters & Networks of Workstations (C&S 7.7) Connect a bunch of commodity machines together Options: –Low perf comm, high throughput: Condor-style job management –High perf comm: fast SAN and LAN technology Nothing fundamentally different from what we’ve studied already In general, though, network interface further from processor (I.e., on I/O bus) –Less integration, more software support

32 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Network of Workstations (NOW) Network interface on I/O bus Standards (e.g., PCI) => longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP I/O Bus Processor Cache Main Memory Disk Controller Disk Graphics Controller Network Interface Graphics Network interrupts Core Chip Set

33 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Myricom Myrinet (Berkeley NOW) Programmable network interface on I/O Bus (Sun SBUS or PCI) –Embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) –256KB SRAM –3 DMA engines: to network, from network, to/from host memory Downloadable firmware executes in kernel mode –Includes source-based routing protocol SRAM pages can be mapped into user space –Separate pages for separate processes –Firmware can define status words, queues, etc. »Data for short messages or pointers for long ones »Firmware can do address translation too … w/OS help –Poll to check for sends from user Bottom line: I/O bus still bottleneck, CPU could be faster

34 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: DEC Memory Channel (Princeton SHRIMP) Reflective Memory Writes on sender appear in receiver’s memory –Must specify send & receive regions (not arbitrary locations) Receive region is pinned in memory Requires duplicate writes, really just message buffers Virtual Physical Virtual

35 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline System Organization Taxonomy Distributed Memory Systems (w/o cache coherence) Clusters and Networks of Workstations Presentations of Research Papers

36 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Thinking Machines CM-5 PRESENTATION