ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design.

Slides:

Advertisements

Similar presentations

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Lecture 6: Multicore Systems

Lecture 12 Reduce Miss Penalty and Hit Time

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Computer Organization and Architecture

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Review: Multiprocessor Systems (MIMD)

Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Microprocessors Introduction to RISC Mar 19th, 2002.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

CH12 CPU Structure and Function

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.

CMPE 421 Parallel Computer Architecture

Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

COMP 740: Computer Architecture and Implementation

Processes and threads.

William Stallings Computer Organization and Architecture 8th Edition

Simultaneous Multithreading

Assembly Language for Intel-Based Computers, 5th Edition

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

CMSC 611: Advanced Computer Architecture

Computer Architecture: Multithreading (I)

Levels of Parallelism within a Single Processor

Hardware Multithreading

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Latency Tolerance: what to do when it just won’t go away

Levels of Parallelism within a Single Processor

The University of Adelaide, School of Computer Science

Presentation transcript:

ECE669 L19: Processor Design April 8, 2004 ECE 669 Parallel Computer Architecture Lecture 19 Processor Design

ECE669 L19: Processor Design April 8, 2004 Overview °Special features in microprocessors provide support for parallel processing Already discussed bus snooping °Memory latency becoming worse so multi-process support important °Provide for rapid context switches inside the processor °Support for prefetching Directly affects processor utilization

ECE669 L19: Processor Design April 8, 2004 Why are traditional RISCs ill-suited for multiprocessing? Cannot handle asynchrony well -Traps -Context switches Cannot deal with pipelined memories - (multiple outstanding requests) Inadequate support for synchronization (Eg. R2000 No synchro instruction) (SGI Had to memory map synchronization)

ECE669 L19: Processor Design April 8, 2004 Three major topics Pipeline processor-memory-network -Fast context switching -Prefetching (Pipelining: Multithreading) Synchronization Messages

ECE669 L19: Processor Design April 8, 2004 Pipelining – Multithreading Resource Usage °Mem. Bus °ALU °Overlap memory/ALU usage More effective use of resources Prefetch Cache Pipeline (general) Fetch (Inst. or operand) Execute

ECE669 L19: Processor Design April 8, 2004 RISC Issues °1 Inst/cycle Huge memory bandwidth requirements -Caches: 1 Data Cache or -Separate I&D caches Lots of registers, state °Pipeline Hazards Compiler Reservation bits Bypass Paths More state! °Other “stuff” - register windows Even more state! Interlocks

ECE669 L19: Processor Design April 8, 2004 Fundamental conflict °Better single-thread performance (sequential) More on-chip state °More on-chip state Harder to handle asynchronous events -Traps -Context switches -Synchronization faults -Message arrivals But, why is this a problem in MPs? Makes pipelining proc-mem-net harder. Consider...

ECE669 L19: Processor Design April 8, 2004 Ignore communication system latency (T=0) °Then, max bandwidth per node limits max processor speed °Above Processor-network matched i.e. proc request rate=net bandwidth If processor has higher request rate, it will suffer idle time Proc. Net. t BK d Processor requests Network response Network request Cache miss interval BK d 1

ECE669 L19: Processor Design April 8, 2004 Now, include network latency °Each request suffers T cycles of latency °Processor utilization = Processor utilization Network bandwidth also wasted because of lost issue opportunities! FIX? Proc. Net. t BK d Processor idle Latency T

ECE669 L19: Processor Design April 8, 2004 One solution °Overlap communication with computation. “Multithread” the processor -Need rapid context switch. See HEP, Sparcle. And/or allow multiple outstanding requests -- non- blocking memory Net. T BK d Processor utilization  pt t  T ifpt  (t  T)

ECE669 L19: Processor Design April 8, 2004 One solution Overlap communication with computation. “Multithread” the processor –Need rapid context switch. See HEP, Sparcle. And/or allow multiple outstanding requests -- non-blocking memory Net. T BK d Processor utilization or Context switch interval Z  pt t  T ifpt  (t  T)

ECE669 L19: Processor Design April 8, 2004 Caveat! °Of course, previous analysis assumed network bandwidth was not a limitation. °Consider: °Computation speed (proc. util.) limited by network bandwidth. °Lessons: Multithreading allows full utilization of network bandwidth. Processor util. will reach 1 only if net BW is not a limitation. Net. BK Z t Proc. Must wait till next (issue opportunity)

ECE669 L19: Processor Design April 8, 2004 Same applies to synchronization delays as well °If no multithreading Fault Wasted processor cycles Satisfied Process 1 Process 2 Process 3 Synchronization fault 1 Synchronization fault 1 satisfied

ECE669 L19: Processor Design April 8, 2004 Requirements for latency tolerance (comm or synch) Processors must switch contexts fast Memory system must allow multiple outstanding requests Processors must handle traps fast (esp synchronization) Can also allow multiple memory requests °But, caution: Latency tolerant processors are no excuse for not exploiting locality and trying to minimize latency °Consider...

ECE669 L19: Processor Design April 8, 2004 Fine multithreading versus block multithreading °Block multithreading 1. Switch on cache miss or synchro fault 2. Long runs between switches because of caches 3. Fewer request in network °Fine multithreading 1. Switch on each mem. request 2. Short runs need very fast context switch - minimal processor state - poor single-thread performance 3. Need huge amount of network bandwidth; need lots of threads

ECE669 L19: Processor Design April 8, 2004 °Switch by putting new value into PC °Minimize processor state °Very poor single-thread performance How to implement fast context switches? Memory Processor Instructions PC Data

ECE669 L19: Processor Design April 8, 2004 How to implement fast context switches? Memory Processor °Dedicate memory to hold state & high bandwidth path to state memory °Is this best use of expensive off-chip bandwidth? Registers PC High BW transfer Process i regs Process j regs Special state memory

ECE669 L19: Processor Design April 8, 2004 How to implement fast context switches? Include few (say 4) register frames for each process context. Switch by bumping FP (frame pointer) Switches between 4 processes fast, otherwise invoke software loader/unloader - Sparcle uses SPARC windows Memory Processor Proc i regs FP PC Proc j regs

ECE669 L19: Processor Design April 8, 2004 How to implement fast context switches? Memory Processor Block register files Fast transfer of registers to on-chip data cache via wide path PC Registers

ECE669 L19: Processor Design April 8, 2004 How to implement fast context switches? Fast traps also needed. Also need dedicated synchronous trap lines --- synchronization, cache miss... Need trap vector spreading to inline common trap code Memory Processor PC Registers Trap frame

ECE669 L19: Processor Design April 8, 2004 Pipelining processor - memory - network °Prefetching A B 10 9 LD A B LD

ECE669 L19: Processor Design April 8, 2004 Synchronization °Key issue What hardware support What to do in software °Consider atomic update of the “bound” variable in traveling salesman

ECE669 L19: Processor Design April 8, 2004 °Need mechanism to lock out other request to L Mem bound P P P L While (LOCK(L)==1); Loop read bound incr bound store bound unlock(L) Lock(L) read L if (L==1) return 1; else L=1 store L return o; Atomic test set Synchronization

ECE669 L19: Processor Design April 8, 2004 In uniprocessors Raise interrupt level to max, to gain uninterrupted access °In multiprocessors Need instruction to prevent access to L. °Methods Keep synchro vars in memory, do not release bus Keep synchro vars in cache, prevent outside invalidations °Usually, can memory map some data fetches such that cache controller locks out other requests

ECE669 L19: Processor Design April 8, 2004 Data-parallel synchronization Can also allow controller to do update of L. Eg. Sparcle (in Alewife machine) Controlle r Cache Full/ Empty Bit (as in HEP) L Processor Mem word ldent load, trap if full, set empty ldet trap if f/e bit=1

ECE669 L19: Processor Design April 8, 2004 Given primitive atomic operation can synthesize in software higher forms Eg. 1. Producer-consumer Producer lde Dstf D Consumer Store if f/e=0 set f/e=1 trap otherwise...retry Load if f/e=1 set f/e=0 trap otherwise...retry D

ECE669 L19: Processor Design April 8, 2004 Some provide massive HW support for synchronization -- eg. Ultracomputer, RP3 °Combining networks. °Say, each processor wants a unique i Switches become processors -- slow, expensive Software combining -- implement combining tree in software using a tree data structure L=5 F&A(L,1) F&A(L,2) mem requests variable in software

ECE669 L19: Processor Design April 8, 2004 Summary °Processor support for parallel processing growing °Latency tolerance supports by fast context switching Also more advanced software systems °Maintaining processor utilization is a key Ties to network performance °Important to maintain RISC performance °Even uniprocessors can benefit from context switches Register windows