Advanced Computer Architectures

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Cache Optimization Summary
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
The University of Adelaide, School of Computer Science
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
CS 704 Advanced Computer Architecture
Chapter 4: Threads.
Distributed Shared Memory
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
Chapter 4: Threads.
The University of Adelaide, School of Computer Science
Chapter 6 Multiprocessors and Thread-Level Parallelism
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 4: Threads.
Chapter 5 Multiprocessor and Thread-Level Parallelism
CMSC 611: Advanced Computer Architecture
Multiprocessors - Flynn’s taxonomy (1966)
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Threads Chapter 4.
/ Computer Architecture and Design
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
High Performance Computing
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Multiprocessor and Thread-Level Parallelism Chapter 4
Lecture 17 Multiprocessors and Thread-Level Parallelism
CPE 631 Lecture 20: Multiprocessors
CSL718 : Multiprocessors 13th April, 2006 Introduction
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Advanced Computer Architectures Lecture 27: Introduction to Multiprocessors

Introduction Initial computer performance improvements came from use of: Innovative manufacturing techniques. In later years, Most improvements came from exploitation of ILP. Both software and hardware techniques are being used. Pipelining, dynamic instruction scheduling, out of order execution, VLIW, vector processing, etc. ILP now appears fully exploited: Further performance improvements from ILP appears limited.

Thread and Process-Level Parallelism The way to achieve higher performance: Of late, exploitation of thread and process- level parallelism is being focused. Exploit parallelism existing across multiple processes or threads: Cannot be exploited by any ILP processor. Consider a banking application: Individual transactions can be executed in parallel.

Processes versus Threads A process is a program in execution. An application normally consists of multiple processes. Threads: A process consists of one of more threads. Threads belonging to the same process share data, and code space.

Single and Multithreaded Processes

How can Threads be Created? By using any of the popular thread libraries: POSIX Pthreads Win32 threads Java threads, etc.

User Threads Thread management done in user space. User threads are supported and managed without kernel support. Invisible to the kernel. If one thread blocks, entire process blocks. Limited benefits of threading.

Kernel Threads Kernel threads supported and managed directly by the OS. Kernel creates Light Weight Processes (LWPs). Most modern OS support kernel threads: Windows XP/2000 Solaris Linux Mac OS, etc.

Benefits of Threading Responsiveness: As an example in Solaris: Threads share code, and data. Thread creation and switching therefore much more efficient than that for processes; As an example in Solaris: Creating threads 30x less costly than processes. Context switching about 5x faster than processes.

Benefits of Threading cont… Truly concurrent execution: Possible with processors supporting concurrent execution of threads: SMP, multi-core, SMT (hyper threading), etc.

A Few Thread Examples Independent threads occur naturally in several applications: Web server: different http requests are the threads. File server Name server Banking: independent transactions Desktop applications: file loading, display, computations, etc. can be threads.

Reflection on Threading To think of it: Threading is inherent to any server application. Threads are also easily identifiable in traditional applications: Banking, Scientific computations, etc.

Thread-level Parallelism --- Cons cont… Threads with severe dependencies: May make multithreading an exercise in futility. Also not as “programmer friendly” as ILP.

Thread Vs. Process-Level Parallelism Threads are light weight (or fine-grained): Threads share address space, data, files etc. Even when extent of data sharing and synchronization is low: Exploitation of thread-level parallelism meaningful only when communication latency is low. Consequently, shared memory architectures (UMA) are a popular way to exploit thread-level parallelism.

A Broad Classification of Computers Shared-memory multiprocessors Also called UMA Distributed memory computers Also called NUMA: Distributed Shared-memory (DSM) architectures Clusters Grids, etc.

UMA vs. NUMA Computers P1 P2 Pn P1 P2 Pn Network (a) UMA Model Latency = several milliseconds to seconds P1 P2 Pn P1 P2 Pn Cache Cache Cache Cache Cache Cache Bus Main Memory Main Memory Main Memory Main Memory Network Latency = 100s of ns (a) UMA Model (b) NUMA Model

Distributed Memory Computers Distributed memory computers use: Message Passing Model Explicit message send and receive instructions have to be written by the programmer. Send: specifies local buffer + receiving process (id) on remote computer (address). Receive: specifies sending process on remote computer + local buffer to place data.

Advantages of Message-Passing Communication Hardware for communication and synchronization are much simpler: Compared to communication in a shared memory model. Explicit communication: Programs simpler to understand, helps to reduce maintenance and development costs. Synchronization is implicit: Naturally associated with sending/receiving messages. Easier to debug.

Disadvantages of Message-Passing Communication Programmer has to write explicit message passing constructs. Also, precisely identify the processes (or threads) with which communication is to occur. Explicit calls to operating system: Higher overhead.

DSM Physically separate memories are accessed as one logical address space. Processors running on a multi- computer system share their memory. Implemented by operating system. DSM multiprocessors are NUMA: Access time depends on the exact location of the data.

Distributed Shared-Memory Architecture (DSM) Underlying mechanism is message passing: Shared memory convenience provided to the programmer by the operating system. Basically, an operating system facility takes care of message passing implicitly. Advantage of DSM: Ease of programming

Disadvantage of DSM High communication cost: A program not specifically optimized for DSM by the programmer shall perform extremely poorly. Data (variables) accessed by specific program segments have to be collocated. Useful only for process-level (coarse- grained) parallelism.

Advanced Computer Architectures Lecture 29: Symmetric Multiprocessors (SMPs)

Symmetric Multiprocessors (SMPs) SMPs are a popular shared memory multiprocessor architecture: Processors share Memory and I/O Bus based: access time for all memory locations is equal --- “Symmetric MP” P P P P Cache Cache Cache Cache Bus Main memory I/O system

SMPs: Some Insights In any multiprocessor, main memory access is a bottleneck: Multilevel caches reduce the memory demand of a processor. Multilevel caches in fact make it possible for more than one processor to meaningfully share the memory bus. Hence multilevel caches are a must in a multiprocessor!

Different SMP Organizations Processor and cache on separate extension boards (1980s): Plugged on to the backplane. Integrated on the main board (1990s): 4 or 6 processors placed per board. Integrated on the same chip (multi-core) (2000s): Dual core (IBM, Intel, AMD) Quad core

Pros of SMPs Ease of programming: Especially when communication patterns are complex or vary dynamically during execution.

Cons of SMPs As the number of processors increases, contention for the bus increases. Scalability of the SMP model restricted. One way out may be to use switches (crossbar, multistage networks, etc.) instead of a bus. Switches set up parallel point-to-point connections. Again switches are not without any disadvantages: make implementation of cache coherence difficult.

Why Multicores? Can you recollect the constraints on further increase in circuit complexity: Clock skew and temperature. Use of more complex techniques to improve single-thread performance is limited. Any additional transistors have to be used in a different core.

Why Multicores? Cont… Multiple cores on the same physical packaging: Execute different threads. Switched off, if no thread to execute (power saving). Dual core, quad core, etc.

Cache Organizations for Multicores L1 caches are always private to a core L2 caches can be private or shared which is better? P1 P2 P3 P4 P1 P2 P3 P4 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2

L2 Organizations Advantages of a shared L2 cache: Efficient dynamic use of space by each core Data shared by multiple cores is not replicated. Every block has a fixed “home” – hence, easy to find the latest copy. Advantages of a private L2 cache: Quick access to private L2 Private bus to private L2, less contention.

An Important Problem with Shared-Memory: Coherence When shared data are cached: These are replicated in multiple caches. The data in the caches of different processors may become inconsistent. How to enforce cache coherency? How does a processor know changes in the caches of other processors?

The Cache Coherency Problem 5 P1 4 P2 P3 U:? 3 1 U:? 3 U:7 U:5 U:? U:5 U:5 2 1 What value will P1 and P2 read?

Cache Coherence Solutions (Protocols) The key to maintain cache coherence: Track the state of sharing of every data block. Based on this idea, following can be an overall solution: Dynamically recognize any potential inconsistency at run-time and carry out preventive action.

Basic Idea Behind Cache Coherency Protocols Main memory I/O system Bus

Pros and Cons of the Solution Consistency maintenance becomes transparent to programmers, compilers, as well as to the operating system. Con: Increased hardware complexity .

Two Important Cache Coherency Protocols Snooping protocol: Each cache “snoops” the bus to find out which data is being used by whom. Directory-based protocol: Keep track of the sharing state of each data block using a directory. A directory is a centralized register for all memory blocks. Allows coherency protocol to avoid broadcasts.

Snoopy and Directory-Based Protocols Cache Cache Cache Cache Bus Main memory I/O system

Snooping vs. Directory-based Protocols Snooping protocol reduces memory traffic. More efficient. Snooping protocol requires broadcasts: Can meaningfully be implemented only when there is a shared bus. Even when there is a shared bus, scalability is a problem. Some work arounds have been tried: Sun Enterprise server has up to 4 buses.

Snooping Protocol As soon as a request for any data block by a processor is put out on the bus: Other processors “snoop” to check if they have a copy and respond accordingly. Works well with bus interconnection: All transmissions on a bus are essentially broadcast: Snooping is therefore effortless. Dominates almost all small scale machines.

Categories of Snoopy Protocols Essentially two types: Write Invalidate Protocol Write Broadcast Protocol Write invalidate protocol: When one processor writes to its cache, all other processors having a copy of that data block invalidate that block. Write broadcast: When one processor writes to its cache, all other processors having a copy of that data block update that block with the recent written value.

Write Invalidate Vs. Write Update Protocols Cache Cache Cache Cache Bus Main memory I/O system

Write Invalidate Protocol Handling a write to shared data: An invalidate command is sent on bus --- all caches snoop and invalidate any copies they have. Handling a read Miss: Write-through: memory is always up-to-date. Write-back: snooping finds most recent copy.

Write Invalidate in Write Through Caches Simple implementation. Writes: Write to shared data: broadcast on bus, processors snoop, and update any copies. Read miss: memory is always up-to-date. Concurrent writes: Write serialization automatically achieved since bus serializes requests. Bus provides the basic arbitration support.

Write Invalidate versus Broadcast cont… Invalidate exploits spatial locality: Only one bus transaction for any number of writes to the same block. Obviously, more efficient. Broadcast has lower latency for writes and reads: As compared to invalidate.

An Example Snoopy Protocol Assume: Invalidation protocol, write-back cache. Each block of memory is in one of the following states: Shared: Clean in all caches and up-to-date in memory, block can be read. Exclusive: cache has the only copy, it is writeable, and dirty. Invalid: Data present in the block obsolete, cannot be used.

Advanced Computer Architectures Lecture 30: Cache Coherence Protocols

Implementation of the Snooping Protocol A cache controller at every processor would implement the protocol: Has to perform specific actions: When the local processor requests certain things. Also, certain actions are required when certain address appears on the bus. Exact actions of the cache controller depends on the state of the cache block. Two FSMs can show the different types of actions to be performed by a controller.

Snoopy-Cache State Machine-I CPU Read hit State machine considering only CPU requests for each cache block. Shared (read/only) Invalid CPU Read Place read miss on bus CPU Write CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus Place Write Miss on bus Invalid: read => shared write => dirty shared looks the same CPU Write Place Write Miss on Bus Exclusive (read/write) CPU read hit CPU write hit CPU Write Miss Write back cache block Place write miss on bus

Snoopy-Cache State Machine-II State machine considering only bus requests for each cache block. Write miss for this block Shared (read/only) Invalid Write miss for this block Read miss for this block Write Back Block; (abort memory access) Write Back Block; (abort memory access) Exclusive (read/write)

Combined Snoopy-Cache State Machine CPU Read hit State machine considering both CPU requests and bus requests for each cache block. Write miss for this block Shared (read/only) Invalid CPU Read Place read miss on bus CPU Write Place Write Miss on bus Write miss for this block CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus Write Back Block; Abort memory access. Invalid: read => shared write => dirty shared looks the same CPU Write Place Write Miss on Bus Write Back Block; (abort memory access) Read miss for this block Exclusive (read/write) CPU read hit CPU write hit CPU Write Miss Write back cache block Place write miss on bus

Directory-based Solution In NUMA computers: Messages have long latency. Also, broadcast is inefficient --- all messages have explicit responses. Main memory controller to keep track of: Which processors are having cached copies of which memory locations. On a write, Only need to inform users, not everyone On a dirty read, Forward to owner

Directory Protocol Three states as in Snoopy Protocol Shared: 1 or more processors have data, memory is up-to-date. Uncached: No processor has the block. Exclusive: 1 processor (owner) has the block. In addition to cache state, Must track which processors have data when in the shared state. Usually implemented using bit vector, 1 if processor has copy.

Directory Behavior On a read: Unused: Exclusive or shared: give (exclusive) copy to requester record owner Exclusive or shared: send share message to current exclusive owner return value Exclusive dirty: forward read request to exclusive owner.

Directory Behavior On Write On Write-Thru/Write-back Send invalidate messages to all hosts caching values. On Write-Thru/Write-back Update value.

CPU-Cache State Machine Invalidate or Miss due to address conflict: State machine for CPU requests for each memory block Invalid state if in memory CPU Read hit Shared (read/only) Uncacheed CPU Read Send Read Miss message CPU Write: Send Write Miss msg to h.d. CPU Write: Send Write Miss message to home directory Fetch/Invalidate or Miss due to address conflict: send Data Write Back message to home directory Invalid: read => shared write => dirty shared looks the same Fetch: send Data Write Back message to home directory Exclusive (read/write) CPU read hit CPU write hit

State Transition Diagram for the Directory Tracks all copies of memory block. Same states as the transition diagram for an individual cache. Memory controller actions: Update of directory state Send msgs to statisfy requests. Also indicates an action that updates the sharing set, Sharers, as well as sending a message.

Directory State Machine Read miss: Sharers += {P}; send Data Value Reply State machine for Directory requests for each memory block Uncached state if in memory Read miss: Sharers = {P} send Data Value Reply Shared (read only) Uncached Write Miss: Sharers = {P}; send Data Value Reply msg Write Miss: send Invalidate to Sharers; then Sharers = {P}; send Data Value Reply msg Data Write Back: Sharers = {} (Write back block) Invalid: read => shared write => dirty shared looks the same Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write)