COMP 740: Computer Architecture and Implementation

COMP 740: Computer Architecture and Implementation
Montek Singh Nov 14, 2016 Topic: Intro to Multiprocessors and Thread-Level Parallelism

Outline Motivation Multiprocessors Multithreading
SISD, SIMD, MIMD, and MISD Memory organization Communication mechanisms Multithreading

Motivation Instruction-Level Parallelism (ILP): What all we have covered so far: simple pipelining dynamic scheduling: scoreboarding and Tomasulo’s alg. dynamic branch prediction multiple-issue architectures: superscalar, VLIW hardware-based speculation compiler techniques and software approaches Bottom line: There just aren’t enough instructions that can actually be executed in parallel! instruction issue: limit on maximum issue count branch prediction: imperfect # registers: finite functional units: limited in number data dependencies: hard to detect dependencies via memory

So, What do we do? Key Idea: Increase number of running processes
multiple processes: at a given “point” in time i.e., at the granularity of one (or a few) clock cycles not sufficient to have multiple processes at the OS level! Two Approaches: multiple CPU’s: each executing a distinct process “Multiprocessors” or “Parallel Architectures” single CPU: executing multiple processes (“threads”) “Multi-threading” or “Thread-level parallelism”

Taxonomy of Parallel Architectures
Flynn’s Classification: SISD: Single instruction stream, single data stream uniprocessor SIMD: Single instruction stream, multiple data streams same instruction executed by multiple processors each has its own data memory Ex: multimedia processors, vector architectures MISD: Multiple instruction streams, single data stream successive functional units operate on the same stream of data rarely found in general-purpose commercial designs special-purpose stream processors (digital filters etc.) MIMD: Multiple instruction stream, multiple data stream each processor has its own instruction and data streams most popular form of parallel processing single-user: high-performance for one application multiprogrammed: running many tasks simultaneously (e.g., servers)

Multiprocessor: Memory Organization
Centralized, shared-memory multiprocessor: usually few processors share single memory & bus use large caches

Multiprocessor: Memory Organization
Distributed-memory multiprocessor: can support large processor counts cost-effective way to scale memory bandwidth works well if most accesses are to local memory node requires interconnection network communication between processors becomes more complicated, slower

Multiprocessor: Hybrid Organization
Use distributed-memory organization at top level Each node itself may be a shared-memory multiprocessor (2-8 processors)

Communication Mechanisms
Shared-Memory Communication around for a long time, so well understood and standardized memory-mapped ease of programming when communication patterns are complex or dynamically varying better use of bandwidth when items are small Problem: cache coherence harder use “Snoopy” and other protocols Message-Passing Communication simpler hardware because keeping caches coherent is easier communication is explicit, simpler to understand focusses programmer attention on communication synchronization: naturally associated with communication fewer errors due to incorrect synchronization

Multi-threading

Performance Beyond Single Thread
Motivation: There is much higher natural parallelism in some applications e.g., Database or Scientific Explicit Thread Level Parallelism or Data Level Parallelism What is a Thread? a process with own instructions and data thread may be a process, part of a parallel program of multiple processes, or an independent program each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute What is Data Level Parallelism Perform identical operations on lots of data 11

Multithreading Threads: multiple processes that share code and data (and much of their address space) recently, the term has come to include processes that may run on different processors and even have disjoint address spaces, as long as they share the code Multithreading: exploit thread-level parallelism within a processor fine-grain multithreading switch between threads on each instruction! coarse-grain multithreading switch to a different thread only if current thread has a costly stall E.g., switch only on a level-2 cache miss

Thread Level Parallelism (TLP)
ILP s. TLP ILP exploits implicit parallel operations within a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel each thread needs: its own PC and its own Register File Goal: Use multiple instruction streams to improve Throughput of computers that run many programs Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP

Multithreaded Execution
Multithreading: multiple threads to share the functional units of 1 processor via overlapping processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table memory shared through the virtual memory mechanisms, which already support multiple processes HW for fast thread switch; much faster than full process switch (100s to 1000s of clocks) When to switch? Alternate instruction per thread (fine grain) When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)

Fine-Grain Multithreading
switch between threads on each instruction! multiple threads executed in interleaved manner interleaving is usually round-robin CPU must be capable of switching threads on every cycle! fast, frequent switches main disadvantage: slows down the execution of individual threads that is, traded off latency for better throughput example: Sun’s Niagara

Coarse-Grain Multithreading
switch only if current thread has a costly stall E.g., level-2 cache miss can accommodate slightly costlier switches less likely to slow down an individual thread a thread is switched “off” only when it has a costly stall main disadvantage: limited in ability to overcome throughput losses shorter stalls are ignored, and there may be plenty of those issues instructions from a single thread every switch involves emptying and restarting the instruction pipeline hence, better for reducing penalty of high cost stalls, where pipeline refill << stall time example: IBM AS/400

Simultaneous Multithreading (SMT)
Example: new Pentium with “Hyperthreading” Key Idea: Exploit ILP across multiple threads! i.e., convert thread-level parallelism into more ILP exploit following features of modern processors: multiple functional units modern processors typically have more functional units available than a single thread can utilize register renaming and dynamic scheduling multiple instructions from independent threads can co-exist and co-execute!

Multithreaded Categories
Simultaneous Multithreading Superscalar Coarse-Grained Fine-Grained Multiprocessing Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot

SMT: Design Challenges
Dealing with a large register file needed to hold multiple contexts Maintaining low overhead on clock cycle fast instruction issue: choosing what to issue instruction commit: choosing what to commit keeping cache conflicts within acceptable bounds

Example: Power 4 Single-threaded predecessor to Power 5.
8 execution units in out-of-order engine, each may issue an instruction each cycle.

Power 4 Power 5 2 commits 2 fetch (PC), 2 initial decodes

Power 5 Data Flow Why only 2 threads? With 4, one of the shared resources (physical registers, cache, memory bandwidth) would be prone to bottleneck

Power 5 Performance On 8 processor IBM servers
ST baseline w/ 8 threads SMT with 16 threads Note few with performance loss

Changes in Power 5 to support SMT
Increased associativity of L1 instruction cache and the instruction address translation buffers Added per thread load and store queues Increased size of the L2 (1.92 vs MB) and L3 caches Added separate instruction prefetch and buffering per thread Increased the number of virtual registers from 152 to 240 Increased the size of several issue queues The Power5 core is about 24% larger than the Power4 core because of the addition of SMT support

COMP 740: Computer Architecture and Implementation

Similar presentations

Presentation on theme: "COMP 740: Computer Architecture and Implementation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 740: Computer Architecture and Implementation

Similar presentations

Presentation on theme: "COMP 740: Computer Architecture and Implementation"— Presentation transcript:

Similar presentations

About project

Feedback