Download presentation
Presentation is loading. Please wait.
Published byReginald Barry Dixon Modified over 9 years ago
1
Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin
2
2 Outline Waves of innovation in architecture Innovation in uniprocessors Opportunities for innovation Example CMP innovations Parallel processing models CMP memory hierarcies
3
3 Waves of Research and Innovation A new direction is proposed or new opportunity becomes available The center of gravity of the research community shifts to that direction SIMD architectures in the 1960s HLL computer architectures in the 1970s RISC architectures in the early 1980s Shared-memory MPs in the late 1980s OOO speculative execution processors in the 1990s
4
4 Waves Wave is especially strong when coupled with a “step function” change in technology Integration of a processor on a single chip Integration of a multiprocessor on a chip
5
5 Uniprocessor Innovation Wave Integration of processor on a single chip The inflexion point Argued for different architecture (RISC) More transistors allow for different models Speculative execution Then the rebirth of uniprocessors Continue the journey of innovation Totally rethink uniprocessor microarchitecture
6
6 The Next Wave Can integrate a simple multiprocessor on a chip Basic microarchitecture similar to traditional MP Rethink the microarchitecture and usage of chip multiprocessors
7
7 Broad Areas for Innovation Overcoming traditional barriers New opportunities to use CMPs for parallel/multithreaded execution Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)
8
8 Remainder of Talk Roadmap Summary of traditional parallel processing Revisiting traditional barriers and overheads to parallel processing Novel CMP applications and workloads Novel CMP microarchitectures
9
9 Multiprocessor Architecture Take state-of-the-art uniprocessor Connect several together with a suitable network Have to live with defined interfaces Expend hardware to provide cache coherence and streamline inter-node communication Have to live with defined interfaces
10
10 Software Responsibilities Reason about parallelism, execution times and overheads This is hard Use synchronization to ease reasoning Parallel trends towards serial with the use of synchronization Very difficult to parallelize transparently
11
11 Net Result Difficult to get parallelism speedup Computation is serial Inter-node communication latencies exacerbate problem Multiprocessors rarely used for parallel execution Typical use: improve throughput This will have to change Will need to rethink “parallelization”
12
12 Rethinking Parallelization Speculative multithreading Speculation to overcoming other performance barriers Revisiting computation models for parallelization New parallelization opportunities New types of workloads (e.g., multimedia) New variants of parallelization models
13
13 Speculative Multithreading Speculatively parallelize an application Use speculation to overcome ambiguous dependences Use hardware support to recover from mis- speculation E.g., multiscalar Use hardware to overcome barriers
14
14 Overcoming Barriers: Memory Models Weak models proposed to overcome performance limitations of SC Speculation used to overcome “maybe” dependences Series of papers showing SC can achieve performance of weak models
15
15 Implications Strong memory models not necessarily low performance Programmer does not have to reason about weak models More likely to have parallel programs written
16
16 Overcoming Barriers: Synchronization Synchronization to avoid “maybe” dependences Causes serialization Speculate to overcome serialization Recent work on techniques to dynamically elide synchronization constructs
17
17 Implications Programmer can make liberal use of synchronization to ease programming Little performance impact of synchronization More likely to have parallel programs written
18
18 Revisiting Parallelization Models Transactions simplify writing of parallel code very high overhead to implement semantics in software Hardware support for transactions will exist Speculative multithreading is ordered transactions No software overhead to implement semantics More applications likely to be written with transactions
19
19 New Opportunities for CMPs New opportunities for parallelism Emergence of multimedia workloads Amenable to traditional parallel processing Parallel execution of overhead code Program demultiplexing Separating user and OS
20
20 New Opportunities for CMPs New opportunities for microarchitecture innovation Instruction memory hierarchies Data memory hierarchies
21
21 Reliable Systems Software is unreliable and error-prone Hardware will be unreliable and error-prone Improving hardware/software reliability will result in significant software redundancy Redundancy will be source of parallelism
22
22 Software Reliability & Security Reliability & Security via Dynamic Monitoring -Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… -VM performs checks for Java/C# High Overheads! -Encourages use of unsafe code
23
23 Software Reliability & Security - Divide program into tasks -Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A B C D A’ B’ C’ Monitoring Code Program
24
24 Software Reliability & Security -Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks C D B B’ A A’ COMMIT C’ ABORT C’ D
25
25 Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow
26
26 Software Reliability & Security
27
27 Program De-multiplexing New opportunities for parallelism 2-4X parallelism Program is a multiplexing of methods (or functions) onto single control flow De-multiplex methods of program Execute methods in parallel in dataflow fashion
28
28 Data-flow Execution 3 1 4 5 2 Data-flow Machines - Dataflow in programs - No control flow, PC - Nodes are instrs. on FU OoO Superscalar Machines - Sequential programs - Limited dataflow with ROB - Nodes are instrs. on FU 6 7
29
29 Program Demultiplexing 3 1 4 5 2 6 7 Demux’ed Execution Nodes - methods (M) Processors - FUs Sequential Program
30
30 Program Demultiplexing 3 1 4 5 2 6 7 Sequential Program Triggers - usually fire after data dep of M - Chosen with software support - Handlers generate params PD - Data-flow based spec. execution - Differs from control-flow spec. 6 On Trigger- Execute On Call- Use Demux’ed exec. (6)
31
31 Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M
32
32 Impact of Parallelization Expect different characteristics for code on each core More reliance on inter-core parallelism Less reliance on intra-core parallelism May have specialized cores
33
33 Microarchitectural Implications Processor Cores Skinny, less complex Will SMT go away? Perhaps specialized Memory structures (i-caches, TLBs, d- caches) Different organizations possible
34
34 Microarchitectural Implications Novel memory hierarchy opportunities Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth Pressure on non-core techniques to tolerate longer latencies Helper threads, pre-execution
35
35 Instruction Memory Hierarchy Computation spreading
36
36 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……
37
37 Code Commonality Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity Poor utilization of L1 I-cache due to replication
38
38 Removing Duplication Avoiding duplication significantly reduces misses Shared L1 I-cache may be impractical MPKI (64K) 18.4 17.9 23.3 3.91 19.1 26.8 22.1 4.211
39
39 Computation Spreading Avoid reference spreading by distributing the computation based on code regions Each processor is responsible for a particular region of code (for multiple threads) References to a particular code region is localized Computation from one thread is carried out in different processors
40
P1P2P3 AB’C’’ BC’A’’ CA’B’’ P1P2P3 AB’C’’ A’BC’ A’’B’’C T1T2T3 ABCABC B’ C’ A’ C’’ A’’ B’’ Canonical Model Computation Spreading Example TIME
41
41 L1 cache performance Significant reduction in instruction misses Data Cache Performance is deteriorated Migration of computation disrupts locality of data reference
42
42 Data Memory Hierarchy Co-operative caching
43
43 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……
44
44 Conventional CMP Organizations P L1I L1D Interconnect Network P L1I L1D P L1I L1D P L1I L1D …… Shared L2
45
45 A Spectrum of CMP Cache Designs Shared Caches Private Caches Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) Cooperative Caching Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) …
46
46 CMP Cooperative Caching Basic idea: Private caches cooperatively form an aggregate global cache Use private caches for fast access / isolation Share capacity through cooperation Mitigate interference via cooperation throttling Inspired by cooperative file/web caches P L1IL1D L2 P L1IL1D L2 L1IL1DL1IL1D PP
47
47 Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core Shared (LRU) Most reduction C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads 3.21 11.635.531.670.505.9715.845.28 MPKI (private): Reduction of off-chip access rate
48
48 Cycles Spent on L1 Misses Trace-based simulation: no miss overlapping Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses PCFS OLTPApacheECperfJBBBarnesOceanMix1Mix2 PCFS P:Private C:C-Clean F:C-1Fwd S:Shared
49
49 Summary Start of a new wave of innovation to rethink parallelism and parallel architecture New opportunities for innovation in CMPs New opportunities for parallelizing applications Expect little resemblance between MPs today and CMPs in 15 years We need to invent and define differences
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.