Lecture: SMT, Cache Hierarchies

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture: Out-of-order Processors

Lecture 18: Core Design, Parallel Algos

Prof. Onur Mutlu Carnegie Mellon University

Computer Structure Multi-Threading

Lecture: Cache Hierarchies

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Out-of-order Processors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture: Cache Hierarchies

Lecture: SMT, Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 26: Multiprocessors

Lecture 21: Memory Hierarchy

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 18: Pipelining Today’s topics:

Ka-Ming Keung Swamy D Ponpandi

Lecture 22: Cache Hierarchies, Memory

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 27: Multiprocessors

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture 22: Cache Hierarchies, Memory

Lecture 11: Cache Hierarchies

Lecture: SMT, Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 21: Synchronization & Consistency

Lecture 22: Multithreading

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Lecture 9: Dynamic ILP Topics: out-of-order processors

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 4 7 abba LD R7  [R8] 2 abce ST R9  [R10] 8 3 abba LD R11  [R12] 1 abba

Problem 1 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 4 7 abba 5 commit LD R7  [R8] 2 abce 3 6 ST R9  [R10] 8 3 abba 9 commit LD R11  [R12] 1 abba 2 10

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 5 7 abba LD R7  [R8] 2 abce ST R9  [R10] 1 4 abba LD R11  [R12] 2 abba

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 5 7 abba 6 commit LD R7  [R8] 2 abce 3 7 ST R9  [R10] 1 4 abba 2 commit LD R11  [R12] 2 abba 3 5

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 4 7 abba LD R7  [R8] 2 abce ST R9  [R10] 8 3 abba LD R11  [R12] 1 abba

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 4 7 abba 5 commit LD R7  [R8] 2 abce 3 4 ST R9  [R10] 8 3 abba 9 commit LD R11  [R12] 1 abba 2 3/10

Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism: multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT) each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP) or multi-core

How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Thread 1 Thread 2 Thread 3 Cycles Thread 4 Idle Superscalar Fine-Grained Multithreading Simultaneous Multithreading Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot

What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, IFQ, logical regs (and its own mappings from logical to phys regs) For performance, each thread could have its own ROB/LSQ (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources Each additional thread costs a PC, IFQ, rename tables, and ROB – cheap!

Pipeline Structure Private/ Shared Front-end I-Cache Bpred Front End Rename ROB Execution Engine Regs IQ Shared Exec Engine DCache FUs

Resource Sharing Thread-1 R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 P65 P1 + P2 P66  P65 + P4 P67  P65 + P66 Instr Fetch Instr Rename Issue Queue Instr Fetch Instr Rename P65 P1 + P2 P66  P65 + P4 P67  P65 + P66 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 Thread-2 Register File FU FU FU FU

Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4

Pentium4 Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred

Multi-Programmed Speedup sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference – worst slowdown is 0.9

The Cache Hierarchy Core L1 L2 L3 Off-chip memory

Problem 1 Memory access time: Assume a program that has cache access times of 1-cyc (L1), 10-cyc (L2), 30-cyc (L3), and 300-cyc (memory), and MPKIs of 20 (L1), 10 (L2), and 5 (L3). Should you get rid of the L3?

Problem 1 Memory access time: Assume a program that has cache access times of 1-cyc (L1), 10-cyc (L2), 30-cyc (L3), and 300-cyc (memory), and MPKIs of 20 (L1), 10 (L2), and 5 (L3). Should you get rid of the L3? With L3: 1000 + 10x20 + 30x10 + 300x5 = 3000 Without L3: 1000 + 10x20 + 10x300 = 4200

Accessing the Cache Byte address 101000 Offset 8-byte words 8 words: 3 index bits Direct-mapped cache: each address maps to a unique address Sets Data array

The Tag Array Byte address 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Increasing Line Size Byte address A large cache line size  smaller tag array, fewer misses because of spatial locality 10100000 32-byte cache line size or block size Tag Offset Tag array Data array

Associativity Byte address Set associativity  fewer conflicts; wasted power because multiple data and tags are read 10100000 Tag Way-1 Way-2 Tag array Data array Compare

Problem 2 Assume a direct-mapped cache with just 4 sets. Assume that block A maps to set 0, B to 1, C to 2, D to 3, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A

Problem 2 Assume a direct-mapped cache with just 4 sets. Assume that block A maps to set 0, B to 1, C to 2, D to 3, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A M MH MM H MM HM HMM M M M

Problem 3 Assume a 2-way set-associative cache with just 2 sets. Assume that block A maps to set 0, B to 1, C to 0, D to 1, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A

Problem 3 Assume a 2-way set-associative cache with just 2 sets. Assume that block A maps to set 0, B to 1, C to 0, D to 1, E to 0, and so on. For the following access pattern, estimate the hits and misses: A B B E C C A D B F A E G C G A M MH M MH MM HM HMM M H M

Problem 4 64 KB 16-way set-associative data cache array with 64 byte line sizes, assume a 40-bit address How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

Problem 4 64 KB 16-way set-associative data cache array with 64 byte line sizes, assume a 40-bit address How many sets? 64 How many index bits (6), offset bits (6), tag bits (28)? How large is the tag array (28 Kb)?

Title Bullet