1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

1 Lecture 25: Multi-core Processors Today’s topics:  Writing parallel programs  SMT  Multi-core examples Reminder:  Assignment 9 due Tuesday.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

1 Lecture 25: Multiprocessors Today’s topics:  Synchronization  Consistency  Shared memory vs message-passing  Simultaneous multi-threading (SMT)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture: Cache Hierarchies

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture: Cache Hierarchies

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 26: Multiprocessors

Lecture 21: Memory Hierarchy

Levels of Parallelism within a Single Processor

Lecture 23: Cache, Memory, Virtual Memory

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture 22: Cache Hierarchies, Memory

Lecture: SMT, Cache Hierarchies

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 22: Cache Hierarchies, Memory

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 27: Multiprocessors

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Lecture 22: Cache Hierarchies, Memory

Lecture 11: Cache Hierarchies

Lecture: SMT, Cache Hierarchies

Lecture 21: Memory Hierarchy

Lecture 21: Synchronization & Consistency

Lecture 22: Multithreading

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

Lecture 14: Cache Performance

Presentation transcript:

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

2 Thread-Level Parallelism Motivation:  a single thread leaves a processor under-utilized for most of the time  by doubling processor area, single thread performance barely improves Strategies for thread-level parallelism:  multiple threads share the same large processor  reduces under-utilization, efficient resource allocation Simultaneous Multi-Threading (SMT)  each thread executes on its own mini processor  simple design, low interference between threads Chip Multi-Processing (CMP)

3 How are Resources Shared? Each box represents an issue slot for a functional unit. Peak thruput is 4 IPC. Cycles Superscalar processor has high under-utilization – not enough work every cycle, especially when there is a cache miss Fine-grained multithreading can only issue instructions from a single thread in a cycle – can not find max work every cycle, but cache misses can be tolerated Simultaneous multithreading can issue instructions from any thread every cycle – has the highest probability of finding work for every issue slot SuperscalarFine-Grained Multithreading Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Idle

4 What Resources are Shared? Multiple threads are simultaneously active (in other words, a new thread can start without a context switch) For correctness, each thread needs its own PC, its own logical regs (and its own mapping from logical to phys regs) For performance, each thread could have its own ROB (so that a stall in one thread does not stall commit in other threads), I-cache, branch predictor, D-cache, etc. (for low interference), although note that more sharing  better utilization of resources Each additional thread costs a PC, rename table, and ROB – cheap!

5 Front End Front End Front End Front End Execution Engine RenameROB I-CacheBpred RegsIQ FUsDCache Private/ Shared Front-end Private Front-end Shared Exec Engine What about RAS, LSQ? Pipeline Structure

6 Resource Sharing R1  R1 + R2 R3  R1 + R4 R5  R1 + R3 R2  R1 + R2 R5  R1 + R2 R3  R5 + R3 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 P73  P1 + P2 P74  P73 + P4 P75  P73 + P74 P76  P33 + P34 P77  P33 + P76 P78  P77 + P35 FU Instr Fetch Instr Rename Issue Queue Register File Thread-1 Thread-2

7 Performance Implications of SMT Single thread performance is likely to go down (caches, branch predictors, registers, etc. are shared) – this effect can be mitigated by trying to prioritize one thread While fetching instructions, thread priority can dramatically influence total throughput – a widely accepted heuristic (ICOUNT): fetch such that each thread has an equal share of processor resources With eight threads in a processor with many resources, SMT yields throughput improvements of roughly 2-4 Alpha and Intel Pentium 4 are examples of SMT

8 Pentium4 Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred

9 Multi-Programmed Speedup sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference – worst slowdown is 0.9

10 Memory Hierarchy As you go further, capacity and latency increase Registers 1KB 1 cycle L1 data or instruction Cache 32KB 2 cycles L2 cache 2MB 15 cycles Memory 1GB 300 cycles Disk 80 GB 10M cycles

11 Accessing the Cache 8-byte words Direct-mapped cache: each address maps to a unique address 8 words: 3 index bits Byte address Data array Sets Offset

12 The Tag Array 8-byte words Direct-mapped cache: each address maps to a unique address Byte address Tag Compare Data arrayTag array

13 Increasing Line Size 32-byte cache line size or block size Byte address Tag Data arrayTag array Offset A large cache line size  smaller tag array, fewer misses because of spatial locality

14 Associativity Byte address Tag Data arrayTag array Set associativity  fewer conflicts; wasted power because multiple data and tags are read Way-1Way-2 Compare

15 Example 32 KB 4-way set-associative data cache array with 32 byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

16 Cache Misses On a write miss, you may either choose to bring the block into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace?  no choice for a direct-mapped cache  randomly pick one of the ways to replace  replace the way that was least-recently used (LRU)  FIFO replacement (round-robin)

17 Writes When you write into a block, do you also update the copy in L2?  write-through: every write to L1  write to L2  write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data

18 Title Bullet