CMPT 886: Computer Architecture Primer

CMPT 886: Computer Architecture Primer
Dr. Alexandra Fedorova School of Computing Science SFU

Outline Caches Branch prediction Out-of-order execution
Instruction Level Parallelism

Caches Level 1 / Level 2 / Level 3 Instruction/Data or unified

Direct-Mapped Cache Line size = 32 bytes Cache eviction

Set-Associative Cache
4-way set associative cache The data can go into any of the four locations When the entire set is full, which line should we replace? LRU – least recently used (LRU stack)

Cache Hit/Miss Cache hit – the data is found in the cache
Cache miss – the data is not in the cache Miss rate: misses per instruction misses per cycle misses per access (also miss ratio) Hit rate: the opposite

Cache Miss Latency How long you have to wait if you miss in the cache
Miss in L1  L2 latency (~20 cycles) Miss in L2  memory latency (~300 cycles) (if there is no L3)

Writing in Cache Write through – write directly to memory
Write back – write to memory later, when the line is evicted

Caches on Multiprocessor Systems
Bus memory © Herlihy-Shavit 2007

Processor Issues Load Request
cache cache cache Bus memory data data © Herlihy-Shavit 2007

Another Processor Issues Load Request
I want data I got data data data cache cache cache Bus Bus Bus memory data © Herlihy-Shavit 2007

Processor Modifies Data
cache data data cache cache Bus Now other copies are invalid memory data © Herlihy-Shavit 2007

Send Invalidation Message to Others
Other caches lose read permission Invalidate! data cache data data cache cache Bus Bus No need to change now: other caches can provide valid data memory data © Herlihy-Shavit 2007

Processor Asks for Data
I want data cache data data data cache cache Bus Bus memory data © Herlihy-Shavit 2007

Shared Caches Filled on demand No control over cache shares
Thread 1 Thread 1 Thread 2 Thread 2 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 2 Filled on demand No control over cache shares An aggressive thread can grab a large cache share, hurt others

Branching and CPU Pipeline

Branching Hurts Pipelining

Branch Prediction

Out-of-order Execution
Modern CPUs are super-scalar They can issue more than one instructions per clock cycle If consecutive instructions depend on each other instruction-level parallelism is limited To keep the processor going at full speed, issue instructions out of order

Speculative Execution
Out-of-order execution is limited to basic blocks To go beyond basic blocks, use speculative execution

Instruction-Level Parallelism
Many programs fail to keep processor busy Code with lots of loads Code with frequent and unpredictable branches CPU cycles are wasted: power is consumed, no useful work is done Running multiple threads on the chip helps this

CMPT 886: Computer Architecture Primer

Similar presentations

Presentation on theme: "CMPT 886: Computer Architecture Primer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPT 886: Computer Architecture Primer

Similar presentations

Presentation on theme: "CMPT 886: Computer Architecture Primer"— Presentation transcript:

Similar presentations

About project

Feedback