Download presentation
Presentation is loading. Please wait.
Published byArnold Walsh Modified over 8 years ago
1
Introduction to computer architecture April 7th
2
Access to main memory –E.g. 1: individual memory accesses for j=0, j++, j<m for i=0, i++, i<n // some computation
3
Access to main memory –E.g. 1: individual memory accesses –E.g. 2: for j=0, j++, j<m for i=0, i++, i<n // some computation for i=0, i++, i<n for j=0, j++, j<m // some computation
4
Access to main memory –E.g. 1: individual memory accesses –E.g. 2: consecutive memory accesses for j=0, j++, j<m for i=0, i++, i<n // some computation for i=0, i++, i<n for j=0, j++, j<m // some computation
5
Access to main memory Accessing consecutive rather than random memory addresses is faster –E.g. 1: individual memory accesses –E.g. 2: consecutive memory accesses
6
Access to main memory Locality of references! During the execution of a program the memory references tend to group into specific memory regions. –Loops (spatial proximity, i.e. “other data close by will probably be needed soon”) –Sequential instruction execution (temporal proximity, i.e. “we are likely to need this data in the near future”)
7
Goal: memory with cost as low as cheapest level and speed as fast as fastest level Take advantage of locality principle! If an access to main memory transfers ONLY the data at that address I can’t take advantage of possible (and probable) contiguous memory accesses CPU Main memory tapes CD-ROM /DVD Magnetic disk DRAM SRAM reg
8
If an access to main memory transfers the data at that address PLUS contiguous data I could take advantage of contiguous accesses CPU Main memory Goal: take advantage of locality principle
9
If an access to main memory transfers the data at that address PLUS contiguous data I could take advantage of contiguous accesses –Transfer a memory block –But where do I store it? CPU Main memory Goal: take advantage of locality principle
10
Cache memory Small, fast memory Between main memory and CPU CPU cache Main memory transfer wordstransfer blocks
11
lw $t0 ($a0) li $t1 4 … Cache operation 1. CPU requests access to given address in main memory. 2. Cache checks whether it already has data from that address: –Yes – cache hit: 3.A.1 Send it to CPU from cache (fast). –No – cache miss: 3.B.1 Initiate transfer of main memory block from that address into the cache… 3.B.2 then send data requested to CPU from cache. CPUcache Memoria principal 2 1 3.A.1 3.B.1 3.B.2 The cache has labels to identify which main memory blocks reside in which cache line.
12
Tm = Pca * Tca + (1-Pca) * (Tmp + Tca) ~ Pca * Tca + (1-Pca) * (Tmp) Cache performance 1. CPU requests access to given address in main memory. 2. Cache checks whether it already has data from that address: –Yes – cache hit: 3.A.1 Send it to CPU from cache (fast). –No – cache miss: 3.B.1 Initiate transfer of main memory block from that address into the cache… 3.B.2 then send data requested to CPU from cache.
13
1. Tca: cache access time -> 10 ns 2. Tmp: main memory access time -> 120 ns - mem. latency determines time to retrieve first word - mem. bandwidth determines time to retrieve rest of block 3. Pca: Probability that data requested already in cache -> 10%, 20%, …, 80%, 90%, 100% Tm = Pca * Tca + (1-Pca) * (Tmp + Tca) ~ Pca * Tca + (1-Pca) * (Tmp) Example
14
Tm = Pca * Tca + (1-Pca) * (Tmp + Tca) ~ Pca * Tca + (1-Pca) * (Tmp) Example Tm = X * 10 + (1-X) * (120 + 10) ~ Tm = X * 10 + (1-X) * (120) 1. Tca: cache access time -> 10 ns 2. Tmp: main memory access time -> 120 ns 3. Pca: Probability that data requested already in cache -> 10%, 20%, …, 80%, 90%, 100%
15
Tm = Pca * Tca + (1-Pca) * (Tmp + Tca) ~ Pca * Tca + (1-Pca) * (Tmp) Example Tm = X * 10 + (1-X) * (120 + 10) ~ Tm = X * 10 + (1-X) * (120)
16
Tm = Pca * Tca + (1-Pca) * (Tmp + Tca) ~ Pca * Tca + (1-Pca) * (Tmp) Example Tm = X * 10 + (1-X) * (120 + 10) ~ Tm = X * 10 + (1-X) * (120)
17
RAM Dynamic RAM (DRAM) –Store each bit of data in separate capacitor. –Capacitors leak! Needs periodic refresh. +: one transistor and one capacitor per bit -> simple, high density -: refresh and precharge circuitry, slower (2%-3% cycle time for refresh), more error prone –Main memory Static RAM (SRAM) –Stores each bit of data in bistable latching circuitry - 4/6 transistors per bit (cell) –Does not need refresh. + : no refresh -> faster, less power - : more complex, less dense (more space for same memory), more expensive –Cache memory
18
Cache levels Three is usual: –L1: Internal cache: closest to CPU Small size (8KB-128KB), fast –L2: External cache: between L1 and L3 (or L1 and main memory) Medium size (256KB – 4MB), slower than L1 –L3: Right before main memory Larger and slower than L2
19
Example: Intel ITANIUM2
20
Example: AMD quad-core
22
Example: Intel Core i7
23
Cache structure and design Main memory C ache 64 KB 16 MB 2 14 4-byte lines 2 22 4-byte blocks 4 bytes Divide main memory and cache in equally sized blocks Each main memory block has a corresponding line in cache Must decide: –Cache/block size –Block placement –Block identification –Block replacement –Write strategy Usual to have different design for L1/L2/L3
24
Cache and block size Based on benchmark analysis (of most used program classes)
25
Block placement What cache address corresponds to a given main memory address? –Direct mapped – each block has only ONE place it can appear in the cache: block_add MOD nr_blocks_cache –Fully associative – block can be placed anywhere in the cache –Set associative – block has restricted set of places in cache: Select set: block_add MOD nr_blocks_cache Place block anywhere in n-blocked set (n-way associative) Usually caches are direct mapped, 2-way / 4- way set associative
26
Block placement 0 1 2 3 4 5 6 7 Fully associative 0 1 2 3 4 5 6 7 Direct mapped 0 1 2 3 4 5 6 7 Set associative Main memory 0 1 2 3 4 5 6 7 8 9 0 1 2 3………….. 12 mod 8 = 4 2-way -> 12 mod 4 = 0
27
Block identification Caches have a tag on each block frame that gives main memory block address Valid bit on tag (valid address or not?)
28
Block replacement If cache miss but cache is full, what block do I evict? –Direct mapped – check 1 block, replace it –Associative: Random LRU (last recently used) – replace block that went unused longest FIFO – approximates LRU by determining oldest block -> simpler to calculate LFU …
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.