Download presentation
Presentation is loading. Please wait.
Published byJemima Dickerson Modified over 9 years ago
1
Computer Architecture Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke Island
2
Memory Bottleneck State-of-the-art processor f = 3 GHz t clock = 330ps 1-2 instructions per cycle ~25% memory reference Memory response 4 instructions x 330ps ~1.2ns needed! Bulk semiconductor RAM 100ns+ for a ‘random’ access! Processor will spend most of its time waiting for memory!
3
Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time, cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth !! Clearly a bottleneck! Needs to be 3 GB/s for 25% load store instructions!
4
Cache Small, fast memory Typically ~50kbytes (1998) 2 cycle access time Same die as processor “Off-chip” cache possible Custom cache chip closely coupled to processor Use fast static RAM (SRAM) rather than slower dynamic RAM Several levels possible 2 nd level of the memory hierarchy “Caches” most recently used memory locations “closer” to the processor closer = closer in time
5
Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time, cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth Cache provides small, very fast memory ‘close’ to processor ‘Close’ = close in time, ie with high bandwidth connection
6
Very large very slow Memory Bottleneck Assume * Clock speed, f = 3GHz Cycle time, cyc = 1/f = 330ps * 32-bit = 4 byte machine word Internal bandwidth = (bytes per word) * f = 4 * f = 12 GB/s * 64-bit PCI bus, f bus = 32 MHz Arrow width (roughly) indicates data bandwidth Small fast Larger slower
7
Memory hierarchy & performance Usual metric is machine cycle time, cyc = 1/f Visible to programmer Registers < 1 cycle latency (respond in same cycle) Transparent to programmer Level 1 (L1) cache 2 cycle latency L2 cache 5-6 cycles L3 cache about 10 cycles Main memory 100+ cycles for a random access Disc > 1 ms or >10 6 cycles Effective memory access time, eff = f i t i where f i = fraction of hits at level i, t i = access time at level i
8
Cache - organisation Direct-mapped cache Each word in the cache has a tag Assume cache size - 2 k words machine words - p bits byte-addressed memory m = log 2 ( p/8 ) bits not used to address words m = 2 for 32-bit machines p-k-mmk p bits tagcache address byte address Address format
9
Cache - organisation Direct-mapped cache p-k-mmk tagcache address byte address tagdata Hit? memory CPU 2 k lines p-k-mp A cache line Memory address
10
Cache - Conflicts Conflicts Two addresses separated by 2 k+m will hit the same cache location p-k-mmk p bits tag cache address byte address Addresses in which these k bits are the same will map to the same cache line
11
Cache - Conflicts When a word is modified in cache Write-back cache Only writes data back when needed Misses Two memory accesses Write modified word back Read new word Write-through cache Low priority write to main memory is queued Processor is delayed by read only Memory write occurs in parallel with other work Instruction and necessary data fetches take priority
12
Cache - Write-through or write-back? Write-through Seems a good idea! but... Multiple writes to the same location waste memory bus bandwidth Typical programs better with write-back caches however Often you can easily predict which will be best Some processors ( eg PowerPC) allow you to classify memory regions as write-back or write-through
13
Cache - more bits Cache lines need some status bits Tag bits +.. Valid All set to false on power up Set to true as words are loaded into cache Dirty Needed by write-back cache Write- through cache always queues the write, so lines are never ‘dirty’ Tag VMData Cache line p-k-mp11
14
Cache – Improving Performance Conflicts ( addresses 2 k+m bytes apart ) Degrade cache performance Lower hit rate Murphy’s Law operates Addresses are never random! Some locations ‘thrash’ in cache Continually replaced and restored Alternatively Ideal cache performance depends on uniform access to all parts of memory Never happens in real programs!
15
Cache - Fully Associative All tags are compared at the same time Words can use any cache line
16
Cache - Fully Associative Associative Each tag is compared at the same time Any match hit Avoids ‘unnecessary’ flushing Replacement Least Recently Used - LRU Needs extra status bits Cycles since last accessed Hardware cost high Extra comparators Wider tags p-m bits vs p-k-m bits
17
Cache - Set Associative Each line - two words two comparators only 2-way set associative
18
Cache - Set Associative n -way set associative caches n can be small: 2, 4, 8 Best performance Reasonable hardware cost Most high performance processors Replacement policy LRU choice from n Reasonable LRU approximation 1 or 2 bits Set on access Cleared / decremented by timer Choose cleared word for replacement
19
Cache - Locality of Reference Temporal Locality Same location will be referenced again soon Access same data again Program loops - access same instruction again Caches described so far exploit temporal locality Spatial Locality Nearby locations will be referenced soon Next element of an array Next instruction of a program
20
Cache - Line Length Spatial Locality Use very long cache lines Fetch one datum Neighbours fetched also PowerPC 601 (Motorola/Apple/IBM) first of the single chip Power processors 64 sets 8-way set associative 32 bytes per line 32 bytes (8 instructions) fetched into instruction buffer in one cycle 64 x 8 x 32 = 16k byte total
21
Cache - Separate I- and D-caches Unified cache Instructions and Data in same cache Two caches - * Instructions * Data Increases total bandwidth MIPS R10000 32Kbyte Instruction; 32Kbyte Data Instruction cache is pre-decoded! (32 36bits) Data 8-word (64byte) line, 2-way set associative 256 sets Replacement policy?
22
COMPSYS 304 Computer Architecture Memory Management Units Reefed down - heading for Great Barrier Island
23
Memory Management Unit èVirtual Address Space Each user has a “private” address space User D’s Address Space
24
Virtual Addresses Mappings between user space and physical memory created by OS
25
Memory Management Unit (MMU) Responsible for VIRTUAL PHYSICAL address mapping Sits between CPU and cache Cache operates on Physical Addresses (mostly - some research on VA cache) CPU MMU Cache Main Mem D or I VA PA D or I
26
MMU - operation q-k
27
MMU - Virtual memory space Page Table Entries can also point to disc blocks Valid bit Set: page in memory address is physical page address Cleared: page “swapped out” address is disc block address MMU hardware generates page fault when swapped out page is requested Allows virtual memory space to be larger than physical memory Only “working set” is in physical memory Remainder on paging disc
28
Page Fault q-k
29
MMU – Page faults Very expensive! Gap in access times Main memory ~100+ ns Disc ~1+ ms A factor of 10 4 slower!! +May require write-back of old (but modified) page +May require reading of Page Table Entries from disc! Good way to make a system thrash!
30
MMU – Access control Provides additional protection to programmer Pages can be marked Read only Execute only Can prevent wayward programmes from corrupting their own programme code or vital data Protection is hardware! MMU will raise exception if illegal access attempted OS traps the exception and process it
31
MMU Inverted page tables Scheme which saves memory for page tables One PTE per page of physical memory Hash function used Collisions probable Possibly slower Sharing Map virtual pages for several users to same physical page Good for sharing program code Data also (read/write control provided by OS) Saves physical memory Reduces pressure on main memory
32
MMU TLB Cache for page table entries Enables MMU to translate VA PA in time! Can be quite small: 50-100 entries Often fully associative Small size avoids one ‘cost’ of FA cache Only 50-100 comparators needed TLB Coverage Amount of memory covered by TLB entries Size of a program for which VA PA translation will be fast
33
Memory Hierarchy - Operation
34
System Interface Unit Tasks Control bus Match cache line length to bus width Follow bus protocol Request / Grant / Data cycles Manage ‘burst’ transactions Burst transactions greater bus efficiency More ‘work’ (data cycles) per transaction Overhead (request | grant | address) is smaller fraction of total bus cycles / transaction Maintain transaction queues Read (high priority) Write (low priority) Reads check write Q for latest copy of data
35
System Interface Unit: Bus efficiency Split phase transactions Separate address and data buses Separate address and data phases Overlap greater bus utilization Multiple transactions ‘in flight’ at any time Slow peripheral devices don’t ‘hog’ the bus and prevent fast transactions ( eg memory) from accessing bus Overhead cycles ‘Work’ cycles 2 nd transaction starts before 1 st completes
36
System Interface Unit: Bus efficiency Single purpose bus Graphics, memory Simpler, faster Single direction (CPU graphics buffer) Single device ( eg memory) Simpler protocol (only one type of device) Point to point wiring Shorter, faster Single driver (no need for delay in switch from read to write)
37
Superscalar Processors Superpipelined Deep pipeline (>5 stages) Hazards and dependencies limit depth Each stage has overhead Registers needed Larger circuit Speed reduction >8 stages decrease in efficiency vs Superscalar next slide
38
Superscalar Processors Superscalar Multiple functional units Integer ALUs, FPUs, branches, load/store Floating point typically 3 internal stages Usually several integer ALUs per FPU Addressing, loop calcs need integer ALU Instruction issue unit is now more complex Determines which instructions can be issued in each cycle What data is ready? Which functional units are free? Typically tries to issue 4 instructions / cycle Achieves 2-3 instructions / cycle on average Out of order execution Instructions executed when data is available Dependent instructions may stall while later ones execute Number of functional units > instruction issue width eg 6 FUs, max 4 instructions / cycle
39
Speculation Data prefetch Try to get data into cache well in advance No stall for memory read when data actually needed PowerPC: dcbt – data cache block touch Advice for the system – low priority read Pentium: prefetchT x ( x =0,1,2) Semantics varies for Pentium 3 and Pentium 4 Pentium 4 fetches into L2 cache only Compiler can detect many patterns eg sequential access of array elements for( j=0; j<n; j++ ) sum = sum + x[j]; Programmer can insert pre-fetch instructions Speculative because data may not be needed
40
Speculation - branching Branches are expensive Stall pipeline Fetching useless instructions wastes bandwidth! Couple Branch unit with Instruction Issue unit Conditional branches if ( cond ) s1 else s2 Execute both s1 and s2 If functional units and data available Use idle resources! Squash results from wrong branch when value of cond known MIPS allows 4 streams of speculative execution Pentium 4: Up to 126 ‘in flight’ ? From a web article by an obvious Intel fan Starts with “The Pentium still kicks butt.” Not a good flag for an objective article! Probably counts instruction issue unit buffers + system interface transactions too!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.