U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems Principles Architecture
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2 Architecture
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Von Neumann 3
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “von Neumann architecture” 4
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Fetch, Decode, Execute 5
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 6 The Memory Hierarchy Registers Caches –Associativity –Misses “Locality” registers L1 L2 RAM
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 7 SP FP arg0 arg1 arg0 arg1 arg2 Registers Register = dedicated name for word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 8 SP FP arg0 arg1 Registers Register = dedicated name for one word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter Change processes: save current registers & load saved registers = context switch
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 9 Caches Access to main memory: “expensive” –~ 100 cycles (slow, but relatively cheap ($)) Caches: small, fast, expensive memory –Hold recently-accessed data (D$) or instructions (I$) –Different sizes & locations Level 1 (L1) – on-chip, smallish Level 2 (L2) – on or next to chip, larger Level 3 (L3) – pretty large, on bus –Manages lines of memory ( bytes)
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 10 D$, I$ separate registers L1 L2 RAM Disk 1-cycle latency 2-cycle latency 7-cycle latency 100 cycle latency 40,000,000 cycle latency Network 200,000,000+ cycle latency D$, I$ unified load evict Memory Hierarchy Higher = small, fast, more $, lower latency Lower = large, slow, less $, higher latency
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Locality” 11
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 0 Cache” 12
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 1 Cache” 13
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “RAM” 14
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 15
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 16
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 17
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 18
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Book Hierarchy” 19
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 20 Orders of Magnitude 10^0 registers L1
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 21 Orders of Magnitude 10^1 L2
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 22 Orders of Magnitude 10^2 RAM
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 23 Orders of Magnitude 10^3
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 24 Orders of Magnitude 10^4
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 25 Orders of Magnitude 10^5
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 26 Orders of Magnitude 10^6
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 27 Orders of Magnitude 10^7 Disk
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 28 Orders of Magnitude 10^8 Network
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 29 Orders of Magnitude 10^9 Network
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 30 Cache Jargon Cache initially cold Accessing data initially misses –Fetch from lower level in hierarchy –Bring line into cache (populate cache) –Next access: hit Warmed up –cache holds most-frequently used data –Context switch implications? LRU: Least Recently Used –Use the past as a predictor of the future
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 31 Cache Details Ideal cache would be fully associative –That is, LRU (least-recently used) queue –Generally too expensive Instead, partition memory addresses and put into separate bins divided into ways –1-way or direct-mapped –2-way = 2 entries per bin –4-way = 4 entries per bin, etc.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 32 Associativity Example Hash memory based on addresses to different indices in cache
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 33 Miss Classification First access = compulsory miss –Unavoidable without prefetching Too many items in way = conflict miss –Avoidable if we had higher associativity No space in cache = capacity miss –Avoidable if cache were larger Invalidated = coherence miss –Avoidable if cache were unshared
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quick Activity Cache with 8 slots, 2-way associativity –Assume hash(x) = x % 4 (modulus) How many misses? –# compulsory misses? –# conflict misses? –# capacity misses?
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 35 Locality Locality = re-use of recently-used items –Temporal locality: re-use in time –Spatial locality: use of nearby items In same cache line, same page (4K chunk) Intuitively – greater locality = fewer misses –# misses depends on cache layout, # of levels, associativity… –Machine-specific
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values –Normalize
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise Derive hit curve for following trace:
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise Derive hit curve for following trace:
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise Derive hit curve for following trace:
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science What can we do with this? What would be the hit rate –with a cache size of 4 or 9?
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Simple cache simulator Only argument is N, length of LRU queue –Read in addresses (ints) from cin –Output hits & misses to cout queue –push_front (v) = put v on front of queue –pop_back() = remove back from queue –erase(i) = erase element (iterator i) –size() = number of elements –for (queue ::iterator i = q.begin(); i < q.end(); ++i) cout << *i << endl; 48
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 49 Important CPU Internals Other issues that affect performance –Pipelining –Branches & prediction –System calls (kernel crossings)
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 50 Scalar architecture Straight-up sequential execution –Fetch instruction –Decode it –Execute it Problem: I or D cache miss –Result – stall: everything stops –How long to wait for miss? long time compared to CPU
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 51 Superscalar architectures Out-of-order processors –Pipeline of instructions in flight –Instead of stalling on load, guess! Branch prediction Value prediction –Predictors based on history, location in program –Speculatively execute instructions Actual results checked asynchronously If mispredicted, squash instructions Accurate prediction = massive speedup –Hides latency of memory hierarchy
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining Pipelining overlaps instructions to exploit parallelism, –allows the clock rate to be increased
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining Branches cause bubbles in the pipeline, stages are left idle.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle. Unresolved branch Pipelining and Branches
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Speculative execution Branch Prediction
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 56 Kernel Mode Protects OS from users –kernel = English for nucleus Think atom Only privileged code executes in kernel System call expensive because: –Enters kernel mode Flushes pipeline, saves context –Executes code “in kernel land” –Returns to user mode, restoring context Where we are in user land
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 57 Timers & Interrupts Need to respond to events periodically –Change executing processes Quantum – time limit for process execution Fairness – when timer goes off, interrupt –Current process stops –OS takes control through interrupt handler –Scheduler chooses next process Interrupts also signal I/O events –Network packet arrival, disk read complete…
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 58 The End
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Branch Prediction A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path