U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems Principles Architecture

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2 Architecture

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Von Neumann 3

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “von Neumann architecture” 4

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Fetch, Decode, Execute 5

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 6 The Memory Hierarchy  Registers  Caches –Associativity –Misses  “Locality” registers L1 L2 RAM

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 7 SP FP arg0 arg1 arg0 arg1 arg2 Registers  Register = dedicated name for word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 8 SP FP arg0 arg1 Registers  Register = dedicated name for one word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter  Change processes: save current registers & load saved registers = context switch

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 9 Caches  Access to main memory: “expensive” –~ 100 cycles (slow, but relatively cheap ($))  Caches: small, fast, expensive memory –Hold recently-accessed data (D$) or instructions (I$) –Different sizes & locations Level 1 (L1) – on-chip, smallish Level 2 (L2) – on or next to chip, larger Level 3 (L3) – pretty large, on bus –Manages lines of memory (32-128 bytes)

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 10 D$, I$ separate registers L1 L2 RAM Disk 1-cycle latency 2-cycle latency 7-cycle latency 100 cycle latency 40,000,000 cycle latency Network 200,000,000+ cycle latency D$, I$ unified load evict Memory Hierarchy  Higher = small, fast, more $, lower latency  Lower = large, slow, less $, higher latency

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Locality” 11

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 0 Cache” 12

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 1 Cache” 13

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “RAM” 14

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 15

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Book Hierarchy” 19

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 20 Orders of Magnitude  10^0 registers L1

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 21 Orders of Magnitude  10^1 L2

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 22 Orders of Magnitude  10^2 RAM

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 23 Orders of Magnitude  10^3

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 27 Orders of Magnitude  10^7 Disk

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 28 Orders of Magnitude  10^8 Network

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 29 Orders of Magnitude  10^9 Network

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 30 Cache Jargon  Cache initially cold  Accessing data initially misses –Fetch from lower level in hierarchy –Bring line into cache (populate cache) –Next access: hit  Warmed up –cache holds most-frequently used data –Context switch implications?  LRU: Least Recently Used –Use the past as a predictor of the future

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 31 Cache Details  Ideal cache would be fully associative –That is, LRU (least-recently used) queue –Generally too expensive  Instead, partition memory addresses and put into separate bins divided into ways –1-way or direct-mapped –2-way = 2 entries per bin –4-way = 4 entries per bin, etc.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 32 Associativity Example  Hash memory based on addresses to different indices in cache

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 33 Miss Classification  First access = compulsory miss –Unavoidable without prefetching  Too many items in way = conflict miss –Avoidable if we had higher associativity  No space in cache = capacity miss –Avoidable if cache were larger  Invalidated = coherence miss –Avoidable if cache were unshared

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 34 3711237799613725810 Quick Activity  Cache with 8 slots, 2-way associativity –Assume hash(x) = x % 4 (modulus)  How many misses? –# compulsory misses? –# conflict misses? –# capacity misses? 10 2 0

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 35 Locality  Locality = re-use of recently-used items –Temporal locality: re-use in time –Spatial locality: use of nearby items In same cache line, same page (4K chunk)  Intuitively – greater locality = fewer misses –# misses depends on cache layout, # of levels, associativity… –Machine-specific

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 36 377237 123456 3 7 Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 37 377237 123456 3 7 Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 38 377237 123456 3 7 2 Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 42 123456 113333 Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 43.3 1111 Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values –Normalize

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 44 354283699613725810 Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 45 354283699613725810 123456789 122233456 Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 46 354283699613725810 123456789 122233456 Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science What can we do with this?  What would be the hit rate –with a cache size of 4 or 9?

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Simple cache simulator  Only argument is N, length of LRU queue –Read in addresses (ints) from cin –Output hits & misses to cout  queue –push_front (v) = put v on front of queue –pop_back() = remove back from queue –erase(i) = erase element (iterator i) –size() = number of elements –for (queue ::iterator i = q.begin(); i < q.end(); ++i) cout << *i << endl; 48

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 49 Important CPU Internals  Other issues that affect performance –Pipelining –Branches & prediction –System calls (kernel crossings)

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 50 Scalar architecture  Straight-up sequential execution –Fetch instruction –Decode it –Execute it  Problem: I or D cache miss –Result – stall: everything stops –How long to wait for miss? long time compared to CPU

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 51 Superscalar architectures  Out-of-order processors –Pipeline of instructions in flight –Instead of stalling on load, guess! Branch prediction Value prediction –Predictors based on history, location in program –Speculatively execute instructions Actual results checked asynchronously If mispredicted, squash instructions  Accurate prediction = massive speedup –Hides latency of memory hierarchy

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining  Pipelining overlaps instructions to exploit parallelism, –allows the clock rate to be increased

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining  Branches cause bubbles in the pipeline, stages are left idle.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle. Unresolved branch Pipelining and Branches

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Speculative execution Branch Prediction

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 56 Kernel Mode  Protects OS from users –kernel = English for nucleus Think atom  Only privileged code executes in kernel  System call expensive because: –Enters kernel mode Flushes pipeline, saves context –Executes code “in kernel land” –Returns to user mode, restoring context Where we are in user land

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 57 Timers & Interrupts  Need to respond to events periodically –Change executing processes Quantum – time limit for process execution  Fairness – when timer goes off, interrupt –Current process stops –OS takes control through interrupt handler –Scheduler chooses next process  Interrupts also signal I/O events –Network packet arrival, disk read complete…

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 58 The End

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Branch Prediction  A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.

Similar presentations

Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.

Similar presentations

Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems."— Presentation transcript:

Similar presentations

About project

Feedback