U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Organization and Architecture

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Computer Organization and Architecture

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.

Computer Organization and Architecture The CPU Structure.

Computer System Overview

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

Chapter 2 The OS, the Computer, and User Programs Copyright © 2008.

CH12 CPU Structure and Function

ARM Processor Architecture

1 CSC 2405: Computer Systems II Spring 2012 Dr. Tom Way.

CSE 451: Operating Systems Autumn 2013 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.

Computer System Overview Chapter 1. Operating System Exploits the hardware resources of one or more processors Provides a set of services to system users.

Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

CMPE 421 Parallel Computer Architecture

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.

Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

Operating Systems and Networks AE4B33OSS Introduction.

Virtual Memory Expanding Memory Multiple Concurrent Processes.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Chapter 1 Computer System Overview

Computer Architecture Chapter (14): Processor Structure and Function

William Stallings Computer Organization and Architecture 8th Edition

Multilevel Memories (Improving performance using alittle “cash”)

5.2 Eleven Advanced Optimizations of Cache Performance

Architecture Background

Cache Memory Presentation I

Morgan Kaufmann Publishers

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 14: Reducing Cache Misses

CSE 451: Operating Systems Spring 2012 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.

Lecture 20: OOO, Memory Hierarchy

Chapter 1 Computer System Overview

Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Emery Berger and Mark Corner University of Massachusetts Amherst Computer Systems Principles Architecture

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2 Architecture

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Von Neumann 3

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “von Neumann architecture” 4

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Fetch, Decode, Execute 5

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 6 The Memory Hierarchy  Registers  Caches –Associativity –Misses  “Locality” registers L1 L2 RAM

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 7 SP FP arg0 arg1 arg0 arg1 arg2 Registers  Register = dedicated name for word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 8 SP FP arg0 arg1 Registers  Register = dedicated name for one word of memory managed by CPU –General-purpose: “AX”, “BX”, “CX” on x86 –Special-purpose: “SP” = stack pointer “FP” = frame pointer “PC” = program counter  Change processes: save current registers & load saved registers = context switch

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 9 Caches  Access to main memory: “expensive” –~ 100 cycles (slow, but relatively cheap ($))  Caches: small, fast, expensive memory –Hold recently-accessed data (D$) or instructions (I$) –Different sizes & locations Level 1 (L1) – on-chip, smallish Level 2 (L2) – on or next to chip, larger Level 3 (L3) – pretty large, on bus –Manages lines of memory ( bytes)

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 10 D$, I$ separate registers L1 L2 RAM Disk 1-cycle latency 2-cycle latency 7-cycle latency 100 cycle latency 40,000,000 cycle latency Network 200,000,000+ cycle latency D$, I$ unified load evict Memory Hierarchy  Higher = small, fast, more $, lower latency  Lower = large, slow, less $, higher latency

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Locality” 11

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 0 Cache” 12

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Level 1 Cache” 13

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “RAM” 14

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 15

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 16

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 17

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Disk” 18

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science “Book Hierarchy” 19

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 20 Orders of Magnitude  10^0 registers L1

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 21 Orders of Magnitude  10^1 L2

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 22 Orders of Magnitude  10^2 RAM

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 23 Orders of Magnitude  10^3

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 24 Orders of Magnitude  10^4

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 25 Orders of Magnitude  10^5

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 26 Orders of Magnitude  10^6

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 27 Orders of Magnitude  10^7 Disk

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 28 Orders of Magnitude  10^8 Network

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 29 Orders of Magnitude  10^9 Network

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 30 Cache Jargon  Cache initially cold  Accessing data initially misses –Fetch from lower level in hierarchy –Bring line into cache (populate cache) –Next access: hit  Warmed up –cache holds most-frequently used data –Context switch implications?  LRU: Least Recently Used –Use the past as a predictor of the future

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 31 Cache Details  Ideal cache would be fully associative –That is, LRU (least-recently used) queue –Generally too expensive  Instead, partition memory addresses and put into separate bins divided into ways –1-way or direct-mapped –2-way = 2 entries per bin –4-way = 4 entries per bin, etc.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 32 Associativity Example  Hash memory based on addresses to different indices in cache

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 33 Miss Classification  First access = compulsory miss –Unavoidable without prefetching  Too many items in way = conflict miss –Avoidable if we had higher associativity  No space in cache = capacity miss –Avoidable if cache were larger  Invalidated = coherence miss –Avoidable if cache were unshared

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quick Activity  Cache with 8 slots, 2-way associativity –Assume hash(x) = x % 4 (modulus)  How many misses? –# compulsory misses? –# conflict misses? –# capacity misses?

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 35 Locality  Locality = re-use of recently-used items –Temporal locality: re-use in time –Spatial locality: use of nearby items In same cache line, same page (4K chunk)  Intuitively – greater locality = fewer misses –# misses depends on cache layout, # of levels, associativity… –Machine-specific

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Assume perfect LRU cache Ignore compulsory misses

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Quantifying Locality  Instead of counting misses, compute hit curve from LRU histogram –Start with total misses on right hand side –Subtract histogram values –Normalize

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Hit Curve Exercise  Derive hit curve for following trace:

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science What can we do with this?  What would be the hit rate –with a cache size of 4 or 9?

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Simple cache simulator  Only argument is N, length of LRU queue –Read in addresses (ints) from cin –Output hits & misses to cout  queue –push_front (v) = put v on front of queue –pop_back() = remove back from queue –erase(i) = erase element (iterator i) –size() = number of elements –for (queue ::iterator i = q.begin(); i < q.end(); ++i) cout << *i << endl; 48

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 49 Important CPU Internals  Other issues that affect performance –Pipelining –Branches & prediction –System calls (kernel crossings)

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 50 Scalar architecture  Straight-up sequential execution –Fetch instruction –Decode it –Execute it  Problem: I or D cache miss –Result – stall: everything stops –How long to wait for miss? long time compared to CPU

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 51 Superscalar architectures  Out-of-order processors –Pipeline of instructions in flight –Instead of stalling on load, guess! Branch prediction Value prediction –Predictors based on history, location in program –Speculatively execute instructions Actual results checked asynchronously If mispredicted, squash instructions  Accurate prediction = massive speedup –Hides latency of memory hierarchy

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining  Pipelining overlaps instructions to exploit parallelism, –allows the clock rate to be increased

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Pipelining  Branches cause bubbles in the pipeline, stages are left idle.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back Pipelining overlaps instructions to exploit parallelism, allowing the clock rate to be increased. Branches cause bubbles in the pipeline, where some stages are left idle. Unresolved branch Pipelining and Branches

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Instruction fetch Instruction decode Execute Memory access Write back A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path. Speculative execution Branch Prediction

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 56 Kernel Mode  Protects OS from users –kernel = English for nucleus Think atom  Only privileged code executes in kernel  System call expensive because: –Enters kernel mode Flushes pipeline, saves context –Executes code “in kernel land” –Returns to user mode, restoring context Where we are in user land

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 57 Timers & Interrupts  Need to respond to events periodically –Change executing processes Quantum – time limit for process execution  Fairness – when timer goes off, interrupt –Current process stops –OS takes control through interrupt handler –Scheduler chooses next process  Interrupts also signal I/O events –Network packet arrival, disk read complete…

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 58 The End

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Branch Prediction  A branch predictor allows the processor to speculatively fetch and execute instructions down the predicted path