Chapter 13 (In Part) Computer Architecture Pipelines and caches Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

COMP375 Computer Architecture and Organization Senior Review.
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 12 Reduce Miss Penalty and Hit Time
Computer Organization and Architecture
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
Computer Architecture, Memory Hierarchy & Virtual Memory
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer Architecture Pipelines Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson.
DLX Instruction Format
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Computer Architecture Pipelines Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson.
Appendix A Pipelining: Basic and Intermediate Concepts
CMPE 421 Parallel Computer Architecture
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Lecture 19: Virtual Memory
CSC 4250 Computer Architectures September 15, 2006 Appendix A. Pipelining.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CS203 – Advanced Computer Architecture Virtual Memory.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Computer Organization CS224
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
From Address Translation to Demand Paging
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSC3050 – Computer Architecture
Pipelining Appendix A and Chapter 3.
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

Chapter 13 (In Part) Computer Architecture Pipelines and caches Diagrams are from Computer Architecture: A Quantitative Approach, 2nd, Hennessy and Patterson. Use these note slides as your primary study source.

Instruction Execution

1.Instruction FetchIF IR  M[PC] NPC  PC Instruction Decode/Register FetchID A  Register[R s1 ] B  Register[Rs2] Imm  (IR 16 ) 16 ##IR

Instruction Execution 3.ExecutionEX The instruction has been decoded, so execution can be split according to instruction type. Reg-RegALUout  A op B Reg-ImmALUout  A op Imm BranchALUout  NPC + ImmTarget cond  (A{=,!=}0) LD/STALUout  A op ImmEff Address Jump??

Instruction Execution 4.Memory Access/Branch CompletionMEM Load: Load Memory Data = LMD = MDR = Mem[ALUout] Store: Mem[ALUout]  B Branch: PC  (cond)?ALUout:NPC Jump/JAL: ?? JR: PC  A ELSE: PC  NPC 5.Write-BackWB ALU Instruction: Rd  ALUoutput Load: Rd  LMD JAL: R31  OldPC

What is Pipelining?? Pipelining is an implementation technique whereby multiple instructions are overlapped in execution.

Pipelining IFIDMEMWBInstr 1 IFIDEXMEMWBInstr 2 IFIDEXMEMWBInstr 3 IFIDEXMEMInstr 4 IFIDEXInstr 5 What are the overheads of pipelinging? What are the complexities? What types of pipelines are there? Time 

Pipelining

Pipeline Overhead Latches between stages Must latch data and control Expansion of stage delay… Original time was IF + ID + EX + MEM + WB With phases skipped for some instructions New best time is MAX (IF,ID,EX,MEM,WB) + Latch Delay More memory ports or separate memories Requires complicated control to make as fast as possible But so do non-pipelined high performance designs

Pipelining Pipeline complexities 3 or 5 or 10 instructions are executing at the same time What if the instructions are dependent? Must have hardware or software ways to ensure proper execution. What if an interrupt occurs? Would like to be able to figure out what instruction caused it!!

Pipeline Stalls – Structural What causes pipeline stalls? 1. Structural Hazard Inability to perform 2 tasks. Example: Memory needed for both instructions and data SpecInt92 has 35% load and store If this remains a hazard, each load or store will stall one IF.

Pipeline Stalls – Data 2.1 Data Hazards – Read after WriteRAW Data is needed before it is written. ADD R1,R2,R3IFIDEXMEMWB ADD R4,R1,R5IFIDEXMEMWB ADD R4,R1,R5IFIDEXMEMWB ADD R4,R2,R3IFIDEXMEMWB TIME = Calc DoneR1 Written R1 Read?R1 Used R1 Read?

Pipeline Stalls – Data Data must be forwarded to other instructions in pipeline if ready for use but not yet stored. In 5-stage MIPS pipeline, this means forwarding to next three instructions. Can reduce to 2 instructions if register bank can process a simultaneous read and write. Text indicates this as writing in first half of WB and reading in second half of ID. More likely, both occur at once, with read receiving the data as it is written.

Pipeline Stalls – Data Load Delays Data available at end of MEM, not in time for the Instr + 1’s EX phase.

Pipeline Stalls – Data So insert a bubble…

Pipeline Stalls – Data A Store requires two registers at two different times: Rs1 is required for address calculation during EX Data to be stored (Rd) is required during MEM Data can be loaded and immediately stored Data from the ALU must be forwarded to MEM for stores

Pipeline Stalls – Data 2.2 Data Hazards – Write After WriteWAW Out-of-order completion of instructions IF ID/RF ISSUE FMUL1 FMUL2 FDIV1 FDIV2 FDIV3 FDIV4 WB Length of FDIV pipe could cause WAW error FDIV R1, R2, R3 FMUL R1, R2, R3

Pipeline Stalls – Data 2.3 Data Hazards – Write After ReadWAR A slow reader followed by a fast writer IF ID ISSUE FLOAD1 FMADD1 FMADD2 FMADD3 FMADD4 WB Read registers for multiply Read registers for add FMADD R1, R2, R3, R4IF ID IS FM1 FM2 FM3 FM4 WB FLOAD R4, #300(R5) IF ID IS F1 WB

Pipeline Stalls – Data 2.4 Data Hazards – Read After ReadRAR A read after a read is not a hazard Assuming that a read does not change any state

Pipeline Stalls – Control 3. Control Hazards Branch, Jump, Interrupt Or why computers like straight line code… 3.1 Control Hazards – Jump Jump IF ID EX MEM WB IF ID EX MEM WB When is knowledge of jump available? When is target available? With absolute addressing, may be able to fit jump into IF phase with zero overhead.

Pipeline Stalls – Control

This relies on performing simple decode (2 bits for Sparc, 6 bits for other) in IF. If IF is currently the slowest phase, this will slow the clock Relative jumps would require an adder and its delay

Pipeline Stalls – Control 3.2 Control Hazards – Branch Branch IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Branch is resolved in MEM and information forwarded to IF, causing three stalls. 16% of the SpecInt instructions are branches, and about 67% are taken, CPI = CPI IDEAL + STALLS = 1 + 3(0.67)(0.16) = 1.32 Must Resolve branch sooner Update PC sooner Fetch correct next instruction more frequently

Pipeline Stalls – Control Delayed branch and branch delay slot Ideally, fill slot with instruction from before branch Otherwise from target (since branches are more often taken than not) Or from fall-through Slots filled 50-70% of time Interrupt recovery complicated

Memory Hierarchy CPU Registers 3-10 access/cycle words On-Chip Cache 1-2 access/cycle 5-10 ns 16KB-2MB Off-Chip Cache (SRAM) 5-20 cycles/access ns 1MB – 16MB Main Memory (DRAM) cycles/access ns 64MB-many GB Disk Buffer Disk or Network 1M-2M cycles/access 4GB – many TB $6/MB $0.28/MB $0.65/GB sector Page (1KB-16KB) Blocks or Lines (8-128 bytes) Words

Generalized Caches Movement of Technology Alpha Alpha VAX 11/780 / Instr.CyclesMemory (ns)(ns) PenaltyMissMainClockCPIMachine ?? ~5 ( DDR, < RMBS )0.5??Pentium IV

Cache Example: Example cache: Alpha KB cache. With 34-bit addressing. 256-bit lines (32 bytes) Block placement: Direct map One possible place for each address Multiple addresses for each possible place 033 OffsetCache IndexTag Cache line include…

Cache Example: 21064

Cache operation Send address to cache Parse address into offset, index, and tag Decode index into a line of the cache, prepare cache for reading (precharge) Read line of cache: valid, tag, data Compare tag with tag field of address Miss if no match Select word according to byte offset and read or write If there is a miss… Stall the processor while reading in line from the next level of hierarchy Which in turn may miss and read from main memory Which in turn may miss and read from disk

Virtual Memory Block1KB-16KBPage/Segment Hit CyclesDRAM access Miss700,000-6,000,000 cyclesPage Fault Miss rate1:0.1 – 10 million Differences from cache Implement miss strategy in software Hit/Miss factor 10,000+ (vs for cache) Critical concerns are Fast address translation Miss ratio as low as possible without ideal knowledge

Virtual Memory Q0: Fetch strategy Swap pages on task switch May pre-fetch next page if extra transfer time is only issue may include a disk cache Q1: Block Placement Anywhere – fully associate – random access is easily available, and time to place a block well is tiny compared to miss penalty.

Virtual Memory Q2: Finding a block Page table List of VPNs (Virtual Page Numbers) and physical address (or disk location) Consider 32-bit VA, 30-bit PA, and 2 KB pages. Page table has 2 32 /2 11 = 2 21 entries for perhaps 2 25 bytes or 2 14 pages. Page table must be in virtual memory (segmented paging) System page table must always be in memory. Translation look-aside buffer (TLB) Cache of address translations Hit in 1 cycle (no stalls in pipeline) Miss results in page table access (which could lead to page fault). Perhaps OS instructions.

Virtual Memory Q3: Page replacement LRU used most often (really, approximations of LRU with a fixed time window). TLB will support determining what translations have been used. Q4: Write policy Write through or write back? Write Through – data is written to both the block in the cache and to the block in lower level memory. Write Back – data is written only to the block in the cache, only written to lower level when replaced.

Virtual Memory Memory protection Must index page table entries by PID Flush TLB on task switch Verify access to page before loading into TLB Provide OS access to all memory, physical, and virtual Provide some un-translated addresses to OS for I/O buffers

Address Translation TLB – typically 8-32 entries Set-associative or fully associative Random or LRU replacement Two or more ports (instruction and data)

Why use a pipeline? Seems so complicated. What are some of the overheads? What is the deal with memory hierarchies? Why are the caches so small? Why not make them larger? Do I have to worry about any of this when I am writing code? RISC and CISC again. What is the big deal and difference? Summary Pipelines and Memory Hierarchies

Other Architectures Motorola Family (circa late ’70’s) Instructions can specify 8, 16, or 32 bit operands. Has 32 bit registers. Has 2 register files A and D both have 8 32-bit registers. (A is for addresses and D is for Data). Used to save bits in the opcode since implicit use of A or D is in the instruction. Uses a two-address instruction set. Instructions are usually 16-bits but can have up to 4 more 16-bit words. Supports lots of addressing modes.

Other Architectures Intel x86 Family (circa late ’70’s) Probably the least elegant design available. Uses 16-bit words, so only 64k unique address can be provided. Used segment addressing to overcome this limit. Basically the shift the address by 4 (multiply by 16) and then add a 16-bit offset giving and effective 20-bit address, thus 1MB of memory. Has 4 registers (segment registers) used to hold segment addresses: CS, DS, SS, ES. Has 4 general registers: AX, BX, CX, and DX.

Other Architectures Sun SPARC Architecture (late 1980’s) Scalable Processor ARChitecture Developed around the same time as the MIPS architecture. Very similar to MIPS except for register usage. Has a set of global registers (8) like MIPS. Has a large set of registers from which only a subset can be accessed at any time, called a register file. Is a load/store architecture with 3-addressing instructions. Has 32-bit registers and all instructions are 32-bits. Designed to support procedure call and returns efficiently. Number of register in the register file vary with implementation but at most 24 can be accessed. Those 24 plus the 8 global give a register window.