November, 2003 Saeid Nooshabadi

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Chapter 4 Memory Management Basic memory management Swapping
SE-292: High Performance Computing
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
Some of the slides are adopted from David Patterson (UCB)
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Virtual Storage SystemCS510 Computer ArchitecturesLecture Lecture 14 Virtual Storage System.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Virtual Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Chap. 7.4: Virtual Memory. CS61C L35 VM I (2) Garcia © UCB Review: Caches Cache design choices: size of cache: speed v. capacity direct-mapped v. associative.
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley and Rabi Mahapatra & Hank Walker.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 34 – Virtual Memory II Jim Cicconi of AT&T says that the surge of video.
CS61C L25 Review Cache © UC Regents 1 CS61C - Machine Structures Lecture 25 - Review Cache/VM December 2, 2000 David Patterson
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Virtual Memory - II Lecturer: Hui Wu Session 2, 2005 Modified.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
CS 61C L25 VM III (1) Garcia, Spring 2004 © UCB TA Chema González www-inst.eecs/~cs61c-td inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture.
Memory Management 2010.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 35 – Virtual Memory III Researchers at three locations have recently added.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ECE 232 L27.Virtual.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 27 Virtual.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
CS61C L34 Virtual Memory II (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
COMP3221 lec37-vm-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 13: Virtual Memory - II
CS61C L37 VM III (1)Garcia, Fall 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,
Lecture 9: Memory Hierarchy Virtual Memory Kai Bu
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 35 – Virtual Memory III ”The severity of the decline in the market is further evidence.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Review (1/2) °Caches are NOT mandatory: Processor performs arithmetic Memory stores data Caches simply make data transfers go faster °Each level of memory.
Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CS161 – Design and Architecture of Computer
Lecturer PSOE Dan Garcia
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual Memory Chapter 7.4.
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
CS 704 Advanced Computer Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Csci 211 Computer System Architecture – Review on Virtual Memory
CPE 631 Lecture 05: Cache Design
Some of the slides are adopted from David Patterson (UCB)
Virtual Memory Overcoming main memory size limitation
Albert Chae, Instructor
COMP3221: Microprocessors and Embedded Systems
Cache Memory Rabi Mahapatra
Presentation transcript:

November, 2003 Saeid Nooshabadi saeid@unsw.edu.au COMP 3221 Microprocessors and Embedded Systems Lectures 39: Cache & Virtual Memory Review http://www.cse.unsw.edu.au/~cs3221 November, 2003 Saeid Nooshabadi saeid@unsw.edu.au

Apply Principle of Locality Recursively Review (#1/3) Apply Principle of Locality Recursively Reduce Miss Penalty? add a (L2) cache Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings vs. tag/data in cache Virtual memory to Physical Memory Translation too slow? Add a cache of Virtual to Physical Address Translations, called a TLB

Review (#2/3) Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound via segmentation Spatial Locality means Working Set of Pages is all that must be in memory for process to run fairly well TLB to reduce performance cost of VM Need more compact representation to reduce memory size cost of simple 1-level page table (especially 32 - 64-bit addresses)

Why Caches? 1000 Performance 100 10 1 µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap: (grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 1989 first Intel CPU with cache on chip; 1999 gap “Tax”; 37% area of Alpha 21164, 61% StrongArm SA110, 64% Pentium Pro

Memory Hierarchy Pyramid “Upper” “Lower” Central Processor Unit (CPU) Increasing Distance from CPU, Decreasing cost / MB Level 1 Levels in memory hierarchy Level 2 Level 3 . . . Level n Size of memory at each level Principle of Locality (in time, in space) + Hierarchy of Memories of different speed, cost; exploit to improve cost-performance

Why virtual memory? (#1/2) Protection regions of the address space can be read only, execute only, . . . Flexibility portions of a program can be placed anywhere, without relocation (changing addresses) Expandability can leave room in virtual address space for objects to grow Storage management allocation/deallocation of variable sized blocks is costly and leads to (external) fragmentation; paging solves this

Why virtual memory? (#2/2) Generality ability to run programs larger than size of physical memory Storage efficiency retain only most important portions of the program in memory Concurrent I/O execute other processes while loading/dumping page

Virtual Memory Review (#1/4) User program view of memory: Contiguous Start from some set address Infinitely large Is the only running program Reality: Non-contiguous Start wherever available memory is Finite size Many programs running at a time

Virtual Memory Review (#2/4) Virtual memory provides: illusion of contiguous memory all programs starting at same set address illusion of infinite memory protection

Virtual Memory Review (#3/4) Implementation: Divide memory into “chunks” (pages) Operating system controls pagetable that maps virtual addresses into physical addresses Think of memory as a cache for disk TLB is a cache for the pagetable

Why Translation Lookaside Buffer (TLB)? Paging is most popular implementation of virtual memory (vs. base/bounds in segmentation) Every paged virtual memory access must be checked against Entry of Page Table in memory to provide protection Cache of Page Table Entries makes address translation possible without memory access (in common case) to make translation fast

Virtual Memory Review (#4/4) Let’s say we’re fetching some data: Check TLB (input: VPN, output: PPN) hit: fetch translation miss: check pagetable (in memory) pagetable hit: fetch translation pagetable miss: page fault, fetch page from disk to memory, return translation to TLB Check cache (input: PPN, output: data) hit: return value miss: fetch value from memory

Paging/Virtual Memory Review User A: Virtual Memory User B: Virtual Memory TLB ¥ Physical Memory ¥ Stack Stack 64 MB A Page Table B Page Table Heap Heap Static Static Code Code

Three Advantages of Virtual Memory 1) Translation: Program can be given consistent view of memory, even though physical memory is scrambled Makes multiple processes reasonable Only the most important part of program (“Working Set”) must be in physical memory Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later

Three Advantages of Virtual Memory 2) Protection: Different processes protected from each other Different pages can be given special behavior (Read Only, Invisible to user programs, etc). Privileged data protected from User programs Very important for protection from malicious programs  Far more “viruses” under Microsoft Windows 3) Sharing: Can map same physical page to multiple users (“Shared memory”)

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q1: Where block placed in upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Mod Number of Sets Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Set Set 1 Set 2 Set 3 Fully associative: block 12 can go anywhere Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Block-frame address 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 Block no.

Q2: How is a block found in upper level? offset Block Address Tag Index Set Select Data Select Direct indexing (using index and block offset), and tag comparing Increasing associativity shrinks index, expands tag

Q3: Which block replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used) Miss Rates Associativity: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Q4: What happens on a write? Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no writes of repeated writes

3D - Graphics For Mobile Phones Developed in collaboration with Imagination Technologies, MBX 2D and 3D accelerator cores deliver PC and console-quality 3D graphics on embedded ARM- based devices. Supporting the feature-set and performance-level of commodity PC hardware, MBX cores use a unique screen-tiling technology to reduce the memory bandwidth and power consumption to levels suited to mobile devices, providing excellent price- performance for embedded SoC devices. 660K gates (870K with optional VGP geometry processor) 80MHz operation in 0.18µm process Over 120MHz operation in 0.13µm process Up to 500 mega pixel/sec effective fill rate Up to 2.5 million triangle/sec rendering rate Suited to QVGA (320x240) up to VGA (640x480) resolution screens <1mW/MHz in 0.13µm process and <2mW in 0.18 µm process Optional VGP floating point geometry engine compatible with Microsoft VertexShader specification 2D and 3D graphics acceleration and video acceleration Screen tiling and deferred texturing - only visible pixels are rendered Internal Z-buffer tile within the MBX core Developed in collaboration with Imagination Technologies, MBX 2D and 3D accelerator cores deliver PC and console-quality 3D graphics on embedded ARM-based devices. Supporting the feature-set and performance-level of commodity PC hardware, MBX cores use a unique screen-tiling technology to reduce the memory bandwidth and power consumption to levels suited to mobile devices, providing excellent price-performance for embedded SoC devices. 660K gates (870K with optional VGP geometry processor) 80MHz operation in 0.18µm process Over 120MHz operation in 0.13µm process Up to 500 mega pixel/sec effective fill rate Up to 2.5 million triangle/sec rendering rate Suited to QVGA (320x240) up to VGA (640x480) resolution screens <1mW/MHz in 0.13µm process and <2mW in 0.18 µm process Optional VGP floating point geometry engine compatible with Microsoft VertexShader specification 2D and 3D graphics acceleration and video acceleration Screen tiling and deferred texturing - only visible pixels are rendered Internal Z-buffer tile within the MBX core Flat and Gourard shading Bi-linear, tri-linear and anisotropic texture filtering Ambient, diffuse and specular lighting 2-layer multi-texturing Full scene anti-aliasing, 2x2, 2x1 or 1x2 YUV to RGB conversion Environmental bump mapping Various texture formats, plus texture compression Alpha blending 32 bit internal rendering, even with 16 bit frame buffer Render to texture and render to window (multiple windows) Optional Vertex Geometry Processor (VGP) - Hardware transform and lighting (TnL) engine AMBA® AHB interface ARM multi-port memory controller interface http://news.zdnet.co.uk/0,39020330,39117384,00.htm

Address Translation & 3 Exercises VPN-tag Offset Virtual Address INDEX Hit = TLB Physical Page Number PPN ... TLB-tag TLB- tag PPN Offset Physical Address VPN = VPN-tag + Index

Address Translation Exercise 1 (#1/2) 40-bit VA, 16 KB pages, 36-bit PA Number of bits in Virtual Page Number? a) 18; b) 20; c) 22; d) 24; e) 26; f) 28 Number of bits in Page Offset? a) 8; b) 10; c) 12; d) 14; e) 16; f) 18 Number of bits in Physical Page Number? e) 26 d) 14 16KB  14bits 40 –14 = 26 36 – 14 = 22 c) 22

Address Translation Exercise 1 (#2/2) 40- bit virtual address, 16 KB (214 B) 36- bit virtual address, 16 KB (214 B) Virtual Page Number (26 bits) Page Offset (14 bits) Physical Page Number (22 bits) Page Offset (14 bits)

Address Translation Exercise 2 (#1/2) 40-bit VA, 16 KB pages, 36-bit PA 2-way set-assoc TLB: 256 "slots", 2 per slot Number of bits in TLB Index? a) 8; b) 10; c) 12; d) 14; e) 16; f) 18 Number of bits in TLB Tag? a) 18; b) 20; c) 22; d) 24; e) 26; f) 28 Approximate Number of bits in TLB Entry? a) 32; b) 36; c) 40; d) 42; e) 44; f) 46 a) 8 a) 18 f) 46

Address Translation 2 (#2/2) 2-way set-assoc data cache, 256 (28) "slots", 2 TLB entries per slot => 8 bit index Data Cache Entry: Valid bit, Dirty bit, Access Control (2-3 bits?), Virtual Page Number, Physical Page Number TLB Tag (18 bits) TLB Index (8 bits) Page Offset (14 bits) Virtual Page Number (26 bits) V D Access (3 bits) TLB Tag (18 bits) Physical Page No. (22 bits)

Address Translation Exercise 3 (#1/2) 40-bit VA, 16 KB pages, 36-bit PA 2-way set-assoc TLB: 256 "slots", 2 per slot 64 KB data cache, 64 Byte blocks, 2 way S.A. Number of bits in Cache Offset? a) 6; b) 8; c) 10; d) 12; e) 14; f) 16 Number of bits in Cache Index? a) 6; b) 9; c) 10; d) 12; e) 14; f) 16 Number of bits in Cache Tag? a) 18; b) 20; c) 21; d) 24; e) 26; f) 28 Approximate No. of bits in Cache Entry? a) 6 b) 9 16KB  14bits 40 –14 = 26 26 – 8 = 22 36 – 14 = 22 c) 21

Address Translation 3 (#2/2) 2-way set-assoc data cache, 64K/64 =1K (210) blocks, 2 entries per slot => 512 slots => 10 bit index Data Cache Entry: Valid bit, Dirty bit, Cache tag + 64 Bytes of Data Cache Tag (21 bits) Cache Index (9 bits) Block Offset (6 bits) Physical Page Address (36 bits) V D Cache Tag (21 bits) Cache Data (64 Bytes)

Cache/VM/TLB Summary: (#1/3) The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled?

Cache/VM/TLB Summary: (#2/3) Virtual Memory allows protected sharing of memory between processes with less swapping to disk, less fragmentation than always swap or base/bound in segmentation 3 Problems: 1) Not enough memory: Spatial Locality means small Working Set of pages OK 2) TLB to reduce performance cost of VM 3) Need more compact representation to reduce memory size cost of simple 1-level page table, especially for 64-bit address (See COMP3231)

Cache/VM/TLB Summary: (#3/3) Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? 1000X DRAM growth removed controversy Today VM allows many processes to share single memory without having to swap all processes to disk; VM protection today is more important than memory hierarchy Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43)