CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture
Advertisements

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Bart Miller. Outline Definition and goals Paravirtualization System Architecture The Virtual Machine Interface Memory Management CPU Device I/O Network,
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
Virtual Memory Adapted from lecture notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CPSC 614 Computer Architecture Advanced Memory Hierarchy 2 Hank Walker Dept. of Computer Science Texas A&M University
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Zen and the Art of Virtualization Paul Barham, et al. University of Cambridge, Microsoft Research Cambridge Published by ACM SOSP’03 Presented by Tina.
Chapter 2: Memory Hierarchy Design David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley
Chapter 5. Outline (2nd part)
Review of Mem. HierarchyCSCE430/830 Review of Memory Hierarchy & Storage CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Fall, 2008 Portions.
Advanced Memory Hierarchy. 10/3/20152 Outline 11 Advanced Cache Optimizations Memory Technology and DRAM optimizations Virtual Machines Xen VM: Design.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
CPE 731 Advanced Computer Architecture Advanced Memory Hierarchy Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
Lecture 19: Virtual Memory
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
Lecture 9: Memory Hierarchy Virtual Memory Kai Bu
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Chapter 5: Memory Hierarchy Design
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
Protection of Processes Security and privacy of data is challenging currently. Protecting information – Not limited to hardware. – Depends on innovation.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
CS203 – Advanced Computer Architecture Virtual Memory.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
CS352H: Computer Systems Architecture
From Address Translation to Demand Paging
From Address Translation to Demand Paging
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
OS Virtualization.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
David Patterson Electrical Engineering and Computer Sciences
Virtual Memory Overcoming main memory size limitation
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

5/1/07CSC LSU - Dr. Khalaf2 Review [1] Memory wall inspires optimizations since so much performance lost there –Reducing hit time: Small and simple caches, Way prediction, Trace caches –Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches –Reducing Miss Penalty: Critical word first, Merging write buffers –Reducing Miss Rate: Compiler optimizations –Reducing miss penalty or miss rate via parallelism: Hardware prefetching, Compiler prefetching “Auto-tuners” search replacing static compilation to explore optimization space? DRAM – Continuing Bandwidth innovations: Fast page mode, Synchronous, Double Data Rate

5/1/07CSC LSU - Dr. Khalaf3 Review [2/2] VM Monitor presents a SW interface to guest software, isolates state of guests, and protects itself from guest software (including guest OSes) Virtual Machine Revival –Overcome security flaws of large OSes –Manage Software, Manage Hardware –Processor performance no longer highest priority

5/1/07CSC LSU - Dr. Khalaf4 Outline Review AMD Opteron Memory Hierarchy Opteron Memory Performance vs. Pentium 4 Fallacies and Pitfalls Conclusion

5/1/07CSC LSU - Dr. Khalaf5 Protection and Instruction Set Architecture Example Problem: 80x86 POPF instruction loads flag registers from top of stack in memory –One such flag is Interrupt Enable (IE) –In system mode, POPF changes IE –In user mode, POPF simply changes all flags except IE –Problem: guest OS runs in user mode inside a VM, so it expects to see changed a IE, but it won’t Historically, IBM mainframe HW and VMM took 3 steps: 1.Reduce cost of processor virtualization –Intel/AMD proposed ISA changes to reduce this cost 2.Reduce interrupt overhead cost due to virtualization 3.Reduce interrupt cost by steering interrupts to proper VM directly without invoking VMM 2. and 3. not yet addressed by Intel/AMD; in the future?

5/1/07CSC LSU - Dr. Khalaf6 80x86 VM Challenges 18 instructions cause problems for virtualization: 1.Read control registers in user model that reveal that the guest operating system in running in a virtual machine (such as POPF), and 2.Check protection as required by the segmented architecture but assume that the operating system is running at the highest privilege level Virtual memory: 80x86 TLBs do not support process ID tags  more expensive for VMM and guest OSes to share the TLB –each address space change typically requires a TLB flush

5/1/07CSC LSU - Dr. Khalaf7 Intel/AMD address 80x86 VM Challenges Goal is direct execution of VMs on 80x86 Intel's VT-x –A new execution mode for running VMs –An architected definition of the VM state –Instructions to swap VMs rapidly –Large set of parameters to select the circumstances where a VMM must be invoked –VT-x adds 11 new instructions to 80x86 Xen 3.0 plan proposes to use VT-x to run Windows on Xen AMD’s Pacifica makes similar proposals –Plus indirection level in page table like IBM VM 370 Ironic adding a new mode –If OS start using mode in kernel, new mode would cause performance problems for VMM since  100 times too slow

5/1/07CSC LSU - Dr. Khalaf8 Outline Review AMD Opteron Memory Hierarchy Opteron Memory Performance vs. Pentium 4 Fallacies and Pitfalls Discuss “Virtual Machines” paper Conclusion

5/1/07CSC LSU - Dr. Khalaf9 AMD Opteron Memory Hierarchy 12-stage integer pipeline yields a maximum clock rate of 2.8 GHz and fastest memory PC3200 DDR SDRAM 48-bit virtual and 40-bit physical addresses I and D cache: 64 KB, 2-way set associative, 64-B block, LRU L2 cache: 1 MB, 16-way, 64-B block, pseudo LRU Data and L2 caches use write back, write allocate L1 caches are virtually indexed and physically tagged L1 I TLB and L1 D TLB: fully associative, 40 entries –32 entries for 4 KB pages and 8 for 2 MB or 4 MB pages L2 I TLB and L1 D TLB: 4-way, 512 entities of 4 KB pages Memory controller allows up to 10 cache misses –8 from D cache and 2 from I cache

5/1/07CSC LSU - Dr. Khalaf10 Opteron Memory Hierarchy Performance For SPEC2000 –I cache misses per instruction is 0.01% to 0.09% –D cache misses per instruction are 1.34% to 1.43% –L2 cache misses per instruction are 0.23% to 0.36% Commercial benchmark (“TPC-C-like”) –I cache misses per instruction is 1.83% (100X!) –D cache misses per instruction are 1.39% (  same) –L2 cache misses per instruction are 0.62% (2X to 3X) How compare to ideal CPI of 0.33?

5/1/07CSC LSU - Dr. Khalaf11 CPI breakdown for Integer Programs CPI above base attributable to memory  50% L2 cache misses  25% overall (50% memory CPI) –Assumes misses are not overlapped with the execution pipeline or with each other, so the pipeline stall portion is a lower bound

5/1/07CSC LSU - Dr. Khalaf12 CPI breakdown for Floating Pt. Programs CPI above base attributable to memory  60% L2 cache misses  40% overall (70% memory CPI) –Assumes misses are not overlapped with the execution pipeline or with each other, so the pipeline stall portion is a lower bound

5/1/07CSC LSU - Dr. Khalaf13 Pentium 4 vs. Opteron Memory Hierarchy CPUPentium 4 (3.2 GHz*)Opteron (2.8 GHz*) Instruction Cache Trace Cache (8K micro-ops) 2-way associative, 64 KB, 64B block Data Cache 8-way associative, 16 KB, 64B block, inclusive in L2 2-way associative, 64 KB, 64B block, exclusive to L2 L2 cache 8-way associative, 2 MB, 128B block 16-way associative, 1 MB, 64B block Prefetch8 streams to L21 stream to L2 Memory200 MHz x 64 bits200 MHz x 128 bits *Clock rate for this comparison in 2005; faster versions existed

5/1/07CSC LSU - Dr. Khalaf14 Misses Per Instruction: Pentium 4 vs. Opteron  Opteron better  Pentium better D cache miss: P4 is 2.3X to 3.4X vs. Opteron L2 cache miss: P4 is 0.5X to 1.5X vs. Opteron Note: Same ISA, but not same instruction count 2.3X 3.4X 0.5X 1.5X

5/1/07CSC LSU - Dr. Khalaf15 Fallacies and Pitfalls Not delivering high memory bandwidth in a cache-based system –10 Fastest computers at Stream benchmark [McCalpin 2005] –Only 4/10 computers rely on data caches, and their memory BW per processor is 7X to 25X slower than NEC SX7

5/1/07CSC LSU - Dr. Khalaf16 And in Conclusion Virtual Machine Revival –Overcome security flaws of modern OSes –Processor performance no longer highest priority –Manage Software, Manage Hardware “… VMMs give OS developers another opportunity to develop functionality no longer practical in today’s complex and ossified operating systems, where innovation moves at geologic pace.” [Rosenblum and Garfinkel, 2005] Virtualization challenges for processor, virtual memory, I/O –Paravirtualization, ISA upgrades to cope with those difficulties Xen as example VMM using paravirtualization –2005 performance on non-I/O bound, I/O intensive apps: 80% of native Linux without driver VM, 34% with driver VM Opteron memory hierarchy still critical to performance