Presented by David Wolinsky

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Caching IV Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
11/10/2005Comp 120 Fall November 10 8 classes to go! questions to me –Topics you would like covered –Things you don’t understand –Suggestions.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
What is it and why do we need it? Chris Ward CS147 10/16/2008.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CMSC 611: Advanced Computer Architecture
Dynamic Scheduling Why go out of style?
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
ASR: Adaptive Selective Replication for CMP Caches
Improving Memory Access 1/3 The Cache and Virtual Memory
Section 9: Virtual Memory (VM)
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Appendix B. Review of Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers
(Find all PTEs that map to a given PPN)
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Virtual Memory 4 classes to go! Today: Virtual Memory.
Lecture 23: Cache, Memory, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Virtual Memory فصل هشتم.
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Cache - Optimization.
Principle of Locality: Memory Hierarchies
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Performance metrics for caches
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Presented by David Wolinsky Exploiting eDRAM Bandwidth with Data Prefetching: Simulation and Measurements Presented by David Wolinsky

Outline eDRAM Prefetching Blue Gene/L Overview Cache System Overview System Analysis Conclusion

eDRAM Embedded DRAM – that’s right on the chip Wider buses Higher operation speeds Benefits over SRAM Reduces costs Greater density (more capacity) Lower power Hide latency by increased bandwidth

Prefetching(1) Stores data before its needed Done via streaming If a stream is valid, a new line is added for each subsequent data request Two streaming methods N-Deep History Optimistic

Prefetching(2) N-Deep History data streaming Optimistic L1 misses, L2 records the address to stream detection buffer If L2 address has entry, stream is established Subsequent accesses trigger prefetch request Optimistic Issues a fetch request for each new L2 cache request not in prefetch Prefetches next line Streams are maintained until displaced Fully associative lines

Blue Gene/L Overview 65,536 Nodes 2 PPC440 per node System on Chip SIMD FPU unit on Each chip

Cache System Overview (1) 32kB L1 Cache with 32B cache line Private 2kB prefetch SRAM with 128B cache line 4 MB L3 eDRAM with 128B cache line

System Analysis (1) Optimal stream detector size NAS optimal is 16 Splash-2 optimal 8 Actual count is 16

System Analysis (2) Optimal number of line buffers Chose 15 for performance / cost ratio

System Analysis (3) Optimal stream detector size LRU fastest but expensive to implement RRMRU uses round robin but won’t replace the 3 most recently used, BG/L uses this

System Analysis (4) Optimal stream detector size NAS optimal is 16 Splash-2 optimal 8 Actual count is 16

System Analysis (5) Does it matter if streams are bidirectional or unidirectional? Results show that bidirectional helps a little but isn’t performance / cost optimal

System Analysis (6) Prefetches in simulator and hardware Stream detectors are very competitive with optimal Prefetching is very good against no prefetching Simulator was in 10 to 20% of accuracy of hardware

Conclusion No direct comparison to L2 SRAM caches Doesn’t work well on Linux using standard 4kB pages The simulator provides a somewhat accurate way to test different memory configurations prior to implementing them in hardware eDRAM with prefetch is much faster than with no prefetch