Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This.

Slides:

Advertisements

Similar presentations

Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary

Advertisements

The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.

ULC: An Unified Placement and Replacement Protocol in Multi-level Storage Systems Song Jiang and Xiaodong Zhang College of William and Mary.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Exploiting Locality in DRAM

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

ECE7995 Caching and Prefetching Techniques in Computer Systems Lecture 8: Buffer Cache in Main Memory (IV)

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Systems I Locality and Caching

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Exploiting Locality in DRAM Xiaodong Zhang College of William and Mary.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Clock-Pro: An Effective Replacement in OS Kernel Xiaodong Zhang College of William and Mary.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

LIRS: Low Inter-reference Recency Set Replacement for VM and Buffer Caches Xiaodong Zhang College of William and Mary.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

Memory COMPUTER ARCHITECTURE

Lecture 12 Virtual Memory.

CS 704 Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

CMSC 611: Advanced Computer Architecture

ECE7995 Caching and Prefetching Techniques in Computer Systems

CPE 631 Lecture 05: Cache Design

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Virtual Memory Overcoming main memory size limitation

Virtual Memory: Working Sets

Cache - Optimization.

Main Memory Background

Virtual Memory Lecture notes from MKP and S. Yalamanchili.

Cache Memory and Performance

Presentation transcript:

Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National Science Foundation College of William and Mary This talk does not necessarily reflect NSF`s official opinions

Acknowledgement l Participants of the project l David Bryan, Jefferson Labs (DOE) l Stefan Kubricht, Vsys Inc. l Song Jiang and Zhichun Zhu, William and Mary l Li Xiao, Michigan State University. l Yong Yan, HP Labs. l Zhao Zhang, Iowa State University. l Sponsors of the project l Air Force Office of Scientific Research l National Science Foundation l Sun Microsystems Inc.

CPU-DRAM Gap 60% per year 7% per year 50% per year

Cache Miss Penalty l A cache miss = Executing hundreds of CPU instructions (thousands in the future). l 2 GHz, 2.5 avg. issue rate: issue 350 instructions in 70 ns access latency. l A small cache miss rate  A high memory stall time in total execution time. l On average, 62% memory stall time for SPEC2000.

I/O Bottleneck is Much Worse l Disk access time is limited by mechanical delays. l A fast Seagate Cheetah X15 disk (15000 rpm): l average seek time: 3.9 ms, rotation latency: 2 ms l internal transfer time for a strip unit (8KB): 0.16 ms l Total disk latency: 6.06 ms. l External transfer rate increases 40% per year. l from disk to DRAM: 160 MBps (UltraSCSI I/O bus) l To get 8KB from disk to DRAM takes ms. l More than 22 million CPU cycles of 2GHz!

CPU Registers L1 TLB L3 L2 Row buffer DRAM Bus adapter Controller buffer Buffer cache CPU-memory bus I/O bus I/O controller disk Disk cache TLB registers L1 L2 L3 Controller buffer Buffer cache disk cache Row buffer Memory Hierarchy with Multi-level Caching Algorithm implementation Compiler Micro architecture Micro architecture Micro architecture Operating system

Other Systems Effects to Locality Locality exploitation is not guaranteed by the buffers! l Initial and runtime data placement. l static and dynamic data allocations, and interleaving. l Data replacement at different caching levels. l LRU is used but fails sometimes. l Locality aware memory access scheduling. l reorder access sequences to use cached data.

Outline u Cache optimization at the application level. l Designing fast and high associativity caches l Exploiting multiprocessor cache locality at runtime. u Exploiting locality in DRAM row buffer. l Fine-grain memory access scheduling. u Efficient replacement in buffer cache. l Conclusion

Application Software Effort: Algorithm Restructuring for Cache Optimization l Traditional algorithm design means: l to give a sequence of computing steps subject to minimize CPU operations. l It ignores: l inherent parallelizations and interactions (e.g. ILP, pipelining, and multiprogramming), l memory hierarchy where data are laid out, and l increasingly high data access cost.

Mutually Adaptive Between Algorithms and Architecture l Restructuring commonly used algorithms l by effectively utilizing caches and TLB, l minimizing cache and TLB misses. l A highly optimized application library is very useful. l Restructuring techniques l data blocking: grouping data in cache for repeat usage l data padding to avoid conflict misses l using registers as fast data buffers

Two Case Studies l Bit-Reversals: l basic operations in FFT and other applications l data layout and operations cause large conflict misses l Sortings: merge-, quick-, and insertion-. l TLB and cache misses are sensitive to the operations. l Our library outperforms systems approaches l We know exactly where to pad and block! l Usage of the two libraries (both are open sources) l bit-reversals: an alternative in Sun’s scientific library. l Sorting codes are used a benchmark for testing compilers.

Microarchitecture Effort: Exploit DRAM Row Buffer Locality l DRAM features: l High density and high capacity l Low cost but slow access (compared to SRAM) l Non-uniform access latency l Row-buffer serves as a fast cache l the access patterns here has been paid little attention. l Reusing buffer data minimizes the DRAM latency.

CPU Registers L1 TLB L3 L2 Row buffer DRAM Bus adapter Controller buffer Buffer cache CPU-memory bus I/O bus I/O controller disk Disk cache TLB registers L1 L2 L3 Controller buffer Buffer cache disk cache Row buffer Locality Exploitation in Row Buffer

DRAM Access = Latency + Bandwidth Time Precharge Row Access Bus bandwidth time DRAM Core Row Buffer Processor Column Access DRAMLatency

Nonuniform DRAM Access Latency l Case 1: Row buffer hit (20+ ns) l Case 2: Row buffer miss (core is precharged, 40+ ns ) l Case 3: Row buffer miss (not precharged, ≈ 70 ns) prechargerow accesscol. access row accesscol. access Row buffer misses come from a sequence of accesses to different pages in the same bank.

Amdahl’s Law applies in DRAM  As the bandwidth improves, DRAM latency will decide cache miss penalty.  Time (ns) to fetch a 128-byte cache block:

Row Buffer Locality Benefit Objective: serve memory requests without accessing the DRAM core as much as possible. Reduce latency by up to 67%.

Row Buffer Misses are Surprisingly High §Standard configuration l Conventional cache mapping l Page interleaving for DRAM memories l 32 DRAM banks, 2KB page size l SPEC95 and SPEC2000 §Why is the reason behind this?

Conventional Page Interleaving Page 0Page 1Page 2Page 3 Page 4Page 5Page 6Page 7 ………… Bank 0 Address format Bank 1Bank 2Bank 3 page indexpage offsetbank rpk

Address Mapping Symmetry cache tagcache set indexblock offset page indexpage offset tsb bank rpk cache-conflicting: same cache index, different tags. row-buffer conflicting: same bank index, different pages. address mapping: bank index  cache set index Property:  x  y, x and y conflict on cache  also on row buffer. page: cache:

Sources of Misses l Symmetry: invariance in results under transformations. l Address mapping symmetry propogates conflicts from cache address to memory address space: l cache-conflicting addresses are also row-buffer conflicting addresses l cache write-back address conflicts with the address of the to be fetched block in the row-buffer. l Cache conflict misses are also row-buffer conflict misses.

Breaking the Symmetry by Permutation-based Page Interleaving k XOR k page index page offset new bank k page offset indexbank L2 Cache tag

Permutation Property (1) §Conflicting addresses are distributed onto different banks memory banks Permutation-based interleaving L2 Conflicting addresses xor Different bank indexes Conventional interleaving Same bank index

Permutation Property (2) §The spatial locality of memory references is preserved. memory banks …… Within one page Permutation-based interleaving Conventional interleaving Same bank index xor Same bank index

Permutation Property (3) §Pages are uniformly mapped onto ALL memory banks. C+1P 2C+2P bank 0bank 1bank 2bank 3 C 2C+3P C+3P 2C 01P2P3P C+2P 2C+1P 4P5P6P7P ………… C+5PC+4PC+7PC+6P ………… 2C+6P2C+7P2C+4P2C+5P …………

Row-buffer Miss Rates

Comparison of Memory Stall Time

Improvement of IPC

Where to Break the Symmetry? l Break the symmetry at the bottom level (DRAM address) is most effective: l Far away from the critical path (little overhead) l Reduce the both address conflicts and write-back conflicts. l Our experiments confirm this (30% difference).

System Software Effort: Efficient Buffer Cache Replacement l Buffer cache borrows a variable space in DRAM. l Accessing I/O data in buffer cache is about a million times faster than in the disk. l Performance of data intensive applications relies on exploiting locality of buffer cache. l Buffer cache replacement is a key factor.

CPU Registers L1 TLB L3 L2 Row buffer DRAM Bus adapter Controller buffer Buffer cache CPU-memory bus I/O bus I/O controller disk Disk cache TLB registers L1 L2 L3 Controller buffer Buffer cache disk cache Row buffer Locality Exploitation in Buffer Cache

The Problem of LRU Replacement l File scanning: one-time accessed blocks are not replaced timely; l Loop-like accesses: blocks to be accessed soonest can be unfortunately replaced; l Accesses with distinct frequencies: Frequently accessed blocks can be unfortunately replaced. Inability to cope with weak access locality

Reasons for LRU to Fail and but Powerful Why LRU fails sometimes? A recently used block will not necessarily be used again or soon. The prediction is based on a single source information. Why it is so widely used? Simplicity: an easy and simple data structure. Work well for accesses following LRU assumption.

Our Objectives and Contributions Address the limits of LRU fundamentally. Retain the low overhead and strong locality merits of LRU. Significant efforts have been made to improve/replace LRU, Case by case; or High runtime overhead Our objectives:

Related Work l Aided by user-level hints l Application-hinted caching and prefetching [OSDI, SOSP,...] l rely on users` understanding of data access patterns. l Detection and adaptation of access regularities l SEQ, EELRU, DEAR, AFC, UBM [OSDI, SIGMETRICS …] l case-by-case oriented approaches l Tracing and utilizing deeper history information l LRFU, LRU-k, 2Q (VLDB, SIGMETRICS, SIGMOD …) l high implementation cost, and runtime overhead.

Observation of Data Flow in LRU Stack Blocks are ordered by recency in the LRU stack. Blocks enter the stack top, and leave from its bottom. A block evicted from the bottom of the stack should have been evicted much earlier ! LRU stack The stack is long and bottom is the only exit.

Inter-Reference Recency (IRR) IRR of a block: number of other unique blocks accessed between two consecutive references to the block. Recency: number of other unique blocks accessed from last reference to the current time IRR = 3 R = 2

Basic Ideas of LIRS l A high IRR block will not be frequently used. l High IRR blocks are selected for replacement. l Recency is used as a second reference. l LIRS: Low Inter-reference Recency Set algorithm l Keep Low IRR blocks in buffer cache. l Foundations of LIRS: l effectively use multiple sources of access information. l Responsively determine and change the status of each block. l Low cost implementations.

Data Structure: Keep LIR Blocks in Cache Low IRR (LIR) block and High IRR (HIR) block LIR block set (size is L lirs ) HIR block set Cache size L = L lirs + L hirs L hirs L lirs Physical Cache Block Sets

Replacement Operations of LIRS L lirs =2, L hirs =1 LIR block set = {A, B}, HIR block set = {C, D, E} E becomes a resident HIR determined by its low recency

D is referenced at time 10 The resident HIR block E is replaced ! Which Block is replaced ? Replace an HIR Block

How is LIR Set Updated ? LIR Block Recency is Used HIR is a natural place for D, but this is not insightful.

After D is Referenced at Time 10 D enters LIR set, and B steps down to HIR set Because D`s IRR< R max in LIR set

The Power of LIRS Replacement l File scanning: one-time access blocks will be replaced timely; (due to their high IRRs) l Loop-like accesses: blocks to be accessed soonest will NOT be replaced; (due to their low IRRs) l Accesses with distinct frequencies: Frequently accessed blocks will NOT be replaced. (dynamic status changes) Capability to cope with weak access locality

LIRS Efficiency: O(1) Rmax (Maximum Recency of LIR blocks) IRR HIR (New IRR of a HIR block) Yes!. this efficiency is achieved by our LIRS stack: Both recencies and useful IRRs are automatically recorded. Rmax of the block in the stack bottom is larger than IRRs of others. No comparison operations are needed. Can O(LIRS) = O(LRU)?

LIRS Operations resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = LIRS stack 5 3 LRU Stack for HIRs Initialization: All the referenced blocks are given an LIR status until LIR block set is full. We place resident HIR blocks in a small LRU Stack. Upon accessing an LIR block (a hit) Upon accessing a resident HIR block (a hit) Upon accessing a non-resident HIR block (a miss)

Access an LIR block (a Hit) S 5 3 Q S 5 3 Q Access 4Access 8 resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = S 5 3 Q 6 9 SdSd

Access an HIR Resident Block (a Hit) S 5 3 Q Access 3Access S 5 Q 5 resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir = S 5 Q 5 2 SdSd

Access a Non-Resident HIR Block ( a Miss) Access S 7 Q S 5 Q 5 resident in cache LIR block HIR block Cache size L = 5 L lir = 3 L hir =2

Access a Non-Resident HIR block (a Miss) (Cont) resident in cache 5 block number LIR block HIR block Cache size L = 5 L lir = 3 L hir =2 Access S 7 Q S 9 Q Access 5 4 S Q

LIRS Stack Simplifies Replacement l Recency is ordered in stack with Rmax LIR block in bottom l No need to keep track of each HIR block`s IRR because l A newly accessed HIR block`s IRRs in stack = recency < R max. l A small LRU stack is used to store resident HIR blocks. l Additional operations of pruning and demoting are constant. l Although LIRS operations are much more dynamic than LRU, its complexity is identical to LRU.

Performance Evaluation l Trace-driven simulations on different patterns shows l LIRS outperforms existing replacement algorithms in almost all the cases. l The performance of LIRS is not sensitive to its only parameter L hirs. l Performance is not affected even when LIRS stack size is bounded. l The time/space overhead is as low as LRU. l LRU can be regarded as a special case of LIRS.

Selected Workload Traces 2-pools is a synthetic trace to simulate the distinct frequency case. cpp is a GNU C compiler pre-processor trace cs is an interactive C source program examination tool trace. glimpse is a text information retrieval utility trace. link is a UNIX link-editor trace. postgres is a trace of join queries among four relations in a relational database system sprite is from the Sprite network file system mulit1: by executing 2 workloads, cs and cpp, together. multi2: by executing 3 workloads, cs, cpp, and postgres, together. multi3: by executing 4 workloads, cpp, gnuplot, glimpse, and postgres, together (1) various patterns, (2) non-regular accesses, (3) large traces.

Looping Pattern: postgres (Time-space map)

Looping Pattern: postgres (Hit Rates)

Potential Impact of LIRS l A LIRS patent has been filed, pending for approval. l Has been positively evaluated by IBM Almaden Research. l A potential adoption from LaserFiche in digital library. l The trace-driven simulation package has been distributed to many universities for research and classroom teaching.

Conclusion Locality-aware research is long term and multidisciplinary. l Application software support l +: optimization is effective for architecture dependent library. l -: cache optimization only, and case by case l Hardware support l +: touching fundamental problems, such as address symmetry. l - : optimization space is very limited due to cost consideration. l System software support l +: a key for locality optimization of I/O and virtual memory l -: lack application knowledge, and kernel modifications.

Selected References l Application software for cache optimization l Cache effective sortings, ACM Journal on Exp. Alg., l Fast bit-reversals, SIAM Journal on Sci. Comp., 2001 l Fast and high associativity cache designs l Multicolumn caches, IEEE Micro, 1997 l Low power caches, IEEE Micro, l Hardware support for DRAM locality exploitation l Permutation-based page interleaving, Micro-33, l Fine-grain memory access scheduling, HPCA-8, l System software support buffer cache optimization l LIRS replacement, SIGMETRICS’02, l TPF systems, Software: Practice & Experience, 2002.