现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)

The University of Adelaide, School of Computer Science

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

CMPE 421 Parallel Computer Architecture

1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Chapter 5 Memory Hierarchy Design

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)

For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院通信邮箱：提交作业邮箱： 2014 年 1.

CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

/ Computer Architecture and Design

COMP 740: Computer Architecture and Implementation

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

CSE 351 Section 9 3/1/12.

Yu-Lun Kuo Computer Sciences and Information Engineering

Reducing Hit Time Small and simple caches Way prediction Trace caches

Associativity in Caches Lecture 25

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

The Hardware/Software Interface CSE351 Winter 2013

The University of Adelaide, School of Computer Science

Lecture 9: Memory Hierarchy (3)

Morgan Kaufmann Publishers

11 Advanced Cache Optimizations

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Chapter 2 Memory Hierarchy Design

CPE 631 Lecture 06: Cache Design

CS252 Graduate Computer Architecture Lecture 7 Cache Design (continued) Feb 12, 2002 Prof. David Culler.

CS252 Graduate Computer Architecture Lecture 4 Cache Design

Lecture 14: Reducing Cache Misses

Lecture 08: Memory Hierarchy Cache Performance

Adapted from slides by Sally McKee Cornell University

Memory Hierarchy.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Siddhartha Chatterjee

/ Computer Architecture and Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

11 Advanced Cache Optimizations

Main Memory Background

Cache Performance Improvements

Areg Melik-Adamyan, PhD

The University of Adelaide, School of Computer Science

10/18: Lecture Topics Using spatial locality

Presentation transcript:

现代计算机体系结构主讲教师：张钢教授天津大学计算机学院课件、作业、讨论网址：http://glearning.tju.edu.cn/ 通信邮箱：gzhang@tju.edu.cn 2017年

The Main Contents课程主要内容 Chapter 1. Fundamentals of Quantitative Design and Analysis Chapter 2. Memory Hierarchy Design Chapter 3. Instruction-Level Parallelism and Its Exploitation Chapter 4. Data-Level Parallelism in Vector, SIMD, and GPU Architectures Chapter 5. Thread-Level Parallelism Chapter 6. Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Appendix A. Pipelining: Basic and Intermediate Concepts

Many Levels in Memory Hierarchy Pipeline registers Invisible only to high-level language programmers Register file 1st-level cache (on-chip) There can also be a 3rd (or more) cache levels here Usually made invisible to the programmer (even assembly programmers) 2nd-level cache (on same MCM as CPU) Physical memory (usu. mounted on same board as CPU) Virtual memory (on hard disk, often in same enclosure as CPU) Disk files (on hard disk often in same enclosure as CPU) Network-accessible disk files (often in the same building as the CPU) Tape backup/archive system (often in the same building as the CPU) Data warehouse: Robotically-accessed room full of shelves of tapes (usually on the same planet as the CPU)

Simple Hierarchy Example Note many orders of magnitude change in characteristics between levels: ×128→ ×8192→ ×200→ ×4 → ×100 → ×50,000 → (for random access) (2 GB) (1 TB 10 ms)

Memory Hierarchy

Why More on Memory Hierarchy? Processor-Memory Performance Gap Growing Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989

Review for Cache Blocks or Lines? How to find a block in a Cache? Map? Write strategies? Write buffer? Miss or miss rate? Hit or hit rate?

Memory Hierarchy Basics

Memory Hierarchy Basics

Memory Hierarchy Basics

Memory Hierarchy Basics

Three Types of Misses Compulsory强制 Capacity容量 Conflict冲突 During a program, the very first access to a block will not be in the cache (unless pre-fetched) Capacity容量 The working set of blocks accessed by the program is too large to fit in the cache Conflict冲突 Unless cache is fully associative, sometimes blocks may be evicted too early (compared to fully-associative) because too many frequently-accessed blocks map to the same limited set of frames.

An Alternative Metric =? CPU Cache (Average memory access time) = (Hit time) + (Miss rate)×(Miss penalty) =? The times Tacc, Thit, and T+miss can be either: Real time (e.g., nanoseconds), or, number of clock cycles T+miss means the extra (not total) time (or cycle) for a miss in addition to Thit, which is incurred by all accesses Hit time Lower levels of hierarchy CPU Cache Miss penalty

6 Basic Cache Optimizations (New) Reducing hit time 1. Avoiding Address Translation during Cache Indexing Reducing Miss Penalty 2. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer 3. Multilevel Caches Reducing Miss Rate 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses)

Multiple-Level Caches Avg mem acc time = Hit time (L1) + Miss rate (L1) x Miss Penalty (L1) Miss penalty (L1) = Hit time (L2) + Miss rate (L2) x Miss Penalty (L2) Can plug 2nd equation into the first: Avg mem access time = Hit time(L1) + Miss rate(L1) x (Hit time(L2) + Miss rate(L2)x Miss penalty(L2))

Ten Advanced Optimizations of Cache Performance

1. Small & Simple First-Level Caches to Reduce hit time and Power Cost: indexing, Smaller is faster L2 small enough to fit on processor chip Direct mapping is simple Overlap tag check with data transmit CACTI - Simulate impact on hit time E.g., Fig 2.4 Access vs. size & associativity Suggest: Hit time Direct mapped is 1.2 - 1.5 x faster than 2-way set associative 2-way is 1.02 - 1.1 x 4-way 4-way is 1.0 - 1.08 x fully associative

Access Time versus Cache Size and associativity Access time in picoseconds

Energy per read vs. size and associativity

2. Way prediction 路预测- Reduce hit time Extra bits kept in cache to predict way, block within set of next cache access Set multiplexor early to select desired block Single tag compare in that cycle in parallel with reading cache data Miss? Check other blocks for matches in next cycle Saves pipeline stages 85% of accesses for 2-way ==> Good match for speculative processors Pentium 4 uses

3. Pipelined cache access - Increase cache bandwidth Pipeline cache access Effective latency of L1 cache hit is multiple clock cycles Fast clock high bandwidth Slow hits Pentium cache hit took 1 cycle Core i7 cache hit takes 4 cycles Increased pipeline stages

4. Nonblocking cache (hit under miss) - Increase cache bandwidth With out-of-order execution, processor need not stall on cache miss Continue fetching instructions while waiting for cache data If cache does not block, allow cache to supply data to hits while processing a miss Reduces effective miss penalty Overlap multiple misses? Called hit under multiple misses or miss under miss Requires memory to service multiple misses simultaneously

5. Multibanked caches - Increase cache bandwidth Independent banks supporting simultaneous access Originally used in main memory AMD Opteron has 2 banks of L2 Sun Niagara has 4 banks of L2 Best when accesses spread across banks Spread addresses sequentially across banks - Sequential interleaving

6. Critical word first, Early restart - Reduce miss penalty Don’t wait for full block before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Spatial locality  tend to want next sequential word, so not clear size of benefit of just early restart Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today  Critical Word 1st Widely used block

6. Critical word first, Early restart - Reduce miss penalty Processor needs 1 word of block Give it what it needs first How is block retrieved from memory? Critical word first - Get requested word first Return it Continue with memory transfer Early restart - Fetch in normal order When requested word comes, return it Benefits only with large blocks. Why? Disadvantage: Spatial locality. Why? Miss penalty is hard to estimate

7. Merge write buffers - Reduce miss penalty Write-through relies on write buffers All stores sent to lower level Write-back uses simple buffer when block is replaced Case: Write buffer is empty Data & addresses written from cache block to buffer Cache thinks write is done Case: Write buffer contained modified blocks Is this block already in write buffer? Write merging - Combine newly modified with buffer contents Case: Buffer full & no address match Must wait for empty buffer block Uses memory more efficiently - multi-word writes

7. Merge write buffers - Reduce miss penalty

8. Compiler optimizations - Reduce miss rate Compiler research improvements Instruction misses Data misses Optimizations include Code & data rearrangement Reorder procedures - might reduce conflict misses Align basic blocks to beginning of a cache block - decreases chance of cache miss Branch straightening - Change sense of branch test, swap basic blocks of branches Data - Arrange to improve spatial & temporal locality E.g., arrays by block

8. Compiler optimizations - Reduce miss rate Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data （自学，阅读教材相关内容） Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality

Blocking Example Two Inner Loops: /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 2N3 + N2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits

Blocking Example B called Blocking Factor /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B +N2 Conflict Misses Too?

Loop Blocking – Matrix Multiply Before: After:

9. Hardware Prefetching of Ins 9. Hardware Prefetching of Ins. & Data, Reduce Miss Penalty or Miss Rate Processor fetch two instruction block on a miss Requested block is placed in I-Cache Next block is placed in instruction stream buffer A similar approach can be applied to data accesses Intel Core i7 supports hardware prefetching into both L1 and L2

10. Compiler-controlled prefetch - Reduce miss penalty or miss rate Compiler inserts instructions to prefetch To register To cache Faulting or nonfaulting? Should prefetch cause page fault or memory protection fault? Assume nonfaulting cache prefetch Does not change contents of registers or memory Does not cause memory fault Goal: Overlap execution with prefetching

10. Compiler-controlled prefetch - Reduce miss penalty or miss rate Before for (i=0; i<3; i=i+1) for (j=0; j<100; j=j+1) a[i][j]=b[j][0]*b[j+1][0]; After for (j=0; j<100; j=j+1) prefetch(b[j+7][0]); prefetch(a[0][j+7]); a[0][j]=b[j][0]*b[j+1][0]; for (i=0; i<3; i=i+1) for (j=0; j<100; j=j+1) prefetch(a[0][j+7]); a[i][j]=b[j][0]*b[j+1][0];

作业4(第五版2.1)