ECE Dept., University of Toronto

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Today Memory hierarchy, caches, locality Cache organization
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Systems I Locality and Caching
Cache Lab Implementation and Blocking
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Memory hierarchy. – 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Carnegie Mellon Introduction to Computer Systems /18-243, spring th Lecture, Feb. 19 th Instructors: Gregory Kesden and Markus Püschel.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
University of Washington Today Midterm topics end here. HW 2 is due Wednesday: great midterm review. Lab 3 is posted. Due next Wednesday (July 31)  Time.
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
Cache Memories.
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
Optimization III: Cache Memories
Cache Memories CSE 238/2038/2138: Systems Programming
The Hardware/Software Interface CSE351 Winter 2013
Today How’s Lab 3 going? HW 3 will be out today
CS 105 Tour of the Black Holes of Computing
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Memory hierarchy.
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Lecture 21: Memory Hierarchy
Cache Memories September 30, 2008
Memory hierarchy.
CS 105 Tour of the Black Holes of Computing
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Cache Memory and Performance

Presentation transcript:

ECE Dept., University of Toronto ECE 454 Computer Systems Programming Memory performance (Part I: review of mem. hierarchy) Ding Yuan ECE Dept., University of Toronto http://www.eecg.toronto.edu/~yuan

Content Cache basics and organization Optimizing for Caches (next lec.) Tiling/blocking Loop reordering 9/10/13 Ding Yuan, ECE454

Matrix Multiply What is the range of performance due to optimization? double a[4][4]; double b[4][4]; double c[4][4]; // assume already set to zero /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i][j] += a[i][k] * b[k][j]; // work } What is the range of performance due to optimization?

MMM Performance Best code 160x Triple loop Standard desktop computer, compiler, using optimization flags Both implementations have exactly the same operations count (2n3) What is going on?

Problem: Processor-Memory Bottleneck L1 cache reference 0.5 ns* (L1 cache size: < 10 KB) Main memory reference 100 ns (mem size: GBs) 200X slower! *1 ns = 1/1,000,000,000 second For a 2.7 GHz CPU (my laptop), 1 cycle = 0.37 ns

Memory Hierarchy CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte registers on-chip L1 cache (SRAM) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory on-chip L2 cache (SRAM) Main memory holds disk blocks retrieved from local disks main memory (DRAM) Larger, slower, cheaper per byte local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers remote secondary storage (tapes, distributed file systems, Web servers)

Cache Basics (review (hopefully!))

General Cache Mechanics Smaller, faster, more expensive memory caches a subset of the blocks Cache 8 4 9 14 10 3 Data is copied in block-sized transfer units 4 10 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15

General Cache Concepts: Hit Request: 14 Data in block b is needed Block b is in cache: Hit! Cache 8 9 14 14 3 Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

General Cache Concepts: Miss Request: 12 Data in block b is needed Block b is not in cache: Miss! Cache 8 9 12 14 3 Block b is fetched from memory 12 Request: 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) Memory 1 2 3 4 5 6 7 8 9 10 11 12 12 13 14 15

Cache Performance Metrics Miss Rate Fraction of memory references not found in cache (misses / accesses) = 1 – hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: 1-3 clock cycles for L1 5-20 clock cycles for L2 Miss Penalty Additional time required because of a miss typically 50-400 cycles for main memory

Lets think about those numbers Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cycle miss penalty of 100 cycles Average access time: 97% hits: 99% hits: This is why “miss rate” is used instead of “hit rate” 0.97 * 1 cycle + 0.03 * 100 cycles = 3.97 cycles 0.99 * 1 cycle + 0.01 * 100 cycles = 1.99 cycles

Types of Cache Misses Cold (compulsory) miss Conflict miss Occurs on first access to a block Can’t do too much about these (except prefetching---more later) Conflict miss Most hardware caches limit blocks to a small subset (sometimes a singleton) of the available cache slots e.g., block i must be placed in slot (i mod 4) Conflict misses occur when the cache is large enough, but multiple data objects all map to the same slot e.g., referencing blocks 0, 8, 0, 8, ... would miss every time Conflict misses are less of a problem these days (more later) Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache This is where to focus nowadays

Why Caches Work Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time block block

Example: Locality? Data: Instructions: sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 pattern Instructions: Temporal: cycle through loop repeatedly Spatial: reference instructions in sequence Being able to assess the locality of code is a crucial skill for a programmer!

Cache Organization

General Cache Organization (S, E, B) E = 2e blocks per set set block S = 2s sets Cache size: S x E x B data bytes v tag 1 2 B-1 valid bit B = 2b bytes per cache block (the data)

Example: Direct Mapped Cache (E = 1) Direct mapped: One block per set Assume: cache block size 8 bytes Address of int: 1 2 7 tag v 3 6 5 4 t bits 0…01 100 1 2 7 tag v 3 6 5 4 find set S = 2s sets 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4

Example: Direct Mapped Cache (E = 1) Direct mapped: One block per set Assume: cache block size 8 bytes Address of int: valid? + match: assume yes = hit t bits 0…01 100 1 2 7 tag v 3 6 5 4 tag block offset

Example: Direct Mapped Cache (E = 1) Direct mapped: One block per set Assume: cache block size 8 bytes Address of int: valid? + match: assume yes = hit t bits 0…01 100 1 2 7 tag v 3 6 5 4 tag block offset int (4 Bytes) is here No match: old line is evicted and replaced

E-way Set Associative Cache (E = 2) Address of short int: E = 2: Two lines per set Assume: cache block size 8 bytes t bits 0…01 100 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4 find set 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4

E-way Set Associative Cache (E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit 1 2 7 tag v 3 6 5 4 tag block offset

E-way Set Associative Cache (E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit 1 2 7 tag v 3 6 5 4 tag block offset short int (2 Bytes) is here No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), …

Core 2: Cache Associativity Not drawn to scale L1/L2 cache: 64 B blocks 6 MB ~4 GB ~500 GB (?) L1 I-cache D-cache L2 unified cache Main Memory Disk 32 KB CPU Reg Latency: 3 cycles 16 cycles 100 cycles 10s of millions 8-way associative! 16-way associative! Punchline: conflict misses are less of an issue nowadays Staying within on-chip cache capacity is key

What about writes? Multiple copies of data exist: L1, L2, Main Memory, Disk What to do on a write-hit? Write-through (write immediately to memory) Write-back (defer write to memory until replacement of line) Need a dirty bit (line different from memory or not) What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow No-write-allocate (writes immediately to memory) Typical Write-through + No-write-allocate Write-back + Write-allocate

Understanding/Profiling Memory

Recall: UG Machine Memory Hierarchy 32KB, 8-way data cache 32KB, 8-way inst cache Multi-chip Module L2 Cache L1 Caches P Processor Chip Processor Chip L1 Caches P L1 Caches P L2 Cache 12 MB (2X 6MB), 16-way Unified L2 cache

Get Memory System Details: lstopo Run lstopo on UG machine, gives: Machine (3829MB) + Socket #0 L2 #0 (6144KB) L1 #0 (32KB)+Core #0+PU #0 (phys=0) L1 #1 (32KB)+Core #1+PU #1 (phys=1) L2 #1 (6144KB) L1 #2 (32KB) + Core #2 + PU #2 (phys=2) L1 #3 (32KB) + Core #3 + PU #3 (phys=3) 4GB RAM 2X 6MB L2 cache 2 cores per L2 32KB L1 cache per core

Get More Cache Details: L1 dcache ls /sys/devices/system/cpu/cpu0/cache/index0 coherency_line_size: 64 // 64B cache lines level: 1 // L1 cache number_of_sets physical_line_partition shared_cpu_list shared_cpu_map size: type: data // data cache ways_of_associativity: 8 // 8-way set associative

Get More Cache Details: L2 cache ls /sys/devices/system/cpu/cpu0/cache/index2 coherency_line_size: 64 // 64B cache lines level: 2 // L2 cache number_of_sets physical_line_partition shared_cpu_list shared_cpu_map size: 6144K type: Unified // unified cache, means instructions and data ways_of_associativity: 24 // 24-way set associative

Access Hardware Counters: perf The tool ‘perf’ allows you to access performance counters way easier than it used to be To measure L1 cache load misses for program foo, run: perf stat -e L1-dcache-load-misses foo 7803 L1-dcache-load-misses # 0.000 M/sec To see a list of all events you can measure: perf list Note: you can measure multiple events at once