CS 295: Modern Systems Cache And Memory System

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Query Reordering for Photon Mapping Rohit Saboo. Photon Mapping A two step solution for global illumination: Step 2: Shoot eye rays and perform a “gather”
Chapter Hardwired vs Microprogrammed Control Multithreading
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
Multi-Core Architectures
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Chapter 1 — Computer Abstractions and Technology — 1 Uniprocessor Performance Constrained by power, instruction-level parallelism, memory latency.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
1 Lecture 5a: CPU architecture 101 boris.
Computer Architecture & Operations I
Employing compression solutions under openacc
CS161 – Design and Architecture of Computer Systems
Cache Memory and Performance
Reducing Hit Time Small and simple caches Way prediction Trace caches
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
Uniprocessor Performance
CSCI206 - Computer Organization & Programming
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Concurrent and Distributed Programming
Today How’s Lab 3 going? HW 3 will be out today
Architecture & Organization 1
5.2 Eleven Advanced Optimizations of Cache Performance
What happens inside a CPU?
CS 105 Tour of the Black Holes of Computing
Memory Hierarchies.
Architecture & Organization 1
ECE Dept., University of Toronto
Alpha Microarchitecture
Adapted from slides by Sally McKee Cornell University
Part V Memory System Design
Computer Evolution and Performance
CS 3410, Spring 2014 Computer Science Cornell University
Many-Core Graph Workload Analysis
Memory System Performance Chapter 3
Mattan Erez The University of Texas at Austin
Fundamentals of Computing: Computer Architecture
Jakub Yaghob Martin Kruliš
EE 155 / Comp 122 Parallel Computing
Chapter 03: Modern Architectures
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
CS 295: Modern Systems Storage Technologies Introduction
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

CS 295: Modern Systems Cache And Memory System Sang-Woo Jun Spring, 2019

Watch Video! CPU Caches and Why You Care – Scott Meyers His books are great!

Some History 80386 (1985) : Last Intel desktop CPU with no on-chip cache (Optional on-board cache chip though!) 80486 (1989) : 4 KB on-chip cache Coffee Lake (2017) : 64 KiB L1 Per core 256 KiB L2 Per core Up to 2 MiB L3 Per core (Shared) What is the Y-axis? Most likely normalized latency reciprocal Source: Extreme tech, “How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips,” 2018

Memory System Architecture Package Package Core L1 I$ L1 D$ L2 $ Core L1 I$ L1 D$ L2 $ Core L1 I$ L1 D$ L2 $ Core L1 I$ L1 D$ L2 $ L3 $ L3 $ QPI / UPI DRAM DRAM

Memory System Bandwidth Snapshot Core Cache Bandwidth Estimate 64 Bytes/Cycle ~= 200 GB/s/Core Core DDR4 2666 MHz 128 GB/s QPI / UPI DRAM DRAM Ultra Path Interconnect Unidirectional 20.8 GB/s Memory/PCIe controller used to be on a separate “North bridge” chip, now integrated on-die All sorts of things are now on-die! Even network controllers! (Specialization!)

Cache Architecture Details Numbers from modern Xeon processors (Broadwell – Kaby lake) L3 $ Core L1 I$ L1 D$ L2 $ Cache Level Size Latency (Cycles) L1 64 KiB < 5 L2 256 KiB < 20 L3 ~ 2 MiB per core < 50 DRAM 100s of GB > 100* L1 cache is typically divided into separate instruction and data cache Each with 32 KiB L1 cache accesses can be hidden in the pipeline All others take a performance hit * DRAM subsystems are complicated entities themselves, and latency/bandwidth of the same module varies by situation…

Cache Lines Recap Caches are managed in Cache Line granularity Typically 64 Bytes for modern CPUs 64 Bytes == 16 4-byte integers Reading/Writing happens in cache line granularity Read one byte not in cache -> Read all 64 bytes from memory Write one byte -> Eventually write all 64 bytes to memory Inefficient cache access patterns really hurt performance!

Cache Line Effects Example Multiplying two 2048 x 2048 matrices 16 MiB, doesn’t fit in any cache Machine: Intel i5-7400 @ 3.00GHz Time to transpose B is also counted A B A BT … … … … × VS × 10.39 seconds (6x performance!) 63.19 seconds

Cache Prefetching CPU speculatively prefetches cache lines While CPU is working on the loaded 64 bytes, 64 more bytes are being loaded Hardware prefetcher is usually not very complex/smart Sequential prefetching (N lines forward or backwards) Strided prefetching Programmer-provided prefetch hints __builtin_prefetch(address, r/w, temporal_locality); for GCC

Cache Coherence Recap We won’t go into architectural details Simply put: When a core writes a cache line All other instances of that cache line needs to be invalidated Emphasis on cache line

Issue #1: Capacity Considerations: Matrix Multiply Performance is best when working set fits into cache But as shown, even 2048 x 2048 doesn’t fit in cache -> 2048 * 2048 * 2048 elements read from memory for matrix B Solution: Divide and conquer! – Blocked matrix multiply For block size 32 × 32 -> 2048 * 2048 * (2048/32) reads A BT C … A1 B1 C1 B2 × = B3 C1 sub-matrix = A1×B1 + A1×B2 + A1×B3 …

Blocked Matrix Multiply Evaluations Benchmark Elapsed (s) Normalized Performance Naïve 63.19 1 Transposed 10.39 6.08 Blocked Transposed 7.35 8.60 AVX Transposed 2.20 28.72 Blocked AVX Transposed 1.50 42.13 8 Thread AVX Transposed 1.09 57.97 Bottlenecked by computations Bottlenecked by memory Bottlenecked by computation Bottlenecked by memory (Not scaling!) AVX Transposed reading from DRAM at 14.55 GB/s 20483 * 4 (Bytes) / 2.20 (s) = 14.55 GB/s 1x DDR4 2400 MHz on machine -> 18.75 GB/s peak Pretty close! Considering DRAM also used for other things (OS, etc) Multithreaded getting 32 GB/s effective bandwidth Cache effects with small chunks

Issue #2: False Sharing Different memory locations, written to by different cores, mapped to same cache line Core 1 performing “results[0]++;” Core 2 performing “results[1]++;” Remember cache coherence Every time a cache is written to, all other instances need to be invalidated! “results” variable is ping-ponged across cache coherence every time Bad when it happens on-chip, terrible over processor interconnect (QPI/UPI) Remember the case in the Scott Meyers video

Results From Scott Meyers Video Again

Issue #3 Instruction Cache Effects Instruction cache misses can effect performance “Linux was routing packets at ~30Mbps [wired], and wireless at ~20. Windows CE was crawling at barely 12Mbps wired and 6Mbps wireless. […] After we changed the routing algorithm to be more cache-local, we started doing 35MBps [wired], and 25MBps wireless – 20% better than Linux. – Sergey Solyanik, Microsoft [By organizing function calls in a cache-friendly way, we] achieved a 34% reduction in instruction cache misses and a 5% improvement in overall performance. -- Mircea Livadariu and Amir Kleen, Freescale

Improving Instruction Cache Locality #1 Unfortunately, not much we can do explicitly… We can… Careful with loop unrolling and inlining They reduce branching overhead, but reduces effective I$ size When gcc’s –O3 performs slower than –O2, this is usually what’s happening Inlining is typically good for very small* functions Move conditionals to front as much as possible Long paths of no branches good fit with instruction cache/prefetcher

Improving Instruction Cache Locality #2 Organize function calls to create temporal locality Sequential algorithm Ordering changed for cache locality Balance to reduce memory footprint Livadariu et. al., “Optimizing for instruction caches,” EETimes

https://frankdenneman https://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache- coherency/