1 Lecture 5a: CPU architecture 101 boris.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs.
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
CS 153 Design of Operating Systems Spring 2015
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EECE476: Computer Architecture Lecture 27: Virtual Memory, TLBs, and Caches Chapter 7 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
My Coordinates Office EM G.27 contact time:
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Xbox 360 Architecture Presenter: Ataç Deniz Oral Date: 30/11/06.
COMP 740: Computer Architecture and Implementation
OCR GCSE Computer Science Teaching and Learning Resources
Simultaneous Multithreading
Computer Structure Multi-Threading
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
5.2 Eleven Advanced Optimizations of Cache Performance
Prof. Zhang Gang School of Computer Sci. & Tech.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Computer Architecture 2
Vector Processing => Multimedia
Flow Path Model of Superscalars
Chapter 8: Main Memory.
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy 11/12/2018.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Memory Management 11/17/2018 A. Berrached:CS4315:UHD.
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Memory and cache CPU Memory I/O.
Lecture 23: Cache, Memory, Virtual Memory
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Lecture 22: Cache Hierarchies, Memory
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
Coe818 Advanced Computer Architecture
Lecture 24: Memory, VM, Multiproc
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Lecture 20: OOO, Memory Hierarchy
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
EE 4xx: Computer Architecture and Performance Programming
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Memory System Performance Chapter 3
CS703 - Advanced Operating Systems
Main Memory Background
The University of Adelaide, School of Computer Science
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory and Performance
CSE 502: Computer Architecture
Performance Analysis in Out-of-Order Cores
Presentation transcript:

1 Lecture 5a: CPU architecture 101 boris.

2 High-level Computer Architecture Haswell motherboard

3 High-level Computer Architecture

4 High level CPU archietcture Haswell (4 th generation Core)=CPU+GPU+L3$+SystemI/O

5 Core u-architecture Out-Of-Order FRONT END Execution Memory

6 Core: Front-End Front-end: brings instruction into core brings instruction into core branch prediction branch prediction translates variable length instructions into fixed size u-ops translates variable length instructions into fixed size u-ops

7 Core: Out-of-Order   Register renaming: – –maps architectural x86 registers onto the physical register files (PRFs) – –allocates other resources: load, store and branch buffer entries and scheduler entries.   Schedule instructions for execution in an order governed by the availability of input data, rather than by original order in a program

8 Core: Execution  Parallel execution of multiple instructions  AVX2 instructions –256b integer operations, –256b FMA(Fused-MultiplyAdd), –256b vector load (gather)

9 Core: Memory subsystem Translate virtual address to physical:   2-level TLB (translation look-aside buffer) 2-level Data$ / core   L1$ = 32KB, L2$ = 256KB   Cache line = 64B

10 Virtual Address Translation  Translation is done per 1 page = 4K –TLB (translation look aside buffer) – cache for translated pages  Example : –Array 1024x1024 –Row = 1 page  1 entry in TLB –Column = 1024 pages  1024 entries 1024pages 1 page

11 Prefetching

12 Array Prefetching Data can be speculatively loaded to the DCache using SW prefetching or HW prefetching  Explicit “ fetch ” instructions –Streaming SIMD Extensions (SSE) prefetch instructions to enable software-controlled prefetching. These instructions are hints to bring a cache line of data into the desired levels of the cache hierarchy. –Cons: Additional instructions executed  Hardware-based –Special hardware –Cons: Unnecessary prefetchings (w/o compile-time information)

13 SW Prefetching Example: Vector Product  No prefetching for (i = 0; i < N; i++) { sum += a[i]*b[i]; }  Assume each cache line holds 4 elements  2 misses/4 iterations  Simple prefetching for (i = 0; i < N; i++) { fetch (&a[i+1]); fetch (&b[i+1]); sum += a[i]*b[i]; }  Problem –Unnecessary prefetch operations

14 SW Prefetching Example: Vector Product (Cont.)  Prefetching + loop unrolling for (i = 0; i < N; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; }  Problem –First and last iterations fetch (&sum); fetch (&a[0]); fetch (&b[0]); for (i = 0; i < N-4; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; } for (i = N-4; i < N; i++) sum = sum + a[i]*b[i];

15 HW prefetchers SW pre-fetching is difficult, you should know a lot about HW: – cache line size, latency of operations, time required for DRAM access,… Good news - there are lot of HW prefetchers –L1$ (DCU) – streaming and IP-based prefecthers –L2$ - spatial (pair of CLs) prefetcher, streamer,…

16 Core: SMT( Core: SMT(Simultaneous Multi-Threading) Core supports 2 active logical threads / core: –if one thread is stalled (e.g. TLB or cache miss) another thread can work  better utilization of ecexution units –All resources (RF, buffers, caches ) are shared between 2 threads –Can be very useful when working with large graphs or sparse matrix OS sees two virtual cores where in fact there is one physcial core with 2 SMT threads. SMT can decrease performance if any of the shared resources are bottlenecks for performance: –For example for dense matrix multiplication or convolutional NNs

17 Basic Rules of Thumb for Fast Code  Arrays are good –access by row much faster than access by column –vectorization can improve speed of your code by 10x  Branches are bad –Compute costs less than branch error  Think memory –Cache miss is expensive –Cache line alignment –Pre-fetchers - sometimes good, sometimes bad –Page miss is expensive and TLB (cache for translation of virtual address to physical) is small  There are many cores inside, use them –OpenMP, pthreads,… –SMT - sometimes good, sometimes bad