Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Lecture 5a: CPU architecture 101 boris.

Similar presentations


Presentation on theme: "1 Lecture 5a: CPU architecture 101 boris."— Presentation transcript:

1 1 Lecture 5a: CPU architecture 101 boris. ginsburg@gmail.com

2 2 High-level Computer Architecture Haswell motherboard

3 3 High-level Computer Architecture

4 4 High level CPU archietcture Haswell (4 th generation Core)=CPU+GPU+L3$+SystemI/O

5 5 Core u-architecture Out-Of-Order FRONT END Execution Memory

6 6 Core: Front-End http://www.realworldtech.com/haswell-cpu/ Front-end: brings instruction into core brings instruction into core branch prediction branch prediction translates variable length instructions into fixed size u-ops translates variable length instructions into fixed size u-ops

7 7 Core: Out-of-Order   Register renaming: – –maps architectural x86 registers onto the physical register files (PRFs) – –allocates other resources: load, store and branch buffer entries and scheduler entries.   Schedule instructions for execution in an order governed by the availability of input data, rather than by original order in a program

8 8 Core: Execution http://www.realworldtech.com/haswell-cpu/  Parallel execution of multiple instructions  AVX2 instructions –256b integer operations, –256b FMA(Fused-MultiplyAdd), –256b vector load (gather)

9 9 Core: Memory subsystem http://www.realworldtech.com/haswell-cpu/ Translate virtual address to physical:   2-level TLB (translation look-aside buffer) 2-level Data$ / core   L1$ = 32KB, L2$ = 256KB   Cache line = 64B

10 10 Virtual Address Translation  Translation is done per 1 page = 4K –TLB (translation look aside buffer) – cache for translated pages  Example : –Array 1024x1024 –Row = 1 page  1 entry in TLB –Column = 1024 pages  1024 entries 1024pages 1 page 1024 1024

11 11 Prefetching

12 12 Array Prefetching Data can be speculatively loaded to the DCache using SW prefetching or HW prefetching  Explicit “ fetch ” instructions –Streaming SIMD Extensions (SSE) prefetch instructions to enable software-controlled prefetching. These instructions are hints to bring a cache line of data into the desired levels of the cache hierarchy. –Cons: Additional instructions executed  Hardware-based –Special hardware –Cons: Unnecessary prefetchings (w/o compile-time information)

13 13 SW Prefetching Example: Vector Product  No prefetching for (i = 0; i < N; i++) { sum += a[i]*b[i]; }  Assume each cache line holds 4 elements  2 misses/4 iterations  Simple prefetching for (i = 0; i < N; i++) { fetch (&a[i+1]); fetch (&b[i+1]); sum += a[i]*b[i]; }  Problem –Unnecessary prefetch operations

14 14 SW Prefetching Example: Vector Product (Cont.)  Prefetching + loop unrolling for (i = 0; i < N; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; }  Problem –First and last iterations fetch (&sum); fetch (&a[0]); fetch (&b[0]); for (i = 0; i < N-4; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; } for (i = N-4; i < N; i++) sum = sum + a[i]*b[i];

15 15 HW prefetchers SW pre-fetching is difficult, you should know a lot about HW: – cache line size, latency of operations, time required for DRAM access,… Good news - there are lot of HW prefetchers –L1$ (DCU) – streaming and IP-based prefecthers –L2$ - spatial (pair of CLs) prefetcher, streamer,…

16 16 Core: SMT( Core: SMT(Simultaneous Multi-Threading) Core supports 2 active logical threads / core: –if one thread is stalled (e.g. TLB or cache miss) another thread can work  better utilization of ecexution units –All resources (RF, buffers, caches ) are shared between 2 threads –Can be very useful when working with large graphs or sparse matrix OS sees two virtual cores where in fact there is one physcial core with 2 SMT threads. SMT can decrease performance if any of the shared resources are bottlenecks for performance: –For example for dense matrix multiplication or convolutional NNs

17 17 Basic Rules of Thumb for Fast Code  Arrays are good –access by row much faster than access by column –vectorization can improve speed of your code by 10x  Branches are bad –Compute costs less than branch error  Think memory –Cache miss is expensive –Cache line alignment –Pre-fetchers - sometimes good, sometimes bad –Page miss is expensive and TLB (cache for translation of virtual address to physical) is small  There are many cores inside, use them –OpenMP, pthreads,… –SMT - sometimes good, sometimes bad


Download ppt "1 Lecture 5a: CPU architecture 101 boris."

Similar presentations


Ads by Google