1 Lecture 5a: CPU architecture 101 boris.
2 High-level Computer Architecture Haswell motherboard
3 High-level Computer Architecture
4 High level CPU archietcture Haswell (4 th generation Core)=CPU+GPU+L3$+SystemI/O
5 Core u-architecture Out-Of-Order FRONT END Execution Memory
6 Core: Front-End Front-end: brings instruction into core brings instruction into core branch prediction branch prediction translates variable length instructions into fixed size u-ops translates variable length instructions into fixed size u-ops
7 Core: Out-of-Order Register renaming: – –maps architectural x86 registers onto the physical register files (PRFs) – –allocates other resources: load, store and branch buffer entries and scheduler entries. Schedule instructions for execution in an order governed by the availability of input data, rather than by original order in a program
8 Core: Execution Parallel execution of multiple instructions AVX2 instructions –256b integer operations, –256b FMA(Fused-MultiplyAdd), –256b vector load (gather)
9 Core: Memory subsystem Translate virtual address to physical: 2-level TLB (translation look-aside buffer) 2-level Data$ / core L1$ = 32KB, L2$ = 256KB Cache line = 64B
10 Virtual Address Translation Translation is done per 1 page = 4K –TLB (translation look aside buffer) – cache for translated pages Example : –Array 1024x1024 –Row = 1 page 1 entry in TLB –Column = 1024 pages 1024 entries 1024pages 1 page
11 Prefetching
12 Array Prefetching Data can be speculatively loaded to the DCache using SW prefetching or HW prefetching Explicit “ fetch ” instructions –Streaming SIMD Extensions (SSE) prefetch instructions to enable software-controlled prefetching. These instructions are hints to bring a cache line of data into the desired levels of the cache hierarchy. –Cons: Additional instructions executed Hardware-based –Special hardware –Cons: Unnecessary prefetchings (w/o compile-time information)
13 SW Prefetching Example: Vector Product No prefetching for (i = 0; i < N; i++) { sum += a[i]*b[i]; } Assume each cache line holds 4 elements 2 misses/4 iterations Simple prefetching for (i = 0; i < N; i++) { fetch (&a[i+1]); fetch (&b[i+1]); sum += a[i]*b[i]; } Problem –Unnecessary prefetch operations
14 SW Prefetching Example: Vector Product (Cont.) Prefetching + loop unrolling for (i = 0; i < N; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; } Problem –First and last iterations fetch (&sum); fetch (&a[0]); fetch (&b[0]); for (i = 0; i < N-4; i+=4) { fetch (&a[i+4]); fetch (&b[i+4]); sum += a[i]*b[i]; sum += a[i+1]*b[i+1]; sum += a[i+2]*b[i+2]; sum += a[i+3]*b[i+3]; } for (i = N-4; i < N; i++) sum = sum + a[i]*b[i];
15 HW prefetchers SW pre-fetching is difficult, you should know a lot about HW: – cache line size, latency of operations, time required for DRAM access,… Good news - there are lot of HW prefetchers –L1$ (DCU) – streaming and IP-based prefecthers –L2$ - spatial (pair of CLs) prefetcher, streamer,…
16 Core: SMT( Core: SMT(Simultaneous Multi-Threading) Core supports 2 active logical threads / core: –if one thread is stalled (e.g. TLB or cache miss) another thread can work better utilization of ecexution units –All resources (RF, buffers, caches ) are shared between 2 threads –Can be very useful when working with large graphs or sparse matrix OS sees two virtual cores where in fact there is one physcial core with 2 SMT threads. SMT can decrease performance if any of the shared resources are bottlenecks for performance: –For example for dense matrix multiplication or convolutional NNs
17 Basic Rules of Thumb for Fast Code Arrays are good –access by row much faster than access by column –vectorization can improve speed of your code by 10x Branches are bad –Compute costs less than branch error Think memory –Cache miss is expensive –Cache line alignment –Pre-fetchers - sometimes good, sometimes bad –Page miss is expensive and TLB (cache for translation of virtual address to physical) is small There are many cores inside, use them –OpenMP, pthreads,… –SMT - sometimes good, sometimes bad