Time-predictability of a computer system Master project in progress By Wouter van der Put
2 How long does it take?
3 Goal Problem, approach and final goal l Problem ä How to meet timing requirements on an x86 multi- core multi-CPU computer system? l Method ä Investigate, characterize and give advice to increase the time-predictability of x86 multi-core multi-CPU computer systems l Final goal ä Advise how to maximise time-predictability, minimise latency and maximise throughput
4 Overview Time-predictability l Influenced by (bottom-up approach) ä Hardware –Processor (architecture) –Memory (hierarchy) –System architecture (motherboard) ä Software –Operating System (scheduling) –Algorithms and their data (regularity) l Approach ä Theory: Explore (CPU) architectures ä Practice: Perform measurements ä Conclusion l Focus on contemporary architecture ä Quad-core dual-CPU Intel Nehalem server (next slide)
5 Time-predictability White = observed Black = reality
6 Overview Nehalem/Tylersburg architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
7 Overview Processor QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
8 Processor Theory – Time-predictability l Designed to improve average case latency ä Memory access –Caches: reduce average access time ä Hazards –Prediction: reduce average impact l Complexity increases ä Time-predictability almost impossible to describe ä Instruction Set Architecture expands (next slide)
9 Processor Theory – Historical overview
10 Processor Theory – Nehalem architecture l In novel processors ä Core i7 & Xeon 5500s l 3 cache levels l 2 TLB levels l 2 branch predictors l Out-of-Order execution l Simultaneous Multithreading l Loop stream decoder l Dynamic frequency scaling
11 Processor Theory – Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 5 LoadStore L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores Micro- code ROM Q PI
12 Processor Theory – Nehalem pipeline (2/2)
13 Processor Theory – Hazards l Negative impact on time-predictability ä Data hazards –RAW & WAR & WAW ä Structural hazards –Functional unit in use l Stall l SMT ä Control hazards –Exception and interrupt handling l Irregular –Branch hazards l Branch misprediction penalty (next page)
14 Processor Practice – Branch prediction for (a=0;a< ;a++) { if random<BranchPoint DoSomething; else DoSomething; } //BranchPoint=0%...100% Lower latency by max 30%
15 Processor Conclusion l Branch prediction ä Make your branches predictable –Lower latency by max 30% ä If input-dependent –Decreases time-predictability l Other features increase throughput, but decrease time-predictability ä Out-of-Order execution ä Simultaneous Multithreading ä Loop stream decoder ä Dynamic frequency scaling
16 Overview Memory hierarchy QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
17 Memory hierarchy Theory – Overview (1/2) LevelCapacityAssociativity (ways) Line Size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D4 x 32 KiB86441Writeback L1I4 x 32 KiB4N/A L2U4 x 256 KiB86410VariesWriteback L3U1 x 8 MiB VariesWriteback
18 Memory hierarchy Theory – Overview (2/2) Hit rateAccess time L1$95%4,000 clock cycles L2$95%10,000 clock cycles L3$95%40,000 clock cycles Mem100,000 clock cycles Minimum4,000 clock cycles Average4,383 clock cycles Maximum100,000 clock cycles l Goal ä Minimise average latency l Result ä Program (and input) influences hit rate and thus average latency ä Input may influence time- predictability
19 l Negative impact on time-predictability ä Locality of reference –Temporal locality –Spatial locality l Sequential locality l Equidistant locality l Branch locality ä Write policy –Write-through (Latency: Write = 1, Read = 1) –Write-back (Latency: Write = 0, Read = 2) Memory hierarchy Theory – Caches (1/2)
20 Memory hierarchy Theory – Caches (2/2) l Negative impact on time-predictability ä Cache types –Instruction cache –Data cache –Translation Lookaside Buffer (TLB) ä (Non-)Blocking caches ä Replacement policy –Fully associative –N-way set associative –Direct mapped (1-way associative)
21 Memory hierarchy Practice – Method.code start: mov eax, alloc( ) mov ecx, 0 loopy: mov ebx, [eax ] mov ebx, [eax ]... (100,000x) mov ebx, [eax ] mov ebx, [eax ] inc ecx cmp ecx, jnz loopy free eax exit end start Assembly: no compiler Begin Allocate variable number of bytes For ecx = 0 to BIG_NUMBER (run 10s) Read random data from array... (100,000x) Read random data from array Next ecx Free memory End
22 Memory hierarchy Practice – Results (1/3)
23 Memory hierarchy Practice – Results (2/3)
24 Memory hierarchy Practice – Results (3/3)
25 Memory hierarchy Conclusion l Stay in the cache (here 4x32KiB L1 / 2x6MIB L2) e.g. by splitting large dataset into smaller pieces l Possible speed gain of more than 50x!
26 Overview System architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
27 System architecture Theory – Layout and limits l Limits ä DDR GB/s ä QPI - 2 x 13 GB/s ä PCIe Gen2 16x - 8 GB/s ä 10GbE - 1 GB/s ä SATA II MB/s ä USB MB/s
28 System architecture Practice – Results (1/4)
29 System architecture Practice – Results (2/4)
30 System architecture Practice – Results (3/4)
31 System architecture Practice – Results (4/4)
32 System architecture Conclusion l Divide load between NUMA nodes ä Cores in one node compete for memory bandwidth ä Increase throughput by number of nodes l Run one process on one core ä To increase time-predictability l Run time-critical process on core (and CPU) without interrupts ä Interrupts increase latency and decrease time- predictability
33 Overview Operating system QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
34 Operating System l Theory ä Multi tasking –Context switch –Virtual addressing (RAM → L2TLB → L1TLB) –Different process priorities (highly unpredictable) –Kernel ä General purpose / Real-time OS –Focus on predictable latency (not minimum) l Practice ä Low priority l Conclusion ä Run your program at high priority (on RTOS)
35 Conclusion l Processor ä Make your branches predictable (30%) l Memory hierarchy ä Stay in the cash (50x) l System architecture ä Divide load between NUMA nodes (Nx) ä Avoid interrupted core (and CPU) ä Run one process on one core l Operating System ä Run your program at high priority (on RTOS)