Download presentation
Presentation is loading. Please wait.
1
Time-predictability of a computer system Master project in progress By Wouter van der Put
2
2 How long does it take?
3
3 Goal Problem, approach and final goal l Problem ä How to meet timing requirements on an x86 multi- core multi-CPU computer system? l Method ä Investigate, characterize and give advice to increase the time-predictability of x86 multi-core multi-CPU computer systems l Final goal ä Advise how to maximise time-predictability, minimise latency and maximise throughput
4
4 Overview Time-predictability l Influenced by (bottom-up approach) ä Hardware –Processor (architecture) –Memory (hierarchy) –System architecture (motherboard) ä Software –Operating System (scheduling) –Algorithms and their data (regularity) l Approach ä Theory: Explore (CPU) architectures ä Practice: Perform measurements ä Conclusion l Focus on contemporary architecture ä Quad-core dual-CPU Intel Nehalem server (next slide)
5
5 Time-predictability White = observed Black = reality
6
6 Overview Nehalem/Tylersburg architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
7
7 Overview Processor QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
8
8 Processor Theory – Time-predictability l Designed to improve average case latency ä Memory access –Caches: reduce average access time ä Hazards –Prediction: reduce average impact l Complexity increases ä Time-predictability almost impossible to describe ä Instruction Set Architecture expands (next slide)
9
9 Processor Theory – Historical overview
10
10 Processor Theory – Nehalem architecture l In novel processors ä Core i7 & Xeon 5500s l 3 cache levels l 2 TLB levels l 2 branch predictors l Out-of-Order execution l Simultaneous Multithreading l Loop stream decoder l Dynamic frequency scaling
11
11 Processor Theory – Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 5 LoadStore L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores Micro- code ROM Q PI
12
12 Processor Theory – Nehalem pipeline (2/2)
13
13 Processor Theory – Hazards l Negative impact on time-predictability ä Data hazards –RAW & WAR & WAW ä Structural hazards –Functional unit in use l Stall l SMT ä Control hazards –Exception and interrupt handling l Irregular –Branch hazards l Branch misprediction penalty (next page)
14
14 Processor Practice – Branch prediction for (a=0;a<99999999;a++) { if random<BranchPoint DoSomething; else DoSomething; } //BranchPoint=0%...100% Lower latency by max 30%
15
15 Processor Conclusion l Branch prediction ä Make your branches predictable –Lower latency by max 30% ä If input-dependent –Decreases time-predictability l Other features increase throughput, but decrease time-predictability ä Out-of-Order execution ä Simultaneous Multithreading ä Loop stream decoder ä Dynamic frequency scaling
16
16 Overview Memory hierarchy QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
17
17 Memory hierarchy Theory – Overview (1/2) LevelCapacityAssociativity (ways) Line Size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D4 x 32 KiB86441Writeback L1I4 x 32 KiB4N/A L2U4 x 256 KiB86410VariesWriteback L3U1 x 8 MiB166435-40VariesWriteback
18
18 Memory hierarchy Theory – Overview (2/2) Hit rateAccess time L1$95%4,000 clock cycles L2$95%10,000 clock cycles L3$95%40,000 clock cycles Mem100,000 clock cycles Minimum4,000 clock cycles Average4,383 clock cycles Maximum100,000 clock cycles l Goal ä Minimise average latency l Result ä Program (and input) influences hit rate and thus average latency ä Input may influence time- predictability
19
19 l Negative impact on time-predictability ä Locality of reference –Temporal locality –Spatial locality l Sequential locality l Equidistant locality l Branch locality ä Write policy –Write-through (Latency: Write = 1, Read = 1) –Write-back (Latency: Write = 0, Read = 2) Memory hierarchy Theory – Caches (1/2)
20
20 Memory hierarchy Theory – Caches (2/2) l Negative impact on time-predictability ä Cache types –Instruction cache –Data cache –Translation Lookaside Buffer (TLB) ä (Non-)Blocking caches ä Replacement policy –Fully associative –N-way set associative –Direct mapped (1-way associative)
21
21 Memory hierarchy Practice – Method.code start: mov eax, alloc(1073741824) mov ecx, 0 loopy: mov ebx, [eax+911191543] mov ebx, [eax+343523495]... (100,000x) mov ebx, [eax+261645419] mov ebx, [eax+275857221] inc ecx cmp ecx, 80000000 jnz loopy free eax exit end start Assembly: no compiler Begin Allocate variable number of bytes For ecx = 0 to BIG_NUMBER (run 10s) Read random data from array... (100,000x) Read random data from array Next ecx Free memory End
22
22 Memory hierarchy Practice – Results (1/3)
23
23 Memory hierarchy Practice – Results (2/3)
24
24 Memory hierarchy Practice – Results (3/3)
25
25 Memory hierarchy Conclusion l Stay in the cache (here 4x32KiB L1 / 2x6MIB L2) e.g. by splitting large dataset into smaller pieces l Possible speed gain of more than 50x!
26
26 Overview System architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
27
27 System architecture Theory – Layout and limits l Limits ä DDR3 - 32 GB/s ä QPI - 2 x 13 GB/s ä PCIe Gen2 16x - 8 GB/s ä 10GbE - 1 GB/s ä SATA II - 500 MB/s ä USB 2.0 - 60 MB/s
28
28 System architecture Practice – Results (1/4)
29
29 System architecture Practice – Results (2/4)
30
30 System architecture Practice – Results (3/4)
31
31 System architecture Practice – Results (4/4)
32
32 System architecture Conclusion l Divide load between NUMA nodes ä Cores in one node compete for memory bandwidth ä Increase throughput by number of nodes l Run one process on one core ä To increase time-predictability l Run time-critical process on core (and CPU) without interrupts ä Interrupts increase latency and decrease time- predictability
33
33 Overview Operating system QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe
34
34 Operating System l Theory ä Multi tasking –Context switch –Virtual addressing (RAM → L2TLB → L1TLB) –Different process priorities (highly unpredictable) –Kernel ä General purpose / Real-time OS –Focus on predictable latency (not minimum) l Practice ä Low priority l Conclusion ä Run your program at high priority (on RTOS)
35
35 Conclusion l Processor ä Make your branches predictable (30%) l Memory hierarchy ä Stay in the cash (50x) l System architecture ä Divide load between NUMA nodes (Nx) ä Avoid interrupted core (and CPU) ä Run one process on one core l Operating System ä Run your program at high priority (on RTOS)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.