Presentation is loading. Please wait.

Presentation is loading. Please wait.

Time-predictability of a computer system Master project in progress By Wouter van der Put.

Similar presentations


Presentation on theme: "Time-predictability of a computer system Master project in progress By Wouter van der Put."— Presentation transcript:

1 Time-predictability of a computer system Master project in progress By Wouter van der Put

2 2 How long does it take?

3 3 Goal Problem, approach and final goal l Problem ä How to meet timing requirements on an x86 multi- core multi-CPU computer system? l Method ä Investigate, characterize and give advice to increase the time-predictability of x86 multi-core multi-CPU computer systems l Final goal ä Advise how to maximise time-predictability, minimise latency and maximise throughput

4 4 Overview Time-predictability l Influenced by (bottom-up approach) ä Hardware –Processor (architecture) –Memory (hierarchy) –System architecture (motherboard) ä Software –Operating System (scheduling) –Algorithms and their data (regularity) l Approach ä Theory: Explore (CPU) architectures ä Practice: Perform measurements ä Conclusion l Focus on contemporary architecture ä Quad-core dual-CPU Intel Nehalem server (next slide)

5 5 Time-predictability White = observed Black = reality

6 6 Overview Nehalem/Tylersburg architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

7 7 Overview Processor QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

8 8 Processor Theory – Time-predictability l Designed to improve average case latency ä Memory access –Caches: reduce average access time ä Hazards –Prediction: reduce average impact l Complexity increases ä Time-predictability almost impossible to describe ä Instruction Set Architecture expands (next slide)

9 9 Processor Theory – Historical overview

10 10 Processor Theory – Nehalem architecture l In novel processors ä Core i7 & Xeon 5500s l 3 cache levels l 2 TLB levels l 2 branch predictors l Out-of-Order execution l Simultaneous Multithreading l Loop stream decoder l Dynamic frequency scaling

11 11 Processor Theory – Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 5 LoadStore L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores Micro- code ROM Q PI

12 12 Processor Theory – Nehalem pipeline (2/2)

13 13 Processor Theory – Hazards l Negative impact on time-predictability ä Data hazards –RAW & WAR & WAW ä Structural hazards –Functional unit in use l Stall l SMT ä Control hazards –Exception and interrupt handling l Irregular –Branch hazards l Branch misprediction penalty (next page)

14 14 Processor Practice – Branch prediction for (a=0;a<99999999;a++) { if random<BranchPoint DoSomething; else DoSomething; } //BranchPoint=0%...100% Lower latency by max 30%

15 15 Processor Conclusion l Branch prediction ä Make your branches predictable –Lower latency by max 30% ä If input-dependent –Decreases time-predictability l Other features increase throughput, but decrease time-predictability ä Out-of-Order execution ä Simultaneous Multithreading ä Loop stream decoder ä Dynamic frequency scaling

16 16 Overview Memory hierarchy QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

17 17 Memory hierarchy Theory – Overview (1/2) LevelCapacityAssociativity (ways) Line Size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D4 x 32 KiB86441Writeback L1I4 x 32 KiB4N/A L2U4 x 256 KiB86410VariesWriteback L3U1 x 8 MiB166435-40VariesWriteback

18 18 Memory hierarchy Theory – Overview (2/2) Hit rateAccess time L1$95%4,000 clock cycles L2$95%10,000 clock cycles L3$95%40,000 clock cycles Mem100,000 clock cycles Minimum4,000 clock cycles Average4,383 clock cycles Maximum100,000 clock cycles l Goal ä Minimise average latency l Result ä Program (and input) influences hit rate and thus average latency ä Input may influence time- predictability

19 19 l Negative impact on time-predictability ä Locality of reference –Temporal locality –Spatial locality l Sequential locality l Equidistant locality l Branch locality ä Write policy –Write-through (Latency: Write = 1, Read = 1) –Write-back (Latency: Write = 0, Read = 2) Memory hierarchy Theory – Caches (1/2)

20 20 Memory hierarchy Theory – Caches (2/2) l Negative impact on time-predictability ä Cache types –Instruction cache –Data cache –Translation Lookaside Buffer (TLB) ä (Non-)Blocking caches ä Replacement policy –Fully associative –N-way set associative –Direct mapped (1-way associative)

21 21 Memory hierarchy Practice – Method.code start: mov eax, alloc(1073741824) mov ecx, 0 loopy: mov ebx, [eax+911191543] mov ebx, [eax+343523495]... (100,000x) mov ebx, [eax+261645419] mov ebx, [eax+275857221] inc ecx cmp ecx, 80000000 jnz loopy free eax exit end start Assembly: no compiler Begin Allocate variable number of bytes For ecx = 0 to BIG_NUMBER (run 10s) Read random data from array... (100,000x) Read random data from array Next ecx Free memory End

22 22 Memory hierarchy Practice – Results (1/3)

23 23 Memory hierarchy Practice – Results (2/3)

24 24 Memory hierarchy Practice – Results (3/3)

25 25 Memory hierarchy Conclusion l Stay in the cache (here 4x32KiB L1 / 2x6MIB L2) e.g. by splitting large dataset into smaller pieces l Possible speed gain of more than 50x!

26 26 Overview System architecture QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

27 27 System architecture Theory – Layout and limits l Limits ä DDR3 - 32 GB/s ä QPI - 2 x 13 GB/s ä PCIe Gen2 16x - 8 GB/s ä 10GbE - 1 GB/s ä SATA II - 500 MB/s ä USB 2.0 - 60 MB/s

28 28 System architecture Practice – Results (1/4)

29 29 System architecture Practice – Results (2/4)

30 30 System architecture Practice – Results (3/4)

31 31 System architecture Practice – Results (4/4)

32 32 System architecture Conclusion l Divide load between NUMA nodes ä Cores in one node compete for memory bandwidth ä Increase throughput by number of nodes l Run one process on one core ä To increase time-predictability l Run time-critical process on core (and CPU) without interrupts ä Interrupts increase latency and decrease time- predictability

33 33 Overview Operating system QPI DDR3 ICH10R 2x2 ESI 2x16 8x4 4x8 1x4 Tylersburg 36D IOH QPI Nehalem-EP PCIe

34 34 Operating System l Theory ä Multi tasking –Context switch –Virtual addressing (RAM → L2TLB → L1TLB) –Different process priorities (highly unpredictable) –Kernel ä General purpose / Real-time OS –Focus on predictable latency (not minimum) l Practice ä Low priority l Conclusion ä Run your program at high priority (on RTOS)

35 35 Conclusion l Processor ä Make your branches predictable (30%) l Memory hierarchy ä Stay in the cash (50x) l System architecture ä Divide load between NUMA nodes (Nx) ä Avoid interrupted core (and CPU) ä Run one process on one core l Operating System ä Run your program at high priority (on RTOS)


Download ppt "Time-predictability of a computer system Master project in progress By Wouter van der Put."

Similar presentations


Ads by Google