Download presentation
Presentation is loading. Please wait.
Published byAdrian Ford Modified over 9 years ago
1
Lessons from HLT benchmarking (For the new Farm) Rainer Schwemmer, LHCb Computing Workshop 2014
2
The Setup Goals –Determine most cost efficient compute platform for next year’s farm upgrade –Help to estimate what can be expected from the new farm CPU time measurement: –Run Moore similar to how it runs at P8 during data taking (actually deferred processing) –Buffer Manager –File Reader (instead of MEPRx) –Variable instances of HLT1/HLT1+2 –Measure how many triggers where processed over a certain amount of time (typically 1 hour) Memory measurement: –Intel Performance Counter Monitor –Profiles entire system for Cache behaviour IPC Other interesting stats 2
3
Results 3 Had access to quite a few next generation prototype systems
4
Results 4 Machines that are interesting for the new farm Current farm nodes New Farm Node (x800)
5
Interesting little detail 5 HLT seems to run faster on first socket than on second –Effect is also visible on current farm nodes, but to lesser extent ~630 ~560
6
NUMA Architecture 6 Non Uniform Memory Access hits us when we are launching applications as forks of a master process HLT Master >50% of mem HLT Master >50% of mem HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave
7
NUMA Architecture 7 When off core/off socket instances access master memory they incur additional latency due to Socket-Socket/Core-Core interconnect HLT Master >50% of mem HLT Master >50% of mem HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave
8
NUMA Architecture 8 Solution: Launch one master process per NUMA node Disadvantage: Every additional master needs memory but does not participate in data processing HLT Master HLT Master HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Slave HLT Master HLT Master
9
The numbers The raw numbers for HLT1/HT1+2 without/with NUMA awareness Values are in Hz Core Value: Hlt1+2_Classic / HLT1+2_Single / N_Cores –If you want to compute fully loaded performance from single core performance –Warning: New machines are not 1.0 anymore! I have most of these values for most of the Haswell, AMD and Atom cores Can provide a spreadsheet if you are interested 9 CPUHLT1 Single HLT1 Classic HLT1 NUMA HLT1+2 Single HLT1+2 Classic HLT1+2 NUMA NUMA Gain 1 NUMA GAIN1+2 Core Value DELL89.271049115048.9599.6648.81.0961.0821.02 SM590627~1.063 AMD40.241105115031.7632.356821.04~1.0790.62 E_2630 (8 cores) 168964.67865986~1.140.84 E_2650 (10 cores) 206662.8311291210~1.0710.90
10
Frequency Scaling Benchmark Performance vs. Core Frequency –Results from 2010 Performance scales more or less linear with frequency –Dashed: linear extrapolation based on lowest measurement point –Solid: Measured performance 10
11
Frequency Scaling Benchmark Performance vs. Core Frequency –Results from 2010 + 2014 Performance does not scale with frequency at all anymore –There is a good chance the extrapolation curve is underestimated and could be better 11
12
Frequency Scaling Cause unclear so far –No profiler for Haswell yet Suspect high memory latency and bad data locality in application DDR4 is still in its infancy might get better but we won’t profit from it with new farm 12
13
Conclusion Forking is good for start up, but we lost quite a bit of performance –This did not go unnoticed Johannes ca. 2011 –We did not keep track of performance values so it was blamed on changes in the application Running the HLT NUMA aware can give us 14% more performance on new machines –6%-8% on old farm nodes –Memory consumption will go up Performance does not scale with CPU anymore –Probably issue with memory latency and bad data access pattern –No plan to upgrade farm again until next shutdown –Will not profit from better memory until after LS2 –Some more % can probably be gained by optimizing data structures and access patterns 13
14
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.