Download presentation
Presentation is loading. Please wait.
Published byKristian Vernon White Modified over 8 years ago
1
nci.org.au @NCInews Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For performance and profit Malcolm Haak, SGI August 2016 W
2
2 Agenda Current state of monitoring – Traditional Approach – Limitations – Scalability Design goals – Client side – Server side Architecture – Monitoring platform Performance – Identifying how much you have – Identifying where its all gone What else? – Use cases
3
3 Current state of monitoring Watching the watchers
4
4 Only really do one thing (without plugins or customization) Often require status logic in the ‘agent’/script Graphing can be inflexible Binary only agents and/or lots of ‘hand-rolling’ of scripts required More than one usually required to get full picture of performance and alerting Current monitoring solutions
5
5 Data reusability is low. One data point for one alert type No ‘official’ agents for some products Bundled monitoring offerings are often limited in scope/output Lacking API for integration with other products or NOC/Status views Combination or complex tests not possible Very manual configuration Reconfiguration often invalidates existing data Other issues
6
6 Heavy clients behave poorly during heavy load Segregation of criticality Merging of data from external and out of band sources Integration and automation of existing workflows Other issues
7
7 Design goals Let’s build a bike shed
8
8 Don’t reinvent the wheel Monitor everything. More data means more useful alerts/views Allow for actions to be taken based on data Alerts and Performance presented from the same interface Easy access to information Well documented API Move logic out of monitored machines Expressed goals
9
9 Separate monitoring of different criticalities, with the more critical information using the ‘lightest’ possible client Ease of use as ‘10 foot view’ Ability to separate alerting based on team or personal responsibilities Room for future adaptability Expressed goals
10
10 Architecture What colour is your bike shed?
11
11 Based on Zabbix Full ‘automated’ configuration via setup scripts with each ‘plugin’ Modular service written in Python for ease of expansion Reusable plugins also function as standalone command line tools for debugging and lightweight usage Architecture
12
12 Storage Arrays – IS 4X00/5X00 – IS 17000/17500 Lustre – Server side performance metrics – Client via Server side metrics Current Monitoring Plugins
13
13 Alerting on faults at array and disk level Inventory of drives in array Alerting on changes Configurable history of drive S/N Disk R/W latency In progress – Drive side bandwidth (dependent on array support) Storage Arrays
14
14 Collectable data – MDS operations (Open/Close/etc) – Per OSS bandwidth/IPC rates – Per LNET router bandwidth/IPC rates – Per OST bandwidth – Per OST read/write IO size/latency breakdown – Per client MDS operations – Per client bandwidth/IPC rates Lustre
15
15 CPU Load/context switches/running processes Memory/swap utilization Free space in filesystems (Useful for Lustre) Checksums on important files (/etc/groups /etc/passwd…. etc) Multipath (XVM and multipathd) status Pacemaker resource status Other Information
16
16 Performance Where did all my IOPS go?
17
17 How much do we have? MDS performance OSS performance LNET performance Array performance
18
18 Uncontested IOR run
19
19 MDS performance on an average day
20
20 LNET throughput on a normal day
21
21 R/W sizes
22
22 Where did it go? MDS performance OSS performance LNET performance Array performance
23
23 “Lustre feels slow”
24
24 ~99% Read bandwidth consumed
25
25 OST Failover
26
26 Future And then?
27
27 Profiling workloads
28
28 Peaks at front are cache saturation
29
29 Total RPCs
30
30 At a glance health status
31
31 Client side NFS performance
32
32 Data view of IS4600 Array
33
33 Questions ?
34
nci.org.au @NCInews Providing Australian researchers with world-class computing services W NCI Contacts General enquiries: +61 2 6125 9800 Media enquiries: +61 2 6125 4389 Help desk: help@nci.org.au Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.