Presentation is loading. Please wait.

Presentation is loading. Please wait.

Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For.

Similar presentations


Presentation on theme: "Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For."— Presentation transcript:

1 nci.org.au @NCInews Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For performance and profit Malcolm Haak, SGI August 2016 W

2 2 Agenda Current state of monitoring – Traditional Approach – Limitations – Scalability Design goals – Client side – Server side Architecture – Monitoring platform Performance – Identifying how much you have – Identifying where its all gone What else? – Use cases

3 3 Current state of monitoring Watching the watchers

4 4 Only really do one thing (without plugins or customization) Often require status logic in the ‘agent’/script Graphing can be inflexible Binary only agents and/or lots of ‘hand-rolling’ of scripts required More than one usually required to get full picture of performance and alerting Current monitoring solutions

5 5 Data reusability is low. One data point for one alert type No ‘official’ agents for some products Bundled monitoring offerings are often limited in scope/output Lacking API for integration with other products or NOC/Status views Combination or complex tests not possible Very manual configuration Reconfiguration often invalidates existing data Other issues

6 6 Heavy clients behave poorly during heavy load Segregation of criticality Merging of data from external and out of band sources Integration and automation of existing workflows Other issues

7 7 Design goals Let’s build a bike shed

8 8 Don’t reinvent the wheel Monitor everything. More data means more useful alerts/views Allow for actions to be taken based on data Alerts and Performance presented from the same interface Easy access to information Well documented API Move logic out of monitored machines Expressed goals

9 9 Separate monitoring of different criticalities, with the more critical information using the ‘lightest’ possible client Ease of use as ‘10 foot view’ Ability to separate alerting based on team or personal responsibilities Room for future adaptability Expressed goals

10 10 Architecture What colour is your bike shed?

11 11 Based on Zabbix Full ‘automated’ configuration via setup scripts with each ‘plugin’ Modular service written in Python for ease of expansion Reusable plugins also function as standalone command line tools for debugging and lightweight usage Architecture

12 12 Storage Arrays – IS 4X00/5X00 – IS 17000/17500 Lustre – Server side performance metrics – Client via Server side metrics Current Monitoring Plugins

13 13 Alerting on faults at array and disk level Inventory of drives in array Alerting on changes Configurable history of drive S/N Disk R/W latency In progress – Drive side bandwidth (dependent on array support) Storage Arrays

14 14 Collectable data – MDS operations (Open/Close/etc) – Per OSS bandwidth/IPC rates – Per LNET router bandwidth/IPC rates – Per OST bandwidth – Per OST read/write IO size/latency breakdown – Per client MDS operations – Per client bandwidth/IPC rates Lustre

15 15 CPU Load/context switches/running processes Memory/swap utilization Free space in filesystems (Useful for Lustre) Checksums on important files (/etc/groups /etc/passwd…. etc) Multipath (XVM and multipathd) status Pacemaker resource status Other Information

16 16 Performance Where did all my IOPS go?

17 17 How much do we have? MDS performance OSS performance LNET performance Array performance

18 18 Uncontested IOR run

19 19 MDS performance on an average day

20 20 LNET throughput on a normal day

21 21 R/W sizes

22 22 Where did it go? MDS performance OSS performance LNET performance Array performance

23 23 “Lustre feels slow”

24 24 ~99% Read bandwidth consumed

25 25 OST Failover

26 26 Future And then?

27 27 Profiling workloads

28 28 Peaks at front are cache saturation

29 29 Total RPCs

30 30 At a glance health status

31 31 Client side NFS performance

32 32 Data view of IS4600 Array

33 33 Questions ?

34 nci.org.au @NCInews Providing Australian researchers with world-class computing services W NCI Contacts General enquiries: +61 2 6125 9800 Media enquiries: +61 2 6125 4389 Help desk: help@nci.org.au Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200


Download ppt "Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For."

Similar presentations


Ads by Google