Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For.

Slides:



Advertisements
Similar presentations
Customer Strategic Presentation March 2010
Advertisements

Lucas Schill Brent Grover Ed Schilla Advisor: Danny Miller.
1 Cplant I/O Pang Chen Lee Ward Sandia National Laboratories Scalable Computing Systems Fifth NASA/DOE Joint PC Cluster Computing Conference October 6-8,
© 2012 IBM Corporation What’s new in OpenAdmin Tool for Informix? Erika Von Bargen May 2012.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
May 2005 Greg Giacovelli – Nick Mancuso – Shaun Newsum – Jean-Paul Pietraru – Nick Stroh.
©Company confidential 1 Performance Testing for TM & D – An Overview.
VTS INNOVATOR SERIES Real Problems, Real solutions.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
A Day in the Life of an Application Performance Engineer Keith Lyon - Shunra Software
ENTERPRISE JOB SCHEDULER SAJEEV RAMAKRISHNAN 29 AUG 2014.
Lustre at Dell Overview Jeffrey B. Layton, Ph.D. Dell HPC Solutions |
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Microsoft SharePoint Server 2010 for the Microsoft ASP.NET Developer Yaroslav Pentsarskyy
Computer Emergency Notification System (CENS)
A Presentation to Oracle OpenWorld Blistering Web Applications with Oracle TimesTen In Memory Option.
Personal Computer - Stand- Alone Database  Database (or files) reside on a PC - on the hard disk.  Applications run on the same PC and directly access.
Network Monitoring Manage your business without blowing your budget. Learn how the Calhoun ISD utilizes free “Open Source” tools for real-time monitoring.
Laboratoire LIP6 The Gedeon Project: Data, Metadata and Databases Yves DENNEULIN LIG laboratory, Grenoble ACI MD.
Virtual Server Monitoring Solution Overview. Agenda MonitorIT Overview Solution Demonstration Questions Contact Information.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
Citrix XenApp and XenDesktop Monitoring Solution Overview.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Software Architecture in Practice Mandatory project in performance engineering.
Tackling I/O Issues 1 David Race 16 March 2010.
Capacity Planning For the Hybrid Cloud From an infrastructure owner’s perspective.
Unit 2 VIRTUALISATION. Unit 2 - Syllabus Basics of Virtualization Types of Virtualization Implementation Levels of Virtualization Virtualization Structures.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
OpenNMS Case Studies SCALE 5x 2007 Feb 10. Agenda ● What the heck is OpenNMS? ● What can it do? ● Case Studies – New Edge Networks – Hospitality Services.
PHD Virtual Technologies “Reader’s Choice” Preferred product.
Software and Communication Driver, for Multimedia analyzing tools on the CEVA-X Platform. June 2007 Arik Caspi Eyal Gabay.
Application or server monitoring
ENOG13 Saint Petersburg Diego Luis Neto SW NL-ix.
UKNOF37 Manchester Diego Luis Neto SW NL-ix.
Implementation Plan Notes
By: Raza Usmani SaaS, PaaS & TaaS By: Raza Usmani
Scalable High Availability for Lustre with Pacemaker
Infrastructure Orchestration to Optimize Testing
Software Architecture in Practice
Diskpool and cloud storage benchmarks used in IT-DSS
Using OpenStack to Measure OpenStack Cinder Performance
The Client/Server Database Environment
Windows Azure Migrating SQL Server Workloads
Performance Testing Methodology for Cloud Based Applications
Storage elements discovery
Large Scale Test of a storage solution based on an Industry Standard
SQL Server BI on Windows Azure Virtual Machines
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Chapter 3: Operating-System Structures
Assessment Findings System Professional <Insert Consultant Name>
Selling IIoT Solutions to Systems Integrators
Moodle Scalability What is Scalability?
AWS Cloud Computing Masaki.
Specialized Cloud Architectures
Agile testing for web API with Postman
Performance Evaluation
Hardware-less Testing for RAS Software
Backup Monitoring – EMC NetWorker
Software - Operating Systems
Mark Quirk Head of Technology Developer & Platform Group
Profit Tools IT Infrastructure Improvement
Nolan Leake Co-Founder, Cumulus Networks Paul Speciale
Infokall Enterprise Solutions
Presentation transcript:

Providing Australian researchers with world-class computing services Lustre Community Conference 2016 In depth Lustre monitoring For performance and profit Malcolm Haak, SGI August 2016 W

2 Agenda Current state of monitoring – Traditional Approach – Limitations – Scalability Design goals – Client side – Server side Architecture – Monitoring platform Performance – Identifying how much you have – Identifying where its all gone What else? – Use cases

3 Current state of monitoring Watching the watchers

4 Only really do one thing (without plugins or customization) Often require status logic in the ‘agent’/script Graphing can be inflexible Binary only agents and/or lots of ‘hand-rolling’ of scripts required More than one usually required to get full picture of performance and alerting Current monitoring solutions

5 Data reusability is low. One data point for one alert type No ‘official’ agents for some products Bundled monitoring offerings are often limited in scope/output Lacking API for integration with other products or NOC/Status views Combination or complex tests not possible Very manual configuration Reconfiguration often invalidates existing data Other issues

6 Heavy clients behave poorly during heavy load Segregation of criticality Merging of data from external and out of band sources Integration and automation of existing workflows Other issues

7 Design goals Let’s build a bike shed

8 Don’t reinvent the wheel Monitor everything. More data means more useful alerts/views Allow for actions to be taken based on data Alerts and Performance presented from the same interface Easy access to information Well documented API Move logic out of monitored machines Expressed goals

9 Separate monitoring of different criticalities, with the more critical information using the ‘lightest’ possible client Ease of use as ‘10 foot view’ Ability to separate alerting based on team or personal responsibilities Room for future adaptability Expressed goals

10 Architecture What colour is your bike shed?

11 Based on Zabbix Full ‘automated’ configuration via setup scripts with each ‘plugin’ Modular service written in Python for ease of expansion Reusable plugins also function as standalone command line tools for debugging and lightweight usage Architecture

12 Storage Arrays – IS 4X00/5X00 – IS 17000/17500 Lustre – Server side performance metrics – Client via Server side metrics Current Monitoring Plugins

13 Alerting on faults at array and disk level Inventory of drives in array Alerting on changes Configurable history of drive S/N Disk R/W latency In progress – Drive side bandwidth (dependent on array support) Storage Arrays

14 Collectable data – MDS operations (Open/Close/etc) – Per OSS bandwidth/IPC rates – Per LNET router bandwidth/IPC rates – Per OST bandwidth – Per OST read/write IO size/latency breakdown – Per client MDS operations – Per client bandwidth/IPC rates Lustre

15 CPU Load/context switches/running processes Memory/swap utilization Free space in filesystems (Useful for Lustre) Checksums on important files (/etc/groups /etc/passwd…. etc) Multipath (XVM and multipathd) status Pacemaker resource status Other Information

16 Performance Where did all my IOPS go?

17 How much do we have? MDS performance OSS performance LNET performance Array performance

18 Uncontested IOR run

19 MDS performance on an average day

20 LNET throughput on a normal day

21 R/W sizes

22 Where did it go? MDS performance OSS performance LNET performance Array performance

23 “Lustre feels slow”

24 ~99% Read bandwidth consumed

25 OST Failover

26 Future And then?

27 Profiling workloads

28 Peaks at front are cache saturation

29 Total RPCs

30 At a glance health status

31 Client side NFS performance

32 Data view of IS4600 Array

33 Questions ?

Providing Australian researchers with world-class computing services W NCI Contacts General enquiries: Media enquiries: Help desk: Address: NCI, Building 143, Ward Road The Australian National University Canberra ACT 0200