Monitoring at GRIF Frédéric SCHAER cea.fr Hepix Workshop, April 27. 2012.

Slides:



Advertisements
Similar presentations
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Advertisements

Visualization of Monitoring Data at the NASA Advanced Supercomputing Facility Janice Singh
13,000 Jobs and counting…. Advertising and Data Platform Our System.
Lucas Schill Brent Grover Ed Schilla Advisor: Danny Miller.
1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.
Copyright GeneGo CONFIDENTIAL »« MetaCore TM (System requirements and installation) Systems Biology for Drug Discovery.
IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
Do MUCH More with Less Presented by: Jon Farley 2W Technologies.
METAARCHIVE & CLOUD COMPUTING Central Server Functions Bill Robbins System Administrator MetaArchive Cooperative.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
© 2010 VMware Inc. All rights reserved VMware ESX and ESXi Module 3.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
VMware vSphere 4 Introduction. Agenda VMware vSphere Virtualization Technology vMotion Storage vMotion Snapshot High Availability DRS Resource Pools Monitoring.
Hyrax Installation and Customization Dan Holloway James Gallagher.
Real Security for Server Virtualization Rajiv Motwani 2 nd October 2010.
Fundamentals of Networking Discovery 1, Chapter 2 Operating Systems.
Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.
Hands-On Microsoft Windows Server 2008
THE BASICS. “ A free, open source customized distribution of FreeBSD tailored for use as a firewall and router ”
Zabbix Performance Tuning
Josh Riggs Utilizing Open Source Network Monitoring.
Nagios Is Down and Your Boss Wants to See You Andrew Widdersheim
An Introduction to IBM Systems Director
Inventory:OCSNG + GLPI Monitoring: Zenoss 3
Dynamic Firewalls and Service Deployment Models for Grid Environments Gian Luca Volpato, Christian Grimm RRZN – Leibniz Universität Hannover Cracow Grid.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
Block1 Wrapping Your Nugget Around Distributed Processing.
Using Virtual Servers for the CERN Windows infrastructure Emmanuel Ormancey, Alberto Pace CERN, Information Technology Department.
Network Monitoring Manage your business without blowing your budget. Learn how the Calhoun ISD utilizes free “Open Source” tools for real-time monitoring.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration System Monitoring.
Master thesis Analysis and implementation of monitoring systems of active network equipment. Scientific advisor: Univ. Prof., Dr. Hab., Pavel TOPALA Master.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Monitoring with InfluxDB & Grafana
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
2: Operating Systems Networking for Home & Small Business.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Next Generation of Apache Hadoop MapReduce Owen
Virtual Server Server Self Service Center (S3C) JI July.
Queensland University of Technology Nagios – an Open Source monitoring solution and it’s deployment at QUT.
Core and Framework DIRAC Workshop October Marseille.
Difference between External and Internal Server Monitoring.
Hepix spring 2012 Summary SITE:
Windows Vista Configuration MCTS : Network Security.
Sql Server Architecture for World Domination Tristan Wilson.
EGI-InSPIRE RI Pakiti Michal Prochazka, (Daniel Kouril)
A comparison between xen and kvm Andrea Chierici Riccardo Veraldi INFN-CNAF CCR 2009.
Ethan Galstad What Is Nagios? What Nagios Is IT Infrastructure Monitoring.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
Administering the SOWN Network David R Newman & Chris Malton.
Metrics data published Via different methods Monitoring Server
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
COP 4343 Unix System Administration
High Availability Linux (HA Linux)
Quattor Usage at Nikhef
PES Lessons learned from large scale LSF scalability tests
Virtualization in the gLite Grid Middleware software process
Zhen Xiao, Qi Chen, and Haipeng Luo May 2013
Client/Server and Peer to Peer
Network File System (NFS)
Presentation transcript:

Monitoring at GRIF Frédéric SCHAER cea.fr Hepix Workshop, April

Highlights Monitoring Software Monitoring Architecture Lessons learnt Foreseen evolutions

Monitoring Software Used in the past – Lemon, nagiosgraph In use now – Nagios+nrpe+pnp+rrdcached, ipmi, pakiti,cacti, custom graphing scripts, and since 2011 : parts of check_mk – Vendor HW monitoring sensors : using nagios checks

Monitoring Architecture Constraints – Multiple physical locations : CEA, LAL, LLR, APC, LPNHE – Protocol restrictions : no SSH, firewalls – Variety of monitored systems and configurations Until recently : sl4/sl5/sl6, x86_64, i686 Various hardware, different capabilities – Number of admins 1000 hosts + 16 admins = many config changes/day – Quattor

Monitoring Architecture User Access : HTTPS+X509 HTTP+iptables ReportingRRDBIRTSecurityPakitiOperationsNagiosCheck_mk

Reporting : RRD

Reporting : BIRT

Security Pakiti setup – Every GRIF node runs a cron job, and reports back to the server at some random time, using SSL – The server periodically downloads RHEL Oval patch definitions Computes vulnerabilities on node report Displays result to trusted users

Security

Some (rough) operations history hostsservicesdependenciesHardware state : LEMON + only ~100~1000 KVM Virtual Machine - Idle : LEMON + GRIF ~600~8000~10000Bi-cpu KVM VM, quite loaded 2012 : Nagios GRIF Cores Opteron, 8GB Ram – Dying 2012 : GRIF/LLR GRIF except GRIF/LLR cores Xeon, Idle 4CPU, busy

Operations 2 Nagios servers in GRIF for now – Running nagios, with MK’s mklivestatus broker – Independant nagioses : no complex setup – Generated configuration : quick reinstall – Mainly using active checks using nrpe – Mklivestatus exporting data using xinetd : firewalled ports : only nagios hosts allowed Using unix sockets – MK’s multisite (mod_python) transparently aggregates data Efficiently displays data Is *fast*, amongst many other things

Operations

Nagios/check_mk config – Automatically generated using quattor information – No {warning,recovery,flapping,unknown} – Uses hosts dependencies (downtimes, network outages) – Uses services dependencies (mainly on nrpe) – Uses the « large setups tweaks » nagios option – Use rrdcached daemon for PNP – Required many optimizations (and always will) NRPE – Arguments forbidden – firewalled

Bonus : nrpe trick Compression : – Nrpe_check output|bzip2 | uuencode – Nagios check_nrpe |bunzip2 | uudecode Compressing check output allows to retrieve up to ~500K data (with op5 patch) – Still not enough under high load for check_mk agents, because of ps output

Operations Patched/packaged software (SL5/6 32/64): SoftPatch Nagiosstartup script : 15min -> 5 min Allow SELinux Check_mkFix X509 SSL+FakeBasicAuth usage Change instant check retry button timeout Agents patches (disable ipmi, reduce data size, fix $PATH) fix hardcoded configuration paths, wrong links NrpeAllow up to 64K output (op5 patch) Create system user/group Nagios plugins, GRIF plugins rrdtool

Lessons Learnt Nagios recovery actions must be temporary, and bugs must be fixed asap. What should not happen – when the almighty nagios crashed) : – Or when a /dev/null becomes a regular file, or… Bank holiday…

Lessons Learnt Nagios is powerful, but has its limits : – with no patch, restart takes 15 minutes ! – Excessive number of active checks are limited by the fork capabilities of the server more expensive hardware (see evolutions)? Monitoring is an ongoing process – Tests output must constantly be interpreted/correlated – Too much monitoring kills monitoring (admins) – But there’s always a need to monitor more things

Evolutions Divide nagios load on multiple and distributed hardware in GRIF Fork issue ? – Big nagios performance boost when using Hugepages memory (man libhugetlbfs)… at the expense of nagios crashes. – But Hugetlb seems worth investigation reduce load with “active” passive checks (using check_multi ? MK’s check_mrpe ? ) Many things aren’t monitored yet : – Memory and cpu cores that just disappear, exotic filesystems configurations, new services… and more to come

Evolutions Check_mk – Using check_mk agents and service auto discovery might help reduce load (passive checks) – but agents large output is incompatible with huge server loads – But some MK’s probes must be tweaked (autofs, virtual NICs …) Rework nagios config generation, so that it can be shared with the quattor community ?

Questions ?