NGOP Prototype Status Report T.Levshina. N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof.

Slides:



Advertisements
Similar presentations
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Advertisements

NGAS – The Next Generation Archive System Jens Knudstrup NGAS The Next Generation Archive System.
Database System Concepts and Architecture
17th February, 2000 by Maciej Korzeniowski (CERN-IT-IA-MI) 1 Oracle Discoverer Product Presentation  This is an ad hoc query and analysis tool for.
ActiveXperts Network Monitor Monitors servers, workstations and devices for availability Alerts and corrects.
® IBM Software Group © 2010 IBM Corporation What’s New in Profiling & Code Coverage RAD V8 April 21, 2011 Kathy Chan
The Zebra Striped Network File System Presentation by Joseph Thompson.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 4 Installing and Configuring the Dynamic Host Configuration Protocol.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Jaeyoung Choi School of Computing, Soongsil University 1-1, Sangdo-Dong, Dongjak-Ku Seoul , Korea {heaven, psiver,
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Maintaining and Updating Windows Server 2008
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
1 NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
VMware vCenter Server Module 4.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
Windows Server 2008 Chapter 11 Last Update
Understanding and Managing WebSphere V5
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
TM Herding Penguins with Performance Co-Pilot Ken McDonell Performance Tools Group SGI, Melbourne.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
BMC Software confidential. BMC Performance Manager Will Brown.
Fundamentals of Networking Discovery 1, Chapter 2 Operating Systems.

Firewall and Internet Access Mechanism that control (1)Internet access, (2)Handle the problem of screening a particular network or an organization from.
THE GITB TESTING FRAMEWORK Jacques Durand, Fujitsu America | December 1, 2011 GITB |
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Module 7: Fundamentals of Administering Windows Server 2008.
Overview of MSS System Human Actors Non-Human Actors In-house developed components Third party products.
November 3, FBSNG Overview Jim Fromm Farms and Clustered Systems Group, Computing Division, Fermilab.
NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 4 Installing and Configuring the Dynamic Host Configuration Protocol.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
ALICE, ATLAS, CMS & LHCb joint workshop on
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration System Monitoring.
Manchester University Tiny Network Element Monitor (MUTiny NEM) A Network/Systems Management Tool Dave McClenaghan, Manchester Computing George Neisser,
Workforce Scheduling Release 5.0 for Windows Implementation Overview OWS Development Team.
Monitoring and Managing Server Performance. Server Monitoring To become familiar with the server’s performance – typical behavior Prevent problems before.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
Auditing Project Architecture VERY HIGH LEVEL Tanya Levshina.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
2: Operating Systems Networking for Home & Small Business.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Maintaining and Updating Windows Server 2008 Lesson 8.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Troubleshooting Tools
Hands-On Microsoft Windows Server 2008
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
OpenEMS: Automating the Data Center with Condor
HC Hyper-V Module GUI Portal VPS Templates Web Console
Presentation transcript:

NGOP Prototype Status Report T.Levshina

N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson

Presentation Highlights NGOP project phases Status of the Framework Status of the prototype deployment Near future milestones

NGOP Project Phases (since last HEPIX) December 2000: First prototype implementation was released. January 2001:Prototype installation on farms. Classes for farm administrators. February 2001: Ngop server node in the operator console area was installed. Monitoring by operators was started. March 2001:New release (“Swatch” and “PlugIns” Agents). Ngop was evaluated by system administrators, operators and others. Strategy meeting was carried out. April 2001 “Xfalive” service (low-level ping) was provided for the all nodes monitored by Computing Services Department.

NGOP Architecture Data Analyzer Persistent Config.Data Persistent Config.Data Archive Configuraton File Management Service Configuraton File Management Service Archive Service Central Server Cluster B Performance Data Cluster A Performance Storage Service Cluster B1 s s s s S s MA Cluster B2 MA Monitored Objects Host Element Cluster System NGOP Components Sensor Agent Server Monitoring Agent Monitoring Data Storage Clients Connections TCP connection between UDP Monitored Element and MA Not implemented in prototype yet MA s Administrator Monitor Report Generator Router s Action Client

Monitor Data Flow and NGOP Components Interaction MA Monitored Elements MA Monitored Elements ID=swap.nodeA State=Up Value=98 SevLevel=Error Dscrb=“swap > 95 %” ID=syslogd.nodeB State=Down Dscrb=“syslogd is down” Central Server Monitor Action Client Action Request Archiver Configuration Service CVS MA Monitored Elements MA Monitored Elements MA Monitored Elements MA Monitored Elements

Status of Framework (Implemented Components) Monitoring Agent: –MA API (only Python binding) –PlugIns Agent (XML configuration is required) –Several types of MAs are provided in NGOP Prototype: Linux Node "health" : –System Daemons presence –Critical File Systems presence and size –Cpu load –Memory utilization –Swap utilization –Number of users –Number of users’ processes –Number of processors –Baseboard temperature –Fan speed “Xfalive”: –Node availability (low level ping) –Node reset FBS : –FBS Daemons presence –Resources (“cpu” and scratch disk availability) “Swatch” : –watches a log file for lines matching a regular expression, e.g. syslog or console log

Status of Framework (Implemented Components) NGOP Central Server(NCS): –Gather events from MA’s –Scalable (so far ~ 512 nodes) –Provide users with requested information –Handle multiple users –Primitive locking mechanism to prevent simultaneous actions –Action broadcasting –Store information locally and forward it to Archive Storage NGOP Configuration File Management Service: –Provide a central repository for system configuration and monitoring rules. –Perform configuration sanity check –Provide clients with component subscription list –Allow dynamic reconfiguration –Notify clients about new configuration

Status of Framework (Implemented Components) Archive Server: –Handles archive storage (Oracle). –Provides a means to read and query the data (FNAL web interface: MISWEB) –Performs data roll out –Performs clean up procedure Action Client: –Performs centralized actions –Verifies user authorization to perform the action –Notifies NCS about action exit status Monitoring Client: –Allows to configure custom-built system views –Defines rules that determine the status of the system and their components –Requests and receives information about monitored objects –Determines the status of system based on the rules and obtained information –Initiates request to perform action. –All configuration files are written in XML

Status of Framework (Not yet implemented components) Sensor Agent: Agent that collects performance data and generates events at a higher rate than a monitoring agent. Performance Data Storage Service: Service that allows persistent storage of performance data, as well as means to read and query the data.Performance data will need to be consolidated. Looping Monitoring Agent: Agent that is capable to received information form NCS, analyze it, derive new events and send it back to NCS.

CFMS Admin

NGOP Monitor (Configuration)

NGOP Monitor (Display)

NGOP Monitor (Display)

Prototype Statistics Some implementation details: –Written primarily in Python (some modules in C) ~ 10, 000 line of Python code and ~1,000 of C code –Use XML (and partially MATHML) for all configuration files ~ 600 configuration files Some deployment details: –Monitoring 512 nodes, checking for node being down and node reset. –Monitoring four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes out of 512) –Number of Monitoring Agents ~ 557( 270 local MAs monitor operating system and sensors data on the farms, 270 local MAs monitor syslog on the farms, 4 MAs monitor FBS on corresponding farms, 13 MAs perform “xfalive” service) –Number of Monitored Objects ~ 6,500 –About 5 instances of “ngop monitor” (GUI) are running simultaneously. – Local event log is kept since January,12. Rate is ~ 13 events per hour

Current Configuration NGOP Action Client MAs (Ping) NGOP Central Server Config File Management Server FNCDUH Archive Service NGOP Monitor User Node NGOP Monitor User Node NGOP Monitor User Node MA (OSHealth) MA (OFT_FBS) fnpc fnsfh Old FixTarget Farm Swatch MA (OSHealth) MA (CDF_FBS) fncdf cdffarm1 CDF Farm Swatch MA (OSHealth) MA (FT_FBS) Fnpc fnsfo FixTarget Farm Swatch WWW Mail Servers License Servers EnstoreCMSSDSS MA (OSHealth) MA (D0_FBS) fnd d0bbin D0 Farm Swatch Division Servers MISCOMPKerberosD0CDFBTEVLicense Servers FNALUKTEVMINOSODSHPPCPPD

Summary Of Occurred Events Detected Problems: –Node reset –Node is down –One CPU is missing after reboot –File system not mounted –System daemon is dead –FBS Batch Manager is down Raised Alarms: –Memory usage is high –Swap usage is high –CPU Load is high –File System is full –Baseboard temperature is high –Specific messages found in syslog : nfs timeouts, drive timeouts …

Report Generator (MISCOMP Web Query Interface) Monitoring Agent id Monitored Object id Event typeEvent nameEvent valueDescription fnpc242_healthOSHealth fnpc242 cpuLoad fnpc242 sysUsagecpuLoad5.88Average load on the node is less or equal to 8 and greater than 5 fnpc208._healthOSHealth fnpc208 Memory fnpc208 sysUsagememory86Memory usage is greater or equal to 80% and less 95% fnpc204_healthHardware fnpc204 baseTemp_1 Fnpc204 HardwarebaseTemp_145.0Temperature is between 45C and 50C fnpc108_healthOSHealth fnpc108 rstatd fnpc108 Daemonrstatd0rstatd is not running

Next Milestone: From Prototype to Production System (for ~600 nodes) Goal 1: Gradually give the System Managers a Framework to develop and evolve tools to locally monitor their systems and enable them to send filtered information to the CSD operators Goal 2: Make sure all production systems can be supported by NGOP (excluding Windows2000 in the first phase)

Wish List: Improve the Production System Provide Monitoring Client API Implement Looping Agents Implement historical rules and escalating alarms Implement “snapshot” (“give me the updated system status now”) feature Provide other than Python Monitoring Agent API Fully Kerberize Provide Standard Win2000 Monitoring Agents Design and provide dynamic handling of configuration changes for the Monitoring Client Allow for easier handling of multiple configurations Improve Admin (Configuration Client) Client GUI Provide Configuration GUI (hoping for a good free XML Editor though) Provide Performance Data Framework Redesign/Rewrite GUI (for scalability and friendliness) Provide GUI for non-Linux platforms if really needed Work on scalability up to hosts