Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

Slides:



Advertisements
Similar presentations
13,000 Jobs and counting…. Advertising and Data Platform Our System.
Advertisements

Andy van den Biggelaar SecIdm Specialist Wortell
Setting up and configuring BCO EE (BPA) Linux Console How I Learned to Stop Worrying and Love BCO EE Dima Seliverstov 3/3/2014.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
Monitoring a Large-Scale Network: Selecting the Right Tool Sayadur Rahman United International University & Network Manager, Financial Service.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
How Clients and Servers Work Together. Objectives Learn about the interaction of clients and servers Explore the features and functions of Web servers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Consorzio COMETA - PI2S2 Project UNIONE EUROPEA SAGE – Storage Accounting for Grid Environments in gLite Fabio Scibilia Consorzio.
State of Delaware Department of Natural Resources and Environmental Control.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
GIS Application in Firewall Security Log Visualization Juliana Lo.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
1 © 1999 BMC SOFTWARE, INC. 2/10/00 SNMP Simple Network Management Protocol.
Towards a Javascript CoG Kit Gregor von Laszewski Fugang Wang Marlon Pierce Gerald Guo
October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.
Highlights Builds on Splunk implementations – extending enterprise value to include mission-critical IBM mainframe data. Unified mainframe data source.
Page 1 GADD Software & GADD Analytics 1.5 Public version, January 2015, gaddsoftware.com GADD Analytics.
Your First Azure Application Michael Stiefel Reliable Software, Inc.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Suite zTPFGI Facilities. Suite Focus Three of zTPFGI’s facilities:  zAutomation  zTREX  Logger.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
1 What’s the difference between DocuShare 3.1 and 4.0?
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Suite zTPFGI Facilities. Suite Focus Three of zTPFGI’s facilities:  zAutomation  zTREX  Logger.
A Brief Documentation.  Provides basic information about connection, server, and client.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.
Control & Data Handling, Operator Control, Aircraft Interface to C&DH Steve Musko Space Physics Research Laboratory University of Michigan Ann Arbor, MI.
Youngil Kim Awalin Sopan Sonia Ng Zeng.  Introduction  Concept of the Project  System architecture  Implementation – HDFS  Implementation – System.
Compaq Availability Manager Installation, Configuration, Setup and Usage Barry Kierstein.
Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns
Status Report on the Validation Framework S. Banerjee, D. Elvira, H. Wenzel, J. Yarba Fermilab 15th Geant4 Collaboration Workshop 10/06/
Linux Operations and Administration
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Module 14: Advanced Topics and Troubleshooting. Microsoft ® Windows ® Small Business Server (SBS) 2008 Management Console (Advanced Mode) Managing Windows.
Splunk Enterprise Instructor: Summer Partain 3 Day Course.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. SAP BO vs SPLUNK vs OBIEE By Lera Technologies.
1 © 2004 Cisco Systems, Inc. All rights reserved. Session Number Presentation_ID Cisco Technical Support Seminar Using the Cisco Technical Support Website.
Planning Server Deployments Chapter 1. Server Deployment When planning a server deployment for a large enterprise network, the operating system edition.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
SQL Advanced Monitoring Using DMV, Extended Events and Service Broker Javier Villegas – DBA | MCP | MCTS.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
MIS 5208 Ed Ferrara, MSIA, CISSP Week 11: Processing and Analyzing Data.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
Condor Week Apr 30, 2008Pseudo Interactive monitoring - I. Sfiligoi1 Condor Week 2008 Pseudo-interactive monitoring in Condor by Igor Sfiligoi.
Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie
SQL Database Management
MDC-700 Series Modbus Data Concentrator [2016,05,26]
The EDG Testbed Deployment Details
ITIS 3110 IT Infrastructure II
SUBMITTED BY: NAIMISHYA ATRI(7TH SEM) IT BRANCH
Introduction of Week 3 Assignment Discussion
DUCKS – Distributed User-mode Chirp-Knowledgeable Server
Initial job submission and monitoring efforts with JClarens
Ed Ferrara, MSIA, CISSP MIS 5208 Processing and Analyzing Data Ed Ferrara, MSIA, CISSP
TANGO MONITORING SYSTEM
Presentation transcript:

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup

Thomas Finnern | SoGe and Splunk | Page 2 > Title  Evaluation of a new Grid Engine Monitoring and Reporting Setup > Summary  Dashboards and Event Correlation for Grid Engine with Splunk > Content  Splunk is a commercial software platform for collecting, searching, monitoring and analyzing machine data providing interactive real-time dashboards integrating multiple charts, reports and tables. We have been working on a grid engine setup based on the free branch supporting standard reporting and simple job and fairshare debugging with easy chart generation and smart event correlation features. On top of this we try understand the added value of the enterprise branch supporting integrated user authentication and role-based access controls. There is a plan to share our work in a public available Splunk grid engine app. Abstract for Conference

Thomas Finnern | SoGe and Splunk | Page 3 > Grid Engine Data and Splunk  Integration into Splunk  Job Data: submit, prolog, epilog plus Infos on Errors  SoGE (Son of Grid Engine) Data: Messages and Accounting  System Data: Resources and Usage (numbers, projects) > Needs  Job Inspection (View 1)  Accounting and Reports (View 2)  Weekly Project Views (View 3)  Realtime System Data (View 4)  Role Based Access  Grid Engine App … > Status and Conclusions Outline

Thomas Finnern | SoGe and Splunk | Page 4 > Splunk Indexing  Index: Separate Directory used by all SoGE Data  Index: Event indexing Time, Originating Host and Sourcetype  ASCII Store for Data Sets  Field Mapping on the Fly during Analysis > Splunk Data Input  Extra Syslog Port mapped SoGE Index Submit, Jobstart, Jobend, System Data  Splunk Forwarder Running on Grid Engine Master Reliable Upload to Splunk Server Configured for Message File Configured for Accounting File Integration into Splunk

Thomas Finnern | SoGe and Splunk | Page 5 Grid Engine Data and Splunk

Thomas Finnern | SoGe and Splunk | Page 6 > Setup  Perl Script in JSV(Job Submit Verification on Server), Prolog and Epilog > Submit  eventtype=“sgelog” sgeevent=“submit"  sgeuser sgejobid sgeroot sgecell sgehost > Prolog, Epilog  eventtype=sgelog sgeevent={prolog|epilog}  sgeuser sgejobid sgeroot sgecell sgehost sgequeue sgeslots sgetaskid sgearch sgehosts sgepe sgesubmithost Job Data via UDP Syslog

Thomas Finnern | SoGe and Splunk | Page 7 Job Inspection (View 1): Find active users, jobs and hosts > index=bird | transaction sgejobid startswith=submit span= | search sgejobid=* sgeuser=* | chart values(sgeproject) values(sgeuser) values(sgejobid) values(sgehost) by sgeuser

Thomas Finnern | SoGe and Splunk | Page 8 > Setup  Splunk Forwarder (same RPM as on Splunk Master)  Reliable Data Collection over Splunk Protocol for accounting and messages files > Grid Engine Messages  Running on qmaster, optionally on workernode  sourcetype=ge_messages sgeevent=„{I,W,E}“ source sgejobid sgettaskid sgemessage sgescope > Grid Engine Accounting  Running on qmaster  sourcetype=ge_accounting source sgejobid sgedistro sgeuser sgear_submission_time sgearid sgecpu sgedepartment sgeend_time sgeexit_status sgefailed sgegranted_pe sgegroup sgehost sgeio sgeiow sgejobname sgemaxvmem sgemem sgepe_taskid sgepriority sgeproject sgequeue sgeru_idrss sgeru_inblock sgeru_ismrss sgeru_isrss sgeru_ixrss sgeru_majflt sgeru_maxrss sgeru_minflt sgeru_msgrcv sgeru_msgsnd sgeru_nivcsw sgeru_nsignals sgeru_nswap sgeru_nvcsw sgeru_oublock sgeru_stime sgeru_utime sgeru_wallclock sgeslots sgestart_time sgesubmission_time sgetask_number Job Data via TCP Splunk Forwarder

Thomas Finnern | SoGe and Splunk | Page 9 > Pie „Sum CPU Seconds by Project“ > Timechart „ Sum CPU Seconds by Project“ > Queue Timing „Wait Times and Wall Times by Queue“ > Timechart „Jobs in Error by Project“ > Table „All Values“ Reports and Accounting (View 2) : Grid Engine Accounting (30d)

Thomas Finnern | SoGe and Splunk | Page 10 Accounting Analysis

Thomas Finnern | SoGe and Splunk | Page 11 > Setup  Perlscript running Commands qhost and qstat in Cron Job on Master and Slave  qhost provides Worker Node Resources, qstat shows Project Data > sgeevent=„numbers“  sgelog: sgehost="global" sgeShareTotal sgeSumProject- sgeSumJobs-Error sgeSumJobs-Waiting sgeSumJobs-SlotRun sgeSumJobs- Running sgeSumShares- sgeSumShare-Store sgeSumShare- Memory sgeSumShare-Cores sgeSumShare-Hosts sgeSumCores-Total sgeSumCores-Available sgeSumDistro- sgeSumQueue-l sgeSumQueue-total sgeSumStore-mem_used sgeSumStore-h_vmem sgeSumStore-h_ftotal sgeSumStore-mem_total sgeSumStore-h_fused > sgeevent=„projects“  sgelog: sgehost="global" sgeproject sgeshare sgestckt sgeovrts SlotInError sgeotckt sgetckts sgeftckt JobsInError SlotRunning sgemem JobsRunning sgeio sgecpu JobsWaiting System Data via UDP Syslog

Thomas Finnern | SoGe and Splunk | Page 12 > Pies „Slots, CPU, IO and MEM by Project“ > Timechart „Slots, Waiting Jobs and Tickets by Project“ Weekly Project View (View 3): Project Infos plus System Infos

Thomas Finnern | SoGe and Splunk | Page 13 Realtime System Data (View 4): System Health and Trends > Last 15 Minutes Jobs, Slots, Waits and Errors > Timechart 24 Hours Jobs, Slots, Waits and Errors > Timechart „Core Usage by Queue“ > Sparkling Lines: Trends for Unavailable cores, Distros, Memory Usage > Table „All Summed Values“

Thomas Finnern | SoGe and Splunk | Page 14 Enterprise vs. Free Version > Price per Data Volume … > Limit / License Handling > More Data (> 500 M) > Role Based Data Access > Auto Report > HA > Limit on Data Volume … > Limit / License Handling > 500 Mbyte > Interactive Analysis > Overall Correlations > Easy Debugging > Puppet Install

Thomas Finnern | SoGe and Splunk | Page 15 > Status  Work in Progress  Finetuning Reports  Checking Data Consistency  Still Learning Splunk  … > Conclusions  Would like to buy Role Based Data Access High Availibilty … > Thank you for Listening Status and Conclusion