A Marriage of Open Source and Custom William Deck

Slides:



Advertisements
Similar presentations
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Included with Windchill PDMLink Free! Easy to setup and install Excellent high-level overview of your system immediately post setup Collaboration with.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Lincoln Bryant Suchandra Thapa HTCondor Week 2015 May 21, 2015 A Year of HTCondor Monitoring.
Evaluation of NoSQL databases for DIRAC monitoring and beyond
HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer.
Lower costs and improve predictability Automation Enable service owners to focus on work that adds business value Reduce error-prone manual activities.
DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Introduction to Cloud Computing
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Informix IDS Administration with the New Server Studio 4.0 By Lester Knutsen My experience with the beta of Server Studio and the new Informix database.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
AppDynamics Ohio User Group. What is ExactTarget? Software as a Service Marketing 500 million s sent a day 200 million web transactions a day.
Eyes Off Glass Dinesh Gode Sr. Technical Specialist Oct 9, 2007.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
Ganga A quick tutorial Asterios Katsifodimos Trainer, University of Cyprus Nicosia, Feb 16, 2009.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Streamlining Monitoring Infrastructure in IT-DB-IMS Charles Newey ›
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Centralized Logfile Search (a.k.a. Tracing) Vito Baggiolini with Gergo Horanyi, Felix Ehm, Stephen Page.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Monitoring with InfluxDB & Grafana
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
An operating system (OS) is a collection of system programs that together control the operation of a computer system.
Alfresco Monitoring with OpenSource Tools Miguel Rodriguez Technical Account Manager.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
GHRC Dashboard *** work in progress *** Ajinkya Kulkarni Rahul Ramachandran Helen Conover.
Elasticsearch – An Open Source Log Analysis Tool Rob Appleyard and James Adams, STFC Application-Level Logging for a Large Tier 1 Storage System.
1 RIC 2009 Symbolic Nuclear Analysis Package - SNAP version 1.0: Features and Applications Chester Gingrich RES/DSA/CDB 3/12/09.
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
Monitoring Primer HTCondor Workshop Barcelona Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison.
GRID COMPUTING.
Welcome to Indiana University Clusters
Understanding the New PTC System Monitor (PSM/Dynatrace) Application’s Capabilities and Advanced Usage Stephen Vaillancourt PTC Technical Support –Technical.
Welcome to Indiana University Clusters
HTCondor at RAL: An Update
WinCC-OA Log Analysis SCADA Application Service - Reporting
Monitoring with Clustered Graphite & Grafana
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Log Management Systems
HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.
SWITCHdrive Experience with running Owncloud on top of Openstack/Ceph
CREAM-CE/HTCondor site
Condor: Job Management
Oracle OpenWorld 2018: CATCH THE WAVE!
Introduction to Makeflow and Work Queue
Workflows with ENVI and Esri Agriculture workflows for ICARDA
SDM workshop Strawman report History and Progress and Goal.
What’s Different About Overlay Systems?
Module 01 ETICS Overview ETICS Online Tutorials
Get your ETL flow under statistical process control
The Condor JobRouter.
End to End Monitoring Solution using Open Source Technology where webMethods 9.10 is used as ESB IBM Confidential.
Open Source Continuous Integration System
Learn ELK in Docker in 90 minutes
Features Overview.
7.3 Example Use Cases Spirent Automation Platform Technologies.
Indexing with ElasticSearch
Presentation transcript:

A Marriage of Open Source and Custom William Deck Monitoring HTCondor A Marriage of Open Source and Custom William Deck

Goals Help those looking to augment/roll out their own cluster Monitoring Possibly a different perspective on monitoring Get some of this work out in the open

Who am I? Originally a mainframe developer Moved to SIG January of 2015 First task: Monitor the “cluster”

Our Cluster Mixed Cluster (Several) Storage ~1700 cores Lustre: 2.8 PB 1100 Windows 600 Linux (SLES11/SLES12) Storage Lustre: 2.8 PB GPFS: 8.4 PB independent clusters across the world

Our Workload ~100K jobs/week Mixed OS Workloads (Majority Windows) Daily Workflows (1 – 10K) One Off Workflows (+100K) Enough about me!

What is the Problem? Many moving parts Multiple groups using condor differently Single submit files DAGs Someone found out about condor_run Monitor 1 – 100k jobs Hard for users to self-support Correlate job failures to infrastructure problems cluster begets problems goal is to keep the error/retry rate down while giving users the ability to track down errors

Monitor Evolution HTCondor’s Job Monitor CycleComputing Solution Custom Scripts parsing condor_* commands Python bindings with Elastic, Grafana, and Conmon

First Solution – Tales of Condor Modelled after CycleComputing monitoring Parsing condor_status, condor_history output Put information into csv files Simple webpage displayed Total jobs vs running jobs Total jobs/User vs running jobs/User Coorelation between condor monitoring and infrastructure

First Solution: Revolutionary command grep For everything else we used grep Painful and tedious Good news! Really good at grep ssh and grep condor logs

Enter: Monmon Custom webpage designed around our workflow creation scripts Parses nodestatus/HTCondor log files Specific for one group’s workflows Useful for the user to drill down and share with others

Enter Elastic, Grafana, and Conmon Elastic (ELK) (https://www.elastic.co/) Elasticsearch - Search/analytics engine Logstash - Ingest of data Kibana - Frontend Grafana - (https://grafana.com/) extension of Kibana (3.0) Multiple backends (graphite, ES, influxdb ...)

Enter Elastic, Grafana, and Conmon Poll periodically using python bindings Insert into ES Augmented Job/Host ClassAd Custom subset of ClassAd Use Grafana frontend Tales of Condor 2.0 More Complex Dashboards Extended Monmon -> Conmon

ES Job Ad Example Searchable Job ClassAd Common Uses condor_history Used Resources search: kibana frontend, curl, python complete history of all schedds completed jobs Used/requested resources cluster utilization possiblities are endless ES led to Grafana

Grafana: One Stop Shop

Grafana Dashboards Easily create dashboards from ES and performance metrics Single Pane of Glass for Condor and Infrastructure

Tales of Condor 2.0

Overall Cluster Health This dashboard has every layer of our stack from NetApp/Storage lustre (infrastructure) exec Job stats schedds workstations application results, A rousing game of find the bottleneck

Grafana: Error Rates Find blackhole exec Postscript vs. ExitCode Errors by: User DAG Is it User or infrastructure Good to keep track of postcheck vs programmatic failures

Conmon Uses Flask ES Queries Display job classad Workflow overview Grid View DAG/Job Logs Workflow Analysis

Conmon: Home (DAGs)

Conmon: Workflow (DAGs)

Conmon: Grid (DAGs)

Workflow Analysis Information from ES Easy to see long legs

Individual Jobs

Conmon: Benefits Loading Condor information from ES Can handle multiple submission/workflows types (DAGs, submit) User can click through jobs Search/Filter Shareable Consistent views

Questions? Late to the party. Based on Todd’s talk it looks like a lot of what I did a while back is in condor.