Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
GENI Experiment Control Using Gush Jeannie Albrecht and Amin Vahdat Williams College and UC San Diego.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
12 Chapter 12 Client/Server Systems Hachim Haddouti.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Towards Autonomic Hosting of Multi-tier Internet Services Swaminathan Sivasubramanian, Guillaume Pierre and Maarten van Steen Vrije Universiteit, Amsterdam,
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
System Center 2012 R2 Windows Azure Pack Service Management Automation 101.
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Components of Windows Azure - more detail. Windows Azure Components Windows Azure PaaS ApplicationsWindows Azure Service Model Runtimes.NET 3.5/4, ASP.NET,
Fall, Privacy&Security - Virginia Tech – Computer Science Click to edit Master title style Design Extensions to Google+ CS6204 Privacy and Security.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Bright Cluster Manager Advanced cluster management made easy Dr Matthijs van Leeuwen CEO Bright Computing Mark Corcoran Director of Sales Bright Computing.
Service Computation 2010November 21-26, Lisbon.
COMS E Cloud Computing and Data Center Networking Sambit Sahu
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
ArcGIS Server for Administrators
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Configuration Activities Gavin McCance Online Cross-experiment Meeting, 14 June 2012.
System Center Lesson 4: Overview of System Center 2012 Components System Center 2012 Private Cloud Components VMM Overview App Controller Overview.
Managing and Monitoring the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
MGT305 - Application Management in Private and Public Clouds Sean Christensen Senior Product Marketing Manager Microsoft Corporation MGT305.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Overview of DMLite Ricardo Rocha ( on behalf of the LCGDM team.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF SINDES Secure INformation DElivery System CERN IT/CF-ASI.
CERN IT Department CH-1211 Genève 23 Switzerland t Single Sign On, Identity and Access management at CERN Alex Lossent Emmanuel Ormancey,
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Standard Protocols in DPM Ricardo Rocha.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Consulting Services JobScheduler Architecture Decision Template
StratusLab Final Periodic Review
Consulting Services JobScheduler Architecture Decision Template
StratusLab Final Periodic Review
Logo here Module 3 Microsoft Azure Web App. Logo here Module Overview Introduction to App Service Overview of Web Apps Hosting Web Applications in Azure.
Introduction to Cloud Computing
Presentation transcript:

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures Ivan Fedorko, Marian Babik, David Rodriguez CERN CHEP 2010 Taipei, Taiwan

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Motivation I rapid growth of virtualization and cloud computing services o increased complexity of management o scalability o increasing capacity (new nodes every day) o hw lifecycle o large variability of configuration o facing scalability issues with existing infrastructure software Cluman o investigates possible extensions to existing software  advanced visualization and administrative job management Cluman- 2

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Motivation II visualization o extend monitoring system (Lemon) visualization of fabrics administration o run complex large-scale reconfigurations  Castor  Batch o improve performance and security measures o allow end-users to reconfigure their fabrics o introduce dynamic clusters o enable web-based administration / / / /2010 Project initiated First prototype Phase II Initial prototype Production CERN CC Cluman- 3

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Visualization Interactive visualizations o interactive high-density visualization of fabrics (clusters, os, racks, etc.) o with monitoring information  e.g. show me load average of cluster lxbatch (per node)  e.g. show me which nodes in lxbatch have configuration error 3 subclusters of cluster lxbatch node Cluman- 4

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Administration web and command line interface to manage fabrics run actions on selection of nodes or clusters  administrative action = Linux shell script  support multiple backends (currently cluman agent)  interface with existing fabric management tools follow action's lifecycle inspect action's output keep administrative log (who did what and where) support fine-grained authorization o e.g. who can do what and where e.g. run reconfiguration on cluster c2public -> get status -> see what is the error where reconfiguration failed Cluman- 5

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Approach lightweight 3-tier web architecture based on REST (Representational State Transfer) REST is light-weight web service architecture o frontend o middleware o database Key principles o Scalability of component interactions o Generality of interface (easily extendable) o Independent deployment of components (distributed) o Intermediate components to reduce latency, enforce security, etc. Cluman- 6

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Architecture Frontends o cluman-lib o cluman-web o (Django/GWT) o cluman-shell Cluman-agent: Linux daemon controlled by middleware and managing the actions on a node Cluman-api: auth/authorization visualization job-management cluman-queries Cluman- 7

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Performance job management o requires threaded Django REST API with database connection pool o benchmarked for 10k jobs** (~150 req/s)  oracle shared server (SHS)  used in production instance  oracle server pool (POOLS)  promising candidate with 11g  oracle connection manager (CMAN) o possibility to reach 100k jobs with additional infrastructure and oracle server pool o django 1.3 adds oracle client session pool detailed benchmark log at TRAC wiki TRAC wiki **For each job a typical lifecycle is simulated (i.e. 3 requests as scheduled -> running -> finished with random max 5s timeout between running and finished states) Cluman- 8

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Demo 1 st demo Dynamic clusters selection, cluster visualization with monitoring information 2 nd demo Management action, queues and job status, logging, on behalf action Cluman- 9

CERN IT Department CH-1211 Geneva 23 Switzerland t CF References TRAC o Mailing list is: o Cluman- 10

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Thank you From now on Backup Cluman- 11

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Project status project initiated by Sebastian Lopienski in 2008 o with Filipe Manana o first prototype released in 10/2008 phase II in 04/2009 second prototype v0.7 milestone (10/2009) v0.8 milestone (4/2010) o additional functionality and pre-production testing v0.9 milestone (8/2010) o first production release o most of the features implemented (except for automation) roadmap on cluman TRAC Cluman- 12

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Implementation I Middleware (Django) o cluman-api (REST)  auth/authorization API  visualization API  job-management API  cluman-queries API o cluman-event-server  standalone threaded pool server performing non- blocking operations Cluman- 13

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Implementation II Clients o cluman-lib  Cluman REST API libraries (python, javascript) o cluman-web  Django/GWT application (javascript) o cluman-shell  command line shell (python) o cluman-agent  Linux system-level access (python) Cluman- 14

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Security Centralized model -middleware performs both authentication and authorization for all clients o SSL, HTTPS o authentication API (django-shibboleth) - integrates SSO and Django auth model o authorization API -role-based access control to resources o extensive logging kept by apache o additional security assertions Designed in collaboration with security group Cluman agent code reviewed before 0.8 release Cluman- 15

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Performance For each job a typical lifecycle is simulated (i.e. 3 requests as scheduled -> running -> finished with random max 5s timeout between running and finished states) Target was 10k jobs Various Oracle connection technology This number represent number of handled nodes Production CLUMAN instance is able to handle 10k jobs (i.e. ~150 req/s) on 500 nodes. Cluman- 16

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Related work RedHat, IBM, HP, Oracle CMS Rocks, Scali, CMAv3 virtualization management systems o Windows Virtual Control Center o RedHat Enterprise o Oracle VM o Open Nebula cluster management software o Cfengine o Quattor o Puppet Cluman- 17

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Backup slide Visualization o how to visualize trees with high fanout (clusters with 5-6 sub-clusters and 3k+ nodes overall) o how to create a selection model for such scales o recently - not 3k but 10-30k Action management o push-model for 5k-10k jobs o lifecycle o authentication and authorization o inspect stderr/stdout Cluman- 18