Download presentation
Presentation is loading. Please wait.
Published byCamron Snow Modified over 8 years ago
1
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures Ivan Fedorko, Marian Babik, David Rodriguez CERN CHEP 2010 Taipei, Taiwan
2
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Motivation I rapid growth of virtualization and cloud computing services o increased complexity of management o scalability issues @CERN o increasing capacity (new nodes every day) o hw lifecycle o large variability of configuration o facing scalability issues with existing infrastructure software Cluman o investigates possible extensions to existing software advanced visualization and administrative job management Cluman- 2
3
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Motivation II visualization o extend monitoring system (Lemon) visualization of fabrics administration o run complex large-scale reconfigurations Castor Batch o improve performance and security measures o allow end-users to reconfigure their fabrics o introduce dynamic clusters o enable web-based administration 2008 10/2008 04/2009 10/2009 08/2010 Project initiated First prototype Phase II Initial prototype Production CERN CC Cluman- 3
4
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Visualization Interactive visualizations o interactive high-density visualization of fabrics (clusters, os, racks, etc.) o with monitoring information e.g. show me load average of cluster lxbatch (per node) e.g. show me which nodes in lxbatch have configuration error 3 subclusters of cluster lxbatch node Cluman- 4
5
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Administration web and command line interface to manage fabrics run actions on selection of nodes or clusters administrative action = Linux shell script support multiple backends (currently cluman agent) interface with existing fabric management tools follow action's lifecycle inspect action's output keep administrative log (who did what and where) support fine-grained authorization o e.g. who can do what and where e.g. run reconfiguration on cluster c2public -> get status -> see what is the error where reconfiguration failed Cluman- 5
6
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Approach lightweight 3-tier web architecture based on REST (Representational State Transfer) REST is light-weight web service architecture o frontend o middleware o database Key principles o Scalability of component interactions o Generality of interface (easily extendable) o Independent deployment of components (distributed) o Intermediate components to reduce latency, enforce security, etc. Cluman- 6
7
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Architecture Frontends o cluman-lib o cluman-web o (Django/GWT) o cluman-shell Cluman-agent: Linux daemon controlled by middleware and managing the actions on a node Cluman-api: auth/authorization visualization job-management cluman-queries Cluman- 7
8
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Performance job management o requires threaded Django REST API with database connection pool o benchmarked for 10k jobs** (~150 req/s) oracle shared server (SHS) used in production instance oracle server pool (POOLS) promising candidate with 11g oracle connection manager (CMAN) o possibility to reach 100k jobs with additional infrastructure and oracle server pool o django 1.3 adds oracle client session pool detailed benchmark log at TRAC wiki TRAC wiki **For each job a typical lifecycle is simulated (i.e. 3 requests as scheduled -> running -> finished with random max 5s timeout between running and finished states) Cluman- 8
9
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Demo 1 st demo Dynamic clusters selection, cluster visualization with monitoring information 2 nd demo Management action, queues and job status, logging, on behalf action Cluman- 9
10
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF References TRAC o https://svnweb.cern.ch/trac/Cluman/ https://svnweb.cern.ch/trac/Cluman/ Mailing list is: o project-cluman@cern.ch Cluman- 10
11
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Thank you From now on Backup Cluman- 11
12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Project status project initiated by Sebastian Lopienski in 2008 o with Filipe Manana o first prototype released in 10/2008 phase II in 04/2009 second prototype v0.7 milestone (10/2009) v0.8 milestone (4/2010) o additional functionality and pre-production testing v0.9 milestone (8/2010) o first production release o most of the features implemented (except for automation) roadmap on cluman TRAC Cluman- 12
13
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Implementation I Middleware (Django) o cluman-api (REST) auth/authorization API visualization API job-management API cluman-queries API o cluman-event-server standalone threaded pool server performing non- blocking operations Cluman- 13
14
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Implementation II Clients o cluman-lib Cluman REST API libraries (python, javascript) o cluman-web Django/GWT application (javascript) o cluman-shell command line shell (python) o cluman-agent Linux system-level access (python) Cluman- 14
15
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Security Centralized model -middleware performs both authentication and authorization for all clients o SSL, HTTPS o authentication API (django-shibboleth) - integrates SSO and Django auth model o authorization API -role-based access control to resources o extensive logging kept by apache o additional security assertions Designed in collaboration with security group Cluman agent code reviewed before 0.8 release Cluman- 15
16
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Performance For each job a typical lifecycle is simulated (i.e. 3 requests as scheduled -> running -> finished with random max 5s timeout between running and finished states) Target was 10k jobs Various Oracle connection technology This number represent number of handled nodes Production CLUMAN instance is able to handle 10k jobs (i.e. ~150 req/s) on 500 nodes. Cluman- 16
17
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Related work RedHat, IBM, HP, Oracle CMS Rocks, Scali, CMAv3 virtualization management systems o Windows Virtual Control Center o RedHat Enterprise o Oracle VM o Open Nebula cluster management software o Cfengine o Quattor o Puppet Cluman- 17
18
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Backup slide Visualization o how to visualize trees with high fanout (clusters with 5-6 sub-clusters and 3k+ nodes overall) o how to create a selection model for such scales o recently - not 3k but 10-30k Action management o push-model for 5k-10k jobs o lifecycle o authentication and authorization o inspect stderr/stdout Cluman- 18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.