AliEn central services (structure and operation)

Slides:



Advertisements
Similar presentations
What's new?. ETS4 for Experts - New ETS4 Functions - improved Workflows - improvements in relation to ETS3.
Advertisements

National Grid's Contribution to LHCb IFIN-HH Serban Constantinescu, Ciubancan Mihai, Teodor Ivanoaica.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
ALICE G RID SERVICES IP V 6 READINESS
Moving to Win 7 Considerations Dean Steichen A2CAT 2010.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
G RID SERVICES IP V 6 READINESS
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Setting Up a Local WordPress Development Environment By Gregory Young Alternative Hosting
Module 1: Installing and Configuring Servers. Module Overview Installing Windows Server 2008 Managing Server Roles and Features Overview of the Server.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
Sejong STATUS Chang Yeong CHOI CERN, ALICE LHC Computing Grid Tier-2 Workshop in Asia, 1 th December 2006.
Panda Grid Status Kilian Schwarz, GSI on behalf of PANDA GRID Group (slides to a large extend from Radoslaw Karabowicz)
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
1 MONGODB: CH ADMIN CSSE 533 Week 4, Spring, 2015.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz The future of AliEn.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.
1 Armenuhi Abramyan, Narine Manukyan ALICE team of A.I. Alikhanian National Scientific Laboratory {aabramya,
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Storage discovery in AliEn
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Valencia Cluster status Valencia Cluster status —— Gang Qin Nov
Federating Data in the ALICE Experiment
QC-specific database(s) vs aggregated data database(s) Outline
Understanding and Improving Server Performance
Experience of PROOF cluster Installation and operation
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
Database Replication and Monitoring
High Availability Linux (HA Linux)
ALICE Monitoring
Outline Benchmarking in ATLAS Performance scaling
Torrent-based software distribution
Report PROOF session ALICE Offline FAIR Grid Workshop #1
Status of the CERN Analysis Facility
BDII Performance Tests
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.
Update on Plan for KISTI-GSDC
GSIAF "CAF" experience at GSI
Torrent-based software distribution
Artem Trunov and EKP team EPK – Uni Karlsruhe
PES Lessons learned from large scale LSF scalability tests
Conditions Data access using FroNTier Squid cache Server
Storage elements discovery
Simulation use cases for T2 in ALICE
Publishing ALICE data & CVMFS infrastructure monitoring
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
Presentation transcript:

AliEn central services (structure and operation) Costin.Grigoras@cern.ch

ALICE Offline Week - July 2008 Central machines 5 32bit 15 64bit 6 Macs ----------- 26 machines on 2 * 1Gbit uplinks Different roles, MonALISA services running on them report machine monitoring + each services' status at: http://pcalimonitor.cern.ch/stats?page=machines/machines 19.2 KVA UPSs (15m..50m backup): http://pcalimonitor.cern.ch/stats?page=ups/ups 11.07.2008 ALICE Offline Week - July 2008

AliEn services – User interaction 3 Authen 3 Proxy 2 User API Services 4 Jobs API Services 11.07.2008 ALICE Offline Week - July 2008

AliEn services – Internal services PackMan, IS, Logger, TransferMgr MonALISA repository PackMan, Optimizers (Transfer, Catalogue, Jobs) MySQL – Catalogue, LDAP master MySQL – Task Queue, LDAP slave (currently there are >44M entries in the catalogue, ~100x more than what you have on a PC) Alice::CERN::SE xrootd redirector 11.07.2008 ALICE Offline Week - July 2008

AliEn services – backup pcalienstorage: 9TB raw / 6TB available for backup MySQL slave for both catalogue and task queue DBs (weekly stop / take snapshot / restart) /backup on all machines mounted over NFS from this machine /opt/alien on all central machines is also mounted from this machine over NFS, with different base paths for each architecture 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Build servers 32bit SLC4 64bit SLC4 32bit OSX 10.5 64bit OSX 10.5 (+Itanium build server in CC) 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 DNS load balancing Each machine reports through ML to the central repository the full status of each machine, including: Operational status of each service (tested every 15m) Load on the machine, CPU, memory and swap utilisation No. of connected sockets A weighted score is generated based on the parameters above, updating every minute the CERN DNS aliases with the IP addresses of the machines that are not overloaded. The IP aliases are queried by users or site services when connecting to the central services; by using them we distribute the load evenly between the active machines and limit the damage that can be caused to the central services. TODO: faster reaction times to services not working / overloaded 11.07.2008 ALICE Offline Week - July 2008

DNS load balancing in action Wed Jul 9 07:23:24 CEST 2008 : alice-proxy 137.138.99.136 137.138.99.137 137.138.99.141 Thu Jul 10 13:40:38 CEST 2008 : alice-proxy 137.138.99.137 137.138.99.141 Thu Jul 10 13:44:52 CEST 2008 : alice-proxy 137.138.99.136 137.138.99.137 137.138.99.141 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Making use of the Macs 6 8-core machines with 8GB of RAM...sounds very tempting! Pablo managed to start both Authen and Proxy on alimacx01 in almost no time, BUT... The services kept crashing very fast: Default ulimit -u : 266 Max ulimit -u : 2500 With these constraints, we cannot use the machines for anything spawning many processes (eg. Proxy). Authen runs fine though, as probably would several other central services. 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Running jobs profile 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Load comparison ”The more jobs, the less problems” ? (not quite, the load is higher when many jobs start / finish, or worse when a SE is not available and cause an avalanche of failing jobs) 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Load at >10k jobs 11.07.2008 ALICE Offline Week - July 2008

Running jobs vs. Load (last 6 months, 2hours averages) 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Future plans Upgrade old central machines (2+ years) with more modern hardware (8 cores, 16-32GB RAM, fast SAS drives) Use all available resources (especially the Macs) to be prepared to run at least 2x more jobs Install two additional power lines (16A) to accomodate the greedy hardware Maybe install some additional AC unit... 11.07.2008 ALICE Offline Week - July 2008

ALICE Offline Week - July 2008 Last slide :) 11.07.2008 ALICE Offline Week - July 2008