4.12.2008, Prague JAN ŠVEC Institute of Physics AS CR.

Slides:



Advertisements
Similar presentations
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware and software on Prague farms Brief statistics about running LHC experiments.
Advertisements

Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Site Report HEPHY-UIBK Austrian federated Tier 2 meeting
11 September 2007Milos Lokajicek Institute of Physics AS CR Prague Status of the GRID in the Czech Republic NEC’2007.
Regional Computing Centre for Particle Physics Institute of Physics AS CR (FZU) TIER2 of LCG (LHC Computing Grid) 1M. Lokajicek Dell Presentation.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Tier 2 Prague Institute of Physics AS CR Status and Outlook J. Chudoba, M. Elias, L. Fiala, J. Horky, T. Kouba, J. Kundrat, M. Lokajicek, J. Svec, P. Tylka.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Prague Site Report Jiří Chudoba Institute of Physics, Prague Hepix meeting, Prague.
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Prague TIER2 Computing Centre Evolution Equipment and Capacities NEC'2009 Varna Milos Lokajicek for Prague Tier2.
FZU Computing Centre Jan Švec Institute of Physics of the AS CR, v.v.i
Computing for HEP in the Czech Republic Jiří Chudoba Institute of Physics, AS CR, Prague.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
29 June 2004Distributed Computing and Grid- technologies in Science and Education. Dubna 1 Grid Computing in the Czech Republic Jiri Kosina, Milos Lokajicek,
Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,
March 2003 CERN 1 EDG and AliEn in Prague Dagmar Adamova INP Rez near Prague.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
WLCG Tier-2 site in Prague: a little bit of history, current status and future perspectives Dagmar Adamova, Jiri Chudoba, Marek Elias, Lukas Fiala, Tomas.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.
October 2002 INFN Catania 1 The (LHCC) Grid Project Initiative in Prague Dagmar Adamova INP Rez near Prague.
1 PRAGUE site report. 2 Overview Supported HEP experiments and staff Hardware on Prague farms Statistics about running LHC experiment’s DC Experience.
INFSO-RI Enabling Grids for E-sciencE Experience with monitoring of Prague T2 site Tomáš Kouba NEC 2007, Varna, Bulgaria
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
Grid activities in the Czech Republic Jiří Kosina, Miloš Lokajíček, Jan Švec Institute of Physics of the Academy of Sciences of the Czech Republic
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.
Computing Jiří Chudoba Institute of Physics, CAS.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
13 October 2004GDB - NIKHEF M. Lokajicek1 Operational Issues in Prague Data Challenge Experience.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.
Materials for Report about Computing Jiří Chudoba x.y.2006 Institute of Physics, Prague.
Portuguese Grid Infrastruture(s) Gonçalo Borges Jornadas LIP 2010 Braga, Janeiro 2010.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Steinbuch Centre for Computing
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
The status of IHEP Beijing Site WLCG Asia-Pacific Workshop Yaodong CHENG IHEP, China 01 December 2006.
Grid activities in Czech Republic Jiri Kosina Institute of Physics of the Academy of Sciences of the Czech Republic
13 January 2004GDB Geneva, Milos Lokajicek Institute of Physics AS CR, Prague LCG regional centre in Prague
A Nordic Tier-1 for LHC Mattias Wadenstein Systems Integrator, NDGF Grid Operations Workshop Stockholm, June the 14 th, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Use of Nagios in Central European ROC
Moroccan Grid Infrastructure MaGrid
Prague TIER2 Site Report
LCG Deployment in Japan
Christof Hanke, HEPIX Spring Meeting 2008, CERN
This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.
Presentation transcript:

, Prague JAN ŠVEC Institute of Physics AS CR

 History and main users  Hardware description  Networking  Software and jobs distribution  Management  Monitoring  User support

 HEP LHC - ATLAS, ALICE Tevatron - D0 RHIC - Star H1, Calice  Astrophysics Auger  Solid State Physics  Others (HEP section of Institute of Physics)  Active participation in GRID projects since 2001 (EDG, EGEE, EGEE II, EGEE III (running), EGI) in collaboration with CESNET Czech Tier-2 site, connected to 2 Tier-1 sites (Forschungszentrum Karlsruhe – Germany, ASGC – Taiwan) and 3 Tier-3 sites (MFF Troja, FJFI ČVUT, ÚJF Řež, ÚJF Bulovka, UTEF)

 first computers bought in 2001 (2 racks), placed in the main building insufficient cooling, small space, small UPS, inconvenient access (2nd floor)  new server room opened in 2004  server room and adjacent office  18 racks  200kVA UPS  350kVA diesel  2 cooling units  water cooling planned for 2009  automatic fire suppression system (Argonite gas)  good access

 35x dual PIII 1.13 GHz  67x dual Xeon 3.06 GHz  5x dual Xeon 2.8 GHz Redundant components Key services  3x dual Opteron 1.6 GHz File servers  36x bl35p - dual Opteron 275, 280  6x bl20p - dual Xeon 5160  8x bl460 - dual Xeon 5160  12x bl465 - dual Opteron 2220

 HP Netserver TB SCSI  Easy Stor - 10TB ATA  Easy Stor - 30TB SATA  Promise Vtrak M610p - 13TB SATA  HP EVA TB FATA (SATA over FC)  Overland Ultramus - 144TB SATA DPM pool  Overland Ultramus 12TB Fiber Channell 4Gb tape library cache  100 TB tape library - LTO4 (expandable to 400TB)

 SGI Altix ICE 8200  512 cores, Intel Xeon 2.5 GHz  1GB RAM per core  Diskless nodes  External SAS disk array 7.2 TB  Infiniband 4x (20 Gbps)  Suse Linux Enterprise Server  Torque + Maui  SGI ProPack

 IBM iDataPlex  672 cores, Intel Xeon 2.83 GHz  2GB RAM per core  local SAS disks 300GB  Scientific Linux Cern 4  Torque + Maui  First in Europe

 Scientific Linux (Cern) 4, 5  Suse Linux (SLES 10, Opensuse 11)  32bit, 64bit testing in progress  Job management - PBSPro 9.x, Torque with Maui scheduler fair share used for scheduling cputime and walltime multiplicators  Legato Networker - tape backup (user homes, configuration)  gLite grid middleware (CE, SE, UI, MON box, site BDII, …)  Job submission Local – “prak” interface (grid unsupported experiments) ○ No special requirements GRID - UI interface (Atlas, Alice, Auger) ○ X509 certificate, signed by GRID certification authority (Cesnet, CERN)  Interface hosts merging in progress

 installation using PXE + kickstart  system automatically updates from SLC repositories  gLite middleware configured with YAIM (integrated into cfengine  local site changes managed using cfengine

 Manual administration is tedious and error prone Configuration is scattered among several places ○ Kickstart’s postinstall vs. existing nodes Ad-hoc changes, no revisions ○ Communication among sysadmins Machines temporarily offline ○ Conflicting changes Issues when reinstalling ○ Stuff went missing Too much work  Cfengine to the rescue! Managing hundreds of boxes from a central place Change tracking with subversion Describe the end result, not the process

 Cfengine architecture a central server running cfservd ○ Server policy files and other data to nodes cfagent on each node ○ Performs the real changes cfexecd (cron) ○ A thin wrapper around cfagent  configuration at one place can be easily managed by SCMs  shooting yourself in the foot easily any change can have huge impact - think before doing changes ! interactions among rules might have unexpected results => TESTING IS IMPORTANT!

 Flexible monitoring is crucial for reliability  Nagios, Munin (graphic representation of time development of events), PBS graphs, MRTG, CEF monitoring, …  Why Nagios de facto standard in monitoring open source easy to write new sensors static configuration is not a problem lots of addons ○ Nuvola – better look for nagios ○ NagiosReport (developed locally, summarizes problems at site) ○ NagiosGrapher (generates graphs from nagios outputs) ○ …  Plugins default plugins (part of nagios installation) - ping, disk, procs, load, swap, ldap, … SRCE plugins (developed by E. Imamagic) - cert, dpm, gridftp, srm, … locally developed – hpacucli, ups, jobs, gstat

Nagios summary generated at 11/28/ :10:02 in seconds. ====================================================================== Hosts in trouble: golias123, golias131 Hosts in downtime (not monitored): golias01, golias02, golias38 ====================================================================== golias123: ========== CFAGENT: CHECK_NRPE: Socket timeout after 30 seconds. golias131: ========== CFAGENT: CHECK_NRPE: Socket timeout after 30 seconds. downtimes: ========== golias01: Down for tests of cfengine installation. golias02: Down for golias02-golias199 synchronization tests. golias38: Disk failure.

 User support  Mailing list (primarily used for news, announcements of downtimes,  wiki pages with documentation (user and admin sections)  RT (Request Tracking) system developed by Best Practical Solutions – operated by Cesnet in cooperation with FZU used by users for communication with administrators 1. user sends mail to 2. RT system creates ticket with unique number 3. all administrators are notified by RT, that new ticket was created 4. administrators can communicate the problem with user (using reply function) or with each other (using comment function). All communication is saved as part of the ticket 5. each week RT automatically reminds administrators of all opened tickets