New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008.

Slides:



Advertisements
Similar presentations
1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
5/2/  Online  Offline 5/2/20072  Online  Raw data : within the DAQ monitoring framework  Reconstructed data : with the HLT monitoring framework.
“Managing a farm without user jobs would be easier” Clusters and Users at CERN Tim Smith CERN/IT.
The LEGO Train Framework
GSIAF "CAF" experience at GSI Kilian Schwarz. GSIAF Present status Present status installation and configuration installation and configuration usage.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
1 DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 3 Processes Skip
Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week –
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
1 Part III: PROOF Jan Fiete Grosse-Oetringhaus – CERN Andrei Gheata - CERN V3.2 –
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
LOGO PROOF system for parallel MPD event processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
AgentsAnd Daemons Automating Data Quality Monitoring Operations Agents And Daemons Automating Data Quality Monitoring Operations Since 2009 when the LHC.
Testing the dynamic per-query scheduling (with a FIFO queue) Jan Iwaszkiewicz.
Online Music Store. MSE Project Presentation III
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
Andrei Gheata, Mihaela Gheata, Andreas Morsch ALICE offline week, 5-9 July 2010.
Analysis trains – Status & experience from operation Mihaela Gheata.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
5/2/  Online  Offline 5/2/20072  Online  Raw data : within the DAQ monitoring framework  Reconstructed data : with the HLT monitoring framework.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
The Million Point PI System – PI Server 3.4 The Million Point PI System PI Server 3.4 Jon Peterson Rulik Perla Denis Vacher.
The CMS CERN Analysis Facility (CAF) Peter Kreuzer (RWTH Aachen) - Stephen Gowdy (CERN), Jose Afonso Sanches (UERJ Brazil) on behalf.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
Threads. Readings r Silberschatz et al : Chapter 4.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
1 Offline Week, October 28 th 2009 PWG3-Muon: Analysis Status From ESD to AOD:  inclusion of MC branch in the AOD  standard AOD creation for PDC09 files.
Summary of User Requirements for Calibration and Alignment Database Magali Gruwé CERN PH/AIP ALICE Offline Week Alignment and Calibration Workshop February.
Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week,
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, R. Brock,T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina,
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
EMI is partially funded by the European Commission under Grant Agreement RI EMI SA2 Report Andres ABAD RODRIGUEZ, CERN SA2.4, Task Leader EMI AHM,
Good user practices + Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CUF,
ATLAS Computing Wenjing Wu outline Local accounts Tier3 resources Tier2 resources.
AAF tips and tricks Arsen Hayrapetyan Yerevan Physics Institute, Armenia.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
Offline week meetings CERN – June 25, 2009 CAF user experience Maria Nicassio (INFN Bari)
Monthly video-conference, 18/12/2003 P.Hristov1 Preparation for physics data challenge'04 P.Hristov Alice monthly off-line video-conference December 18,
V4-18-Release P. Hristov 21/06/2010.
Understanding and Improving Server Performance
Experience of PROOF cluster Installation and operation
DCS Status and Amanda News
Processes and threads.
Report PROOF session ALICE Offline FAIR Grid Workshop #1
Status of the CERN Analysis Facility
GSIAF & Anar Manafov, Victor Penso, Carsten Preuss, and Kilian Schwarz, GSI Darmstadt, ALICE Offline week, v. 0.8.
v4-18-Release: really the last revision!
AliRoot status and PDC’04
AliEn central services (structure and operation)
Support for ”interactive batch”
Presentation transcript:

New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week –

Outline New CAF: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs Staging problems Conclusions

Timeline startup of the new CAF cluster st day with users on the new cluster old CAF dismissed by IT Usage 26 workers instead of 33 (but much faster, see later) Head node is « alicecaf » instead of « lxb6046 » GSI based authentication, AliEn certificate needed Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown Datasets clean up, staged only latest data production (First physics - stage 3) AF v4-15 meta package redistributed New CAF

Technical Differences Cmsd (Cluster Management Service Daemon) Why? Olbd not supported any longer What? Dynamic load balancing of files and data name-space How? Stager daemon can benefits from: bulk prepare replaces touch file bulk prepare allows "co-locate" files on the same node GSI authentication Secure communication using user certificates and LDAP based configuration management

Architectural Differences New CAFOld CAF ArchitectureAMD 64Intel 32 Machines13 x 8-core33 x dual CPU Space for staging13 x 2.33 TB33 x 200 GB Workers26 (2/node)33 (1/node) Mperf Why « only » 26 workers? You could use 104 if you are alone With 26 workers 4 users can effectively run concurrently Estimate average of 8 concurrent users… Processing units 6.5x faster than old CAF

Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs Staging problems Conclusions

CAF1 vs CAF2 (Processing Rate) Test Dataset First physics (stage 3) pp, Pythia6, 5kG, 10TeV /COMMON/COMMON/LHC08c11_10TeV_0.5T 1840 files, 276k events Tutorial task that runs over ESDs and displays Pt distribution Other comparison test: RAW data reconstruction (Cvetan)

Reminder The test is dependent on the file distribution for the used dataset Parallel code: Creation of workers Files validation (workers opening the files) Events loop (execution of the selector on the dataset) Serial code: Initialization of PROOF master, session and query objects Files look up Packetizer (file slices distribution) Merging (biggest task)

#nodes#eventsSize (GB)Init_timeProc_timeEv/sMB/sSpeedupEfficiency 332k s3s k1.3517s k8.1149s k13.531m23s k18.712m34s k s2s x 20k1.356s x 120k8.1128s x 200k s x 276k s x 1042k s2s x 20k1.355s x27% 120k8.1119s x35% 200k s x32% 276k s x30% Task executed 5 times and averaged

Processing Rate Comparison (1) The final average rate is the only important information 104 workers, 200k evs104 workers, 276k evs Final tail reflects the fact one by one workers stop working data unevenly distributed A longer tail shows a worker overloaded on the last packet(s) 3 workers maximum helping on the same «slow» packet

Processing Rate Comparison (2) Events/sec #events MB/sec ___104 workers ___ 26 workes ___ 33 workers

Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users/Groups Machines, Files, Disks, Datasets, CPU s Staging problems Conclusions

Available resources in CAF must be fairly used Highest attention to how disks and CPUs are used Users are grouped ( sub-detectors / physics working groups) Each group has a disk space (quota) which is used to stage datasets from AliEn has a CPU fairshare target (priority) to regulate concurrent queries CAF Usage

CAF Groups Groups#Users PWG021 (5) PWG13 (1) PWG239 (21) PWG318 (8) PWG430 (17) EMCAL2 (1) HMPID1 (1) ITS6 (3) T02 (1) MUON4 (3) PHOS4 (1) TPC3 (2) TOF1 (1) TRD4 (0) ZDC1 (1) VZERO2 (0) ACORDE1 (0) PMD3 (0) DEFAULT 19 registered groups 145 (60) registered users In brackets () the situation at the previous offline week

CAF Status Table

Files Distribution Nodes with more files can produce tails in processing rate Above a defined threshold files are not stored any longer Min: 1727 Max: 1863 Max difference: 8%

Disk Usage Max: 116 Min: 105 Max difference: 10%

Dataset Monitoring - 28TB disk space for staging - PWG0: 4TB - PWG1: 1TB - PWG2: 1TB - PWG3: 1TB - PWG4: 1TB - ITS: 0.2TB - COMMON: 2TB - 28TB disk space for staging - PWG0: 4TB - PWG1: 1TB - PWG2: 1TB - PWG3: 1TB - PWG4: 1TB - ITS: 0.2TB - COMMON: 2TB

CPU Quotas - default group is not the most consuming anymore

Outline CAF2: features CAF1 vs CAF2 processing rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs File Staging Conclusions

File Stager CAF intensively uses 'prepare’ 0-size files in Castor2 cannot be staged, but replicas are ok Check at stager level to avoid spawning infinite prepare on the same empty file unable to get online replica[i] in Castor && size==0? Copy replica (API service) Loop over the replicas (CERN, if any, taken first) replica[i] is not staged? Add to StageLIST Skip it STOP File corrupted. Skip it Stage StageLIST STOP

Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Files Distribution Users/Groups Staging Conclusions

CAF Usage Subscribe to using CERN SIMBA ( Web page at CAF tutorial once a month New CAF Faster machines, more space, more fun Shaky behavior due to higher user activity is under intensive investigation Credits PROOF Team and IT for the prompt support If (ever) you cannot connect just drop a mail and wait for… … « please try again »