ANL T3g infrastructure S.Chekanov (HEP Division, ANL) ANL ASC Jamboree September 2009.

Slides:



Advertisements
Similar presentations
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Advertisements

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Novell Server Linux vs. windows server 2008 By: Gabe Miller.
ATLAS computing in Geneva Szymon Gadomski, NDGF meeting, September 2009 S. Gadomski, ”ATLAS computing in Geneva", NDGF, Sept 091 the Geneva ATLAS Tier-3.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Tier 3g Infrastructure Doug Benjamin Duke University.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Group Computing Strategy Introduction and BaBar Roger Barlow June 28 th 2005.
Tier 3(g) Cluster Design and Recommendations Doug Benjamin Duke University.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Introduction to U.S. ATLAS Facilities Rich Baker Brookhaven National Lab.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
LOGO PROOF system for parallel MPD event processing Gertsenberger K. V. Joint Institute for Nuclear Research, Dubna.
ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.
BES III Computing at The University of Minnesota Dr. Alexander Scott.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
Monte Carlo Data Production and Analysis at Bologna LHCb Bologna.
PC clusters in KEK A.Manabe KEK(Japan). 22 May '01LSCC WS '012 PC clusters in KEK s Belle (in KEKB) PC clusters s Neutron Shielding Simulation cluster.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Atlas Software Structure Complicated system maintained at CERN – Framework for Monte Carlo and real data (Athena) MC data generation, simulation and reconstruction.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
UTA Site Report Jae Yu UTA Site Report 7 th DOSAR Workshop Louisiana State University Apr. 2 – 3, 2009 Jae Yu Univ. of Texas, Arlington.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
Patrick Gartung 1 CMS 101 Mar 2007 Introduction to the User Analysis Facility (UAF) Patrick Gartung - Fermilab.
A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.
Lyon Analysis Facility - status & evolution - Renaud Vernet.
ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.
Compute and Storage For the Farm at Jlab
Status of BESIII Distributed Computing
Experience of Lustre at QMUL
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Tier3(g,w) meeting at ANL ASC
Belle II Physics Analysis Center at TIFR
Solid State Disks Testing with PROOF
PROOF – Parallel ROOT Facility
BDII Performance Tests
Bernd Panzer-Steindel, CERN/IT
Experience of Lustre at a Tier-2 site
SAM at CCIN2P3 configuration issues
Southwest Tier 2.
Emanuele Leonardi PADME General Meeting - LNF January 2017
Статус ГРИД-кластера ИЯФ СО РАН.
ALICE Computing Model in Run3
Computing Infrastructure for DAQ, DM and SC
Operating System Concepts
US ATLAS Physics & Computing
Grid Canada Testbed using HEP applications
Support for ”interactive batch”
TeraScale Supernova Initiative
Multithreaded Programming
ATLAS DC2 & Continuous production
The LHCb Computing Data Challenge DC06
Presentation transcript:

ANL T3g infrastructure S.Chekanov (HEP Division, ANL) ANL ASC Jamboree September 2009

2 S.Chekanov: ANL T3g ANL ASC T3g ~90 registered users Twiki with: – ASC computer workbook – Tier3 setup guide Could provide Indico service –not activated CVS Together with CERN, migrated to SVN SVN and SVN browser: –6 supported packages: HighETJets PromptGamma, InvMass CosmicAnalysis JetAnalysis

3 S.Chekanov: ANL T3g ANL T3g computing About 50 cores chained by Condor Interactive nodes: atlas16,17,18 (16 cores), SL4.8 User scratch disk space (~5 TB) –Excluding NFS, data storage PC farm prototype –ArCond/Condor –24 cores, 6TB data storage Software: –Grid, pathena, OSG-client, dq2-get –All major atlas releases including atlas18 contains locally-installed releases –SL5.3: validation computers (atlas11,50) + all desktop computers –only release. Setup is exactly the same Change: set_atlas.sh to set_atlas_sl5.sh gcc432! Migration to SL5.3 in ~1 month

4 S.Chekanov: ANL T3g ANL T3g cluster design Prototype was designed (24 cores) and operational since Sep Fully satisfies to the T3G requirements: –Grid access –24-core cluster with Arcond/Condor –2 TB/8 cores, data “pre-staged” (local to disks). No network load. –Other 25 cores (atlas16-27) can also be used by Condor, but no local data Interactive worker nodes (atlas16-18) atlas50-53

5 S.Chekanov: ANL T3g Data sets on the PC farm Since Sep. 2008, we store AOD MC files –~ 4M Monte Carlo AOD events (+ few ESD sets) –Corresponds to ~25% of the total capacity of the PC farm prototype

6 S.Chekanov: ANL T3g Benchmarking results for 24 cores (Xeon 2.3 GHz) Running over AOD files –0.5M events /h Fast MC simulation and on the fly analysis –1.5M events /h Running over C++/ROOT ntuples –1000M events /h (1M events / min for 1 core) Generating MC truth ntuples –2.5M events /h AOD production (generating & reconstructing MC events) –120 events /h Most tests done with PromptGamma package (ANL SVN) Accessing all AOD containers + Jets/gamma/e/muons/taus/missET are written to ntuples Data local to each CPU (3 nodes, 8 core per node, 33% of data on each box)

7 S.Chekanov: ANL T3g Getting data from Tier1/2 to ASC ANL SL 5.3 TCP tune Recommended by ESnet R.Yoshida & checks by D.Benjamin for Duke's T3 Satisfactory for MidWest Tier2 (UChicago) ~ 50 MB/s (4.5 TB/day, other sites ~3 TB/day) For a single thread, the network speed is < 120 Mbps (using 1 Gbps uplink!) Brown color: at least one file has 0 size Recent stress tests using “dq2-get” (default: 3 threads) Data: user.RichardHawking topmix_Egamma.AOD.v2 (125 GB) Use a bash script with dq2-get for benchmarking (3 threads)

8 S.Chekanov: ANL T3g Getting data from Tier1/2 to ASC ANL Even after TCP tunning, network bandwidth is ~100 Mbps for single thread download (~300 Mbps for dq2-get) –Reason: packet loses in 10 Gbps →1 Gbps switches Possible solution: take advantage of the PC farm design and use multiple dq2-get threads on each PC farm box –Split dataset on equal subsets. Create a file list –Run dq2-get on each PC farm node in parallel using the file list Use a front-end of dq2-get included into the ArCond package: –arc_ssh -h hosts-file -l -o /tmp/log "exec send_dq2.sh” Gets a list of files. Splits in ranges depending on number of slaves. Executes dq2-get on each slave using this list. –Tested using 5 Linux boxes (five dq2-get threads) 4 TB/day from BNL/SLAC achieved after using 2-3 dq2-get threads ~5 TB/day from Uchicago using a single dq2-get

9 S.Chekanov: ANL T3g ANL Tier3 future (~1 year) 10 Gbps File server ~100 cores ArCond/Condor ~40 TB local ~100 core xrootd/ArCondCondor ~40 TB local PC farm nodes are based on Dell PowerEdge R710 (2 processors, Xeon 5520 (8 cores) GB per core + 4TB local data disks) ~ 100 TB

10 S.Chekanov: ANL T3g Future (~2 years from now) 10 Gbps File server ~100 cores ArCond ~40 TB local ~200 core xrootd/ArCond ~80 TB local all based on Xeon GB per core ~100 TB

11 S.Chekanov: ANL T3g Expected performance: 24 Xeon 5404 cores (now) vs 200 Xeon 5520 (future) Running over AOD files –0.5M events /h → 6M/h Fast MC simulation and on the fly analysis –1.5M events /h → 18M/h Running over C++/ROOT ntuples –1000M events /h (1M events / min for 1 core). 10B? I/O limit? Generating MC truth ntuples –2.5M events /h → 30M/h AOD production (generating & reconstructing MC events) –120 events /h → 1400/h Some reviews claim: 5500 (Nehalem) processors are 50%-100% faster than Harpertown Xeon (5400) Assume 50% (benchmarks are coming):