CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International.

Slides:



Advertisements
Similar presentations
Physics with SAM-Grid Stefan Stonjek University of Oxford 6 th GridPP Meeting 30 th January 2003 Coseners House.
Advertisements

Sep Donatella Lucchesi 1 CDF Status of Computing Donatella Lucchesi INFN and University of Padova.
A tool to enable CMS Distributed Analysis
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
LcgCAF:CDF submission portal to LCG Federica Fanzago for CDF-Italian Computing Group Gabriele Compostella, Francesco Delli Paoli, Donatella Lucchesi, Daniel.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Enabling Grids for E-sciencE ENEA and the EGEE project gLite and interoperability Andrea Santoro, Carlo Sciò Enea Frascati, 22 November.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
INFSO-RI Enabling Grids for E-sciencE Workload Management System Mike Mineter
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
CDF Grid at KISTI 정민호, 조기현 *, 김현우, 김동희 1, 양유철 1, 서준석 1, 공대정 1, 김지은 1, 장성현 1, 칸 아딜 1, 김수봉 2, 이재승 2, 이영장 2, 문창성 2, 정지은 2, 유인태 3, 임 규빈 3, 주경광 4, 김현수 5, 오영도.
Movement of MC Data using SAM & SRM Interface Manoj K. Jha (INFN- Bologna) Gabriele Compostella, Antonio Cuomo, Donatella Lucchesi,, Simone Pagan (INFN.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna 27 th Feb., 2009 University of Birmingham, U.K.
June 24-25, 2008 Regional Grid Training, University of Belgrade, Serbia Introduction to gLite gLite Basic Services Antun Balaž SCL, Institute of Physics.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
May Donatella Lucchesi 1 CDF Status of Computing Donatella Lucchesi INFN and University of Padova.
Outline: Status: Report after one month of Plans for the future (Preparing Summer -Fall 2003) (CNAF): Update A. Sidoti, INFN Pisa and.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
A New CDF Model for Data Movement Based on SRM Manoj K. Jha INFN- Bologna Presently at Fermilab 21 st April, 2009 Post Doctoral Interview University of.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
Antonio Fuentes RedIRIS Barcelona, 15 Abril 2008 The GENIUS Grid portal.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
IFAE Apr CDF Computing Experience - Gabriele Compostella1 IFAE Apr CDF Computing Experience Gabriele Compostella, University.
Dynamic Extension of the INFN Tier-1 on external resources
Grid2Win Porting of gLite middleware to Windows XP platform
Eleonora Luppi INFN and University of Ferrara - Italy
LcgCAF:CDF submission portal to LCG
StoRM: a SRM solution for disk based storage systems
Overview of the Belle II computing
Workload Management System
Grid2Win: Porting of gLite middleware to Windows XP platform
Sergio Fantinel, INFN LNL/PD
Grid2Win: Porting of gLite middleware to Windows XP platform
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Grid Deployment Board meeting, 8 November 2006, CERN
Short update on the latest gLite status
Artem Trunov and EKP team EPK – Uni Karlsruhe
Grid2Win: Porting of gLite middleware to Windows XP platform
LHC Data Analysis using a worldwide computing grid
Support for ”interactive batch”
Gridifying the LHCb Monte Carlo production system
Data Processing for CDF Computing
The LHCb Computing Data Challenge DC06
Presentation transcript:

CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International Conference on e-Science and GRID Computing Bangalore,India, December 10 th -13 th 2007 OUTLINE: ✔ CDF Computing Model ✔ CDF transition to GRID ✔ LcgCAF description ✔ Performances on Monte Carlo production

December 11, The CDFII experiment 2 e-science 07 December 11, 2007 CDF II (Collider Detector at Fermilab) takes data since 2001 Data taking expected to last until the end of Strong desire to continue also in 2010

December 11, The CDFII Computing Model 3 e-science 07 December 11, 2007 CDF Level 3 Trigger CDF~100 Hz 7MHz beam Xing 0.75 Million channels Robotic Tape Storage Production Farm Disk cache Data Handling Services DH DHDH DHDH DHDH GRID CDF Central Analysis Facility: CAF

December 11, The CDFII Computing Model cont'd 4 e-science 07 December 11, 2007 CDF Central Analysis Facility: CAF Robotic Tape Storage Remote Analysis System decentralized CAF, dCAF Disk cache DHDH DHDH DHDH DHDH DHDH User Desktop GRID user job Monte Carlo data generation

December 11, The CAF Model 5 e-science 07 December 11, 2007 Batch system Head Node Monitor Mailer Submitter Worker Nodes CafExe MyJob.sh MyJob.exe Job wrapper User Desktop Three classes of daemons: ➢ submitter: accepts user request and submits to batch system ➢ monitor: interactive to allow user to interact with jobs and classical batch ➢ mailer: notify user with a jobs summary

December 11, The CDF has to exploit GRID 6 e-science 07 December 11, 2007 ● Expected cpu needs in 2008 ~ 6500 KSPI2K logging data rate 30 MB/s (CDF Comp. Model) ● On site (Fermilab) available ~ 5000 KSPI2K ● Missing resources have to be found out sites, GRID  CDF adapted to GRID ● Up to now resources have been exploited in opportunist way ● LHC is starting, CDF needs guarantee resources  CDF Computing Centers have been proposed and created in countries where there are big computing center and CDF representatives: CNAF (Italy), Lyon (France), KISTI (Korea)

December 11, The CDF transition to GRID 7 e-science 07 December 11, 2007 ● CDF has to cope with two different GRID: OSG and LCG ● GRID strategy completely different from dedicated resources one ● CAF model was very successful, keep it! ● Need to address: ➢ Job submission and execution in new environment ➢ Authentication ➢ Code distribution and remote DB access ➢ Output retrieval ➢ Monitor

December 11, The two ways to access GRID 8 e-science 07 December 11, 2007 via Kerberos ● NaMCAF: used to access OSG sites User Desktop Secure connection Job Monitor Submitter Mailer Condor VirtualprivateCDFworkernodes ➢ Based on Condor Glide-in, pilots job technique ➢ Exploit all the Condor features ➢ In production since late 2005  large experience with pilot job at CDF !

December 11, LcgCAF in a nutshell 9 e-science 07 December 11, 2007 Secure connection via Kerberos LcgCAF head node GRID Site (CE) User Desktop W MS... GRID Site (CE) Job... CDF Storage Element (SE) Job output

December 11, Job Submission and Execution 10 e-science 07 December 11, 2007 User Submission Submitter LcgC AF queue (FIFO) Job Manager User job (HTTP) WMSWMS... LcgCAF wrapper Grid UI Web server WN job wrapper ● User job enqueued in local queue and user tarball stored in a defined location ● Jobs in the submission queue submitted to gLite WMS ● LcgCAF wrapper sent WN using InputSandbox

December 11, Job Submission and Execution 11 e-science 07 December 11, 2007 User Submission Submitter LcgC AF queue (FIFO ) Job Manager User job (HTTP) WMSWMS... LcgCAF wrapper Grid UI Web server WN job wrapper Workload Management System – Accept submission request – Match available resources – Submit to CEs – Automatic retry for Grid -specific failures – Keep track of the job status LcgCAF wrapper on WN – Get “support” software needed and user job (HTTP) – Run the user job – Forks monitoring processes – When job is completed, retrieve the output

December 11, Authentication 12 e-science 07 December 11, 2007 ● On LcgCAF head-node: - User authenticated to LcgCAF with Kerberos ticket - Kerberized Certification Authority: Kerberos Grid certificate - VOMS (Bologna, Italy) to get a valid Grid Proxy ● User job submitted and executed with user credentials ● During execution on WN: KDispenser keep valid Kerberos ticket User desktop LcgCAF head- node Grid site … SEs FNAL KCA CNAF VOMS CDF default authentication method: Kerberos V Grid Proxy or Kerberos KDispenser Kerberos V

December 11, Database access 13 e-science 07 December 11, 2007 To access DB at FNAL: ● Translate DB query into HTTP requests using Frontier ● Use squid proxies as cache to improve scalability & performances: ➢ 60% Improvement in speed for usual CDF jobs ➢ 90% of requests retrieved from cache Lcg site: CNAF Lcg site: Gridka … … Proxy Cache FNAL oracle DB server Frontier library Frontier library DB query (HTTP) DB query (HTTP) TomCat Each Monte Carlo simulation job needs: – FNAL DB access for retrieving run conditions – CDF specific software

December 11, Code distribution: Parrot 14 e-science 07 December 11, 2007 ● Parrot to setup a virtual file-system and access CDF software ➢ Trap program's system calls and retrieve needed files ➢ Using HTTP protocol for easy caching near bigger sites ➢ CDF software exported via Apache server at CNAF ● No CDF-specific requirements on the WNs ● Easy caching with squid improves performances Lcg site: CNAF Lcg site: Gridka … … Proxy Cache /home/cdfsoft CNAF server HTTP Parrot

December 11, Monitor 15 e-science 07 December 11, 2007 ● Collect information on user jobs from ➢ WMS: job status using standard grid commands ➢ WN: Ad-hoc monitoring process on WN, collect information about job execution and sent them to the LcgCAF head-node ● Information stored and organized in a local file-based database for real-time monitoring and historical analysis ● User requests information about his/her job to the head-node LcgCAF head node WMS WNs CDF Informati on System Direct request “Pull mode” User

December 11, Monitor: WEB based 16 e-science 07 December 11, 2007 Complete overview: ● all users ● running/pending jobs Jobs history since day zero Single job info: ● CPU and MEM usage ● running processes

December 11, Monitor: Interactive 17 e-science 07 December 11, 2007 Unique to CDF! Available commands: CafMon jobs CafMon kill CafMon dir CafMon tail CafMon ps CafMon top

December 11, Data Movement: Now 18 e-science 07 December 11, 2007 Worker Node CDF SE CDF Storage …… GridFTP rcp Tape User output copied to CDF Storage Elements using ➢ Grid specific tools (GSI authentication using Grid proxy) or to CDF storage locations with ➢ Rcp-like tools (Kerberos V authentication, CDF default) Files are then transferred (after validation) to tape

December 11, Data Movement: In Progress 19 e-science 07 December 11, 2007 ● Current mechanism leads to inefficient use of remote WN ● A framework is needed to ship Monte Carlo Data between remote computing sites and Fermilab and vice-versa for data ● New mechanism has to be interfaced with SAM, the Fermilab Run II Data Handling Framework for CDF, D0 and Minos. … … Italian Sites CNAF: collect locally outputs FNAL: storage destination Monte Carlo data up-load prototype model. Transfer tests among CNAF T1 and Fermilab T1

December 11, CDF usage of LCG GRID: EU 20 e-science 07 December 11, 2007 CDF VO usage in EU has been around 2% in 2007 CDF still has dedicated farms that will disappear in 2008 The major contributor to resources in 2007 is CNAF T1 Other sites are growing!

December 11, CDF usage of LCG GRID: Italy 21 e-science 07 December 11, 2007 CDF VO usage in Italy is the same order of one LHC experiments. Need to keep it at this level also is LHC starts CNAF provides ~90% resurces but the Italian Tier2 start to give important contributions

December 11, LcgCAF Usage 22 e-science 07 December 11, 2007 ● GRID resource usage: ➢ many resources available  CDF can use them  many jobs to manage at the same time ➢ few resources available job matching not so smart  jobs in queue for too long

December 11, LcgCAF job efficiency 23 e-science 07 December 11, 2007 Efficiency: select “exit code” LcgCAF jobs since Jan Failures: 6.5% due mainly to: -output retrieval -GRID site misconfiguration Overall efficiency: 93.5% = 88.9%(Succ.)+4.6%(User Ab.) Access only few “friendly” sites

December 11, Opens Issues 24 e-science 07 December 11, 2007 Output retrieval ● Issues ­ Temporary unavailability of destination host ­ WN problems, like clock not synchronized (Kerberos requirement) ● Solution ­ data movement! At prototype level. Data will be copied to an SE through GRID tools

December 11, Opens Issues cont'd 25 e-science 07 December 11, 2007 GRID sites misconfigurations ● Issues ­ CE misconfiguration: middleware not updated or updated not properly, missing certificates,... ­ WN misconfiguration: SL3/SL4 libraries, broken hw,.. ● Solution ­ should come from GRID! For the moment selected only big and/or “friends” sites and work with local administrators. WMS stability ● Stability problems in the past, v3.0. Solved with v3.1 ● Resource matching criteria still not adeguated

December 11, Summary and Conclusions 26 e-science 07 December 11, 2007 ✔ CDF adapted the computer model to the GRID using portals ✔ LcgCAF access successfully European resources using LCG/gLite middleware since almost a year: - Completely transparent to the user  Good use of caching (CDF software, User job, DB requests)  no requests to sites, minimize data transfer during job lifetime improving perfor mances ✔ Easy to access any site  a lot of cpu power available soon ➢ Expected improvements: LcgCAF: data transfer from WM to FNAL and unified monitor GRID: stability and better resources matching