Download presentation
Presentation is loading. Please wait.
Published byCharles Barber Modified over 8 years ago
1
CDF Monte Carlo Production on LCG GRID via LcgCAF Authors: Gabriele Compostella Donatella Lucchesi Simone Pagan Griso Igor SFiligoi 3 rd IEEE International Conference on e-Science and GRID Computing Bangalore,India, December 10 th -13 th 2007 OUTLINE: ✔ CDF Computing Model ✔ CDF transition to GRID ✔ LcgCAF description ✔ Performances on Monte Carlo production
2
December 11, 2007 2 The CDFII experiment 2 e-science 07 December 11, 2007 CDF II (Collider Detector at Fermilab) takes data since 2001 Data taking expected to last until the end of 2009. Strong desire to continue also in 2010
3
December 11, 2007 3 The CDFII Computing Model 3 e-science 07 December 11, 2007 CDF Level 3 Trigger CDF~100 Hz 7MHz beam Xing 0.75 Million channels Robotic Tape Storage Production Farm Disk cache Data Handling Services DH DHDH DHDH DHDH GRID CDF Central Analysis Facility: CAF
4
December 11, 2007 4 The CDFII Computing Model cont'd 4 e-science 07 December 11, 2007 CDF Central Analysis Facility: CAF Robotic Tape Storage Remote Analysis System decentralized CAF, dCAF Disk cache DHDH DHDH DHDH DHDH DHDH User Desktop GRID user job Monte Carlo data generation
5
December 11, 2007 5 The CAF Model 5 e-science 07 December 11, 2007 Batch system Head Node Monitor Mailer Submitter Worker Nodes CafExe MyJob.sh MyJob.exe Job wrapper User Desktop Three classes of daemons: ➢ submitter: accepts user request and submits to batch system ➢ monitor: interactive to allow user to interact with jobs and classical batch ➢ mailer: notify user with a jobs summary
6
December 11, 2007 6 The CDF has to exploit GRID 6 e-science 07 December 11, 2007 ● Expected cpu needs in 2008 ~ 6500 KSPI2K logging data rate 30 MB/s (CDF Comp. Model) ● On site (Fermilab) available ~ 5000 KSPI2K ● Missing resources have to be found out sites, GRID CDF adapted to GRID ● Up to now resources have been exploited in opportunist way ● LHC is starting, CDF needs guarantee resources CDF Computing Centers have been proposed and created in countries where there are big computing center and CDF representatives: CNAF (Italy), Lyon (France), KISTI (Korea)
7
December 11, 2007 7 The CDF transition to GRID 7 e-science 07 December 11, 2007 ● CDF has to cope with two different GRID: OSG and LCG ● GRID strategy completely different from dedicated resources one ● CAF model was very successful, keep it! ● Need to address: ➢ Job submission and execution in new environment ➢ Authentication ➢ Code distribution and remote DB access ➢ Output retrieval ➢ Monitor
8
December 11, 2007 8 The two ways to access GRID 8 e-science 07 December 11, 2007 via Kerberos ● NaMCAF: used to access OSG sites User Desktop Secure connection Job Monitor Submitter Mailer Condor VirtualprivateCDFworkernodes ➢ Based on Condor Glide-in, pilots job technique ➢ Exploit all the Condor features ➢ In production since late 2005 large experience with pilot job at CDF !
9
December 11, 2007 9 LcgCAF in a nutshell 9 e-science 07 December 11, 2007 Secure connection via Kerberos LcgCAF head node GRID Site (CE) User Desktop W MS... GRID Site (CE) Job... CDF Storage Element (SE) Job output
10
December 11, 2007 10 Job Submission and Execution 10 e-science 07 December 11, 2007 User Submission Submitter LcgC AF queue (FIFO) Job Manager User job (HTTP) WMSWMS... LcgCAF wrapper Grid UI Web server WN job wrapper ● User job enqueued in local queue and user tarball stored in a defined location ● Jobs in the submission queue submitted to gLite WMS ● LcgCAF wrapper sent WN using InputSandbox
11
December 11, 2007 11 Job Submission and Execution 11 e-science 07 December 11, 2007 User Submission Submitter LcgC AF queue (FIFO ) Job Manager User job (HTTP) WMSWMS... LcgCAF wrapper Grid UI Web server WN job wrapper Workload Management System – Accept submission request – Match available resources – Submit to CEs – Automatic retry for Grid -specific failures – Keep track of the job status LcgCAF wrapper on WN – Get “support” software needed and user job (HTTP) – Run the user job – Forks monitoring processes – When job is completed, retrieve the output
12
December 11, 2007 12 Authentication 12 e-science 07 December 11, 2007 ● On LcgCAF head-node: - User authenticated to LcgCAF with Kerberos ticket - Kerberized Certification Authority: Kerberos Grid certificate - VOMS server @CNAF (Bologna, Italy) to get a valid Grid Proxy ● User job submitted and executed with user credentials ● During execution on WN: KDispenser keep valid Kerberos ticket User desktop LcgCAF head- node Grid site … SEs FNAL KCA CNAF VOMS CDF default authentication method: Kerberos V Grid Proxy or Kerberos KDispenser Kerberos V
13
December 11, 2007 13 Database access 13 e-science 07 December 11, 2007 To access DB at FNAL: ● Translate DB query into HTTP requests using Frontier ● Use squid proxies as cache to improve scalability & performances: ➢ 60% Improvement in speed for usual CDF jobs ➢ 90% of requests retrieved from cache Lcg site: CNAF Lcg site: Gridka … … Proxy Cache FNAL oracle DB server Frontier library Frontier library DB query (HTTP) DB query (HTTP) TomCat Each Monte Carlo simulation job needs: – FNAL DB access for retrieving run conditions – CDF specific software
14
December 11, 2007 14 Code distribution: Parrot 14 e-science 07 December 11, 2007 ● Parrot to setup a virtual file-system and access CDF software ➢ Trap program's system calls and retrieve needed files ➢ Using HTTP protocol for easy caching near bigger sites ➢ CDF software exported via Apache server at CNAF ● No CDF-specific requirements on the WNs ● Easy caching with squid improves performances Lcg site: CNAF Lcg site: Gridka … … Proxy Cache /home/cdfsoft CNAF server HTTP Parrot
15
December 11, 2007 15 Monitor 15 e-science 07 December 11, 2007 ● Collect information on user jobs from ➢ WMS: job status using standard grid commands ➢ WN: Ad-hoc monitoring process on WN, collect information about job execution and sent them to the LcgCAF head-node ● Information stored and organized in a local file-based database for real-time monitoring and historical analysis ● User requests information about his/her job to the head-node LcgCAF head node WMS WNs CDF Informati on System Direct request “Pull mode” User
16
December 11, 2007 16 Monitor: WEB based 16 e-science 07 December 11, 2007 Complete overview: ● all users ● running/pending jobs Jobs history since day zero Single job info: ● CPU and MEM usage ● running processes
17
December 11, 2007 17 Monitor: Interactive 17 e-science 07 December 11, 2007 Unique to CDF! Available commands: CafMon jobs CafMon kill CafMon dir CafMon tail CafMon ps CafMon top
18
December 11, 2007 18 Data Movement: Now 18 e-science 07 December 11, 2007 Worker Node CDF SE CDF Storage …… GridFTP rcp Tape User output copied to CDF Storage Elements using ➢ Grid specific tools (GSI authentication using Grid proxy) or to CDF storage locations with ➢ Rcp-like tools (Kerberos V authentication, CDF default) Files are then transferred (after validation) to tape
19
December 11, 2007 19 Data Movement: In Progress 19 e-science 07 December 11, 2007 ● Current mechanism leads to inefficient use of remote WN ● A framework is needed to ship Monte Carlo Data between remote computing sites and Fermilab and vice-versa for data ● New mechanism has to be interfaced with SAM, the Fermilab Run II Data Handling Framework for CDF, D0 and Minos. … … Italian Sites CNAF: collect locally outputs FNAL: storage destination Monte Carlo data up-load prototype model. Transfer tests among CNAF T1 and Fermilab T1
20
December 11, 2007 20 CDF usage of LCG GRID: EU 20 e-science 07 December 11, 2007 CDF VO usage in EU has been around 2% in 2007 CDF still has dedicated farms that will disappear in 2008 The major contributor to resources in 2007 is CNAF T1 Other sites are growing!
21
December 11, 2007 21 CDF usage of LCG GRID: Italy 21 e-science 07 December 11, 2007 CDF VO usage in Italy is the same order of one LHC experiments. Need to keep it at this level also is LHC starts CNAF provides ~90% resurces but the Italian Tier2 start to give important contributions
22
December 11, 2007 22 LcgCAF Usage 22 e-science 07 December 11, 2007 ● GRID resource usage: ➢ many resources available CDF can use them many jobs to manage at the same time ➢ few resources available job matching not so smart jobs in queue for too long
23
December 11, 2007 23 LcgCAF job efficiency 23 e-science 07 December 11, 2007 Efficiency: select “exit code” LcgCAF jobs since Jan. 2007 Failures: 6.5% due mainly to: -output retrieval -GRID site misconfiguration Overall efficiency: 93.5% = 88.9%(Succ.)+4.6%(User Ab.) Access only few “friendly” sites
24
December 11, 2007 24 Opens Issues 24 e-science 07 December 11, 2007 Output retrieval ● Issues Temporary unavailability of destination host WN problems, like clock not synchronized (Kerberos requirement) ● Solution data movement! At prototype level. Data will be copied to an SE through GRID tools
25
December 11, 2007 25 Opens Issues cont'd 25 e-science 07 December 11, 2007 GRID sites misconfigurations ● Issues CE misconfiguration: middleware not updated or updated not properly, missing certificates,... WN misconfiguration: SL3/SL4 libraries, broken hw,.. ● Solution should come from GRID! For the moment selected only big and/or “friends” sites and work with local administrators. WMS stability ● Stability problems in the past, v3.0. Solved with v3.1 ● Resource matching criteria still not adeguated
26
December 11, 2007 26 Summary and Conclusions 26 e-science 07 December 11, 2007 ✔ CDF adapted the computer model to the GRID using portals ✔ LcgCAF access successfully European resources using LCG/gLite middleware since almost a year: - Completely transparent to the user Good use of caching (CDF software, User job, DB requests) no requests to sites, minimize data transfer during job lifetime improving perfor mances ✔ Easy to access any site a lot of cpu power available soon ➢ Expected improvements: LcgCAF: data transfer from WM to FNAL and unified monitor GRID: stability and better resources matching
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.