Presentation is loading. Please wait.

Presentation is loading. Please wait.

Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC)

Similar presentations


Presentation on theme: "Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC)"— Presentation transcript:

1 Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC) Taipei, April 9, 2008

2 Peter Kreuzer - CMS Computing & Analysis2 Outline Brief overview of Worldwide LHC Grid: WLCG Distributed Computing Challenges at CMS –Simulation –Reconstruction –Analysis The physicist view The road to the LHC startup

3 Peter Kreuzer - CMS Computing & Analysis3 From local to distributed Analysis Before : centrally organised Analysis Example CMS : 4-6 PBytes data per year, 2900 scientists, 40 countries, 184 institutes ! Solution : ´´Tiered´´ Computing Model Since 20 years the amount of Data and of Physicists per experiment grew drastically (x 10)

4 Peter Kreuzer - CMS Computing & Analysis4 Level of distribution motivated by the desire to leverage and empower resources + share load, infrastructure and funding Worldwide LHC Computing GRID Aggregate Rate from CERN to Tier-1s  > 1.0 GByte/s Transfer Rate to Tier-2  50-500 MBytes/s Tier-2s primarily at Universities - Simulation - User Analysis Tier-1s at large national labs or universities - Re-Reconstruction - Physics ´skimming´ - Data Serving - Archiving Tier-0 at CERN - Prompt Reconstruction - Calibration and Low latency work - Archiving  1.0 GByte/s Tier-3s at Institutes with modest Infrastructure - Local User Analysis - Opportunistic Simulation

5 Peter Kreuzer - CMS Computing & Analysis5 WLCG Infrastructure EGEEEnabling Grid for E-Science OSGOpen Science Grid 1 Tier-0 + 11 Tier-1 + 67 Tier-2 Tier-0 -- Tier-1: dedicated 10Gbs Optical Network CMS : 1 Tier-0 + 7 Tier-1 + 35 Tier-2

6 Peter Kreuzer - CMS Computing & Analysis6 Examples of Sites T2 RWTH (Aachen)  - CPU : 540 KSI2k = 360 cores - Disc : 100TB - Network (WAN): 2Gbit/sec (2009 : 450 cores & 150TB)  T1 ASGC - CPU: 2.4 MSI2k ~1800 cores - Disc : 930TB  1.5PB - Tape : 586TB  800TB  - Network : 10Gbit/sec  T2 Taiwan - CPU : 150 KSI2k - Disc : 19TB  62TB - Network : up to 10Gbit/sec

7 Peter Kreuzer - CMS Computing & Analysis7 2008 : 40 PetaBytes Pledged WLCG Resources PetaBytes (Tape Storage = 33 PBytes in 2008) Tier-2 CERN Tier-1 2008 : 66,000 cores 250,000 cores 1MSI2K = 670 cores MSI2k (Reference : LCG Project Planning – 1.3.08)  CPU Disc Storage 

8 Peter Kreuzer - CMS Computing & Analysis8 Scale-up and test distributed Computing Infrastructure –Mass Storage Systems and Computing Elements –Data Transfer –Calibration and Reconstruction –Event ´skimming´ –Simulation –Distributed Data Analysis Test CMS Software Analysis Framework Operate in quasi-real data taking conditions and simulateously at various Tier levels  Computing & Software Analysis (CSA) Challenge Challenges for Experiments : Example CMS

9 Peter Kreuzer - CMS Computing & Analysis9 CMS Computing and Software Analysis Challenges CMS Scaling-up in the last 4 years Test (year) Goal : Jobs/day Scale –DC04: 15,000 5% –2005 - 2006 : New Data Model and New Software Framework –CSA06: 50,000 25% –CSA07:100,000 50% –CSA08:150,000 100% Requires 100s M simulated events input ?

10 Peter Kreuzer - CMS Computing & Analysis10 The CSA07 data Challenge TIER-0 CASTOR TIER-1 CAF TIER-2 Reconstruction 100Hz Re-Reconstruction Skimms 25k jobs/day Simulation 50M evt/month Calibration & Express Analysis 300MB/s ~10MB/s HLT TIER-1 Analysis 75k jobs/day 100M Simulated Data 20-200MB/s

11 Peter Kreuzer - CMS Computing & Analysis11 In this presentation Mainly covering CMS Simulation, Reconstruction and Analysis challenges Data transfers challenges covered in talk by Daniele Bonacorsi during this session

12 Peter Kreuzer - CMS Computing & Analysis12 CMS Simulation System ProdRequest Production Manager ProdAgent Tier-2 Tier-1 Tier-2 CMS Physicist ProdAgent Tier-2 GRID Global Data Bookkeeping (DBS) << Where are my data ? >> << Please simulate new physics >>

13 Peter Kreuzer - CMS Computing & Analysis13 ProdAgent workflows Data processing / bookkeeping / tracking / monitoring in local-scope Output promoted to global-scope DBS & Data transfer system PhEDEx Scaling achieved by running in parallel multiple ProdAgent instances ProdAgent Grid WMS Tier-2 Tier-1 Tier-2 Processing Small output file from Processing job SE Local DBS Processing Grid WMS Tier-2 Tier-1 Tier-2 Merging Large output file from Merge job SE PhEDEx Local DBS ProdAgent Merging 1) Processing: 2) Merging:

14 Peter Kreuzer - CMS Computing & Analysis14 CMS Simulation Performance ~ 250M Events in 5 months Tier-2 alone ~ 72% OSG alone ~ 50% (Overall 07-08: 450M) 20k jobs/day reached ~ 75% 30 40 50 60 70 Jan JulOctApr M Evts / Month Production Rate x 1.8 June – November 2007

15 Peter Kreuzer - CMS Computing & Analysis15 Utilization of CMS Resources average ~50% In best productions periods 75% Missing Requests June – November 2007 5000 job- slots

16 Peter Kreuzer - CMS Computing & Analysis16 CSA07 Simulation lessons Major boost in scale and reliability of production machinery Still too many manual operations. From 2008 on: –Deploy ProdManager component (in CSA07 was ´human´ !) –Deploy Resource Monitor –Deploy CleanUpSchedule component Further improvments in scale and reliability –gLite WMS bulk submission : 20k jobs/day with 1 WMS server –Condor-G JobRouter + bulk submission : 100k jobs/day and can saturate all OSG resources in ~1 hour. –Threaded JobTracking and Central Job Log Archival Introduced task-force for CMS Site Commissioning –help detect site issues via stress-test tool (enforce metrics) –couple site-state to production and analysis machinery Regular CMS Site Availability Monitoring (SAM) checks

17 Peter Kreuzer - CMS Computing & Analysis17 CMS Site Availability Monitoring  Important tool to protect CMS use cases at sites (ARDA ´Dashboard´) 03/22/0804/03/08 Availability Ranking 0%100%

18 Peter Kreuzer - CMS Computing & Analysis18 CSA07 Reconstruction & Skimming 0) preparation of ´´Primary Datasets´´ 1) Archive and Reconstruction at CERN T0 2) Archive and Re-Reconstruction at T1s 3) Skimming at T1s 4) Express analysis & Calibration at CERN Analysis Facility  3 different calibrations 10pb -1,100pb -1, 0pb -1 mimics real CMS Detector+ Trigger data

19 Peter Kreuzer - CMS Computing & Analysis19 Produced CSA07 Data Volumes Total CSA07 event counts: 80M GEN-SIM 80M DIGI-RAW 80M HLT 330M RECO (3 diff. calibrations) 250M AOD 100M skims --------------------------- 920M events Total Data volume: ~2PB  Corresponds to expected 2008 volume ! CMS data in CASTOR@CERN: 3.7PB x1e+8 DIGI-RAW-HLT-RECO events 10/’0702/’08

20 Peter Kreuzer - CMS Computing & Analysis20 CSA07 Reconstruction lessons T0 Reconstruction at 100Hz only in bursts, mainly due to stream splitting activity Heavy load on CASTOR Usefull feedback to ProdAgent Developpers to prepare 2008 data taking (repacker, …) T1 Processing : submission rate was main limitation. Now based on gLite bulk submission and reaching 12- 14k jobs/day with 1 ProdAgent instance Further rate improvment to be expected with T1 resource up-scaling 2k running jobs T0 and T1 processing

21 Peter Kreuzer - CMS Computing & Analysis21 CRAB = CMS Remote Analysis Builder An interface to the GRID for CMS physicists Challenge : match processing resources with large quantities of data = ´´chaotic´´ Processing CRAB CMS Analysis System CRAB Server Tier-2 Tier-1 Tier-2 CMS Physicist Tier-2 GRID Global Data Bookkeeping (DBS) << Where are my jobs ? >> << Please analyse datasets X/Y >>

22 Peter Kreuzer - CMS Computing & Analysis22 CRAB Architecture Easy and transparent means for CMS users to submit analysis jobs via the GRID (LCG RB, gLite WMS, Condor-G) CSA07 analysis: direct submission by user to GRID. Simple, but lacking automation and scalability  2008 : CRAB server Other new feature: local DBS for “private” users

23 Peter Kreuzer - CMS Computing & Analysis23 CSA07 Analysis 100k jobs/day not achieved - mainly due to lacking data during the challenge - still limitted by data distribution: 55% jobs at 3 largest Tier-1s - and failure rate too high 53% Successful Jobs 20% failed Jobs 27% Unknown 20k jobs/day achieved + regularly ~30k/day JobRobot submissions Main causes: - data-access - remote stage out - manual user settings Number of jobs

24 Peter Kreuzer - CMS Computing & Analysis24 CMS Grid Users since 1 year CRAB Server plot showing distinct users 300 users during February 2008 20 most active users carry 1/3 of jobs Users Month

25 Peter Kreuzer - CMS Computing & Analysis25 The Physicist View SUSY Search in di-lepton + jets + MET Goal : Simulate excess over Standard Model (´LM1´ at 1 fb -1 ) Infrastructure –1 desktop PC –CMS Software Environment (´CMSSW´, ´CRAB´, ´Discovery´ GUI, …) –GRID Certificate + member of a Virtual Organisation (CMS) Input data (CSA07 simulation/production) –Signal (RECO) : 120k events = 360 GB –Skimmed Background (AOD) : 3.3 M events = 721 GB WW / WZ / ZZ / single top ttbar / Z / W + jets –Unskimmed Background : 27 M events = 4 TB (for detailed studies only) Location of input data –T0/T1 : CERN (CH), FNAL (US), FZK (Germany) –T2 : Legnaro (Italy), UCSD (US), IFCA (Spain) ~1.1 TB

26 Peter Kreuzer - CMS Computing & Analysis26 GRID Analysis Result Z peak from SUSY cascades End-Point Signal Analysis Latency Signal + Bgd = 322 jobs  22h to produce this result ! Detailed studies = 1300 jobs  ~3.5 days Georgia Karapostoli – Athens Univ. [GeV]

27 Peter Kreuzer - CMS Computing & Analysis27 CSA07 Analysis lessons Improve Analysis scalability, automation and reliability –CRAB-Server –Automate job re-submission –Optimize job distribution –Decrease failure rate Move Analysis to Tier-2s –To protect Tier-0/1 LSF and storage systems –To make use of all available GRID resources Encourage Tier-2_to_Physics_group association –In close collaboration with sites –With solid overall Data Management strategy –Assess local scope DM for Physics groups & storage of user data Aim for 500 users by June and exceed capacity of several gLite WMS

28 Peter Kreuzer - CMS Computing & Analysis28 Goals for CSA08 (May ’08) “Play through” first 3 months of data taking Simulation –150M events at 1 pb -1 (“S43”) –150M events at 10 pb -1 (“S156”) Tier-0 : Prompt reconstruction –S43 with startup-calibration –S156 with improved calibration CERN Analysis Facility (CAF) –Demonstrate low turn-around Alignment&Calibration workflows –Coordinated and time-critical physics analyses –Proof-of-principle of CAF Data and Workflow Managment Systems Tier-1 : Re-Reconstruction with new calibration constants –S43 : with improved constants based on 1 pb -1 –S156 : with improved constants based on 10 pb -1 Tier-2 : –iCSA08 simulation (GEN-SIM-DIGI-RAW-HLT) –repeat CAF-based Physics analyses with Re-Reco data ?

29 Peter Kreuzer - CMS Computing & Analysis29 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Detector installation, commissioning and operation Preparation of Software, Computing and Physics analysis iCSA08 / CCRC’08-2 CCRC’08-1 2007 Physics Analyses results CMSSW 2.0 release [production start-up MC samples] CMSSW 2.1 release [all basic sw components ready for LHC, new T0 prod tools] fCSA08or beam! Private global runs (2 days/week) & Private mini-daq Cooldown of magnet Beam-pipe baked-out Pixels installed Low i test CMS closed Initial CMS ready for run Must keep exercises mostly non-overlapped “CROT” “CRAFT” CMSSW 1.8.0 sample production 2 weeks of 2.0 testing iCSA08 sample generation CR 0T GRUMM pre CR 4T CR 4T 2008 CCRC = Common-Vo Computing Readiness Challenge CR = Commissioning Run

30 Peter Kreuzer - CMS Computing & Analysis30 Where do we stand ? WLCG : major up-scaling since 2 years ! CMS : impressive results and valuable lessons from CSA07 –Major boost in Simulation –Produced ~2 PBytes data in T0/T1 Reconstruction and Skimming –Analysis : number of CMS Grid-users ramping up fast ! –Software : addressed memory footprint and data size issues Further Challenges for CMS : scale from 50% to 100% –Simultaneous and continuous operations at all Tier levels –Analysis distribution and automation –Transfer rates (see talk by D.Bonacorsi) –Upscale and commission the CERN Analysis Facility (CAF)  CSA08, CCRC08, Commissioning Runs Challenging and motivating goals in view of Day-1 LHC !


Download ppt "Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC)"

Similar presentations


Ads by Google