Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with.

Similar presentations


Presentation on theme: "CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with."— Presentation transcript:

1 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with entirely my own bias and errors)

2 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 2/32 Outline Elements of a Computing Model Recap of LHC Computing Scales Recent/Current LHC Data Challenges Critical Issues

3 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 3/32 The Real Challenge… Quoting from David Williams’ Opening Talk –Computing hardware isn’t the biggest challenge –Enabling the collaboration to bring their combined intellect to bear on the problems is the real challenge Empowering the intellectual capabilities in very large collaborations to contribute to the analysis of the new energy frontier –Not (just) from sociological/political perspective –You never know where the critical ideas or work will come from –The simplistic vision of central control is elitist and illusory

4 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 4/32 Goals of An (Offline) Computing Model Worldwide Collaborator access and ability to contribute to the analysis of the experiment: Safe Data Storage Feedback to the running experiment Reconstruction according to priority scheme with graceful fall back solutions –Introduce no Deadtime Optimum (Or at least acceptable) usage of Computing Resources Efficient Data Management

5 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 5/32 Where are LHC Computing Models today? In a state of flux! –Last chance to influence initial purchasing planning in next 12 months Basic principles –were contained in MONARC and went into the first assessment of LHC computing requirements –Aka “Hoffmann Report”; Instrumental in establishment of LCG But, now we have two critical milestones to meet –Computing Model Papers (Dec 2004) –Experiment and LCG TDR’s (July 2005) And, we more or less know what will, or won’t, be ready in time –Restrict enthusiasm to attainable goals Process now to review models in all four experiments. –Very interested in running experiment experiences, (This Conference!) –Need the maximum expertise and experience included in these final CM’s

6 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 6/32 LHC Computing Scale (Circa 2008) CERN T0/T1 –Disk Space [PB]5 –Mass Storage Space [ PB]20 –Processing Power [MSI2K]20 –WAN [10Gb/s]~5? Tier-1s (Sum of ~10) –Disk Space [PB]20 –Mass Storage Space [ PB]20 –Processing Power [MSI2K]45 –WAN [10Gb/s/Tier-1]~1? Tier-2s (Sum of ~40) –Disk Space [PB]12 –Mass Storage Space [ PB] 5 –Processing Power [MSI2K]40 –WAN [10Gb/s/Tier-2]~.2? Cost Sharing 30% At CERN, 40% T1s, 30% T2’s CERN T0/T1 Cost Sharing T1 Cost Sharing T2 Cost Sharing

7 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 7/32 Elements of a Computing Model (I) Data Model –Event data sizes, formats, streaming –Data “Tiers” (DST/ESD/AOD etc) Roles, accessibility, distribution,… –Calibration/Conditions data Flow, latencies, update freq –Simulation. Sizes, distribution –File size Analysis Model –Canonical group needs in terms of data, streams, re-processing, calibrations –Data Movement, Job Movement, Priority management –Interactive analysis Computing Strategy and Deployment –Roles of Computing Tiers –Data Distribution between Tiers –Data Management Architecture –Databases. Masters, Updates, Hierarchy –Active/Passive Experiment Policy Computing Specifications –Profiles (Tier N & Time) Processors, Storage, Network (Wide/Local), DataBase services, Specialized servers, Middleware requirements

8 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 8/32 Common Themes Move a copy of the raw data away from CERN in “real-time” –Second secure copy 1 copy at CERN 1 copy spread over N sites –Flexibility. Serve raw data even if Tier-0 saturated with DAQ –Ability to run even primary reconstruction offsite Streaming online and offline –(Maybe not a common theme yet) Tier-1 centers in-line to Online and Tier-0 Simulation at T2 centers –Except LHCb, if simulation load remains high, use Tier-1 ESD Distributed n copies over N Tier-1 sites –Tier-2 centers run complex selections at Tier-1, download skims AOD Distributed to all (?) tier-2 centers –Maybe not a common theme. How useful is AOD, how early in LHC? Some Run II experience indicating long term usage of “raw” data Horizontal Streaming –RAW, ESD, AOD,TAG Vertical Streaming –Trigger streams, Physics Streams, Analysis Skims

9 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 9/32 Purpose and structure of ALICE PDC04 Test and validate the ALICE Offline computing model: –Produce and analyse ~10% of the data sample collected in a standard data- taking year –Use the entire ALICE off-line framework: AliEn, AliRoot, LCG, PROOF… –Experiment with Grid enabled distributed computing –Triple purpose: test of the middleware, the software and physics analysis of the produced data for the Alice PPR Three phases –Phase I - Distributed production of underlying Pb+Pb events with different centralities (impact parameters) and of p+p events –Phase II - Distributed production mixing different signal events into the underlying Pb+Pb events (reused several times) –Phase III – Distributed analysis Principles: –True GRID data production and analysis: all jobs are run on the GRID, using only AliEn for access and control of native computing resources and, through an interface, the LCG resources –In phase III GLite+ARDA

10 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 10/32 Master job submission, Job Optimizer (splitting in sub-jobs), RB, File catalogue, process monitoring and control, SE… Central servers CEs Sub-jobs Job processing AliEn-LCG interface Sub-jobs RB Job processing CEs Storage CERN CASTOR: disk servers, tape Output files File transfer system: AIOD LCG is one AliEn CE Job structure and production (phase I)

11 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 11/32 CEs: 15 directly controlled through AliEn + CERN-LCG and Torino-LCG (Grid.it) Phase I CPU contributions

12 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 12/32 Issues Too many files in MSS Stager (Also CMS) –Solved by splitting the data on two stagers ( Persistent problems with local configurations reducing the availability of GRID sites –Frequent black holes –Problems often come back (e.g. nfs mounts!) –Local disk space on WN Quality of the information in the II Workload Management System does not ensure an even distribution of jobs in the different centres Lack of support for bulk operations makes the WMS response time critical KeyHole approach and lack of appropriate monitoring and reporting tools make debugging difficult

13 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 13/32 Phase II (started 1/07) – statistics In addition to phase I –Distributed production of signal events and merging with phase I events –Network and file transfer tools stress –Storage at remote SEs and stability (crucial for phase III) Conditions, jobs …: –110 conditions total –1 million jobs –10 TB produced data –200 TB transferred from CERN –500 MSI2k hours CPU End by 30 September

14 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 14/32 Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Central servers CEs Sub-jobs Job processing AliEn-LCG interface Sub-jobs RB Job processing CEs Storage CERN CASTOR: underlying events Local SEs CERN CASTOR: backup copy Storage Primary copy Local SEs Output files Underlying event input files zip archive of output files Register in AliEn FC: LCG SE: LCG LFN = AliEn PFN edg(lcg) copy&register File catalogue Structure of event production in phase II

15 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 15/32 Master job submission, Job Optimizer (N sub-jobs), RB, File catalogue, processes monitoring and control, SE… Central servers CEs Sub-jobs Job processing AliEn-LCG interface Sub-jobs RB Job processing CEs Local SEs Primary copy Local SEs Input files File catalogue Job splitter File catalogue Metadata lfn 1 lfn 2 lfn 3 lfn 7 lfn 8 lfn 4 lfn 5 lfn 6 PFN = (LCG SE:) LCG LFN PFN = AliEn PFN Query LFN’s Get PFN’s User query Structure analysis in phase 3

16 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 16/32 ALICE DC04 Conclusions The ALICE DC04 started out with (almost unrealistically) ambitious objectives They are coming very close to reach these objectives and LCG has played an important role. They are ready and willing to move to gLite as soon as possible and contribute to its evolution with their feedback

17 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 17/32 Consider DC2 as a three-part operation: –part I: production of simulated data (July-September 2004) running on “Grid” Worldwide –part II: test of Tier-0 operation (November 2004) Do in 10 days what “should” be done in 1 day when real data-taking start Input is “Raw Data” like output (ESD+AOD) will be distributed to Tier-1s in real time for analysis –part III: test of distributed analysis on the Grid access to event and non-event data from anywhere in the world both in organized and chaotic ways Requests –~30 Physics channels ( 10 Millions of events) –Several millions of events for calibration (single particles and physics samples) ATLAS-DC2 operation

18 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 18/32 LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea ATLAS Production system

19 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 19/32 CPU usage & Jobs

20 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 20/32 ATLAS DC2 Status Major efforts in the past few months –Redesign of the ATLAS Event Data Model and Detector Description –Integration of the LCG components (G4; POOL; …) –Introduction of the Production System Interfaced with 3 Grid flavors (and “legacy” systems) Delays in all activities have affected the schedule of DC2 –Note that Combined Test Beam is ATLAS 1st priority –And DC2 schedule was revisited To wait for the readiness of the software and of the Production system DC2 –About 80% of the Geant4 simulation foreseen for Phase I has been completed using only Grid and using the 3 flavors coherently; –The 3 Grids have been proven to be usable for a real production and this is a major achievement BUT –Phase I progressing slower than expected and it’s clear that all the involved elements (Grid middleware; Production System; deployment and monitoring tools over the sites) need improvements –It is a key goal of the Data Challenges to identify these problems as early as possible.

21 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 21/32 Testing CMS Computing Model in DC04 Focused on organized (CMS-managed) data flow/access Functional DST with streams for Physics and Calibration –DST size ok, almost usable by “all” analyses; (new version ready now) Tier-0 farm reconstruction –500 CPU. Ran at 25Hz. Reconstruction time within estimates. Tier-0 Buffer Management and Distribution to Tier-1’s –TMDB: a CMS-built Agent system communicating via a Central Database. –Manages dynamic dataset “state”, not a file catalog Tier-1 Managed Import of Selected Data from Tier-0 –TMDB system worked. Tier-2 Managed Import of Selected Data from Tier-1 –Meta-data based selection ok. Local Tier-1 TMDB ok. Real-Time analysis access at Tier-1 and Tier-2 –Achieved 20 minute latency from Tier 0 reconstruction to job launch at Tier-1 and Tier-2 Catalog Services, Replica Management –Significant performance problems found and being addressed

22 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 22/32 PIC Barcelona FZK Karlsruhe CNAF Bologna RAL Oxford IN2P3 Lyon T1 T0 FNAL Chicago T1 DC04 Data Challenge Focused on organized (CMS-managed) data flow/access T0 at CERN in DC04 –25 Hz Reconstruction –Events filtered into streams –Record raw data and DST –Distribute raw data and DST to T1’s T1 centres in DC04 –Pull data from T0 to T1 and store –Make data available to PRS –Demonstrate quasi-realtime analysis of DST’s T2 centres in DC04 –Pre-challenge production at > 30 sites –Modest tests of DST analysis T2 Legnaro T2 CIEMAT Madrid Florida T2 IC London T2 Caltech

23 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 23/32 Tier-2 Physicist T2storage ORCA Local Job Tier-2 Physicist T2storage ORCA Local Job Tier-1 Tier-1 agent T1storage ORCA Analysis Job MSS ORCA Grid Job Tier-1 Tier-1 agent T1storage ORCA Analysis Job MSS ORCA Grid Job DC04 layout Tier-0 Castor IB fake on-line process RefDB POOL RLS catalogue TMDB ORCA RECO Job GDB Tier-0 data distribution agents EB LCG-2 Services Tier-2 Physicist T2storage ORCA Local Job Tier-1 Tier-1 agent T1storage ORCA Analysis Job MSS ORCA Grid Job

24 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 24/32 Next Steps Physics TDR requires physicist access to DC04 data –Re-reconstruction passes –Alignment studies –Luminosity effects Estimate 10M events/month throughput required CMS Summer-Timeout to focus new effort on –DST format/contents –Data Management “RTAG” –Workload Management deployment for physicist data access now –Cross-project coordination group focused on end-user Analysis Use Requirements of Physics TDR to build understanding of analysis model, while doing the analysis –Make it work for Physics TDR Component Data Challenges in 2005 –Not a big-bang where everything has to work at the same time Readiness challenge in 2006. –100% Startup scale –Concurrent Production, Distribution, Ordered and Chaotic Analysis

25 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 25/32 LHCb DC’04 aims Gather information for LHCb Computing TDR Physics Goals: –HLT studies, consolidating efficiencies. –B/S studies, consolidate background estimates + background properties. Requires quantitative increase in number of signal and background events: –30 10 6 signal events (~80 physics channels). –15 10 6 specific backgrounds. –125 10 6 background (B inclusive + min. bias, 1:1.8). Split DC’04 in 3 Phases: –Production: MC simulation (Done). –Stripping: Event pre-selection (To start soon). –Analysis (In preparation).

26 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 26/32 DIRAC Job Management Service DIRAC Job Management Service DIRAC CE LCG Resource Broker Resource Broker CE 1 DIRAC Sites Agent CE 2 CE 3 Production manager Production manager GANGA UI User CLI JobMonitorSvc JobAccountingSvc AccountingDB Job monitor InfomationSvc FileCatalogSvc MonitoringSvc BookkeepingSvc BK query webpage BK query webpage FileCatalog browser FileCatalog browser User interfaces DIRAC services DIRAC resources DIRAC Storage DiskFile gridftp bbftp rfio DIRAC Services & Resources

27 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 27/32 Phase 1 Completed DIRAC alone LCG in action 1.8 10 6 /day LCG paused 3-5 10 6 /day LCG restarted 186 M Produced Events Phase 1 Completed

28 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 28/32 LCG Performance (I) Submitted Jobs Cancelled Jobs Aborted Jobs (Before Running) 211k Submitted Jobs After Running: -113 k Done (Successful) -34 k Aborted

29 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 29/32 LCG Performance (II) LCG Job Submission Summary Table LCG Efficiency: 61 %

30 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 30/32 LHCb DC’04 Status LHCb DC’04 Phase 1 is over. The Production Target has been achieved: –186 M Events in 424 CPU years. –~ 50% on LCG Resources (75-80% at the last weeks). Right LHCb Strategy: –Submitting “empty” DIRAC Agents to LCG has proven to be very flexible allowing a good success rate. Big room for improvements, both on DIRAC and LCG: –DIRAC needs to improve in the reliability of the Servers: big step already during DC. –LCG needs improvement on the single job efficiency: ~40% aborted jobs. –In both cases extra protections against external failures (network, unexpected shutdowns…) must be built in. Congratulation and warm thanks to the complete LCG team for their support and dedication

31 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 31/32 Personal Observations on Data Challenge Results Tier-0 Operations at 25% scale demonstrated –Job couplings from objectivity era - gone Directed data flow/management T0>T1>T2. Worked (Intermittently) Massive Simulation on LCG, Grid3, NorduGrid. Worked Beginning to get experience with input-data-intensive jobs Not many users out there yet stressing the chaotic side –The next 6 months are critical, we have to see broad and growing adoption, not having a personal grid user certificate will have to seem odd Many problems are classical computer center ones –Full disks, reboots, SW installation, Dead disks,.. –Actually this is bad news. No Middleware silver-bullet. Hard work getting so many centers up to required performance

32 CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton Univ. 30 Sept 2004 Page 32/32 Critical Issues for early 2005 Data Management –Building Experiment Data Management Solutions Demonstrating End-User access to remote resources –Data and processing Managing Conditions and Calibration databases –And their global distribution Managing Network expectations –Analysis can place (currently) impossible loads on network and DM components Planning for the future, while maintaining priority controls Determining the pragmatic mix of Grid responsibilities and experiment responsibilities –Recall the “Data” in DataGrid, LHC is Data Intensive, –Configuring the experiment and grid software to use generic resources is wise –But (I think) data location will require a more ordered approach in practice


Download ppt "CHEP ‘04 LHC Computing Models and Data Challenges David Stickland Princeton University For the LHC Experiments: ALICE, ATLAS, CMS and LHCb (Though with."

Similar presentations


Ads by Google