Download presentation
Presentation is loading. Please wait.
Published byHarold Daniels Modified over 9 years ago
1
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 1 FDR - Results so far, schedule and challenges Jim Cochran Iowa State University What is the FDR ? Event mixing – a moving target Current Schedule FDR-1 operation: the week of February 4 (Tier 0) Data distribution Preparation for US analysis effort Early user feedback Lessons Learned (so far) Outline Plans for FDR-2 Much (most ?) material stolen from talks by Michael Wilson, Ian Hinchcliffe, Dave Charlton, Alexei Klimentov, Kors Bos, …
2
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 2 What is the FDR ? The Full-Dress Rehearsal (FDR) is an attempt to test the complete chain (or as much of it as we can actually test without real collisions) There are many steps between DAQ output and final plots The individual steps have been tested, the full chain has not … as we’ll see, FDR is not a totally realistic test Basic idea: feed (appropriately mixed*) raw events into the system at DAQ output Treat, as much as possible, as if real data (calibration, streaming, reco, …) To happen in two parts Instantaneous luminosity (cm -1 s -1 ) Integrated luminosity (pb -1 ) FDR-1~10 31 0.36 FDR-2 ~10 33 20-25 details later 13.3
3
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 3 more specifically … mixed data in bytestream format is copied to and streamed from SFO What’s an SFO ? Receives event data from Event Filter nodes and writes raw data files SFOs are the output of TDAQ and reside at Point1 on the surface Data is copied from SFO disks to T0 SFOs have a 24hr buffer and a data rate up to 600 MB/s will be 5-10 SFO PCs Insert mixed (bytestream) data here T0
4
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 4 Motivation If the LHC turned on tomorrow, what would happen to the recorded data? – We know we are not ready—what don’t we know? – Realtime tests of hardware, software, databases, storage media, networks, data flow, data integrity and quality checks, calibrations, etc. – What are the flaws in our processing models? – Are we able to identify and correct routine running problems? Where are resources needed? – Tests of distributed computing and analysis machinery (T0 T1 T2 T3) (especially should be tested under heavy user load) SFOTier0Tier1 Metadata Trigger config. Lumi. info AOD, ESD, TAG prod DQ monitoring Calibrations (ongoing) MC prod. Import Archiving and ExportDPD prod. Reprocessing known unknowns vs unknown unknowns …
5
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 5 Main FDR steps Sample preparation Calibration stream preparation SFO preparation FDR run - SFO to Tier-0 part - Tier-0 operations - Data quality, calibration and alignment operations - Data transfers - Tier-1 operations - Tier-2 operations After main "FDR run" - Perform distributed analysis, use TAGs, produce DPDs, etc. - Re-reconstruct from BS after a time - Remake DPDs as needed for analysis
6
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 6 Schedule Sample preparation1/3/08 (actual 1/18/08) Calibration stream preparation SFO preparation FDR runWeek of Feb 4 After main "FDR run"Feb 11 and beyond (re)processing at T1~March 17 ? FDR-1 Simulationalready started Digitizationstarts April 1 90% of RDO at BNLApril 20 remaining 10% of RDO at BNLApril 30 Mixing complete at BNLMay 14 data on SFOMay 19 FDR-2 done at T2s details later Original estimate for mixing was 30 days
7
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 7 Sample Preparation: Details Make use of previously generated events where possible, else generate Mix events randomly to get correct physics mixture as expected at HLT output - Fakes will be discussed later Event mixing also runs the trigger simulation, only events passing all levels of triggering are kept Trigger information is written into event, for later use in analysis Events are written into physics streams (+ Express & Calibration) Format of written events is bytestream (BS=RAW) Files for most streams respect luminosity block (LB) boundaries - not express stream Files for physics stream therefore are written per stream, per LB, per SFO O(10 7 ) events for each of FDR-1 and FDR-2 All MC-truth information is lost by this processing
8
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 8 Event Mixing: FDR-1 An evolving compromise between realistic simulation and practical limitations – Goal: include all significant standard-model processes in known proportions (“mixing”) with background events passing trigger chains (“fakes”) - 10 one-hour runs at ~10 31 cm -1 s -1 (lumi. varying across run), 1 one-hour run at ~10 32 cm -1 s -1 (constant lumi.) - Produce files in bytestream format, split across streams, SFOs, and luminosity blocks (2 minutes/LB) – Actually achieved by Thurs Feb 7: - 8 runs at ~10 31 no fakes; 2 runs at ~10 31 with some e/γ fakes – 1 run at ~10 32 finished late, was sent to Tier1 asynchronously
9
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 9 Event Mixing: Issues At 10 32 we have an event rate from minbias of 70 kHz and a trigger output of 200 Hz The only way to make 10 hr of unbiased data is to start with 2.5B events! But, we only need realistic event rates after the EventFilter: 8M events in total This low rate is possible only because of trigger rejection and prescales need to prefilter and make samples with threshholds matched to triggers not trivial to get right! Mixing logistics: Data was not available until Jan 18 (Jan 3 was original plan) All data prestaged to CASTOR disk – output copied back to CASTOR Could not get all the data on the compute nodes fast enough CPU time to process events too long In order to get data in time for T0 had to change strategy: Reduce fake samples drastically
10
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 10 Lessons from FDR-1 sample preparation Fake samples must be matched in detail to trigger menu – Fakes need to be prefiltered using same menu as will be used for mixing Tier0 is not appropriate for mixing (as noted will use BNL for FDR-2) A number of lessons about computing infrastructure: – Fast turn-around on MC production not yet reliable – Data distribution is labor intensive – Software fixes have taken time Reallocating Tier0 resources (computers, people) affects operations – Tier0 managers need time to prepare and maintain infrastructure – Disk space and CPU capacities are large but limited
11
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 11 FDR-1 Operation: Trigger menu The trigger menu for L = 10 31 selects mostly fakes – The rate of interesting events selected at this lumi. is much lower than the nominal 200 Hz – Thus, because 8/10 low-lumi. runs do not include enhanced-fake (e.g., EM-like jets) samples, the overall rate is closer to 10 Hz Achieved rate is ~50 Hz for runs with enhanced-fake samples – Conclusions This is a menu for first data and detector commissioning Not optimal for the final FDR-1 mixed sample
12
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 12 FDR-1 data playing Data played from 5 SFOs during 5-8 Feb (Tues-Fri) – Same events replayed every day with different run numbers – Reconstruction with different AtlasPoint1 patch every day (for monitoring updates and fixes) - Flexible: allowed us to run anything at all - Dangerous: not reproducible; software not properly validated - Will build AtlasPoint1 13.0.40.2 for Tier1 reconstruction Exported streamsNon-exported streams - Express - ID tracks for alignment - Muon and b physics (“Muon”) - Jets, tau, and missing E T (“Jet”) - Electron and photon (“Egamma”) - Minimum bias (“Minbias”)
13
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 13 Express stream First use of express stream was successful – Primary purposes: data quality and calibrations – Sufficient to detect problems and details, allowing timely validation of data (if DQ is operational) – Monitoring histograms (detector and performance) available hours after run appeared; shifters spotted problems shortly thereafter – Histograms moved to AFS to avoid overloading castor with too many requests – ESDs and AODs not exported—consider temporary storage so users can access them if needed for DQ
14
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 14 Data quality Data quality was checked by central shifters and system experts – Included: Pixel, LAr, Tile, MDT, RPC, L1RPC, ID alignment, e/gamma, jets, missing E T, muon tracks – Two shifts per day using desk in Tier0 control room – Exercised ability to spot problems with 10-minute granularity – Lots of room for improvement; core functionality came online during the week. http://atlasdqm.web.cern.ch/atlasdqm/results.html http://sroe.home.cern.ch/sroe/runlist/query.html Run summary: Histograms:
15
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 15 Some unexpected features Shoulders at ~150 GeV and ~600 GeV TRT timing not configured for one wheel (shift of 15ns)
16
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 16 Expected DQ flags Run 3062, dead LAr crate (or crates?) Run 3062, minutes 20-30, hot LAr cells What about the noisy barrel crate? Problems were introduced for short time intervals to test data-quality monitoring:
17
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 17 Calibrations Tested the calibration loop: determine calibration constants within 24 hrs of data being recorded and apply new constants during bulk reconstruction – Pixel calibration planned to run on express stream Software only working in 13.X.0; postponed until FDR2 – TRT calibration running on ID-alignment stream First of many iterations completed on Tier0 promptly; remaining iterations finishing on lxbatch – ID alignment
18
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 18 ID-alignment stream First test of a calibration stream – An efficient and dedicated effort had this ready for testing at the beginning of January – Used by both ID alignment and TRT calibration in FDR1 – Uses event fragments from L2 (“L2_trck10i_calib”) Isolated tracks with pT > 10GeV (5GeV for FDR1) Select tracks at 60 Hz (L = 10 31 ) with no additional TDAQ load 600k tracks selected for FDR1 (from a dijet sample; uses 26GB) – Attempted to include 50k cosmic events; postponed due to software incompatibilities
19
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 19 ID alignment procedure The ID alignment group produced constants under tight computing and time constraints – Only had ID alignment stream from beginning of January for testing – The alignment procedure necessarily involves iterations - This is not suitable for Tier0; full implications are still being considered – Calibration model is being adapted—ID alignment ran outside of production, constants fed back before bulk reconstruction – Will expect a full test of the new model in FDR-2
20
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 20 ID alignment updating A new ID alignment was calculated by 16:00 Thursday Feb 7 (two-day turnaround) – Used ID-alignment stream; did not use cosmics or express stream this time – Only two iterations at level 2 due to time constraints Perfect Nominal Aligned Perfect Nominal Aligned Perfect Nominal Aligned d0z0Q/p T
21
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 21 Daily meeting Daily operations meeting at 16:00 from 4-8 Feb – Review first look at fresh data and shift report – Complete look at previous day’s data (incl. overnight) Calibration signoff—upload new constants? If new constants, reprocess express stream ⇒ Need to consider resources for this carefully Process physics streams if data quality is understood – Ensure that all data-quality flags have been set Need reports from ALL systems and groups
22
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 22 Flagging data quality Data-quality assessment has both automatic & manual components – Automatic tools will be run to check DCS, histograms – Expert system shifters need to check automatic assessments and possibly override for every run – After the daily meeting, all assessments are combined into a final flag, which is written into both DB and AODs – Assessments may span time period from 1 LB - entire run
23
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 23 External participation Information flow to those outside CERN should be improved – Ideally, a portal where collaborators can check realtime status of Runs recorded Runs with express stream processed Runs waiting for data-quality signoff List of bulk streams processed, and then exported – How can offsite colleagues participate in meaningful data checks?
24
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 24 Data Distribution
25
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 25
26
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 26 Dataset replication to T2s started immediately after T1s had complete replicas
27
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 27 Data Distribution: Summary Data replication from CERN to Tier-1s and within all clouds is relatively stable Notes: Site problems were fixed by operations team (central and regional) within 24h Data export from CERN was delayed by 3 days (several technical issues - all under discussion) Data replication monitoring and book-keeping still have room for improvement Not enough data to be a transfer challenge - will use other data (M5) for CCRC
28
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 28 Preparation for US analysis effort (and some results) Ongoing effort in US to prepare user analysis queues - have been tested and are mostly ready to go (modulo scl3 vs scl4 issues etc.) - experts have been very responsive User participation in AOD analysis frustrated/delayed by - lack of standard analysis package - should be available this week (including trigger info) - Non-existence of the tags (?) - confusion about where to obtain lumi info (not a serious issue for now) A large set of people have expressed interest in FDR participation - expect “surge” in activity once analysis doesn’t require expert status Primary DPDs to be generated during production - perhaps during T0 reprocessing ? Individual groups are starting to produce their own secondary & tertiary DPDs Tutorial today
29
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 29 Lashkar Kashif Harvard Dimuon mass
30
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 30 Looking for J/ and Upsilon to ee Andy Nelson, ISU
31
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 31
32
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 32 1. 7h of 10 32 data (2.5 pb -1 ) - with fakes where possible - some fakes will be reused many times (JF samples) aim to redigitize so the repeated events are not quite the same 2. 3h of 10 33 data (10.8 pb -1 ) - probably no fakes - with 75ns pileup - for some of this data, the beamspot will move FDR-2 plan (March 4, 2008)
33
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 33 FDR-2 Simulation Production Simulation of physics and background for FDR-2 Data Sample Need to produce before May (1/3 during CCRC-1 ) –0.5 M minimum bias and cavern events –10 M physics events –100 M fake events Simulation HITS (1.5 MB/ev), Digitization RDO (2.5 MB/ev) Reconstruction ESD (1 MB/ev), AOD 0.2 MB/ev), TAG (1 kB/ev) Simulation + Digitization is done at the T2’s HITS and RDOs uploaded to T1 –HITS to tape at T1 –RDO to CERN for mixing Reconstruction done at T1 –ESD, AOD, TAG archived to tape at T1 –ESD copied to other T1 by share (3 full copies world-wide) –AOD and TAG copied to each other T1
34
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 34
35
US ATLAS Transparent Distributed Facility Workshop, UNC 3/4/2008 35 FDR Lessons Learned (so far) Fake samples must match trigger menu items; purity and rate both important Software validation must be improved Calibration procedures need detailed advance planning and coordination Signoff procedure for bulk processing needs both automation and expert attention Express-stream reprocessing needs consideration and resources Always best to try things early – Finding problems was the goal; most were overcome – For FDR-2 expect basics to run more smoothly Will focus on the operational details and the calibration model Tune the existing system; commission new components More physics content in streams Developers, users and experts have a much better sense of how the system is intended to work; can be more efficient in the future – All these problems must be faced sometime—better now than later
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.