Download presentation
Presentation is loading. Please wait.
Published byAdam Stokes Modified over 9 years ago
1
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team
2
CDF data production models 2 CDF collaboration Collider Detector experiment at the Fermilab Tevatron collider Study collisions of 1 TeV protons with 1 TeV anti-protons
3
CDF data production models 3 Trigger, Data Acquisition Sub-detector signals trigger CDF detector data taking rate 2005 Achieved 2006 upgrade Tevatron luminosity :1.8x10 32 cm -2 s -1 3x10 32 cm -2 s -1 Level-1 acceptance : 27 kHz 40 kHz Level-2 acceptance : 850 Hz 1 kHz Event Builder (EVB) :850X0.2 MB/s 500 MB/s Level-3 acceptance : 110 Hz 150 Hz to Tape storage rate :20 MB/s40 MB/s 8 data logging streams Event size : ~140 kByte Average data rate ~ 5 M events/day
4
CDF data production models 4 Data logging rate up to Sep 2005 1 fb -1 of data recorded Data logging rate increase w. luminosity of proton, anti-proton beams Total data volume increase w. integrated luminosity Good-run raw data Feb 2002 - Dec 2004 1017 M events = 201 k files = 185 TByte Dec 2004 - Sep 2005 756 M events = 102 k files = 95 TByte
5
CDF data production models 5 Data flow CDF DAQ Production farm Enstore rawdatasetsrawdatasets CDF Analysis Farm remote CAFs User desk top dCache ProductiondatasetsProductiondatasets
6
CDF data production models 6 Data flow, Enstore storage Level-3 farm Level-1,2 Trigger, DAQ sub-detector DataBase Calibration 8 raw-datasets 52 production datasets Run splitter File catalog Data logging is in divided by Trigger table 6 physics, 2 monitoring streams Split events by Trigger table 52 production datasets Enstore tape library storage 18 STK 9940B drives 200 GB/tape 30 MByte/s read/write Steady R/W rate ~1TByte/drive/day
7
CDF data production models 7 Computing facility dCache file-servers 10Gbit 2Gbit Remote sites Analysis farm Production farm Enstore tape library File-servers Servers starlight CDF Online DAQ 2Gbit Oracle DB offline users
8
CDF data production models 8 Data processing tasks Raw data event reconstruction apply detector calibration calculate detected physics contents output to assigned trigger datasets One input file one binary job split output files Concatenation of output files Raw data file is 1 GByte, Output file size varies 5 MByte to 1 GByte Concatenate small files of the same datasets in data taking sequence to 1 GByte files
9
CDF data production models 9 Production farm, 1 st model dfarm networkMySQL,DB run-splitter calibration Register concatenated 5 1 2 3 worker stager concatenator 4 6 Register output Register input Direct I/O to Enstore tape library Custom I/O node to Enstore FBS batch system dfarm collection of all worker IDE buffer of input and output files Farm Processing system MySQL for bookkeeping Concatenation with rigid run sequence output truncated to 1 GB files Performance Peak rate at 1 TB input/day used to process data up to Dec 2004
10
CDF data production models 10 Upgrade to CAF & SAM Data Handling Condor batch system dressed for CDF CAF (CDF Analysis Farm) package interface for job submission and monitoring uniform platform to other CDF computing facilities compatible to distributed computing development Data handling system SAM (Sequential Access via Metadata) database application for file metadata provide file locations load files from tapes to caches dCache (joint project of DESY+FNAL) virtualizes disk usage, loading files from tapes files appear to user as always on disk
11
CDF data production models 11 Production farm, upgrade output merged 4 fileserver networkSAM,DB input-URL run-splitter calibration declare metadata worker 2 1 3 5 dCache Upgrade to distributed computing infrastructure: SAM data handing & Condor CAF A CAF submit, parallel operations of - SAM Project - Activating data handling to deliver files of the assigned SAM dataset - Tracking file consumption status - Condor batch JOB - Consuming files of the associated SAM project - declare SAM metadata for bookkeeping Concatenation of output Merge output files sorted in run sequence Store to Enstore via SAM Declare metadata and parentage for bookeeping
12
CDF data production models 12 Production challenge Operation tasks : “cron jobs” Resource monitoring Submission and monitoring SAM projects Binary jobs on CAF farms Concatenation and store Service interface and monitoring Enstore tape I/O SAM Data handling, DB service CDF online, calibration DB, software Timely process every event collected Interface to Data-handing, DataBase, multiple CAFS Precision bookkeeping on millions of files zero tolerance to error, every event is counted
13
CDF data production models 13 Resource Monitoring CDF DB, SAM DB, Data-Handling CAF condor batch system Fileserver storage Prohibited cron jobs missing required services
14
CDF data production models 14 CAF condor monitoring Tarball (archived execution binary file) distributed to worker CPUs Input files copied via SAM from dCache End of job, output files are copied to assigned fileserver CPU engagement is monitored
15
CDF data production models 15 Farm monitoring Worker CPUs (Ganglia) & input (rcp) waiting Traffic to fileserver (xfs) Bandwidth limit : Input: Enstore loading to dCache Output: multiple workers to fileservers 1Gbit network port to IDE: 40 MB/s 1output dataset to Enstore: 30 MB/s
16
CDF data production models 16 SAM project monitor Input is delivered by SAM Data-Handling system Input files are organized in data-sets Each data-set is submitted to a SAM project Each project is associated with a CAF condor job SAM projects monitored
17
CDF data production models 17 Monitoring a SAM project Consumption of a data-set is monitored File delivery by SAM from registered locations (dcache, samcache, Enstore etc) Consumption by CAF worker is monitored
18
CDF data production models 18 Bookkeeping via SAM metadata Each output file has a bookkeeping metadata Tagging on parent-daughter after completion Automatic recovery : on datasets having incomplete daughters
19
CDF data production models 19 Production stability CAF condor is very reliable worker hardware failure occasional RAID down-graded occasional Service 24x7 Oracle, Enstore service SAM, dCache shift support CPU usage total 6, output to 6 Fileserver Rougher CPU usage at the end as streams were finishing up CAF+Farm max=540 jobs Farm CPU Traffic to/from Production farm GREEN In bits/sec BLUE Out bits/sec DARK Peak In bits/sec PINK Peak Out bits/sec
20
CDF data production models 20 Production rate Peak performance: Jobs distributed to two CAFs (Analysis & Production farm) use 540 CPU to match with 6 I/O streams 8 dCache input file servers, 6 output fileservers uniform processing speed at 25 M events/day 3 TB input, 4 TB output /day Integrated Output event logging Daily file consumption
21
CDF data production models 21 Scaling capacity At peak performance of 3 TB input, 4 TB output /day farm switch (2Gbit capacity) sees entire traffic average load is 800 Mbit/s saturated by simultaneous network to one fileserver Gbit link (40 MB/s) (corresponds to 100 jobs per data stream for CDF) Scaling on CPU Add more CPU to a CAF Distribute jobs to multiple CAFs Scaling on network I/O Limited by the 6 data-stream algorithm, split further Scale by fileservers (more Gbit links) Scale by tape drives
22
CDF data production models 22 Summary CDF production farm upgrade has reached a reliable rate of 3 TByte/day Capacity is scalable by increasing CPU and I/O ports Easy and reliable operation tolerant to error recovery, with zero data loss
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.